December 7, 2024

Decoding Methods for LLMs

A verbose explanation of common auto-regressive decoding methods: temperature, top k and top p.

LLMs are, at their core, completely deterministic. In a controlled environment—fixed parameters, identical input, and a known random seed—an LLM will generate the same output every time. There’s no “built-in” randomness. Yet, we often want more interesting and varied outputs, rather than the most probable, so we deliberately introduce methods that create the illusion of randomness / creativity. This article explains these methods and how they influence an LLM’s output distribution.

Note: confusion often arises because most users (and devs) encounter LLMs through proprietary APIs. Which are, surprisingly, impossible to make consistent even when all top-level hyper-parameters are exposed. The reality is this that this is due to other factors such as GPU thread scheduling, rounding behaviour in floating-point arithmetic, opaque model versioning etc. This warrants it’s own, separate discussion.

Setup

For this discussion I’ll assume a standard auto-regressive language modelling objective. Let:

V = { v_1, v_2, \ldots, v_{|V|} }

be our vocabulary of discrete tokens. Parameters (

\theta

) are static at inference time, so for any input sequence

x_1 \ldots x_n

we produce a fixed probability distribution over our token vocabulary

V

at each generation / decoding step

t

\begin{aligned} z_t &= f(x_1, \ldots, x_n, y_1, \ldots, y_{t-1}; \theta) \\ p(y_t \mid x_1, \ldots, x_n, y_1, \ldots, y_{t-1}; \theta) &= \text{Softmax}(z_t) \end{aligned}

Here,

z_t

is a vector representing the model’s output logits for each token in

V

. The Softmax for a token at index

i

(corresponding to

v_i \in V

) is given by:

p(y_t = v_i \mid x_1, \ldots, x_n, y_1, \ldots, y_{t-1}; \theta) = \frac{\exp(z_{t,i})}{\sum_{j=1}^{|V|} \exp(z_{t,j})}.

Temperature

Temperature (

T

) is a parameter used to control the amount of randomness during generation. It simply scales the logits before they get converted to probabilities:

\text{Softmax}(z_{t,i}; T) = \frac{e^{z_{t,i} / T}}{\sum_{j=1}^{|V|} e^{z_{t,j} / T}}

So for:

\\

•

T > 1

: the probability distribution is more uniform, leading to more “creative” sampling

\\

•

T < 1

: the probability distribution is sharper, meaning more confident generations.

\\

In production settings you’re typically pretty worried about hallucinations, so will likely set

T=0

. However, in the equation above

T=0

is undefined (division by 0). Practically, you can just sample greedily and bypass the Softmax calculation entirely.

Top K Sampling

Top K sampling restricts the next-token selection to only the top

k

most probable tokens at each decoding step. Starting with the full distribution:

p(y_t = v_i \mid x_1, \ldots, x_n, y_1, \ldots, y_{t-1}; \theta)

We first sort the tokens in

V

according to their probabilities in descending order:

p(y_t = v_{i_1}) \geq p(y_t = v_{i_2}) \geq \cdots \geq p(y_t = v_{i_{|V|}})

We then take the top

k

tokens:

S_{\text{top } k} = { v_{i_1}, v_{i_2}, \ldots, v_{i_k} }.

To form the final probability distribution after top

k

truncation, we renormalise the probabilities of only these selected tokens:

p_{\text{top } k}(y_t = v_i) = \begin{cases} \dfrac{p(y_t = v_i)}{\sum_{v_j \in S_{\text{top } k}} p(y_t = v_j)} & \text{if } v_i \in S_{\text{top } k} \ 0 & \text{otherwise} \end{cases}

Top P (nucleus) Sampling

Top P sampling selects the smallest set of of tokens whose cumulative probability exceeds the specified

p

. So let S be the smallest set of tokens:

S_{\text{top } p} = { v_{i_1}, v_{i_2}, \ldots, v_{i_r} }

such that:

\sum_{k=1}^r p(y_t = v_{i_k}) \geq p \quad\text{and}\quad \sum_{k=1}^{r-1} p(y_t = v_{i_k}) < p.

This set

S_{\text{top } p}

is constructed by starting from the most likely token and adding tokens in descending order of their probability until the cumulative probability surpasses

p

. Once we have identified

S_{\text{top } p}

, we discard all tokens not in that set and renormalise the probabilities of the tokens in

S_{\text{top } p}

just as we did for top

k

sampling:

p_{\text{top } p}(y_t = v_i) = \begin{cases} \dfrac{p(y_t = v_i)}{\sum_{v_j \in S_{\text{top } p}} p(y_t = v_j)} & \text{if } v_i \in S_{\text{top } p}\ 0 & \text{otherwise} \end{cases}

The idea here is to choose a variable set of candidate tokens at each decoding step. Instead of a fixed-size truncation like top

k

, top

p

sampling adapts dynamically to the distribution of probabilities. If the model is very confident about a few tokens,

S_{\text{top } p}

might be small. If the probabilities are more spread out,

S_{\text{top } p}

grows larger. By adjusting

p

, you control how “broad” the sampling distribution is—lower

p

will make it more greedy, while higher

p

yields more diverse outputs.

Stop Sequences

Stop sequences are predefined token patterns that, if generated, immediately stop the decoding process. Unlike the previous methods that adjust token probabilities, stop sequences do not affect the model’s distribution. Instead, they provide a deterministic cutoff: once a stop sequence appears, no further tokens are produced. This helps with enforcing strict output formats.

Final Remarks

I believe that having a precise mental model of these techniques is really important when you’re building with LLMs. We don’t have many tools to control proprietary LLMs but these few parameters do let us shape the model’s output distribution in meaningful ways and hopefully let us build more robust systems.

Marcel Marais

AI Engineer