December 7, 2024

Decoding Methods for LLMs

A verbose explanation of common auto-regressive decoding methods: temperature, top k and top p.

LLMs are, at their core, completely deterministic. In a controlled environment—fixed parameters, identical input, and a known random seed—an LLM will generate the same output every time. There’s no “built-in” randomness.  Yet, we often want more interesting and varied outputs, rather than the most probable, so we deliberately introduce methods that create the illusion of randomness / creativity. This article explains these methods and how they influence an LLM’s output distribution.

Note: confusion often arises because most users (and devs) encounter LLMs through proprietary APIs. Which are, surprisingly, impossible to make consistent even when all top-level hyper-parameters are exposed. The reality is this that this is due to other factors such as GPU thread scheduling, rounding behaviour in floating-point arithmetic, opaque model versioning etc. This warrants it’s own, separate discussion.

Setup

For this discussion I’ll assume a standard auto-regressive language modelling objective.  Let: V=v1,v2,,vVV = { v_1, v_2, \ldots, v_{|V|} } be our vocabulary of discrete tokens. Parameters (θ\theta) are static at inference time, so for any input sequence x1xnx_1 \ldots x_n we produce a fixed probability distribution over our token vocabulary VV at each generation / decoding step tt:
zt=f(x1,,xn,y1,,yt1;θ)p(ytx1,,xn,y1,,yt1;θ)=Softmax(zt) \begin{aligned} z_t &= f(x_1, \ldots, x_n, y_1, \ldots, y_{t-1}; \theta) \\ p(y_t \mid x_1, \ldots, x_n, y_1, \ldots, y_{t-1}; \theta) &= \text{Softmax}(z_t) \end{aligned}
Here, ztz_t is a vector representing the model’s output logits for each token in VV. The Softmax for a token at index ii (corresponding to viVv_i \in V) is given by:
p(yt=vix1,,xn,y1,,yt1;θ)=exp(zt,i)j=1Vexp(zt,j). p(y_t = v_i \mid x_1, \ldots, x_n, y_1, \ldots, y_{t-1}; \theta) = \frac{\exp(z_{t,i})}{\sum_{j=1}^{|V|} \exp(z_{t,j})}.

Temperature

Temperature (TT) is a parameter used to control the amount of randomness during generation. It simply scales the logits before they get converted to probabilities: Softmax(zt,i;T)=ezt,i/Tj=1Vezt,j/T \text{Softmax}(z_{t,i}; T) = \frac{e^{z_{t,i} / T}}{\sum_{j=1}^{|V|} e^{z_{t,j} / T}} So for: \\T>1T > 1: the probability distribution is more uniform, leading to more “creative” sampling \\T<1T < 1: the probability distribution is sharper, meaning more confident generations. \\ In production settings you’re typically pretty worried about hallucinations, so will likely set T=0T=0. However, in the equation above T=0T=0 is undefined (division by 0). Practically, you can just sample greedily and bypass the Softmax calculation entirely.

Top K Sampling

Top K sampling restricts the next-token selection to only the top kk most probable tokens at each decoding step. Starting with the full distribution: p(yt=vix1,,xn,y1,,yt1;θ) p(y_t = v_i \mid x_1, \ldots, x_n, y_1, \ldots, y_{t-1}; \theta) We first sort the tokens in VV according to their probabilities in descending order:
p(yt=vi1)p(yt=vi2)p(yt=viV) p(y_t = v_{i_1}) \geq p(y_t = v_{i_2}) \geq \cdots \geq p(y_t = v_{i_{|V|}}) We then take the top kk tokens:
Stop k=vi1,vi2,,vik. S_{\text{top } k} = { v_{i_1}, v_{i_2}, \ldots, v_{i_k} }. To form the final probability distribution after top kk truncation, we renormalise the probabilities of only these selected tokens:
ptop k(yt=vi)={p(yt=vi)vjStop kp(yt=vj)if viStop k 0otherwise p_{\text{top } k}(y_t = v_i) = \begin{cases} \dfrac{p(y_t = v_i)}{\sum_{v_j \in S_{\text{top } k}} p(y_t = v_j)} & \text{if } v_i \in S_{\text{top } k} \ 0 & \text{otherwise} \end{cases}

Top P (nucleus) Sampling

Top P sampling selects the smallest set of of tokens whose cumulative probability exceeds the specified pp. So let S be the smallest set of tokens:
Stop p=vi1,vi2,,vir S_{\text{top } p} = { v_{i_1}, v_{i_2}, \ldots, v_{i_r} }

such that:

k=1rp(yt=vik)pandk=1r1p(yt=vik)<p. \sum_{k=1}^r p(y_t = v_{i_k}) \geq p \quad\text{and}\quad \sum_{k=1}^{r-1} p(y_t = v_{i_k}) < p.
This set Stop pS_{\text{top } p} is constructed by starting from the most likely token and adding tokens in descending order of their probability until the cumulative probability surpasses pp. Once we have identified Stop pS_{\text{top } p}, we discard all tokens not in that set and renormalise the probabilities of the tokens in Stop pS_{\text{top } p} just as we did for top kk sampling:
ptop p(yt=vi)={p(yt=vi)vjStop pp(yt=vj)if viStop p 0otherwise p_{\text{top } p}(y_t = v_i) = \begin{cases} \dfrac{p(y_t = v_i)}{\sum_{v_j \in S_{\text{top } p}} p(y_t = v_j)} & \text{if } v_i \in S_{\text{top } p}\ 0 & \text{otherwise} \end{cases}
The idea here is to choose a variable set of candidate tokens at each decoding step. Instead of a fixed-size truncation like top kk, top pp sampling adapts dynamically to the distribution of probabilities. If the model is very confident about a few tokens, Stop pS_{\text{top } p} might be small. If the probabilities are more spread out, Stop pS_{\text{top } p} grows larger. By adjusting pp, you control how “broad” the sampling distribution is—lower pp will make it more greedy, while higher pp yields more diverse outputs.

Stop Sequences

Stop sequences are predefined token patterns that, if generated, immediately stop the decoding process. Unlike the previous methods that adjust token probabilities, stop sequences do not affect the model’s distribution. Instead, they provide a deterministic cutoff: once a stop sequence appears, no further tokens are produced. This helps with enforcing strict output formats.

Final Remarks

I believe that having a precise mental model of these techniques is really important when you’re building with LLMs. We don’t have many tools to control proprietary LLMs but these few parameters do let us shape the model’s output distribution in meaningful ways and hopefully let us build more robust systems.

2024 Differentiated.