December 7, 2024
Decoding Methods for LLMs
A verbose explanation of common auto-regressive decoding methods: temperature, top k and top p.
LLMs are, at their core, completely deterministic. In a controlled environment—fixed parameters, identical input, and a known random seed—an LLM will generate the same output every time. There’s no “built-in” randomness. Yet, we often want more interesting and varied outputs, rather than the most probable, so we deliberately introduce methods that create the illusion of randomness / creativity. This article explains these methods and how they influence an LLM’s output distribution.
Note: confusion often arises because most users (and devs) encounter LLMs through proprietary APIs. Which are, surprisingly, impossible to make consistent even when all top-level hyper-parameters are exposed. The reality is this that this is due to other factors such as GPU thread scheduling, rounding behaviour in floating-point arithmetic, opaque model versioning etc. This warrants it’s own, separate discussion.
Setup
Temperature
Top K Sampling
Top P (nucleus) Sampling
such that:
Stop Sequences
Stop sequences are predefined token patterns that, if generated, immediately stop the decoding process. Unlike the previous methods that adjust token probabilities, stop sequences do not affect the model’s distribution. Instead, they provide a deterministic cutoff: once a stop sequence appears, no further tokens are produced. This helps with enforcing strict output formats.
Final Remarks
I believe that having a precise mental model of these techniques is really important when you’re building with LLMs. We don’t have many tools to control proprietary LLMs but these few parameters do let us shape the model’s output distribution in meaningful ways and hopefully let us build more robust systems.