June 22, 2024

Structured Generation Shoot-out

A comparison of popular libraries for constraining LLM generations.

What is Structured Generation?

Structured generation (also sometimes referred to as constrained generation) aims to force an Large Language Model (LLM) to follow a certain schema or set of rules. This may be as simple as enforcing valid JSON but could be as complex as a totally custom context-free grammar . In open-weight models this works really well because you can directly control the token probabilities. This means you don’t even have to generate certain “fixed” tokens.

Think of getting an LLM to generate a list of strings in Python: ["a", "b","c"]. We know that the first character has to be an opening square bracket ([) and that, before each entry in the list (besides the first), we have to have a comma. It makes sense then that we just fill in these tokens instead of hoping that the LLM will generate them correctly. This also speeds things up.

Why do LLMs Need Structured Generation?

LLMs are not perfect instruction following machines and humans are certainly not perfect instruction writing machines.

Sure, fine-tuning and other assorted alignment techniques makes LLMs good at following instructions (especially when using considered few-shot examples) but “good” is simply not good enough for generating output in formal languages like JSON and XML. Which becomes necessary almost immediately when building anything beyond the most basic of chatbots.

With just prompting you might be able to get ~70% of the way there in terms of consistency. To get the rest you may be tempted to just write some defensive code to remove some random preamble here, or some rogue backticks there – but this path is treacherous. There’s a better way.

The Contenders

There’s no lack of libraries, frameworks or APIs that are trying to make structured generation easy. But when the big AI labs like Anthropic and OpenAI change their interfaces for structured output / tool calling on what feels like a monthly basis it becomes a daunting task for any library maintainer. In the past year you may have heard of:

Function calling: an LLM that’s fine-tuned to generate function calls
Tool calling: same as above but with multiple functions
JSON mode: guaranteed JSON output

These are all different but can, in many cases, be used to achieve the same thing. In this section we’ll be discussing libraries that act as abstractions over the structured generation capabilities provided by black-box APIs and open-source models.

Langchain

Starting with the library that everyone loves to hate: Langchain.

Since Langchain covers a lot of ground in LLM orchestration this clearly isn’t going to be their main focus (nothing is). Nevertheless for basic usage there’s a relatively straightforward common interface to get structured output out of the models that Langchain supports: .with_structured_output.

So to get an OpenAI LLM to adhere to Pydantic model all you need to do is:

1from langchain_core.pydantic_v1 import BaseModel, Field
2from langchain_openai import ChatOpenAI
3
4class Joke(BaseModel):
5	'''A hilarious joke'''
6    setup: str = Field(description="The setup of the joke")
7    punchline: str = Field(description="The punchline to the joke")
8
9model = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)
10structured_llm = model.with_structured_output(Joke)

There’s a bit of magic going on here as Langchain will inject the class name and doc string into the prompt to help steer the LLM.

If you prefer less magic you can also use regular JSON schemas instead of Pydantic models - one of the big caveats being that the JSON output won’t be fully validated, so you’ll need to write some checks yourself.

If you’re already using Langchain this is a solid option. Beware though - if you need deeper control of the exact parameters the underlying API exposes (Vertex AI, Anthropic and OpenAI are all quite different) you’re likely going to run into some trouble.

LlamaIndex

Llamaindex is another one the big players in the “LLM abstraction layer” space. It has pretty basic native capabilities for structured generation but has solid integrations in the form of output parsing modules. Which means you can use stand-alone libraries such as Guidance (discussed below) and Guardrails AI. It even integrates with Langchain (in case you wanted an ultra heavy-weight solution).

The native offering involves simply passing a Pydantic model into one of their query engines (a generic interface that allows you to ask question over your data):

1from llama_index.core import VectorStoreIndex
2from typing import List
3from pydantic import BaseModel
4
5class Biography(BaseModel):
6    """Data model for a biography."""
7    name: str
8    best_known_for: List[str]
9    extra_info: str
10
11index = VectorStoreIndex.from_documents(documents)
12
13query_engine = index.as_query_engine(
14    response_mode="tree_summarize", output_cls=Biography
15)
16
17response = query_engine.query("Who is Paul Graham?")

Guidance

Guidance is one of the early constrained generation libraries. Developed by Microsoft. It had a brief hiatus in development but is now back in full-force. It’s a really interesting and powerful library that provides support for custom CFGs and regex. It’s pretty complex under the hood - I recommend this article if you want to take a deeper look.

For most use-cases however you’ll probably be using the pre-built components / parsers. This includes your standard JSON parsing according to a schema (with guidance.json) but also includes some more unique cabapiblities like substring:

1from guidance import substring
2
3# define a set of possible statements
4text = 'guidance is awesome. guidance is the best thing since sliced bread.'
5
6# force the model to make an exact quote
7llama2 + f'Here is a true statement about the guidance library: "{substring(text)}"'

These types of guarantees are really great for grounding RAG pipelines in actual quotes.

Note: a lot of the more advanced functionality is dependent on having full control of the token decoding loop (ie. an open-source model). This means black-box APIs don’t work as well with the more advance capabilities.

Outlines

Outlines from .txt is another standalone solution that’s loved by the community. Like Guidance, it supports more advanced functionality like regex based structured generation and using CFGs (written in EBNF format) to guide generation. Other features include:

Structuring output based on a function signature (see code-snippet)
Following Pydantic Models
Following JSON schemas.

Here is how easy it easy to call a tool using Outlines (it’s very Pythonic, which I love):

1import outlines
2
3def add(a: int, b: int):
4    return a + b
5
6model = outlines.models.transformers("WizardLM/WizardMath-7B-V1.1")
7generator = outlines.generate.json(model, add)
8result = generator("Return json with two integers named a and b respectively. a is odd and b even.")
9
10print(add(**result))
11# 3

If you need something very specific then this probably the library to go for.

Instructor

Built on top of Pydantic, Instructor is another solid open-source option. It patches the relevant API client, making it trivial to integrate into your existing code.

It’s especially relevant if you use a non-Python language as it supports quite a few languages out of the box (including TypeScript, Ruby, Go, and Elixir). It also has some useful quality of life features like:

Retry Management: configure the number of retry attempts for your requests
Streaming Support: works with Lists and Partial responses
Flexible Backends: integrate with various LLM providers beyond OpenAI

Here is how you constrain generation to a Pydantic model with Instructor:

1import instructor
2from pydantic import BaseModel
3from openai import OpenAI
4
5# Define your desired output structure
6class UserInfo(BaseModel):
7    name: str
8    age: int
9
10# Patch the OpenAI client
11client = instructor.from_openai(OpenAI())
12
13# Extract structured data from natural language
14user_info = client.chat.completions.create(
15    model="gpt-3.5-turbo",
16    response_model=UserInfo,
17    messages=[{"role": "user", "content": "John Doe is 30 years old."}],
18)
19
20print(user_info.name)
21#> John Doe
22print(user_info.age)
23#> 30

Final Remarks

As you can see, there’s a lot of different ways to achieve the same thing, which makes it very difficult to decide. In my opinion, the most important factor is which library is going to be supported the longest. Things can be pretty touch and go in the Gen AI world, and the last thing you want is for any library you’re depending on to stop receiving support, as that’ll mean it will probably break when newer models are released (so in about a week).

Marcel Marais

AI Engineer