Function Calling Internals: Grammars and Constrained Sampling
In the previous posts in this series, we established that when you give an LLM a list of function tools, the model must interpret JSON schemas at inference time and produce structured output that conforms to them. We showed that built-in tools outperform function tools because they’re in-distribution, and we traced the token-level mechanics of how tool calls actually fire. But we glossed over something fundamental: how does the model produce valid JSON in the first place? When you send a function definition with a specific schema, what ensures the model’s output actually conforms to it?
The answer involves three concepts that connect in a pipeline: sampling, formal grammars, and constrained decoding. Understanding how they fit together gives you a precise picture of what happens between the moment a model computes its next-token probabilities and the moment a valid tool_use block appears in the API response.
How a Model Picks a Token
An LLM generates text one token at a time. At each step, the model’s final layer produces a vector of raw scores called logits – one score per token in the vocabulary. GPT-4o’s vocabulary has ~200,000 tokens; Claude’s has ~150,000. So at every generation step, the model outputs a vector of 200,000 numbers, each representing how strongly it “wants” to produce that token next.
These logits are not probabilities. They can be negative, they can exceed 1, and they don’t sum to anything meaningful. To convert them into a probability distribution, the inference engine applies a pipeline:
Step 1: Temperature scaling. Each logit is divided by the temperature parameter $T$:
$$z_i' = \frac{z_i}{T}$$Temperature controls the shape of the distribution. At $T = 1$, the logits pass through unchanged. At $T < 1$ (say, 0.2), the distribution sharpens – the gap between high and low scores widens, making the model more deterministic. At $T > 1$, the distribution flattens – lower-probability tokens get a larger relative share, making the output more random.
Step 2: Softmax. The scaled logits are converted to probabilities:
$$P(x_i) = \frac{e^{z_i'}}{\sum_{j=1}^{|V|} e^{z_j'}}$$where $|V|$ is the vocabulary size. After softmax, every value is between 0 and 1, and they sum to exactly 1. This is now a valid probability distribution over the vocabulary.
Step 3: Filtering. Two common filters narrow the candidate set before selection:
- Top-K: Keep only the $K$ highest-probability tokens, zero out everything else, and renormalize. If $K = 50$, the model only considers its 50 best guesses.
- Top-P (nucleus sampling): Starting from the highest-probability token, accumulate tokens until the cumulative probability exceeds threshold $p$ (e.g., 0.95). Only those tokens are candidates. Unlike Top-K, the number of candidates varies – when the model is confident, fewer tokens qualify; when it’s uncertain, more do.
Step 4: Sampling. A token is randomly drawn from the filtered distribution. Each candidate’s probability determines its chance of selection. This is multinomial sampling – the same mechanism as rolling a weighted die.
The selected token is appended to the sequence, and the entire process repeats for the next position. This is autoregressive generation: each token depends on all previous tokens.
For most conversational use cases, this pipeline (temperature + top-p + sampling) is what’s running. It produces diverse, natural-sounding text. But it has a problem: nothing in this pipeline guarantees that the output will be valid JSON, match a schema, or conform to any structure at all. The model might produce {"city": "Tokyo"}, or it might produce {"city": "Tok followed by a stray newline and some prose. The sampling process is probabilistic – it respects the model’s learned preferences, but it doesn’t enforce structural rules.
This is where grammars come in.
What Is a Formal Grammar?
A formal grammar is a set of rules that defines which strings belong to a language. Not a natural language like English – a formal language, which is any set of strings over some alphabet. JSON is a formal language. So is Python, XML, regular expressions, and the set of all valid email addresses.
A grammar specifies the language through production rules that describe how to build valid strings from smaller pieces. The most common notation for writing grammars is EBNF (Extended Backus-Naur Form), which looks like this:
json ::= object | array
object ::= "{" (pair ("," pair)*)? "}"
pair ::= string ":" value
array ::= "[" (value ("," value)*)? "]"
value ::= string | number | object | array | "true" | "false" | "null"
string ::= '"' character* '"'
The rules have two kinds of symbols:
- Non-terminals (like
json,object,pair) are variables that expand into other symbols. They represent structural concepts. - Terminals (like
"{",",","true") are literal characters or strings that appear in the final output.
The grammar above is a context-free grammar (CFG) – each rule has a single non-terminal on the left that can be expanded regardless of its surrounding context. CFGs are powerful enough to describe recursive, nested structures like JSON objects containing arrays containing objects. This is the level of expressiveness needed for function calling schemas.
There’s a hierarchy here. Regular grammars (which correspond to regular expressions and finite automata) can describe flat patterns like phone numbers or email formats, but they can’t handle arbitrary nesting. Context-free grammars (which correspond to pushdown automata – finite automata plus a stack) can handle nesting and recursion, which is why they’re the standard formalism for programming languages and structured data formats. JSON, with its arbitrarily nested objects and arrays, requires a context-free grammar.
Lark: A Practical Grammar Format
Lark is a Python parsing library that implements context-free grammar parsing. Its grammar format has become a de facto standard in the LLM constrained-decoding ecosystem – Microsoft’s llguidance library, which powers grammar-constrained generation in several major serving frameworks, uses a variant of Lark’s syntax.
A Lark grammar has two building blocks:
Rules (lowercase names) define structure:
start: value
value: object | array | string | number | "true" | "false" | "null"
object: "{" [pair ("," pair)*] "}"
pair: string ":" value
array: "[" [value ("," value)*] "]"
Terminals (UPPERCASE names) define the alphabet – the actual characters that appear in the output:
STRING: "\"" /[^"\\]*/ "\""
NUMBER: /-?[0-9]+(\.[0-9]+)?([eE][+-]?[0-9]+)?/
%ignore /\s+/
Terminals can use regular expressions (enclosed in /), string literals (enclosed in "), and character ranges. The %ignore directive tells the parser to skip whitespace.
Lark supports three parsing algorithms: Earley (handles any CFG, including ambiguous ones), LALR(1) (faster, handles a deterministic subset), and CYK. For constrained decoding, Earley with dynamic lexing is the typical choice because it handles the full range of grammars that might arise from arbitrary JSON schemas.
The reason Lark’s format matters for LLMs is that it provides a precise, machine-readable way to specify what the model is allowed to generate. A JSON schema like:
{
"type": "object",
"properties": {
"city": { "type": "string" },
"unit": { "enum": ["celsius", "fahrenheit"] }
},
"required": ["city"]
}
Can be mechanically converted into a Lark grammar that describes exactly the set of valid JSON strings matching that schema. The unit field, for example, becomes a rule that only allows the literals "celsius" or "fahrenheit" – not any arbitrary string. The required constraint ensures the grammar only accepts objects containing a "city" key.
This conversion from JSON Schema to CFG is the first step in the constrained decoding pipeline.
From JSON Schema to Grammar
When you send a function tool definition to the API, the provider’s infrastructure converts the JSON schema into a context-free grammar. The conversion is mechanical:
"type": "string"becomes a rule that matches any valid JSON string"type": "number"becomes a rule that matches JSON number literals"type": "object"with"properties"becomes a rule that matches an object with exactly those keys and value types"enum": [...]becomes an alternation rule listing only the allowed literals"required": [...]restricts which property combinations the grammar accepts- Nested objects and arrays produce recursive grammar rules
For the get_weather function schema above, the resulting grammar (expressed in a Lark-like notation) would look roughly like:
start: "{" ws pair_city ws maybe_unit "}"
pair_city: "\"city\"" ws ":" ws STRING
maybe_unit: ("," ws pair_unit)?
pair_unit: "\"unit\"" ws ":" ws UNIT_ENUM
UNIT_ENUM: "\"celsius\"" | "\"fahrenheit\""
STRING: "\"" /[^"\\]*/ "\""
ws: /\s*/
This grammar accepts {"city": "Tokyo"} and {"city": "Tokyo", "unit": "celsius"} but rejects {"city": 42} (wrong type), {"unit": "kelvin"} (not in enum), and {"temperature": 22} (wrong property name). The grammar is a precise specification of the schema’s valid outputs.
OpenAI described this conversion in their Structured Outputs announcement (August 2024): for each JSON schema, they compute a context-free grammar that represents the set of all valid strings conforming to the schema. This grammar is preprocessed and cached – the first request with a new schema incurs conversion latency, but subsequent requests reuse the cache.
Constrained Decoding
Here’s where it all connects. Constrained decoding (also called constrained sampling or grammar-guided decoding) modifies the sampling pipeline so that only tokens that maintain compliance with the grammar can be selected. Instead of hoping the model produces valid output, constrained decoding guarantees it.
The mechanism inserts a step between logit computation and sampling:
- The model produces its logit vector (200,000 scores).
- The grammar engine inspects the tokens generated so far and determines which vocabulary tokens are valid next, given the current position in the grammar.
- Every invalid token has its logit set to $-\infty$.
- Softmax, temperature scaling, and top-k/top-p proceed over the remaining valid tokens.
- A token is sampled from this constrained distribution.
The effect is straightforward: at every step, the model can only pick tokens that keep the output on a path toward a complete, valid string in the grammar’s language. If the model has generated {"city": " so far, the grammar allows any token that continues a valid JSON string. If it’s generated {"city": "Tokyo", "unit": , the grammar only allows "celsius" or "fahrenheit" (or their constituent tokens). If it’s generated a complete valid object, the grammar allows only the closing }.
The constrained probability distribution is a renormalization of the original:
$$P_{\text{constrained}}(x_t) = \frac{P(x_t)}{\sum_{x \in \mathcal{V}_{\text{valid}}} P(x)}$$where $\mathcal{V}_{\text{valid}}$ is the set of tokens that the grammar permits at position $t$. This preserves the model’s relative preferences among valid tokens. If the model thought “Tokyo” was twice as likely as “Osaka” before constraining, it’s still twice as likely after. The grammar only removes options that would violate the schema – it doesn’t change the model’s ranking of the remaining options.
The Subword Problem
There’s a subtlety that makes constrained decoding harder than it sounds. LLM tokenizers use subword tokenization (BPE), not character-level tokenization. The string "celsius" is not a single token – it might be tokenized as ["\"", "c", "els", "ius", "\""] or some other split depending on the tokenizer. A single token might span a grammar boundary (containing both the end of one field and the start of another), and a single grammar terminal might require multiple tokens to complete.
This means the grammar engine can’t simply check “is this token a valid terminal?” It must track partial matches – knowing that after generating "c", the tokens "els" and "elsius" are valid continuations but "ity" is not (because "city" is not in the enum). This requires the grammar engine to maintain parser state across token boundaries, which is where the choice of parsing algorithm matters.
Two Implementation Approaches
Finite-state machines (FSMs) work for flat schemas. Libraries like Outlines compile regular expressions or simple schemas into deterministic finite automata (DFAs). For each DFA state, the set of valid next tokens can be precomputed – making the per-token cost $O(1)$ lookup. The limitation is that DFAs cannot handle recursion or arbitrary nesting, so they fall short for schemas with nested objects.
Pushdown automata handle the full power of context-free grammars. Libraries like XGrammar (used by vLLM and SGLang) and Microsoft’s llguidance use CFG parsers that maintain a stack to track nesting depth. The token mask cannot be fully precomputed (because the stack state is dynamic), but XGrammar found a powerful optimization: roughly 99% of vocabulary tokens are context-independent – their validity depends only on the grammar position, not the stack contents. Only ~1% of tokens need runtime stack inspection. By precomputing the context-independent masks and only dynamically checking the rest, XGrammar achieves token mask generation in under 40 microseconds per token, which is negligible compared to the model’s own inference latency.
Effectiveness
The combination of model fine-tuning and constrained decoding is remarkably effective. OpenAI reported that their model (GPT-4o) achieved 93% schema conformance from fine-tuning alone. Adding constrained decoding on top brought it to 100%. The model’s training gets it most of the way – it learns what valid JSON looks like, what schemas mean, and how to produce conforming output. Constrained decoding handles the remaining 7% where the model would otherwise produce an off-by-one bracket, a trailing comma, or an enum value that’s close but not exact.
The Distribution Distortion Problem
Constrained decoding guarantees structural correctness, but it introduces a subtle issue: masking tokens and renormalizing can distort the model’s probability distribution in ways that affect output quality.
Consider a simple example. Suppose the model’s unconstrained distribution assigns:
- 40% probability to a token that would produce valid JSON
- 30% probability to a token that’s grammatically invalid but semantically related
- 30% probability spread across other tokens
After masking the invalid token and renormalizing, the first token gets $\approx 57\%$ probability ($\frac{40}{70}$) instead of 40%. That’s fine for a single step. But these distortions compound across hundreds of tokens. The resulting distribution over complete strings can diverge significantly from what the model would have produced if it had only been “thinking in” valid strings from the start.
Researchers at Carnegie Mellon formalized this as the gap between grammar-constrained decoding (GCD) and the model’s true conditional distribution over grammatical strings. Their solution, Grammar-Aligned Decoding (NeurIPS 2024), uses an adaptive sampling algorithm that produces outputs that are both grammatically valid and faithful to the model’s actual preferences. The outputs are structurally correct but also semantically coherent – the grammar doesn’t accidentally steer the model toward low-quality completions that happen to be valid.
In practice, this distortion is most noticeable with complex schemas where many tokens get masked at each step. For simple function-call schemas with a handful of properties and basic types, the effect is minimal. But it’s worth understanding that constrained decoding is not a free lunch – it trades sampling fidelity for structural guarantees.
The Full Pipeline
Putting it all together, here’s what happens end-to-end when you send a function tool definition and the model decides to call it:
1. Schema conversion. Your JSON schema is converted to a context-free grammar. This grammar is preprocessed and cached.
2. Prompt construction. The tool definitions are serialized into the model’s prompt format – whether that’s Harmony tokens, XML-wrapped function definitions, or another provider-specific format. The model sees both the semantic description (so it knows what the tool does) and the structural specification (so the grammar engine knows how to constrain the output).
3. Decision to call. The model generates tokens freely until it decides to invoke a tool. This decision is influenced by training (the model learned when to use tools during post-training) and the prompt (the tool descriptions tell it what’s available). When the model emits the appropriate signal – a <|call|> token in Harmony, or an <invoke> tag in Claude’s format – the system knows a tool call is beginning.
4. Constrained generation. Once inside a tool call, the grammar engine activates. At each token position, it computes the set of valid next tokens given the grammar derived from the schema. Invalid tokens are masked. The model samples from the constrained distribution. This continues until the model produces a complete, valid JSON object that conforms to the schema.
5. Parsing. The generated output – guaranteed to be structurally valid – is parsed to extract the function name and arguments as a structured object.
6. Execution. The parsed tool call is returned to the developer (for function tools) or executed server-side (for built-in tools). The result flows back to the model, and generation continues.
The grammar is the bridge between the model’s probabilistic token generation and the deterministic structural requirements of a function call. Without it, you’d be relying entirely on the model’s training to produce valid output – which works most of the time, but not all of the time. With it, structural correctness is a hard guarantee, and the model’s training only needs to handle the semantic side: deciding which tool to call and what arguments to pass.
Conclusion
Function calling in LLMs is not magic, and it’s not just prompt engineering. It’s a pipeline where a JSON schema gets compiled into a formal grammar, and that grammar constrains the model’s sampling process so it can only produce valid output. The model still does the hard part – understanding the user’s intent, choosing the right tool, selecting the right argument values. But the structural scaffolding – the braces, the commas, the quotes, the property names, the type constraints – is enforced mechanically by the grammar engine at every token.
This is why function calls almost never return malformed JSON. It’s not because the model is that reliable. It’s because the sampling process won’t let it be unreliable.
Comments
Came here from LinkedIn or X? Join the conversation below — all discussion lives here.