The Tool Invocation Gap: From ChatML to the Responses API
In the first post of this series, we looked at how special tokens like <|im_start|> and <|im_end|> form the structural grammar of LLM conversations, and how the newer Harmony format extends this with tokens like <|call|> and <|return|> for tool invocations. In the second post, we established that built-in tools outperform function tools because they’re in-distribution – the model was trained on their exact invocation patterns during post-training, while custom function tools require the model to generalize from a schema it’s never seen before.
This raises a natural question: if tool calls are ultimately just special tokens in a sequence, can’t you just write the tokens yourself? The answer traces a path through three generations of OpenAI’s API surface, and it reveals something important about where tool execution is heading.
The Token-Level View of Tool Calls
At the token level, a tool call in OpenAI’s Harmony format looks like this:
<|start|>assistant<|channel|>commentary to=functions.get_weather
<|constrain|>json<|message|>{"city":"Tokyo"}<|call|>
The model emits <|call|> (token ID 200012) as a stop signal, analogous to how <|im_end|> signals the end of a message in ChatML. The tool result comes back in a structured frame:
<|start|>functions.get_weather to=assistant<|channel|>commentary
<|message|>{"temp":22}<|end|>
And the model continues:
<|start|>assistant<|channel|>final
<|message|>It's 22 degrees in Tokyo.<|return|>
This is not an abstraction. These are literal tokens in the model’s vocabulary – <|call|> is token 200012, <|return|> is 200002, <|channel|> is 200005. The model learned to emit them during post-training, and they’re what separate a tool-calling model from a text-completion model. Built-in tools like code_interpreter and web_search are invoked through this same token-level mechanism internally.
So in principle, if you could construct a token sequence that includes these special tokens, you could invoke tools the way the model was trained to – in-distribution, with no schema interpretation overhead. The question is whether any API lets you do that.
The Chat Completions Wall
The /chat/completions API is where most developers interact with OpenAI’s models. You send structured JSON – an array of message objects with role and content fields – and the API handles serialization to the model’s internal format. You never touch tokens directly.
For tool use, the API exposes a tools parameter. The documentation is explicit: “Currently, only function is supported” as a tool type. You define a name, a description, and a JSON schema:
{
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": { "type": "string" }
},
"required": ["city"]
}
}
}]
}
This is the general-purpose tool mechanism. It works, and it’s the backbone of most agent frameworks. But as we discussed in the previous post, every function tool is out-of-distribution by definition – the model must interpret the schema at inference time using in-context learning rather than activating trained pathways.
There is no way to pass code_interpreter, web_search, file_search, or any other built-in tool through the /chat/completions endpoint. Those tool types simply don’t exist in its vocabulary. You can’t reference them, you can’t enable them, and you can’t construct a ChatML or Harmony payload that invokes them – the API accepts structured JSON, not raw token streams. The serialization boundary is absolute.
The implication is stark: the only tool mechanism available on the most widely-used OpenAI API endpoint is the one that’s out-of-distribution.
The Legacy Escape Hatch
There was, briefly, a way around this. The legacy /completions API – the original text completion endpoint – accepted the prompt parameter as a “string, array of strings, array of tokens, or array of token arrays” according to the API reference. That last option is the key: you could pass raw token IDs directly.
This meant you could, in theory, construct a raw ChatML sequence by hand:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
# Construct raw ChatML with special tokens as token IDs
prompt_tokens = [
100264, # <|im_start|>
*enc.encode("system"),
*enc.encode("\nYou are a helpful assistant."),
100265, # <|im_end|>
100264, # <|im_start|>
*enc.encode("user"),
*enc.encode("\nWhat's the weather in Tokyo?"),
100265, # <|im_end|>
100264, # <|im_start|>
*enc.encode("assistant"),
]
# Pass token IDs directly to the API
response = openai.completions.create(
model="gpt-3.5-turbo-instruct",
prompt=prompt_tokens
)
By injecting special token IDs directly into the prompt, you bypassed the structured JSON layer entirely and spoke the model’s native language. In principle, you could have crafted Harmony-style tool invocation sequences this way – though in practice, the models available on this endpoint (gpt-3.5-turbo-instruct, davinci-002, babbage-002) predated Harmony and lacked the post-training for those tool tokens.
This endpoint received its final update in July 2023. No new models will be added. The window for raw token-level control over OpenAI models is closed.
The Responses API: Tools Move to the Middle Tier
In March 2025, OpenAI launched the Responses API as the successor to both the Assistants API and, eventually, the primary development surface over Chat Completions. It introduces a fundamentally different architecture for tool use.
The Responses API ships with a growing list of built-in tools:
| Tool | Description |
|---|---|
web_search | Queries the internet and incorporates results into the response |
file_search | Searches uploaded file contents for relevant context |
code_interpreter | Executes code in a secure server-side container |
image_generation | Generates or edits images using GPT Image |
computer_use | Controls a computer interface for agentic workflows |
shell | Runs shell commands in hosted or local environments |
| Remote MCP | Connects to external tools via Model Context Protocol |
To use one, you simply declare it by type:
{
"model": "gpt-4o",
"tools": [{ "type": "web_search" }],
"input": "What happened in the news today?"
}
The model decides whether to invoke the tool based on the input. But here’s what matters architecturally: the tool executes server-side, inside OpenAI’s infrastructure. When the model calls web_search, OpenAI’s servers perform the search, process the results, and feed them back to the model – all within a single API call. The client never sees the intermediate tool call or its result unless it inspects the response’s output items.
OpenAI’s engineering blog is direct about the motivation: hosted tools execute server-side, ensuring “better latency and round-trip costs” compared to client-side implementations. The model can search the web, execute code, generate images, and access external services through MCP without round-tripping back to the developer’s backend.
This is the middle-tier pattern. Tool execution no longer happens at the edges (client-side function tools) or at the bottom (raw token sequences). It happens in the middle – between the model and the API response – managed entirely by OpenAI’s orchestration layer.
The Parity Problem
This architecture creates a fundamental asymmetry between what OpenAI can do with its models and what external developers can replicate. The gap has four layers:
1. No token-level access. The Chat Completions API serializes your JSON into the model’s token format behind a wall. You can’t inject Harmony tokens, you can’t construct <|call|> sequences, and you can’t trigger in-distribution tool pathways. The legacy Completions API that allowed raw token IDs is frozen. The only tool mechanism available on Chat Completions is function – which is, by design, out-of-distribution for every tool you define.
2. No server-side execution loop. When a built-in tool fires in the Responses API, the model-to-tool-to-model loop happens internally. The model calls code_interpreter, the code runs in a sandbox, the output flows back, the model reasons over it, potentially calls the tool again, and eventually returns a final response. This entire multi-turn execution cycle happens within a single API call. With function tools on Chat Completions, every tool call requires a round-trip to the client: the API returns a tool_calls response, your code executes the function, and you send the result back in a new request. Each round-trip adds latency and breaks the model’s reasoning continuity.
3. Training distribution mismatch. Built-in tools were part of post-training. The model was fine-tuned on thousands of examples of invoking code_interpreter with specific code patterns, handling execution errors, and iterating on results. It was trained on web_search queries and how to synthesize search results into coherent answers. These are not generic function calls – they’re deeply trained behavioral patterns with specific token sequences the model has been rewarded for producing. A custom function tool that does the same thing relies on the model’s general ability to interpret a schema, which as research has shown, degrades as tool count increases and never reaches the reliability of trained behavior.
4. Internal capabilities. Built-in tools running server-side may have access to model internals that aren’t exposed through the API. The code_interpreter can stream intermediate results back to the model mid-execution. The web_search tool can inject results directly into the model’s context in the format it was trained on. These integration points exist inside the middle tier and have no equivalent in the client-side function tool protocol.
What This Means
The trajectory is clear. Tool execution is moving inward – from client-side function calls, past the API boundary, into OpenAI’s managed infrastructure. The Responses API is explicitly positioned as “the API we’ll be building on for years ahead.” The Assistants API is scheduled for sunset in August 2026. Chat Completions will remain supported, but new capabilities – built-in tools, server-side execution, reasoning persistence across turns – are landing on the Responses API first.
For developers, this creates a practical tension. The Chat Completions API is simple, well-understood, and portable across providers. But it’s frozen at the function-tool level – every tool is out-of-distribution, every execution requires a client round-trip, and the growing list of built-in capabilities in the Responses API has no equivalent. You can approximate web_search with a custom function tool that calls a search API, but you’ll never match the performance of the built-in version because the model wasn’t trained on your tool’s schema, your result format, or your execution semantics.
The story across these three posts follows a single thread: LLMs encode tool use as special tokens in a learned sequence grammar, and the models perform best when they can use the exact token patterns they were trained on. ChatML introduced the grammar. Harmony extended it to tools. The Chat Completions API hid it behind structured JSON. The legacy Completions API briefly exposed raw token access, then froze. And the Responses API moved the entire tool execution loop server-side, making the most capable tool patterns accessible only through OpenAI’s managed middle tier.
The special tokens are still there. <|call|> still fires when the model decides to use a tool. You just can’t see it anymore.
Comments
Came here from LinkedIn or X? Join the conversation below — all discussion lives here.