Why Built-in Tools Outperform Function Tools in LLMs

When you give an LLM a list of tools, two things can happen. Either the model recognizes the tool because it was part of its training data, or it encounters the tool for the first time and must figure out what to do from the name and schema alone. This distinction – whether a tool is in-distribution or out-of-distribution – has a measurable impact on how well the model uses it, and understanding why requires looking at how tool use is actually baked into these models during post-training.

The Two Types of Tool

Modern LLM APIs expose two fundamentally different kinds of tools:

Function tools are user-defined. You provide a name, a natural language description, and a JSON schema describing the parameters. The model generates a structured JSON call, your code executes it, and you pass the result back. This is the general-purpose mechanism that powers most agent frameworks.

{
  "name": "execute_python",
  "description": "Executes Python code in a sandbox and returns stdout/stderr.",
  "input_schema": {
    "type": "object",
    "properties": {
      "code": { "type": "string", "description": "Python code to execute" }
    },
    "required": ["code"]
  }
}

Built-in tools are provider-defined and versioned. Anthropic ships code_execution_20250825, computer_20251124, text_editor_20250728, bash_20250124, and others. OpenAI ships code_interpreter, file_search, and web_search. You don’t provide a schema for these – you enable them by type identifier and the model already knows what to do.

{
  "type": "code_execution_20250825",
  "name": "code_execution"
}

From an end-user perspective, a custom execute_python function tool and Anthropic’s built-in code_execution tool produce the same outcome: Python code gets written, executed in a sandbox, and the results come back. But the model will consistently perform better with the built-in version, and the reason is rooted in how LLMs learn to use tools in the first place.

How Models Learn Tool Use

Tool-calling ability is not something that emerges from pretraining alone. It is explicitly taught during post-training – the supervised fine-tuning (SFT) and reinforcement learning (RLHF/RLAIF/DPO) stages that happen after the base model is trained on internet text.

The training pipeline for tool use typically involves:

Synthetic data generation: Function definitions and their invocations are extracted from code corpora. LLMs then generate natural language queries where those function calls would be the correct response, producing (query, tool_call) training pairs.
Multi-turn conversation synthesis: Entire conversations are generated where the model must decide when to call a tool, interpret the result, and continue reasoning. These include examples of chaining multiple tool calls, handling errors, and recovering from failed executions.
Special token training: The model learns special tokens that gate its behavior. Research has shown that decision tokens like <|use_tool|> and <|answer|> improve tool relevance detection from ~50% to ~65% (arXiv:2412.01130). These tokens act as a learned classifier: before generating content, the model first decides whether it should be producing a tool call or a direct answer.
Reinforcement learning: The model is rewarded for correct tool invocations and penalized for hallucinated calls, malformed schemas, or unnecessary tool use. This shapes not just what the model generates, but when and how it chooses to invoke tools.

The result is a model that has deeply internalized specific tool invocation patterns – the exact JSON structures, the expected response formats, the iterative execution-feedback-refinement loop – as part of its weight distribution.

The Distribution Gap

Here’s the key insight: built-in tools were part of this post-training data. Custom function tools were not.

When a model sees code_execution_20250825 in its tool list, it activates pathways that were reinforced thousands of times during training. The model knows:

The exact output format (e.g., server_tool_use blocks with bash_code_execution sub-tools)
When to write code versus give a direct answer
How to handle execution errors and iterate
The sandbox’s capabilities and limitations
Patterns for data analysis, visualization, and computation

This is in-distribution behavior. The token sequences the model needs to generate are ones it has seen and been rewarded for producing many times before. The model’s internal representations have been specifically shaped to handle these exact patterns.

When the same model sees a custom execute_python function tool, it must rely on in-context learning – parsing the name, description, and JSON schema at inference time to figure out what the tool does and how to use it. This is out-of-distribution in the sense that while the model has general training on how to call any function tool from a schema, it has no specific training on this particular tool’s semantics, edge cases, or optimal usage patterns.

The gap is analogous to the difference between a trained mechanic using their own tools versus reading the manual for an unfamiliar tool that does the same thing. Both work. One is reliably better.

Where the Performance Gap Manifests

The difference between in-distribution and out-of-distribution tool use shows up in several concrete ways:

Schema Interpretation vs. Embedded Knowledge

For built-in tools, the schema is literally in the model weights. Anthropic’s documentation states that for tools like computer use, “the schema is built into Claude’s model and can’t be modified.” The model doesn’t need to parse anything at inference time – it already knows the parameters, their types, their valid ranges, and their interactions.

For function tools, the model must interpret a JSON schema from the context window. This interpretation is a form of reasoning that consumes attention, can be influenced by ambiguous descriptions, and degrades as the number of tools increases. Research from Anthropic’s engineering team showed that with 50+ tools, function tool accuracy drops to 49%, prompting them to develop a “tool search” mechanism that reduced token consumption by 85% and improved accuracy to 74%.

Error Recovery and Iteration

Built-in tools benefit from trained error-handling patterns. The model has seen thousands of examples during post-training where code execution failed, a timeout occurred, or output exceeded limits, and it learned specific recovery strategies for each. With Claude’s server-side code execution, the model can iterate multiple times within a single API call – executing code, observing errors, and retrying – all using deeply trained behavioral patterns.

A custom function tool gets none of this. The model must infer error semantics from whatever string your tool returns, reason about what went wrong using general knowledge, and decide on a recovery strategy from first principles. This works, but it requires more reasoning tokens and is more prone to giving up prematurely or retrying the same failed approach.

Code Quality and Idiom

This is perhaps the most subtle effect. When a model knows it’s writing code for the built-in code execution tool, it generates code in patterns it was trained on – patterns that were specifically selected and reinforced during post-training for correctness, efficiency, and completeness. It knows the exact Python environment available, what packages are installed, and how to structure output for display.

With a custom tool, the model must make assumptions about the execution environment. It may not know the Python version, available libraries, memory limits, or timeout constraints. These uncertainties manifest as more conservative code (excessive try/except blocks, redundant checks) or, conversely, as overly optimistic code that fails in the actual environment.

A Concrete Example

Consider asking the model to analyze a CSV file. With the built-in code execution tool, the model might generate:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('/tmp/data.csv')
print(df.describe())
print(f"\nShape: {df.shape}")
print(f"\nMissing values:\n{df.isnull().sum()}")

fig, axes = plt.subplots(1, 2, figsize=(12, 5))
df.select_dtypes(include='number').hist(ax=axes[0])
axes[0].set_title('Distribution')
df.select_dtypes(include='number').corr().style.background_gradient()
plt.tight_layout()
plt.savefig('/tmp/analysis.png')
plt.show()

The model knows pandas and matplotlib are available, knows /tmp is writable, knows the output will be captured and displayed, and knows it can generate images that will be rendered. It writes this code with confidence because it has been trained on exactly this pattern.

Now consider the same task with a custom execute_python function tool that accepts a code string parameter. The model doesn’t know:

Is pandas installed?
Where can it write files?
Will plt.show() work or hang?
How is stdout captured?
What’s the timeout?

These unknowns lead to hedge-filled code, excessive error handling, or trial-and-error iterations that consume extra tokens and latency.

The Training Format Matters

Research on function-calling training provides direct evidence that the exact format used during training affects performance. The arXiv paper Enhancing Function-Calling Capabilities in LLMs (December 2024) found:

Models trained with function definitions in a dedicated “tools” role achieved 49.58% relevance detection accuracy, compared to 39.58% when tools were embedded in the system prompt – a 25% relative improvement from format alone.
Adding decision tokens (<|use_tool|> vs <|answer|>) further improved relevance detection to 65.42%.
Removing instruction-following data from training caused AST accuracy to drop from 85.25% to 74.62%, showing that the model’s general instruction-following capability directly impacts tool-use quality.

The implication is clear: the model’s performance with any tool is a function of how closely the inference-time format matches the training-time format. Built-in tools are a perfect match by definition. Custom tools are a partial match – they use the general function-calling format the model was trained on, but the specific tool semantics are novel.

Practical Implications

None of this means custom function tools are bad. They’re the backbone of LLM agent architectures and they work well for most use cases. But the performance gap has practical implications:

Use built-in tools when they exist for your use case. If you need code execution, prefer code_execution over a custom execute_python. If you need web search, prefer the built-in web_search tool. The performance difference is not marginal – it compounds across multi-step tasks where each tool call’s quality affects downstream reasoning.

When building custom tools, minimize the distribution gap. Write clear, specific descriptions. Use parameter names and types that align with common patterns in the training data. Include examples in the description if the tool has non-obvious behavior. Essentially, make it as easy as possible for the model’s in-context learning to approximate in-distribution behavior.

Expect the gap to narrow over time. As models improve at in-context learning and tool-use training data grows, the relative advantage of built-in tools may shrink. But it will likely never disappear entirely, because post-training will always be able to optimize for known tools in ways that in-context learning cannot.

Conclusion

The distinction between built-in and function tools is not just an API design choice – it reflects a fundamental aspect of how LLMs work. Post-training creates strong distributional priors for specific tool patterns, and built-in tools benefit from being exactly in-distribution with those priors. Custom function tools rely on the model’s generalization ability, which is impressive but inherently less reliable than trained behavior.

When you’re choosing between functionally equivalent tools – one built-in, one custom – choose the built-in one. The model is, quite literally, trained for it.