Why Agents Hallucinate Tool Calls (and How to Stop It)

Tool call hallucination comes in three flavors. Your model calls search_orders when your tool is named get_orders. It passes user_id when your parameter is customer_id. Or it invokes web_search with full confidence even though you never registered that tool. Each failure looks different in the logs, but they share a root cause: the model is doing exactly what it was trained to do, and your tool list doesn’t match the patterns it learned.

Understanding why requires looking at what “tool selection” actually means at the token level.

Tool Selection Is Token Prediction

There is no router, no lookup table, no dispatch mechanism hidden inside the model. When an LLM decides to call a tool, it is generating the next tokens in a sequence — autoregressively, one token at a time, each conditioned on everything before it.

The tool name, parameter names, and parameter values are all generated the same way: by sampling from a probability distribution over the full vocabulary. At the moment the model emits the first token of a tool name, it is doing a soft selection over every token in its vocabulary — including tokens that spell out tool names it learned during pretraining but which are not in your current tool list.

This is not a bug in the tool-calling protocol. It is a direct consequence of how these models learn. Toolformer (arXiv 2302.04761), the first paper to train models to generate tool calls in-line, demonstrated this explicitly: models were trained to insert [Calculator(400/1400)]-style calls by learning which positions in a sequence would reduce the cross-entropy loss on subsequent tokens. Tool calling was taught as a language modeling task, and it remains one at inference time.

The probability mass from pretraining on API documentation, code repositories, and tool-use examples is not zeroed out because a given tool isn’t in your schema. If the conversation context strongly resembles a “web search” task, the model’s weights assign non-trivial probability to tokens like web_search, brave_search, search_web — regardless of what you’ve registered. This is the mechanism behind every hallucinated tool call.

The Three Failure Modes

1. Ghost Tool Invocations

A ghost tool invocation is when the model calls a tool that isn’t in your list at all. The name isn’t misspelled — it’s a completely different tool that the model learned about during training.

The Gorilla paper (arXiv 2305.15334, NeurIPS 2024) measured this directly. GPT-4 hallucinated 78.65% of API calls in zero-shot evaluations against HuggingFace’s tool catalog. The model wasn’t making minor mistakes — it was fabricating plausible-sounding API names from its training distribution instead of selecting from the provided list.

The problem is worst when the correct tool is absent entirely. Relign (arXiv 2412.04141) tested models with “Unmatched Tools” — scenarios where the right tool had been removed and replaced with irrelevant ones. The baseline hallucination rate jumped to 91.1%: when the model couldn’t find what it was looking for, it almost always invented something rather than reporting that it couldn’t proceed.

This connects directly to the built-in tools post: built-in tools like code_execution are in-distribution because the model was post-trained on their exact invocation patterns. Custom tools with different names are out-of-distribution, and the model must generalize from the schema alone. When the schema doesn’t match any trained pattern closely enough, the model falls back on patterns it does know — even if those tools don’t exist in the current context.

2. Parameter Hallucination

Even when the model selects the right tool, it may generate the wrong parameter names or values. SpecTool (arXiv 2411.13547) catalogs these into a precise taxonomy:

IAN (Incorrect Argument Name): the model hallucinates a parameter name not in the schema — user_id instead of customer_id, filepath instead of path
IAV (Incorrect Argument Value): wrong value or data transformation — passing 5 where the schema expects 0.05 (an annual rate as a percentage vs. a decimal)
IAT (Incorrect Argument Type): wrong type — a string where an integer is expected
IFN (Incorrect Function Name): a tool name that’s close but wrong — a variant or synonym

Parameter hallucination is a schema interpretation problem. The model generates parameter names and values as tokens conditioned on the tool name and the conversation context. If your parameter names don’t match the vocabulary patterns the model was trained on — or if the description is ambiguous about units, formats, or valid ranges — the model fills in what it expects based on similar tools it’s seen before.

When structured input is already available to the caller — a form submission, a parsed record, a prior API response — IAV and IAT can be eliminated entirely through delayed binding: letting the LLM select which parameters to populate while the caller binds the actual values independently.

The pattern separates two concerns that tool calling normally conflates. The LLM is good at semantic routing — recognizing that create_refund is the right tool and that order_id and amount are the relevant fields. It is unreliable at transcription — reproducing "ord_456" accurately or knowing that amount must be a decimal rather than a string. Delayed binding assigns each job to the part of the system best suited for it.

In practice, tool descriptions advertise the structured input fields the caller holds, and the LLM generates symbolic references rather than literal values:

// Caller holds this structured input (already validated, typed correctly)
{ "order_id": "ord_456", "amount": 49.99 }

// LLM generates this — selecting tool and fields, not values
{
  "tool": "create_refund",
  "parameters": {
    "order_id": "$input.order_id",
    "amount": "$input.amount"
  }
}

// Caller resolves references before execution
{
  "tool": "create_refund",
  "parameters": { "order_id": "ord_456", "amount": 49.99 }
}

The binding step happens entirely outside the model. The caller validates that each $input.* reference resolves to a known field and enforces the correct type before the tool is ever invoked. IAV disappears because the model never generates concrete values; IAT disappears because the caller controls the type at binding time. The remaining risk is IAN — the model naming a field that doesn’t exist in the structured input — but that is a name-existence check, not a value-accuracy problem, and it can be caught with a simple validation pass before execution.

This technique requires structured input to be available, which puts it in the same family as the broader question of how to pass well-typed, pre-validated data into an agent’s context — a topic worth its own treatment.

GPT-4o scores 37 out of 100 on ToolBeHonest (arXiv 2406.20015), a benchmark specifically designed to diagnose honest tool use. The primary failure mode isn’t selecting the wrong tool — it’s solvability detection: the model doesn’t recognize when a task cannot be completed with the available tools and instead hallucinates parameters to make an existing tool appear to fit.

3. The Tool Count Effect

Adding more tools makes all of the above worse. This is documented in multiple datasets now.

RAG-MCP (arXiv 2505.03275) ran models against a large MCP tool catalog and found that baseline accuracy — providing all tools in context — was 13.62%. Using retrieval to narrow the candidate set to relevant tools brought accuracy to 43.13%, a 3x improvement, while cutting prompt tokens by more than half.

The numbers from a smaller-scale study (arXiv 2411.15399) are even more direct: Llama 3.1 8B failed entirely to select the correct tool from a set of 46. The same task succeeded when the tool list was reduced to 19.

The mechanism is straightforward: with more tools, the model must make a harder disambiguation decision at each token step. The probability mass gets spread across more plausible tool names, increasing the chance that sampling produces something other than the right answer. This is a fundamental constraint of next-token prediction with a large, overlapping candidate set.

The Reasoning Trap

There is a counterintuitive finding worth noting. You might expect that models with stronger reasoning capability would hallucinate less. The evidence suggests the opposite, at least for tool selection.

The Reasoning Trap paper (arXiv 2510.22977) tested reasoning-capable models against distractor scenarios — cases where only irrelevant tools were provided. DeepSeek-R1-Distill-Qwen-7B hallucinated 78.7% of the time when given only distractor tools. Qwen3-32B with thinking enabled: 50.7% in the same conditions.

The mechanistic finding: reinforcement learning for reasoning “disproportionately destabilizes tool-related representations” in early and middle transformer layers. The cosine similarity of tool-related representations drops below 0.75 in reasoning-RL models, compared to above 0.9 in models without reasoning RL. The reasoning training optimizes for confident, multi-step problem-solving — which, in distractor scenarios, means reasoning confidently toward a hallucinated tool rather than concluding there’s nothing useful available.

Longer chains of reasoning can actually increase hallucination by giving the model more opportunities to construct a plausible-sounding justification for an incorrect tool call.

How to Stop It

Constrained Decoding

The most effective intervention is also the most direct: restrict which tokens the model is allowed to generate at each step.

ToolDec (arXiv 2310.07075) implements a finite-state machine constructed from your tool API signatures. During generation, only tokens that are valid prefixes of registered tool names are eligible to be sampled — so the model literally cannot emit a ghost tool name. The FSM expands to allow any valid parameter name after the tool name is confirmed, then any valid value structure, and so on.

The results are striking: on Mistral-Instruct, accuracy went from 0% to 52% — matching specialized fine-tuned models without any fine-tuning. On REST API tool evaluation, tool errors dropped from 39–47% to 0%. And because the FSM masks invalid tokens before the softmax step, inference was up to 50% faster: fewer candidates means less computation.

Constrained decoding solves ghost invocations entirely and catches many parameter errors. It cannot solve IAV (wrong values that are structurally valid), but it eliminates the failure mode of malformed or non-existent tool names.

Most model providers now offer structured output modes that implement a version of this. Anthropic’s constrained sampling (tool use with tool_choice: {"type": "any"}) ensures the model generates a tool call, not free text. Extending this to grammar-constrained tool name generation is the logical next step — some open-source inference frameworks support it natively via GBNF grammars or Outlines.

Reduce the Tool Set

If you have more than ~20 tools, don’t send them all in every request. Use retrieval to narrow the candidate set first.

The RAG-MCP approach — embed tool descriptions, retrieve the top-k most relevant tools for each query, send only those — is straightforward to implement and produces large accuracy gains. The 3x improvement from 13.62% to 43.13% in the RAG-MCP paper is not an edge case; it reflects the fundamental difficulty of disambiguation in a large candidate set.

A more structured approach is hierarchical tool routing: maintain a small set of high-level tool categories, select the relevant category first, then expose only the tools in that category. This reduces the disambiguation problem at each step to a tractable size.

Treat Tool Names as Semantic Anchors

The model generates token sequences. Tool names that align with patterns the model already knows get activated more reliably than novel naming conventions.

Prefer common, idiomatic names: search_web over internet_query_executor, read_file over retrieve_document_contents. Align parameter names with standard conventions: query, path, user_id, limit, offset. Avoid abbreviations and domain-specific naming that has no analog in training data.

This is the same principle as the built-in tools post: closing the distribution gap between your tool’s invocation pattern and what the model was trained on improves reliability. You can’t get to zero gap with a custom tool, but you can choose names and schemas that minimize it.

Write Descriptions That Include Negative Cases

The default instinct for tool descriptions is to describe what a tool does. It is equally important to describe when not to use it.

The solvability detection failure documented in ToolBeHonest — the model trying to fit an existing tool to an unsolvable task — is partly a description problem. If your tool descriptions don’t say “use this only when X is available” or “this tool does not handle Y”, the model will apply them to Y.

Each tool description functions as a mini-system-prompt for that tool’s selection. Concrete exclusions reduce the probability of incorrect selection: "Use this to look up orders by order_id. Do not use for customer lookups — use search_customers for those." The negative constraint gives the model something to discriminate on.

Validate at the Boundary and Return Structured Errors

Never silently discard a tool call to a non-existent tool. Return a clear error message that the model can reason about:

Error: Tool 'web_search' is not available. Available tools: get_orders, search_customers, update_status.

This is the single cheapest intervention and it’s frequently skipped. When the model receives an empty response or a generic error, it has no signal to correct its behavior — it will often retry the same hallucinated call or proceed as if the call succeeded. An explicit error listing available tools gives the model exactly the in-context information it needs to course-correct.

Putting It Together

Tool call hallucination is not random. The model is pattern-matching against its training distribution, and the failures are systematic: ghost invocations happen when trained tool patterns activate without a matching registered tool; parameter errors happen when the schema doesn’t align with trained parameter conventions; the distractor scenario produces near-certain hallucination because the model was optimized to always produce an answer.

Each mitigation targets a specific part of this mechanism:

Constrained decoding eliminates ghost invocations by restricting token generation to valid names
Tool count reduction reduces disambiguation difficulty by limiting the candidate set
Name and description alignment closes the distribution gap between your tools and trained patterns
Structured error messages give the model in-context recovery information when something goes wrong

The deepest fix — training the model on your specific tool schemas — is what Gorilla demonstrated in 2023: fine-tuning on tool-specific data dropped GPT-4’s 78.65% hallucination rate to single digits. That’s not practical for most teams, but it establishes the ceiling. The production approximation is constrained decoding plus retrieval-narrowed tool lists — both of which are available today and together account for the majority of the improvement.

Comments

Came here from LinkedIn or X? Join the conversation below — all discussion lives here.