thesubhstack

Tool Calling with LLMs

7 min read

Introduction

LLMs have a fundamental limitation: their knowledge is frozen at training time. Yet when you ask ChatGPT "What's the weather in London?" or request current information, it often gives you accurate, real-time answers. How? Through tool-calling—a mechanism where the LLM doesn't access the world directly, but instead generates JSON instructions that tell an application layer which external tools to call (weather APIs, databases, search engines, etc.).

The app executes these tools, sends the results back to the LLM, and the LLM uses that fresh data to craft an intelligent response. It's elegantly simple: the LLM is a reasoning engine that decides what to ask for, while the app layer handles the actual execution. In this post, we'll break down exactly how this works—from the basic flow to the schema alignment that makes it reliable.

This shows the sequential flow:

  1. User sends query to App Layer
  2. App Layer sends (Query + System Prompt + Tool Definitions) to LLM
  3. LLM analyzes and sends back a Tool Call (JSON)
  4. App Layer executes the tool and sends back the Result
  5. LLM generates the Final Answer
  6. App Layer sends the Response to the User

Steps 2-5 can loop if the LLM needs to call multiple tools before generating the final answer.

How the LLM Decides Which Tool to Call?

Schema Standardization: The Foundation

Before we dive into how the LLM makes decisions, we need to understand something crucial: schema matters.

When you look at how tool-calling works across different LLM providers (OpenAI, Anthropic, Google, etc.), they all use similar structures to define tools. A tool definition typically looks like this:

json
{
  "name": "weather_fetch",
  "description": "Get current weather for a location",
  "parameters": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "City and country"
      }
    },
    "required": ["location"]
  }
}

And when the LLM outputs a tool call, it follows this schema:

json

json
{
  "tool": "weather_fetch",
  "params": {
    "location": "London"
  }
}

Why does this matter?

Because LLMs are both pre-trained on massive datasets—including millions of API calls, function definitions, and code examples that follow these patterns—and then specifically fine-tuned to handle tool-calling. The model learns that:

  • Tool names follow a verb_noun pattern (e.g., fetch_weather, search_web, calculate_sum)
  • Parameters are defined as JSON objects with properties
  • Tool calls output as JSON with a specific format

This learned pattern is baked into the model's weights through both pre-training and dedicated fine-tuning on tool-use datasets. When you give the LLM a tool definition in this standard schema, the model recognizes the pattern and knows how to respond.

How Training Increases Reliability

Because the LLM was trained on this schema pattern, it can reliably predict which tokens to output when it sees a similar schema at inference time.

Think of it this way: during fine-tuning, the model was given thousands of examples like:

json
"User: Get the weather in London"
→ Call weather API
→ Output: { "tool": "weather_fetch", "params": {"location": "London"} }

So when you give it a system prompt that says "You have access to weather_fetch", and the user asks "What's the weather in London?", the model's learned patterns activate. The probability that it will output a tool call in the correct JSON format is high.

If you gave it a tool with a completely novel name—something the model never saw during training—it can still work well as long as the tool description is clear. The model relies heavily on the description and parameter schema to understand what a tool does, not just on whether it's seen the exact name before. A well-written description matters more than a familiar name.

This is schema alignment: when your runtime schema matches the patterns the model learned during training, and your descriptions clearly convey what each tool does, predictions are reliable. When there's a mismatch—unusual formats, vague descriptions, or ambiguous tool boundaries—they're not.

Reading the System Prompt and Tool Definitions

Now, when the LLM receives your query, it doesn't receive it in isolation. It receives it alongside two critical pieces of information:

  1. System Prompt: Instructions like "You are a helpful assistant. You have access to the following tools..."
  2. Tool Definitions: The list of available tools with their descriptions and parameters

Together, these form the LLM's context window. The LLM reads all three pieces of information together:

  • The user's query
  • What tools exist (from the system prompt and definitions)
  • What each tool does (from the descriptions)

The LLM then makes a semantic match: "User wants weather → weather_fetch tool is available and does exactly this → I should call weather_fetch with location=London."

This matching happens through the same learned patterns. The model has seen countless examples where:

  • A description of what a user wants
  • Maps to a function with a matching description
  • Results in a specific tool call

One important detail: the model also has to decide whether to call a tool at all. If the user asks "What's 2 + 2?", the model should just answer directly. If the user asks "What's the weather right now in London?", a tool call is needed. This decision—tool call vs. direct answer—is part of the same prediction process.

The Prediction: Which Tool Call to Output

Based on this analysis, the LLM predicts: "The next tokens I should generate are a JSON tool call for weather_fetch."

It doesn't "decide" in a conscious way. Instead, the transformer layers process all the context, and the probability distribution over possible next tokens is heavily biased toward outputting the tool call in the expected format.

The LLM then generates tokens sequentially:

  • Token 1: {
  • Token 2: "tool"
  • Token 3: :
  • Token 4: "weather_fetch"
  • ... and so on

Each token prediction is influenced by what came before and the context that was fed in. Because the schema was standard (matching what the model was trained on), these predictions are confident and reliable.

Handing Off to the Application Layer

Once the LLM finishes generating the complete tool call, it outputs:

json
{
  "tool": "weather_fetch",
  "params": {
    "location": "London"
  }
}

The application layer receives this JSON. It validates it, ensures the tool exists, checks the parameters are correct, and then executes the actual tool. In this case, it calls the real weather API for London and gets back actual, real-time data:

json
{
  "location": "London, UK",
  "temperature": 12,
  "condition": "Rainy",
  "humidity": 75
}

Making Another Inference: Context + Results

Now comes the second inference. The application layer takes that real-time data and adds it back to the context. The LLM now sees:

json
Original Query: "What's the weather in London?"
System Prompt: [same as before]
Tool Definitions: [same as before]
LLM's Previous Tool Call: { "tool": "weather_fetch", "params": {"location": "London"} }
Tool Result: { "location": "London, UK", "temperature": 12, ... }

With this updated context, the LLM makes another inference. It reads the tool result and decides: "I have all the information the user asked for. I don't need to call any more tools. I should now generate a natural language response."

It then predicts tokens for the final answer:

json
"The weather in London is currently rainy with a temperature of 12°C and 75% humidity."

This answer is grounded in real data, not hallucinated from the training set. Because the LLM had access to the actual tool result, it can give an accurate, current answer.

The Full Loop: It's JSON All the Way Down

Tool-calling isn't magic—it's a well-defined loop. The LLM reads the user's query alongside the available tool definitions, decides whether a tool is needed (and if so, which one), and outputs a structured JSON call. The application layer takes over from there, executes the real tool, and feeds the result back into context. The LLM then makes a second inference to turn that real data into a natural language response.

The whole thing works reliably because of two things: standardized schemas that match what the model was trained on, and clear tool descriptions that help the model make the right match. Get those two things right, and the rest follows.