AI Agents and Tool Use

This post explores how AI models take actions in the real world through tool use and agent architectures. It builds on concepts from across the series, particularly How LLMs Actually Work and Prompting and Inference.

Introduction

Every post in this series so far has focused on models that take text in and produce text out. That is useful, but limited. Real-world tasks require more than generating text. They require looking things up, running calculations, calling APIs, reading files, writing code, and chaining multiple steps together.

Agents are systems that give models the ability to take actions. Instead of just answering questions, an agent can search a database, send an email, create a ticket, run a query, or modify a file. This is the capability that turns LLMs from sophisticated text generators into systems that can do work.

Function Calling: The Foundation

Function calling (also called tool use) is the mechanism that allows a model to request the execution of external functions. The model does not execute anything itself. It generates a structured request describing which function to call and with what arguments. The system executes the function and returns the result to the model.

Here is the flow:

You define a set of available functions (tools) with their names, descriptions, and parameter schemas
The user sends a message
The model decides whether to respond with text or call a function
If calling a function, the model generates a structured object: function name + arguments
Your system executes the function and returns the result
The model uses the result to generate its final response

Example:

User: “What’s the weather in Chicago?”

Available tools:

{
  "name": "get_weather",
  "description": "Get current weather for a city",
  "parameters": {
    "city": { "type": "string" },
    "units": { "type": "string", "enum": ["fahrenheit", "celsius"] }
  }
}

Model decides to call:

{
  "function": "get_weather",
  "arguments": { "city": "Chicago", "units": "fahrenheit" }
}

System executes the function, returns: {"temp": 42, "condition": "cloudy"}

Model generates: “It’s currently 42°F and cloudy in Chicago.”

The model never accessed a weather API. It generated a structured request that your code executed. This distinction is critical for understanding the security and reliability characteristics of tool use.

When this matters in practice:

Function calling is how chatbots look up orders, check account balances, and perform real actions for users.
The model decides when to call a function based on the user’s message and the tool descriptions. Good tool descriptions are as important as good prompts.
The model can call multiple functions in sequence or in parallel. “What’s the weather in Chicago and New York?” can trigger two parallel function calls.

Designing Good Tools

The quality of an agent depends heavily on how its tools are designed.

Tool Design Principles

Single responsibility. Each tool should do one thing well. “search_database” is better than “search_and_format_and_email_results.”

Clear descriptions. The model uses tool descriptions to decide when to call them. Vague descriptions lead to wrong tool selection.

# Bad
"name": "process",
"description": "Processes data"

# Good
"name": "search_orders",
"description": "Search customer orders by order ID, customer
email, or date range. Returns order status, items, and
shipping details."

Explicit parameters. Define required vs. optional parameters, types, enums, and validation constraints. The model generates better arguments when the schema is precise.

Meaningful return values. Return data the model can use to formulate a response. Include relevant context, not just raw IDs or status codes.

Error handling. Return clear error messages when a tool call fails. The model can often recover or ask the user for clarification if it understands what went wrong.

How Many Tools?

More tools give the agent more capability but also more opportunities to choose wrong.

Number of Tools	Considerations
1-5	Easy for the model to select correctly. Low overhead.
5-15	Works well with good descriptions. Group related tools logically.
15-30	Model selection accuracy starts to drop. Consider categorization or routing.
30+	Consider breaking into sub-agents or using a tool selection layer.

When this matters in practice:

Frontier models handle 10-20 tools reliably. Smaller models struggle with more than 5-10.
If you need many tools, consider a hierarchical approach: the main agent selects a category, then a specialized sub-agent selects the specific tool.

Agent Architectures

Simple function calling handles single-step tool use. Agents handle multi-step workflows.

ReAct (Reasoning + Acting)

The most common agent pattern. The model alternates between reasoning about what to do and taking action:

Thought: “The user wants to know their order status. I need to look up their order.”
Action: Call search_orders(email="[email protected]")
Observation: Order #12345, shipped, tracking: 1Z999AA10123456784
Thought: “I have the order details. I should check the tracking status.”
Action: Call track_shipment(tracking="1Z999AA10123456784")
Observation: In transit, estimated delivery March 21
Response: “Your order #12345 has shipped and is expected to arrive on March 21.”

The model is planning, executing, observing, and adapting. Each step informs the next.

Planning Agents

For complex tasks, the agent creates a plan before acting:

Break the task into steps
Identify which tools are needed for each step
Execute steps in order, adjusting the plan as results come in
Synthesize results into a final output

This is how coding agents (like Claude Code, Cursor, GitHub Copilot) handle multi-file changes: analyze the request, identify affected files, plan the changes, execute them, verify the results.

Multi-Agent Systems

Multiple specialized agents collaborate on a task:

Orchestrator agent breaks the task into subtasks and assigns them
Research agent gathers information
Writing agent produces content
Review agent checks quality

Each agent has its own tools, instructions, and possibly its own model. The orchestrator coordinates their work.

When this matters in practice:

Start with simple function calling. Move to ReAct when you need multi-step reasoning. Move to multi-agent when individual agents hit complexity limits.
Planning agents are better for tasks where the full sequence matters (like code changes). ReAct agents are better for open-ended exploration (like research).
Multi-agent systems add coordination overhead. Use them when the complexity of a single agent’s tool set or instructions becomes unmanageable.

Practical Agent Use Cases

Customer Support

An agent that handles support requests end-to-end:

Tools: search_orders, check_shipping, process_return, create_ticket, search_knowledge_base
Flow: Understand the issue, look up relevant data, attempt resolution, escalate if needed
Oversight: Human review for refunds above a threshold, automatic handling for routine queries

Code Assistants

Agents that read, write, and modify code:

Tools: read_file, write_file, search_codebase, run_tests, execute_command
Flow: Understand the request, explore the codebase, plan changes, implement, verify with tests
Example: Claude Code, Cursor, GitHub Copilot Workspace

Data Analysis

Agents that answer questions by querying data:

Tools: run_sql_query, create_chart, search_documentation, calculate
Flow: Interpret the question, write and execute queries, analyze results, present findings
Oversight: Read-only database access, query complexity limits

Research and Summarization

Agents that gather and synthesize information:

Tools: web_search, read_url, search_internal_docs, summarize
Flow: Search for relevant sources, read and extract key information, synthesize into a coherent summary
Example: Perplexity AI, research-focused agent workflows

Workflow Automation

Agents that handle multi-system processes:

Tools: create_jira_ticket, send_slack_message, update_crm, deploy_to_staging
Flow: Receive a trigger, execute a sequence of actions across systems, report results
Oversight: Approval gates for high-impact actions

Safety and Oversight

Agents that take actions need guardrails. An agent with access to production databases, email systems, or financial tools can cause real harm if it makes wrong decisions.

Principles

Least privilege. Give agents only the tools and permissions they need. A support agent does not need access to production deployment tools.

Human-in-the-loop. For high-stakes actions (refunds, data deletion, external communications), require human approval before execution.

Reversibility. Prefer reversible actions. Create a draft email instead of sending it. Stage a change instead of deploying it.

Logging. Record every tool call, its arguments, and its result. This is essential for debugging, auditing, and understanding agent behavior.

Rate limiting. Prevent runaway agents from making excessive API calls or taking too many actions in a short period.

Sandboxing. Run code execution in isolated environments. Do not give agents access to production systems without safeguards.

Failure Modes

Tool misselection. The agent calls the wrong tool. Mitigation: better tool descriptions, confirmation prompts for ambiguous cases.

Argument hallucination. The agent generates plausible but incorrect arguments (wrong customer ID, fabricated order number). Mitigation: validate arguments before execution, use lookups instead of generation for IDs.

Infinite loops. The agent gets stuck retrying a failed action. Mitigation: step limits, timeout enforcement.

Scope creep. The agent takes actions beyond what was requested. Mitigation: explicit scope constraints in the system prompt, tool-level access controls.

When this matters in practice:

Start with read-only tools. Add write/action tools incrementally, with oversight at each step.
The cost of a bad agent action can far exceed the cost of a wrong text response. Design accordingly.
Users should always know when they are interacting with an agent and what actions the agent can take. Transparency builds trust.

Building Your First Agent

A practical starting path:

Start with function calling. Define 2-3 tools. Build a system where the model can call them.
Add a system prompt. Define the agent’s role, available tools, and constraints.
Implement the loop. Model generates a response. If it includes a tool call, execute it and feed the result back. Repeat until the model generates a final text response.
Add error handling. What happens when a tool call fails? When the model generates invalid arguments?
Add oversight. Logging, rate limiting, and human approval for sensitive actions.
Test adversarially. Try to confuse the agent, trigger wrong tool calls, and push it outside its defined scope.

Frameworks

Several frameworks simplify agent development:

LangChain / LangGraph: Popular Python framework with built-in agent patterns, tool management, and chain composition.
Claude Agent SDK: Anthropic’s SDK for building agents with Claude.
OpenAI Assistants API: Managed agent infrastructure from OpenAI.
CrewAI: Multi-agent framework for collaborative agent workflows.
AutoGen: Microsoft’s framework for multi-agent conversations.

Frameworks help with boilerplate but add abstraction. For simple agents, direct API integration is often clearer and easier to debug.

What Comes Next

This post covered how models move beyond text generation through tool use and agent architectures. The final post in this series presents The Business Case for AI Adoption: making the case for AI investment to organizational leadership.

Closing Thoughts

Agents represent the shift from AI that talks to AI that does. Function calling and tool use are mature capabilities available from every major model provider. The question is no longer whether models can take actions, but how to design systems that let them do so safely and effectively.

The key principle: start simple. A single tool call that saves a user three clicks is more valuable than an autonomous agent that occasionally makes the wrong decision. Build trust through reliability. Add capability incrementally. Keep humans in the loop for actions that matter.

The technology for agents is ready. The design challenge is deciding what to automate, what to assist, and what to leave to humans. That decision depends on the stakes, the reliability requirements, and the cost of getting it wrong.

Introduction

Function Calling: The Foundation

Designing Good Tools

Tool Design Principles

How Many Tools?

Agent Architectures

ReAct (Reasoning + Acting)

Planning Agents

Multi-Agent Systems

Practical Agent Use Cases

Customer Support

Code Assistants

Data Analysis

Research and Summarization

Workflow Automation

Safety and Oversight

Principles

Failure Modes

Building Your First Agent

Frameworks

What Comes Next

Closing Thoughts

Comments