Commander Track/Multi-Agent Systems
Commander Track
Module 1 of 6

Multi-Agent Systems

Agent architectures, tool use, and orchestrating multiple AI agents together.

18 min read

What You'll Learn

  • Distinguish between AI chatbots and AI agents, and explain the agent loop that drives agentic behavior
  • Understand how tool use and function calling work at the API level
  • Compare single-agent and multi-agent architectures and identify when each is appropriate
  • Evaluate orchestration patterns (sequential, parallel, and hierarchical) and their practical tradeoffs
  • Recognize the key failure modes in multi-agent systems and apply strategies to manage cost and error propagation

Agents vs. Chatbots: A Meaningful Distinction

You have probably used a chatbot. You type something, it responds, and the interaction is complete. The model is stateless in any meaningful sense: it sees the conversation history you send it and produces one output. There is no planning, no tool use, no persistence beyond what you explicitly pass in.

AI agents are a different category of system. An agent is an AI that can take actions in the world, observe the results, and decide what to do next, iterating until it reaches a goal. The defining characteristic is not the model itself but the control loop wrapped around it.

Think about the difference between asking someone a question (chatbot) versus assigning them a project (agent). The project requires planning, breaking down into steps, using tools like email or spreadsheets, handling unexpected results, and course-correcting. An agent does the same thing, just with API calls instead of emails and code instead of spreadsheets.

This distinction matters because building agents requires thinking differently. You are not crafting a single prompt to get a single response. You are designing a system that will run autonomously over multiple steps, potentially for minutes or hours, consuming tokens and making decisions you cannot directly supervise at runtime.

The Agent Loop: Observe, Think, Act

Every agent architecture, regardless of the framework or model underneath it, runs some version of the same fundamental loop.

Observe: The agent receives input: an initial task, a user message, or the output of a previous action. This gets assembled into a context window: the system prompt, conversation history, available tools, and any retrieved information.

Think: The model processes the context and produces output. In a simple chatbot, this output is shown to the user. In an agent, the output might be a decision about what tool to call, a plan to break a problem into subtasks, or reasoning steps before taking action. Many modern agent implementations use explicit reasoning steps (often called "chain-of-thought" or "scratchpad") where the model works through the problem before committing to an action.

Act: The agent executes its chosen action. This might mean calling an external API, running code, reading a file, searching the web, writing to a database, or invoking another model. The result of that action feeds back into the next Observe step.

This loop continues until the agent determines it has completed the task, hits a stopping condition, or encounters an error it cannot recover from. The number of loop iterations is called the step count or turn count, and it directly determines cost. Each loop iteration consumes tokens.

Understanding the loop also tells you where things break. If the model makes a wrong decision in the "Think" step, the wrong action gets taken, the wrong result comes back, and the model now has to reason from a corrupted state. Error recovery is not guaranteed. This is why robust agents need careful guardrails around both the thinking and acting phases.

Set a maximum step limit

Every production agent should have a hard cap on the number of loop iterations. Without one, a confused agent can run indefinitely, burning API budget and doing nothing useful. A step limit of 10-20 is reasonable for most tasks; complex research agents might warrant 50. Always log which step triggered the stop.

Tool Use and Function Calling

The mechanism that turns a language model into an agent is tool use, also called function calling. When you call a model API with tools enabled, you pass in a list of function definitions alongside your prompt. Each definition specifies the function name, a description in natural language, and the parameters it accepts (names, types, required vs. optional).

The model reads these definitions and, when it decides to use a tool, returns a structured response that says: "Call this function with these arguments." Your application code receives that structured response, executes the actual function (the model never calls anything directly; it only requests it), and sends the result back to the model in the next turn.

Here is a simplified example. You give the model a tool called search_web with a parameter query: string. The model, in the middle of answering a research question, decides it needs current information. It returns something like {"tool": "search_web", "arguments": {"query": "latest developments in protein folding 2025"}}. Your code runs the search, gets results back, appends them to the conversation, and calls the model again. The model now has the search results in context and can incorporate them into its answer.

This pattern is powerful because it lets models do things they fundamentally cannot do with pure text prediction: retrieve live data, execute code, write to external systems, and interface with the real world. The model's role is decision-making and reasoning; your infrastructure's role is execution.

Parallel tool calls are supported by most modern APIs. The model can request multiple tool calls in a single response, which your code can execute concurrently and return together. This significantly reduces latency for tasks that require multiple independent lookups.

Inspect a raw tool call response

Call any LLM API with a simple tool defined, something like a calculator with add, subtract, multiply. Send a message that requires math. Inspect the raw JSON response before processing it. You will see the tool call structure the model returns, which demystifies how agents actually work under the hood. Most developers find this moment clarifying.

Single-Agent vs. Multi-Agent Architectures

The simplest agent is a single model in a loop with access to tools. For many tasks (browsing the web, writing and running code, managing files) a well-designed single agent is sufficient and considerably easier to debug than a multi-agent system.

But some problems genuinely benefit from multiple agents working together. Multi-agent systems split work across specialized models, run tasks in parallel, or use hierarchical structures where a coordinator delegates to workers.

When single-agent works best: The task fits within one context window. Steps are naturally sequential and interdependent. Debugging simplicity matters. Token costs are a concern.

When multi-agent is worth the complexity: Tasks can be parallelized with genuine independence between subtasks. Specialized agents (one for code, one for research, one for writing) outperform a generalist. The task is too long for any single context window. You need redundancy, with multiple agents cross-checking each other's outputs.

Three core orchestration patterns:

Sequential: Agent A completes its task and hands output to Agent B, which hands output to Agent C. Simple to reason about, but slow: each step blocks the next. Use this when outputs are tightly coupled.

Parallel: A coordinator dispatches multiple agents simultaneously, collects results, and synthesizes them. Faster for independent subtasks, but requires a synthesis step that can itself introduce errors. Use this for research aggregation, parallel data processing, or any task with clearly separable subproblems.

Hierarchical: An orchestrator agent decomposes a high-level goal into subtasks, dispatches them to worker agents, evaluates results, and decides what to do next. The most powerful pattern and the most complex. The orchestrator itself needs to be a capable model, because routing errors at this level cascade everywhere.

To see these patterns in action, look at real-world autonomous agents. Manus is an AI agent that can browse the web, write code, manage files, and complete complex multi-step tasks with minimal human supervision. It is a practical example of the hierarchical pattern where a planner decomposes goals and a worker executes them. Genspark takes a research-focused approach, deploying multiple specialized agents in parallel to gather, synthesize, and present information from across the web. Both demonstrate that multi-agent architectures are moving from research prototypes to production-ready products.

Frameworks: LangChain, CrewAI, AutoGen, and the Agent SDK

Frameworks abstract away the boilerplate of building agents so you can focus on what the agents actually do. Each has a different philosophy.

LangChain is the oldest and most comprehensive. It provides abstractions for chains, agents, memory, and tools, with integrations for virtually every LLM provider and data source. The tradeoff is complexity: LangChain's abstraction layers can make debugging difficult because errors surface far from their origin. It is best for teams that need breadth of integrations and are comfortable reading through framework internals.

CrewAI is built around the metaphor of a crew of agents with roles, goals, and backstories. It makes multi-agent collaboration feel more declarative: you define agents and tasks in a structured way, and the framework handles orchestration. CrewAI is easier to get started with for multi-agent scenarios but less flexible when you need fine-grained control over the agent loop.

AutoGen (from Microsoft Research) focuses on conversational multi-agent systems where agents talk to each other to solve problems. It is particularly strong for code-generation workflows where a coder agent and a reviewer agent iterate together. The conversational structure makes certain patterns natural but can feel awkward for non-dialogue tasks.

Claude Agent SDK (Anthropic's official SDK) provides a structured way to build agents using Claude models, with first-class support for tool use, subagent spawning, and context management. It is the most tightly integrated option for Claude-based systems and benefits from Anthropic's design philosophy around safety and predictable behavior.

The right choice depends on your model provider, your team's debugging tolerance, and how much flexibility you need. For new projects, start with the simplest thing that could work (often that is direct API calls with your own thin orchestration layer) before reaching for a framework.

Failure Modes: Error Cascading and Cost Management

Multi-agent systems fail in ways that single models do not, and the failure patterns are worth understanding before you build anything in production.

Error cascading is the most dangerous failure mode. When an early agent in a pipeline produces incorrect output, downstream agents receive that bad input and build on it. By the time you see the final output, the original error may be unrecognizable. Mitigation strategies include: validation steps between agents (have the model or a deterministic function check outputs before passing them on), explicit confidence signals (ask the model to rate its certainty), and breaking pipelines at natural checkpoints where a human or rule-based system can verify.

Context poisoning happens when incorrect information gets added to the shared context and subsequent steps treat it as ground truth. This is especially common in long-running agents that accumulate tool call results. Periodically summarizing and pruning context (keeping only the information actually needed for remaining steps) helps prevent this.

Cost spirals are a real operational risk. An agent running 50 steps with a 200,000-token context window on each step, using a frontier model, can cost tens of dollars for a single task. Multiply by concurrent users and failed runs and the numbers add up fast. Strategies: use a cheaper model for orchestration and reserve expensive models for specialized tasks, set hard token budgets per run, cache deterministic tool call results, and always estimate cost per run before deploying.

Tool failures (API timeouts, rate limits, malformed responses) need explicit handling. An agent that receives an error from a tool call should have a defined retry policy and a graceful degradation path if retries fail. Without this, agents either crash or silently continue with missing information.

Design for failure from the start. The best multi-agent systems are not the ones that never fail. They are the ones that fail detectably and recover cleanly.

Always instrument your agents

Log every tool call, every model response, every token count, and every step duration. Without observability, debugging a failed multi-agent run is nearly impossible. Treat agent traces the way you treat application logs: they are not optional. They are how you know what actually happened.

Key Takeaways

  • Agents differ from chatbots in their ability to loop, observing results, making decisions, and taking actions across multiple steps toward a goal
  • Tool use works by passing function definitions to the model; the model requests calls, your code executes them, and results feed back into the next loop iteration
  • Single-agent architectures are simpler to debug; multi-agent architectures are warranted when tasks can be parallelized, specialized, or are too long for one context window
  • Orchestration patterns (sequential, parallel, hierarchical) each have distinct latency and error propagation tradeoffs worth understanding before building
  • Production agents require hard step limits, cost budgets, tool failure handling, and structured logging; failure modes like error cascading and context poisoning need to be designed against explicitly