Why Single-Agent Architectures Break Down
When we first started building LLM-powered applications at TwilightCore, a single ReAct agent felt like magic. Give it tools, a system prompt, and watch it reason. Then we hit production traffic — and everything fell apart.
Single agents struggle with task complexity ceilings. Once you exceed 5-7 tools, accuracy degrades. Context windows fill with irrelevant tool outputs. Error recovery becomes a game of hope. We needed something better: multi-agent orchestration.
LangGraph gives us the primitives to build these systems properly — as stateful, observable graphs with explicit control flow.
Architectural Patterns
Supervisor Architecture
The supervisor pattern uses a central orchestrator agent that delegates to specialized worker agents. It is the simplest to reason about and the easiest to debug.
from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.types import Command
from langchain_openai import ChatOpenAI
from typing import Literal
llm = ChatOpenAI(model="gpt-4o")
def supervisor(state: MessagesState) -> Command[Literal["researcher", "writer", "reviewer", END]]:
"""Central orchestrator that routes to specialized agents."""
system_prompt = """You are a supervisor managing three workers:
- researcher: finds and validates technical information
- writer: produces polished content from research
- reviewer: checks accuracy and suggests improvements
Based on the conversation, decide which worker should act next.
If the task is complete, respond with FINISH."""
response = llm.with_structured_output(RouterOutput).invoke(
[{"role": "system", "content": system_prompt}] + state["messages"]
)
if response.next == "FINISH":
return Command(goto=END)
return Command(goto=response.next)
def make_agent(name: str, tools: list, system_prompt: str):
"""Factory for building worker agents with consistent structure."""
agent = create_react_agent(llm, tools, prompt=system_prompt)
def node(state: MessagesState) -> Command[Literal["supervisor"]]:
result = agent.invoke(state)
return Command(
update={"messages": result["messages"]},
goto="supervisor",
)
return node
graph = StateGraph(MessagesState)
graph.add_node("supervisor", supervisor)
graph.add_node("researcher", make_agent("researcher", research_tools, research_prompt))
graph.add_node("writer", make_agent("writer", writing_tools, writer_prompt))
graph.add_node("reviewer", make_agent("reviewer", review_tools, reviewer_prompt))
graph.add_edge(START, "supervisor")
app = graph.compile()Peer-to-Peer Architecture
In peer architectures, agents communicate directly. Each agent decides who to hand off to next. This works well when workflows are dynamic and the optimal path is not predictable.
Peer Architectures Need Guardrails
Without a supervisor, peer-to-peer systems can enter infinite loops. Always implement a maximum step count and cycle detection. We cap our graphs at 25 steps and track visited node sequences to break loops early.
When to Use Which
| Pattern | Best For | Downsides |
|---|---|---|
| Supervisor | Predictable workflows, clear task decomposition | Bottleneck at supervisor, higher latency |
| Peer-to-peer | Dynamic exploration, creative tasks | Harder to debug, loop risk |
| Hierarchical | Large systems (10+ agents) | Complex state management, slow cold starts |
| Map-reduce | Parallelizable subtasks (e.g., batch analysis) | Fan-out cost, result merging complexity |
State Management That Actually Works
The most underrated aspect of multi-agent systems is state design. We learned this the hard way: pass too much state and you blow context windows; pass too little and agents make bad decisions.
Typed State with Reducers
from langgraph.graph import MessagesState
from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages
import operator
class ResearchState(TypedDict):
messages: Annotated[list, add_messages]
research_notes: Annotated[list[str], operator.add]
current_phase: str
confidence_score: float
sources: Annotated[list[dict], operator.add]
error_log: Annotated[list[str], operator.add]
iteration_count: int
def researcher(state: ResearchState) -> dict:
"""Researcher agent that accumulates findings into typed state."""
notes = state.get("research_notes", [])
confidence = state.get("confidence_score", 0.0)
result = research_agent.invoke({
"messages": state["messages"],
"existing_research": "\n".join(notes),
})
new_notes = extract_notes(result)
new_confidence = evaluate_confidence(notes + new_notes)
return {
"messages": result["messages"],
"research_notes": new_notes,
"confidence_score": new_confidence,
"iteration_count": state.get("iteration_count", 0) + 1,
}
def should_continue_research(state: ResearchState) -> str:
if state["confidence_score"] >= 0.85:
return "writer"
if state["iteration_count"] >= 3:
return "writer" # move on with what we have
return "researcher"The Annotated type with operator.add is critical here. It tells LangGraph to append to lists rather than replace them, so each agent's contributions accumulate.
Error Recovery Patterns
In production, agents fail constantly. Tools return errors, LLMs hallucinate tool calls, rate limits hit at the worst possible moment. We use three patterns to handle this.
1. Retry with Backoff at the Node Level
Wrap individual agent nodes with retry logic rather than restarting the entire graph. LangGraph's checkpointing means you can resume from the last successful node.
2. Fallback Chains
When a primary tool fails, route to an alternative. We maintain a tool priority list per agent — if the vector search fails, fall back to keyword search, then to cached results.
3. Human-in-the-Loop Escalation
For high-stakes decisions, we use LangGraph's interrupt to pause execution and wait for human input.
Checkpointing Saves You in Production
Always compile your graph with a checkpointer. When a node fails at 3 AM, you can resume from the last checkpoint instead of replaying the entire conversation. We use PostgreSQL-backed checkpointers for durability.
Tool Routing and Specialization
A common mistake is giving every agent access to every tool. This degrades performance significantly — models struggle to select from large tool sets. Instead, we scope tools tightly to each agent's role.
Our rule of thumb: no agent should have more than 5 tools. If an agent needs more, it is doing too much and should be split.
Dynamic Tool Selection
For agents that operate across domains, we dynamically select tools based on the current task context rather than loading everything upfront. A lightweight classifier examines the user query and attaches only the relevant tool subset before the agent executes.
Production Deployment Considerations
Observability
Every agent invocation gets a trace. We instrument with LangSmith, tagging each trace with the graph name, run ID, and the node that produced it. When something goes wrong, we can replay the exact sequence of agent decisions.
Latency Budgets
Multi-agent systems are inherently slower than single agents. We set per-node latency budgets and use streaming to give users feedback while agents work. A typical supervisor graph with three workers takes 8-15 seconds end-to-end — acceptable for async workflows, painful for chat.
Cost Control
Each agent call costs money. We track token usage per node and set alerts when a single graph run exceeds a cost threshold. Map-reduce patterns are particularly dangerous — fanning out to 20 parallel agents can burn through budget quickly.
| Metric | Our Target | Why It Matters |
|---|---|---|
| P95 latency | < 20s | User patience threshold for async tasks |
| Cost per run | < $0.15 | Sustainable at 10k daily runs |
| Success rate | > 95% | Below this, users lose trust |
| Max steps | 25 | Prevents runaway loops |
| Tool error rate | < 5% | Signal to fix tool reliability |
Lessons From 18 Months in Production
After running multi-agent systems across several client projects, our biggest lessons are:
- Start with a single agent and split only when you hit a wall. Most tasks do not need multi-agent orchestration.
- State schema is your most important design decision. Get it wrong and every agent suffers.
- Deterministic edges beat LLM routing for known workflows. Use conditional edges based on state, not another LLM call, whenever the logic is clear.
- Test agents in isolation before composing them. Each worker should have its own eval suite.
Multi-agent systems trade simplicity for capability. Use them when a single agent genuinely cannot handle the task — not because the architecture looks impressive. When you do reach for them, LangGraph's explicit graph structure and checkpointing make the complexity manageable in ways that pure prompt-chaining never could.