Designing Multi-Agent Systems with LangGraph

Why Single-Agent Architectures Break Down

When we first started building LLM-powered applications at TwilightCore, a single ReAct agent felt like magic. Give it tools, a system prompt, and watch it reason. Then we hit production traffic — and everything fell apart.

Single agents struggle with task complexity ceilings. Once you exceed 5-7 tools, accuracy degrades. Context windows fill with irrelevant tool outputs. Error recovery becomes a game of hope. We needed something better: multi-agent orchestration.

LangGraph gives us the primitives to build these systems properly — as stateful, observable graphs with explicit control flow.

Architectural Patterns

Supervisor Architecture

The supervisor pattern uses a central orchestrator agent that delegates to specialized worker agents. It is the simplest to reason about and the easiest to debug.

supervisor.py

from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.types import Command
from langchain_openai import ChatOpenAI
from typing import Literal
 
llm = ChatOpenAI(model="gpt-4o")
 
def supervisor(state: MessagesState) -> Command[Literal["researcher", "writer", "reviewer", END]]:
    """Central orchestrator that routes to specialized agents."""
    system_prompt = """You are a supervisor managing three workers:
    - researcher: finds and validates technical information
    - writer: produces polished content from research
    - reviewer: checks accuracy and suggests improvements
    
    Based on the conversation, decide which worker should act next.
    If the task is complete, respond with FINISH."""
    
    response = llm.with_structured_output(RouterOutput).invoke(
        [{"role": "system", "content": system_prompt}] + state["messages"]
    )
    
    if response.next == "FINISH":
        return Command(goto=END)
    
    return Command(goto=response.next)
 
def make_agent(name: str, tools: list, system_prompt: str):
    """Factory for building worker agents with consistent structure."""
    agent = create_react_agent(llm, tools, prompt=system_prompt)
    
    def node(state: MessagesState) -> Command[Literal["supervisor"]]:
        result = agent.invoke(state)
        return Command(
            update={"messages": result["messages"]},
            goto="supervisor",
        )
    return node
 
graph = StateGraph(MessagesState)
graph.add_node("supervisor", supervisor)
graph.add_node("researcher", make_agent("researcher", research_tools, research_prompt))
graph.add_node("writer", make_agent("writer", writing_tools, writer_prompt))
graph.add_node("reviewer", make_agent("reviewer", review_tools, reviewer_prompt))
graph.add_edge(START, "supervisor")
 
app = graph.compile()

Peer-to-Peer Architecture

In peer architectures, agents communicate directly. Each agent decides who to hand off to next. This works well when workflows are dynamic and the optimal path is not predictable.

Peer Architectures Need Guardrails

Without a supervisor, peer-to-peer systems can enter infinite loops. Always implement a maximum step count and cycle detection. We cap our graphs at 25 steps and track visited node sequences to break loops early.

When to Use Which

Pattern	Best For	Downsides
Supervisor	Predictable workflows, clear task decomposition	Bottleneck at supervisor, higher latency
Peer-to-peer	Dynamic exploration, creative tasks	Harder to debug, loop risk
Hierarchical	Large systems (10+ agents)	Complex state management, slow cold starts
Map-reduce	Parallelizable subtasks (e.g., batch analysis)	Fan-out cost, result merging complexity

State Management That Actually Works

The most underrated aspect of multi-agent systems is state design. We learned this the hard way: pass too much state and you blow context windows; pass too little and agents make bad decisions.

Typed State with Reducers

state.py

from langgraph.graph import MessagesState
from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages
import operator
 
class ResearchState(TypedDict):
    messages: Annotated[list, add_messages]
    research_notes: Annotated[list[str], operator.add]
    current_phase: str
    confidence_score: float
    sources: Annotated[list[dict], operator.add]
    error_log: Annotated[list[str], operator.add]
    iteration_count: int
 
def researcher(state: ResearchState) -> dict:
    """Researcher agent that accumulates findings into typed state."""
    notes = state.get("research_notes", [])
    confidence = state.get("confidence_score", 0.0)
    
    result = research_agent.invoke({
        "messages": state["messages"],
        "existing_research": "\n".join(notes),
    })
    
    new_notes = extract_notes(result)
    new_confidence = evaluate_confidence(notes + new_notes)
    
    return {
        "messages": result["messages"],
        "research_notes": new_notes,
        "confidence_score": new_confidence,
        "iteration_count": state.get("iteration_count", 0) + 1,
    }
 
def should_continue_research(state: ResearchState) -> str:
    if state["confidence_score"] >= 0.85:
        return "writer"
    if state["iteration_count"] >= 3:
        return "writer"  # move on with what we have
    return "researcher"

The Annotated type with operator.add is critical here. It tells LangGraph to append to lists rather than replace them, so each agent's contributions accumulate.

Error Recovery Patterns

In production, agents fail constantly. Tools return errors, LLMs hallucinate tool calls, rate limits hit at the worst possible moment. We use three patterns to handle this.

1. Retry with Backoff at the Node Level

Wrap individual agent nodes with retry logic rather than restarting the entire graph. LangGraph's checkpointing means you can resume from the last successful node.

2. Fallback Chains

When a primary tool fails, route to an alternative. We maintain a tool priority list per agent — if the vector search fails, fall back to keyword search, then to cached results.

3. Human-in-the-Loop Escalation

For high-stakes decisions, we use LangGraph's interrupt to pause execution and wait for human input.

Checkpointing Saves You in Production

Always compile your graph with a checkpointer. When a node fails at 3 AM, you can resume from the last checkpoint instead of replaying the entire conversation. We use PostgreSQL-backed checkpointers for durability.

Tool Routing and Specialization

A common mistake is giving every agent access to every tool. This degrades performance significantly — models struggle to select from large tool sets. Instead, we scope tools tightly to each agent's role.

Our rule of thumb: no agent should have more than 5 tools. If an agent needs more, it is doing too much and should be split.

Dynamic Tool Selection

For agents that operate across domains, we dynamically select tools based on the current task context rather than loading everything upfront. A lightweight classifier examines the user query and attaches only the relevant tool subset before the agent executes.

Production Deployment Considerations

Observability

Every agent invocation gets a trace. We instrument with LangSmith, tagging each trace with the graph name, run ID, and the node that produced it. When something goes wrong, we can replay the exact sequence of agent decisions.

Latency Budgets

Multi-agent systems are inherently slower than single agents. We set per-node latency budgets and use streaming to give users feedback while agents work. A typical supervisor graph with three workers takes 8-15 seconds end-to-end — acceptable for async workflows, painful for chat.

Cost Control

Each agent call costs money. We track token usage per node and set alerts when a single graph run exceeds a cost threshold. Map-reduce patterns are particularly dangerous — fanning out to 20 parallel agents can burn through budget quickly.

Metric	Our Target	Why It Matters
P95 latency	< 20s	User patience threshold for async tasks
Cost per run	< $0.15	Sustainable at 10k daily runs
Success rate	> 95%	Below this, users lose trust
Max steps	25	Prevents runaway loops
Tool error rate	< 5%	Signal to fix tool reliability

Lessons From 18 Months in Production

After running multi-agent systems across several client projects, our biggest lessons are:

Start with a single agent and split only when you hit a wall. Most tasks do not need multi-agent orchestration.
State schema is your most important design decision. Get it wrong and every agent suffers.
Deterministic edges beat LLM routing for known workflows. Use conditional edges based on state, not another LLM call, whenever the logic is clear.
Test agents in isolation before composing them. Each worker should have its own eval suite.

The Core Principle

Multi-agent systems trade simplicity for capability. Use them when a single agent genuinely cannot handle the task — not because the architecture looks impressive. When you do reach for them, LangGraph's explicit graph structure and checkpointing make the complexity manageable in ways that pure prompt-chaining never could.