Skip to content
AI Lab
AgentLive

Multi-Agent Debate System

A structured debate framework where three LLM agents are assigned roles (proponent, opponent, devil's advocate) and engage in multi-round argumentation. A judge agent scores arguments on logic, evidence, and novelty, then synthesizes a balanced conclusion. Built to stress-test reasoning capabilities across models.

Claude SonnetGPT-4oGemini 2.5 Pro
3 debate roundsReal-time streamingCross-model comparison

Category

Agent

Status

Live

Tech Stack

LangGraphNext.jsRedisWebSockets

Models

Claude SonnetGPT-4oGemini 2.5 Pro
Overview

Can AI models reason better when they argue? This experiment pits three LLM agents against each other in structured multi-round debates, with a judge agent scoring arguments and synthesizing conclusions. The goal is to test whether adversarial discourse improves reasoning quality compared to single-model responses, and to compare reasoning capabilities across Claude, GPT-4o, and Gemini.

Methodology

I designed a debate framework with 3 roles (Proponent, Opponent, Devil's Advocate) and a Judge. Each debate runs 3 rounds: opening statements, rebuttals, and closing arguments. The judge scores each argument on logic (0-10), evidence quality (0-10), and novelty (0-5), then synthesizes a balanced conclusion. I tested 30 debate topics spanning ethics, technology policy, and business strategy. Each topic was debated 3 times with different model assignments to control for model-role bias.

Tech Stack

LangGraph orchestrates the multi-agent debate flow with explicit state transitions between rounds. WebSockets enable real-time streaming of arguments to the frontend. Redis stores debate state for recovery from interruptions. Next.js renders the debate UI with live argument streaming.

Key Findings

The most important insights from this experiment.

1

Debate improves factual accuracy by 23%

When models argue opposing positions, the synthesized conclusion contains 23% fewer factual errors than a single model's response to the same question, because opposing agents actively challenge unsupported claims.

2

Claude excels at structured argumentation

Claude Sonnet consistently scored highest on logic (avg 8.2/10) and produced the most well-structured arguments. GPT-4o scored highest on evidence citation (avg 7.8/10). Gemini excelled at novel perspectives (avg 3.9/5).

3

Devil's Advocate role is crucial

Removing the Devil's Advocate and running 2-agent debates reduced conclusion quality by 18%. The contrarian role forces both sides to address edge cases they would otherwise ignore.

4

Three rounds is the sweet spot

Testing 2, 3, and 5 rounds showed diminishing returns after round 3. Rounds 4-5 produced repetitive arguments and increased latency by 60% without improving conclusion quality.

Architecture

A LangGraph state machine manages the debate flow: Topic → Role Assignment → Round 1 (parallel opening statements) → Judge scores → Round 2 (sequential rebuttals with access to opponent arguments) → Judge scores → Round 3 (closing arguments) → Final synthesis. Each transition streams tokens to the frontend via WebSockets. The judge uses a separate, higher-temperature prompt to avoid anchoring on any single agent's framing.

Results

Across 30 debate topics: debated conclusions scored 23% higher on factual accuracy than single-model responses (human-evaluated, n=5 evaluators). Average debate completion time: 45 seconds for 3 rounds. Real-time streaming achieved <200ms token-to-screen latency. Cross-model debates produced the highest-quality conclusions when Claude argued + GPT-4o cited evidence + Gemini played devil's advocate.

Challenges

Key technical challenges encountered during this experiment.

Challenge 1

Models agreeing too readily in rebuttals

Agents would sometimes concede valid points too easily, collapsing the debate. Solved by adding a "maintain your position" system prompt directive and penalizing concession in the judge's scoring rubric.

Challenge 2

Streaming coordination across 3 models

Three concurrent LLM streams with different response speeds caused rendering issues. Implemented a token buffer that batches and interleaves streams at 100ms intervals for smooth UI updates.

NEXT

Interested in working with Forward?

We build production AI systems and run experiments like this for teams who value rigorous engineering.