ToolsLive

Multi-Agent Code Reviewer

Drop any GitHub PR link and watch three specialized AI agents collaborate: Planner creates review checklist, Analyzer finds bugs, Fixer writes fixes with unit tests.

20252.5 months

87% bug catch rate92% test coverage7x faster than manual review

My Role

Sole developer — designed the 3-agent review pipeline, integrated static analysis tools, and built the GitHub automation layer.

Duration

2.5 months

Year

2025

Tech Stack

LangGraphClaude 3.5 SonnetSemgrepCodeQLGitHub APIPytestNeon Postgres

Status

Live in Production

Overview

Drop any GitHub PR link and watch three specialized AI agents collaborate: Planner creates review checklist, Analyzer finds bugs, Fixer writes fixes with unit tests.

The Challenge

Code reviews are the biggest bottleneck in software development — senior engineers spend 8-12 hours per week reviewing PRs, creating a 2-3 day average review turnaround. Critical bugs slip through due to reviewer fatigue, and junior developers wait days for feedback, slowing their learning cycle.

The Approach

I built a 3-agent system where specialized AI reviewers collaborate: a Planner creates a review strategy based on the PR's scope, an Analyzer catches security vulnerabilities, performance issues, and logic bugs using Claude + static analysis tools, and a Fixer generates corrected code with unit tests. The system creates GitHub PR comments and even fix PRs automatically.

Key Features

Three-Agent Review Pipeline

PlannerAgent creates a targeted review checklist, AnalyzerAgent identifies issues with severity and confidence scores, FixerAgent generates corrected code with unit tests — all coordinated by a LangGraph supervisor.

Security + Static Analysis

Combines Claude's reasoning with Semgrep (OWASP patterns) and CodeQL (data flow analysis) for comprehensive security scanning that catches both pattern-based and logic-based vulnerabilities.

Automated Fix PRs

For high-confidence issues, the system creates fix PRs with corrected code, unit tests (92% minimum coverage), and explanatory commit messages — ready for one-click merge.

Learning Audit Trail

Every review is stored in Neon Postgres with outcome tracking (was the suggestion accepted? was the fix correct?), enabling continuous improvement of the review quality over time.

Technical Decisions

Key technology choices and the reasoning behind each decision.

LangGraph Supervisor

AI / ML

Chose a supervisor graph over sequential pipeline because different PRs need different review strategies. A CSS-only PR skips security analysis; a database migration PR gets extra scrutiny. The supervisor routes dynamically.

Semgrep + CodeQL

Backend

Layered two static analysis tools because they catch different vulnerability classes. Semgrep excels at pattern matching (SQL injection, XSS), while CodeQL's data flow analysis catches complex injection chains across function boundaries.

Claude 3.5 Sonnet

AI / ML

Selected Claude for code review because of its ability to understand intent, not just syntax. GPT-4o flagged more false positives (32% vs Claude's 11%), creating alert fatigue that would cause developers to ignore real issues.

GitHub Checks API

Infrastructure

Integrated with GitHub Checks rather than simple PR comments for proper CI/CD integration. Review results appear as check runs with line-level annotations, integrating naturally with existing developer workflows.

Architecture

Multi-agent review pipeline with static analysis integration and automated fix generation.

PR Webhook

GitHub webhook triggers on PR open/update → Fetch diff, file tree, and commit history

Planning

PlannerAgent analyzes PR scope → Generates review checklist with focus areas and skip rules

Analysis

AnalyzerAgent runs Claude reasoning + Semgrep + CodeQL in parallel → Ranked issue list with severity and confidence

Fix Generation

FixerAgent creates corrected code for high-confidence issues → Generates unit tests → Validates with pytest (92% minimum coverage)

GitHub Output

PR comments with inline annotations + fix PRs created + GitHub Checks status updated

Challenges & Learnings

Key technical challenges I faced and how I solved them.

Challenge 1

False Positive Fatigue

Problem

Initial versions flagged 40+ issues per PR, with 35% being false positives. Developers started ignoring all suggestions, defeating the purpose of automated review.

Solution

Implemented a confidence calibration system using historical feedback. Each suggestion includes a confidence score trained on accept/reject patterns. Added a threshold filter (only show >75% confidence) with an expandable "lower confidence" section for thoroughness.

Outcome

False positive rate dropped from 35% to 8%. Developer engagement with suggestions increased from 23% to 71%.

Challenge 2

Context Window Limits on Large PRs

Problem

PRs with 50+ changed files exceeded Claude's context window, forcing truncation that caused the analyzer to miss cross-file issues — the exact bugs that are hardest for humans to catch.

Solution

Built a hierarchical analysis pipeline: first pass analyzes each file independently, second pass identifies cross-file dependencies and analyzes only the connected subgraphs together. Uses AST-based dependency resolution to determine which files actually interact.

Outcome

Cross-file bug detection improved by 45%. Large PRs (100+ files) now complete review in 90 seconds vs previous timeout failures.

Interested in working with TwilightCore?

We build production systems like this for teams and founders who value quality engineering.

Start a Conversation More Projects

Previous Project

Voice Appointment Booker

Next Project

Multi-Agent Code Reviewer

My Role

Duration

Year

Tech Stack

Status

Three-Agent Review Pipeline

Security + Static Analysis

Automated Fix PRs

Learning Audit Trail

LangGraph Supervisor

Semgrep + CodeQL

Claude 3.5 Sonnet

GitHub Checks API

PR Webhook

Planning

Analysis

Fix Generation

GitHub Output

False Positive Fatigue

Context Window Limits on Large PRs

Interested in working with TwilightCore?

Voice Appointment Booker

Earnings Call Analyzer