LLMLive

Local LLM Code Reviewer

A VS Code extension that runs code review entirely on your laptop using Ollama. Analyzes diffs, identifies bugs, suggests improvements, and generates inline comments — all without sending code to any external API.

Qwen 2.5 72BLlama 3.3 70BCodeLlama

85% bug detection rate2.9s avg response0 API cost

Source Code

Status

Live

Tech Stack

OllamaPythonASTFastAPITree-sitter

Models

Qwen 2.5 72BLlama 3.3 70BCodeLlama

Overview

This experiment explores whether locally-hosted open-weight LLMs can deliver code review quality comparable to cloud APIs while keeping all source code on-device. The system runs as a VS Code extension backed by Ollama, parsing diffs with Tree-sitter ASTs, feeding structured context windows to the model, and rendering inline review comments — all without a single byte leaving the developer's machine.

Methodology

I benchmarked three local models (Qwen 2.5 72B Q4, Llama 3.3 70B Q5, and CodeLlama 34B) against GPT-4o on a curated dataset of 200 real pull requests spanning Python, TypeScript, and Go. Each PR was independently reviewed by a senior engineer to establish ground truth. I measured bug detection recall, false positive rate, suggestion quality (rated 1-5 by the engineer), and inference latency. Context windows were constructed by extracting changed functions via Tree-sitter, including surrounding scope and import context up to 4,096 tokens.

Tech Stack

Ollama serves the quantized models with GPU offloading. Tree-sitter parses source files into ASTs for intelligent context extraction. FastAPI provides the local HTTP bridge between the VS Code extension and the model server. The extension itself uses the VS Code LSP protocol for inline diagnostics and code actions.

Key Findings

The most important insights from this experiment.

Qwen 2.5 72B matches GPT-4o at 85% recall

On our 200-PR benchmark, Qwen 2.5 72B achieved 85% bug detection recall versus GPT-4o's 89% — a surprisingly small gap considering it runs entirely locally on an RTX 4090.

AST-aware context halves false positives

Switching from naive diff-based prompts to Tree-sitter AST-extracted function context reduced false positive rate from 34% to 16%, because the model could see type signatures and import context.

Quantization matters less than context quality

Q4 vs Q8 quantization changed recall by only 2%, while improving context construction improved recall by 12%. Spending engineering effort on prompt context beats chasing model precision.

Sub-3-second response enables inline workflow

At 2.9s average response time on an RTX 4090, the review feels interactive enough to use as you write code, not just on PR submission.

Architecture

The pipeline starts when a developer saves a file or stages a diff. The VS Code extension detects changed hunks, sends them to a local FastAPI server which uses Tree-sitter to extract the full function context around each change. This structured context is formatted into a review prompt and sent to Ollama. The model response is parsed for issue annotations (line number, severity, suggestion) and rendered as VS Code inline diagnostics with one-click "Apply Fix" code actions.

Results

Across 200 test PRs: 85% bug detection recall, 16% false positive rate, 2.9s average response time, and zero API cost. Developer satisfaction surveys (n=12) rated the tool 4.1/5 for usefulness. Three engineers adopted it as their default pre-commit review step.

Challenges

Key technical challenges encountered during this experiment.

Challenge 1

Context window limitations with large diffs

PRs touching 10+ files exceeded the 4,096-token context budget. Solved by ranking changed functions by complexity score and reviewing the top-k, with a "more issues may exist" disclaimer.

Challenge 2

Model hallucinating line numbers

Local models occasionally mapped suggestions to wrong line numbers. Added a post-processing validation step that cross-references suggested lines against the actual diff to filter phantom annotations.

Interested in working with Forward?

We build production AI systems and run experiments like this for teams who value rigorous engineering.

Start a Conversation More Experiments

Next Experiment

Local LLM Code Reviewer

Category

Status

Tech Stack

Models

Qwen 2.5 72B matches GPT-4o at 85% recall

AST-aware context halves false positives

Quantization matters less than context quality

Sub-3-second response enables inline workflow

Context window limitations with large diffs

Model hallucinating line numbers

Interested in working with Forward?

Voice-to-SQL Agent