Skip to content
All Projects
ToolsLive

Local SLM Coding Assistant

Offline VS Code copilot that runs entirely on your laptop using Qwen 2.5, Llama 3.3, and Mistral Small models with zero API costs or internet dependency.

20252.5 months
85% HumanEval pass@1120 tokens/sec<8GB RAM usage
Local SLM Coding Assistant

My Role

Solo developer — built the model comparison framework, VS Code extension, Tauri desktop wrapper, and performance benchmarking suite.

Duration

2.5 months

Year

2025

Tech Stack

OllamaPyTorchQwen 2.5Llama 3.3MistralTauriWebGPUVS Code LSP

Status

Live in Production
Overview

Offline VS Code copilot that runs entirely on your laptop using Qwen 2.5, Llama 3.3, and Mistral Small models with zero API costs or internet dependency.

The Challenge

Development teams face a privacy vs capability trade-off with AI coding tools. Cloud-based copilots send proprietary code to external servers, violating compliance requirements for healthcare, finance, and defense clients. Meanwhile, $500+/month API costs for large teams make AI-assisted development prohibitively expensive.

The Approach

I built an offline coding assistant that runs three leading small language models (Qwen 2.5, Llama 3.3, Mistral Small) entirely on the developer's laptop via Ollama. The system benchmarks each model in real-time and routes tasks to the optimal one — Qwen for speed-critical completions, Llama for complex reasoning, Mistral for memory-constrained environments. Zero API costs, zero data leaves the machine.

Key Features
1

Multi-Model Comparison Engine

Real-time benchmarking of Qwen 2.5 (72B), Llama 3.3 (70B), and Mistral Small (24B) across speed, accuracy, and memory usage — automatically routing each task to the optimal model.

2

VS Code Deep Integration

LSP-based extension provides inline completions, code explanations, refactoring suggestions, and unit test generation directly in the editor with zero context switching.

3

Automated PR Creation

Highlight buggy code, describe the fix, and the assistant rewrites the code, generates tests, and creates a PR — all without leaving VS Code.

4

Privacy-First Architecture

All inference runs locally via Ollama with CUDA/AVX2 optimization. No network requests, no telemetry, no API keys. Code never leaves the developer's machine.

Technical Decisions

Key technology choices and the reasoning behind each decision.

Ollama 0.3.8

AI / ML

Chose Ollama over llama.cpp directly for its model management and hot-swapping capabilities. Switching between Qwen, Llama, and Mistral takes <2 seconds vs 30+ seconds with manual model loading.

Tauri

Frontend

Selected Tauri over Electron for the desktop wrapper — 10x smaller binary (8MB vs 80MB) and native WebView2 rendering. Critical for a developer tool where every MB of RAM matters alongside the LLM.

WebGPU

Infrastructure

Implemented WebGPU inference as a fallback for systems without CUDA. While 40% slower than CUDA, it enables GPU acceleration on AMD and Intel GPUs that Ollama's default CUDA path doesn't support.

VS Code LSP

Frontend

Built on the Language Server Protocol rather than VS Code's proprietary extension API. LSP compatibility means the extension works in Cursor, Neovim, and any LSP-compatible editor without modification.

Architecture

Local-first architecture with multi-model routing and VS Code integration via LSP.

01

Input

VS Code extension captures code context + user instruction via LSP

02

Model Selection

Task classifier routes to optimal model based on complexity, speed, and available RAM

03

Inference

Ollama runs selected model (Qwen/Llama/Mistral) with CUDA or WebGPU acceleration

04

Post-processing

AST validation ensures generated code is syntactically correct before presentation

05

Delivery

Results streamed back to VS Code — inline completions, diff views, or PR creation

Challenges & Learnings

Key technical challenges I faced and how I solved them.

Challenge 1

Memory Management Across Models

Problem

Running a 70B parameter model alongside VS Code and a browser consumed 24GB+ RAM, causing swap thrashing on 16GB developer machines — making the tool unusable on standard hardware.

Solution

Implemented dynamic quantization selection based on available RAM. The system auto-detects free memory and loads Q4_K_M (4-bit) on 16GB machines, Q5_K_M on 32GB, and Q8 on 64GB+ systems. Added aggressive KV cache pruning for long context windows.

Outcome

Runs smoothly on 16GB MacBooks with <8GB model memory footprint. Performance degradation from quantization is only 3% on HumanEval.

Challenge 2

Code Completion Latency

Problem

Initial completion latency was 800ms+ due to model cold-start and tokenization overhead, making the experience feel sluggish compared to cloud-based copilots that respond in 200-300ms.

Solution

Implemented speculative decoding with a small draft model (Mistral 7B) that runs continuously, pre-generating likely completions. The larger model then validates in a single forward pass, accepting 60-70% of draft tokens.

Outcome

Median completion latency dropped from 800ms to 280ms. Users report the experience feels "as fast as GitHub Copilot" in blind testing.

NEXT

Interested in working with TwilightCore?

We build production systems like this for teams and founders who value quality engineering.