Local SLM Coding Assistant
Offline VS Code copilot that runs entirely on your laptop using Qwen 2.5, Llama 3.3, and Mistral Small models with zero API costs or internet dependency.

My Role
Solo developer — built the model comparison framework, VS Code extension, Tauri desktop wrapper, and performance benchmarking suite.
Duration
2.5 months
Year
2025
Tech Stack
Status
Live in ProductionOffline VS Code copilot that runs entirely on your laptop using Qwen 2.5, Llama 3.3, and Mistral Small models with zero API costs or internet dependency.
Development teams face a privacy vs capability trade-off with AI coding tools. Cloud-based copilots send proprietary code to external servers, violating compliance requirements for healthcare, finance, and defense clients. Meanwhile, $500+/month API costs for large teams make AI-assisted development prohibitively expensive.
I built an offline coding assistant that runs three leading small language models (Qwen 2.5, Llama 3.3, Mistral Small) entirely on the developer's laptop via Ollama. The system benchmarks each model in real-time and routes tasks to the optimal one — Qwen for speed-critical completions, Llama for complex reasoning, Mistral for memory-constrained environments. Zero API costs, zero data leaves the machine.
Multi-Model Comparison Engine
Real-time benchmarking of Qwen 2.5 (72B), Llama 3.3 (70B), and Mistral Small (24B) across speed, accuracy, and memory usage — automatically routing each task to the optimal model.
VS Code Deep Integration
LSP-based extension provides inline completions, code explanations, refactoring suggestions, and unit test generation directly in the editor with zero context switching.
Automated PR Creation
Highlight buggy code, describe the fix, and the assistant rewrites the code, generates tests, and creates a PR — all without leaving VS Code.
Privacy-First Architecture
All inference runs locally via Ollama with CUDA/AVX2 optimization. No network requests, no telemetry, no API keys. Code never leaves the developer's machine.
Key technology choices and the reasoning behind each decision.
Ollama 0.3.8
AI / MLChose Ollama over llama.cpp directly for its model management and hot-swapping capabilities. Switching between Qwen, Llama, and Mistral takes <2 seconds vs 30+ seconds with manual model loading.
Tauri
FrontendSelected Tauri over Electron for the desktop wrapper — 10x smaller binary (8MB vs 80MB) and native WebView2 rendering. Critical for a developer tool where every MB of RAM matters alongside the LLM.
WebGPU
InfrastructureImplemented WebGPU inference as a fallback for systems without CUDA. While 40% slower than CUDA, it enables GPU acceleration on AMD and Intel GPUs that Ollama's default CUDA path doesn't support.
VS Code LSP
FrontendBuilt on the Language Server Protocol rather than VS Code's proprietary extension API. LSP compatibility means the extension works in Cursor, Neovim, and any LSP-compatible editor without modification.
Local-first architecture with multi-model routing and VS Code integration via LSP.
Input
VS Code extension captures code context + user instruction via LSP
Model Selection
Task classifier routes to optimal model based on complexity, speed, and available RAM
Inference
Ollama runs selected model (Qwen/Llama/Mistral) with CUDA or WebGPU acceleration
Post-processing
AST validation ensures generated code is syntactically correct before presentation
Delivery
Results streamed back to VS Code — inline completions, diff views, or PR creation
Key technical challenges I faced and how I solved them.
Memory Management Across Models
Running a 70B parameter model alongside VS Code and a browser consumed 24GB+ RAM, causing swap thrashing on 16GB developer machines — making the tool unusable on standard hardware.
Implemented dynamic quantization selection based on available RAM. The system auto-detects free memory and loads Q4_K_M (4-bit) on 16GB machines, Q5_K_M on 32GB, and Q8 on 64GB+ systems. Added aggressive KV cache pruning for long context windows.
Runs smoothly on 16GB MacBooks with <8GB model memory footprint. Performance degradation from quantization is only 3% on HumanEval.
Code Completion Latency
Initial completion latency was 800ms+ due to model cold-start and tokenization overhead, making the experience feel sluggish compared to cloud-based copilots that respond in 200-300ms.
Implemented speculative decoding with a small draft model (Mistral 7B) that runs continuously, pre-generating likely completions. The larger model then validates in a single forward pass, accepting 60-70% of draft tokens.
Median completion latency dropped from 800ms to 280ms. Users report the experience feels "as fast as GitHub Copilot" in blind testing.
Interested in working with TwilightCore?
We build production systems like this for teams and founders who value quality engineering.

