AILive

Production RAG Enterprise Search

Transform your company's scattered documents into an instant-answer engine that delivers 92% accurate responses with source citations in under 2 seconds.

20254 months

92% F1 accuracy2.1s response time10k queries/day at $0.02/query

My Role

Lead engineer — designed the hybrid retrieval pipeline, implemented eval-gated CI/CD, built the streaming frontend, and optimized query costs to $0.02/query.

Duration

4 months

Year

2025

Tech Stack

Next.js 15FastAPIpgvectorColBERTClaude 3.5 SonnetPydanticAILlamaParseDeepEval

Status

Live in Production

Overview

Transform your company's scattered documents into an instant-answer engine that delivers 92% accurate responses with source citations in under 2 seconds.

The Challenge

Companies with 500+ page handbooks, scattered Notion wikis, and legacy PDF archives have no way to get instant, accurate answers. Employees waste 30+ minutes searching for policies, and new hires take weeks to ramp up. Generic ChatGPT solutions hallucinate 30%+ of the time with no source verification.

The Approach

I built a production-grade RAG pipeline combining BM25 keyword search with dense vector retrieval and ColBERT reranking for 92% F1 accuracy. Documents are parsed with Unstructured.io and LlamaParse for table extraction, embedded via Sentence Transformers, and stored in pgvector. A DeepEval integration blocks deployments if accuracy drops below the 92% threshold — making this the first truly production-safe RAG system I've shipped.

Key Features

Hybrid Retrieval Engine

Combines BM25 keyword search with dense vector similarity and MMR diversity scoring, followed by ColBERT reranking — outperforming pure semantic search by 18% on domain-specific queries.

Eval-Gated Deployments

CI/CD pipeline runs DeepEval test suites on every push. Deployments are automatically blocked if F1 score drops below 0.92, preventing accuracy regressions from reaching production.

Streaming Responses with Citations

PydanticAI structures every response with inline source citations linking to specific page numbers, so users can verify answers against original documents in one click.

Drag-and-Drop Document Ingestion

Upload PDFs, Notion exports, and Google Drive files through a simple UI. Automatic chunking, table extraction, and embedding happen asynchronously with progress tracking.

Cost Tracking Dashboard

Real-time monitoring of per-query costs ($0.02 average), token usage, and retrieval latency — giving full visibility into operational expenses.

Technical Decisions

Key technology choices and the reasoning behind each decision.

pgvector + BM25 Hybrid

Data

Chose pgvector over Pinecone to keep retrieval and structured data in a single database. Added pg_trgm for BM25 keyword matching — hybrid search boosted accuracy 18% over pure vector retrieval on technical documentation.

ColBERT Reranker

AI / ML

Selected ColBERT over cross-encoder reranking for its 3x speed advantage at similar accuracy. Late interaction scoring maintains token-level granularity without the latency penalty of full cross-attention.

FastAPI + Streaming

Backend

Chose FastAPI over Django for native async streaming support. Server-Sent Events deliver tokens to the frontend in real-time, matching ChatGPT's UX without WebSocket complexity.

DeepEval

Infrastructure

Integrated DeepEval over RAGAS for its CI/CD-native test runner. Defining eval suites as pytest fixtures let us block deployments on accuracy regressions with zero custom infrastructure.

Architecture

Multi-stage retrieval pipeline with hybrid search, reranking, and eval-gated deployment.

Ingestion

Upload → Unstructured.io (PDF parsing) + LlamaParse (table extraction) → Sentence chunking

Embedding

Sentence Transformers (all-MiniLM-L6-v2) → pgvector storage with BM25 index

Retrieval

Query → Hybrid search (dense + BM25 + MMR) → Top-20 candidates

Reranking

ColBERT reranker scores top-20 → Top-5 most relevant passages

Synthesis

Claude 3.5 Sonnet generates response with PydanticAI structured citations

Delivery

FastAPI SSE streaming → Next.js frontend with inline source links

Challenges & Learnings

Key technical challenges I faced and how I solved them.

Challenge 1

Table Extraction Accuracy

Problem

Standard PDF parsers destroyed table formatting, causing financial and policy tables to produce nonsensical text chunks. 35% of handbook queries involved tabular data that was being mangled.

Solution

Integrated LlamaParse specifically for table-heavy pages, using its vision model to reconstruct table structures. Added a table-aware chunking strategy that preserves row-column relationships in the embedding space.

Outcome

Table-related query accuracy jumped from 61% to 89%. Finance teams reported the system finally "understood" compensation tables.

Challenge 2

Retrieval Latency at Scale

Problem

With 50k+ document chunks, hybrid search queries were taking 4.2 seconds — far too slow for an interactive search experience where users expect sub-2-second responses.

Solution

Implemented HNSW indexing on pgvector with tuned ef_construction=200 and m=16 parameters. Added a query cache layer in Redis for repeated questions, which handled 40% of production traffic.

Outcome

P95 retrieval latency dropped from 4.2s to 380ms. Cached queries return in under 50ms.

Interested in working with TwilightCore?

We build production systems like this for teams and founders who value quality engineering.

Start a Conversation More Projects

Previous Project

OpenClaw-style Personal AI Agent

Next Project

Production RAG Enterprise Search

My Role

Duration

Year

Tech Stack

Status

Hybrid Retrieval Engine

Eval-Gated Deployments

Streaming Responses with Citations

Drag-and-Drop Document Ingestion

Cost Tracking Dashboard

pgvector + BM25 Hybrid

ColBERT Reranker

FastAPI + Streaming

DeepEval

Ingestion

Embedding

Retrieval

Reranking

Synthesis

Delivery

Table Extraction Accuracy

Retrieval Latency at Scale

Interested in working with TwilightCore?

OpenClaw-style Personal AI Agent

AI Trading Terminal