Skip to content
AI Lab
RAGLive

RAG Evaluation Harness

An opinionated evaluation framework for testing RAG pipelines. Compare chunking strategies (fixed, semantic, recursive), embedding models (OpenAI, Cohere, open-source), and retrieval methods (dense, sparse, hybrid) against ground-truth Q&A datasets with 12 automated metrics.

Claude SonnetGPT-4oall-MiniLM-L6
12 eval metrics6 chunking strategies4 embedding models compared

Category

RAG

Status

Live

Tech Stack

PythonDeepEvalRAGASpgvectorStreamlit

Models

Claude SonnetGPT-4oall-MiniLM-L6
Overview

Most RAG tutorials evaluate with vibes. This experiment builds a rigorous, automated evaluation framework that benchmarks every variable in a RAG pipeline — chunking strategy, embedding model, retrieval method, and generation model — against ground-truth Q&A datasets with 12 standardized metrics. The goal: make RAG engineering empirical rather than anecdotal.

Methodology

I created 3 evaluation datasets (technical documentation, legal contracts, and academic papers) with 50 ground-truth Q&A pairs each. For each dataset, I tested 6 chunking strategies (fixed 512, fixed 1024, recursive, semantic, sentence-window, parent-document), 4 embedding models (OpenAI text-embedding-3-large, Cohere embed-v3, all-MiniLM-L6, BGE-large), and 3 retrieval methods (dense, sparse BM25, hybrid). Each combination was scored on 12 metrics from RAGAS and DeepEval including faithfulness, answer relevancy, context precision, context recall, and hallucination rate.

Tech Stack

DeepEval and RAGAS provide the 12 evaluation metrics. pgvector stores embeddings for dense retrieval. A custom BM25 implementation handles sparse retrieval. Streamlit powers the interactive dashboard for comparing results. Python orchestrates the evaluation pipeline with parallel execution across configurations.

Key Findings

The most important insights from this experiment.

1

Semantic chunking wins on technical docs, loses on legal

Semantic chunking achieved 0.91 context precision on technical docs but dropped to 0.74 on legal contracts where clause boundaries matter more than semantic similarity. Fixed 512 with overlap performed best on legal.

2

Hybrid retrieval is worth the complexity

Dense + BM25 hybrid retrieval improved context recall by 8-15% over pure dense retrieval across all datasets. The improvement is largest on queries containing specific identifiers (error codes, section numbers).

3

OpenAI embeddings lead, but BGE-large is 93% as good

text-embedding-3-large scored highest on all metrics, but BGE-large achieved 93% of its performance at zero API cost — a compelling tradeoff for cost-sensitive deployments.

4

Evaluation must be domain-specific

A chunking strategy that scores 0.91 on technical docs scores 0.74 on legal text. Universal "best" configurations don't exist — the harness lets you find yours.

Architecture

The harness runs in three phases: (1) Ingestion — documents are processed through all chunking strategies in parallel, embedded with all models, and stored in separate pgvector collections. (2) Retrieval — each query is run against all retrieval method + embedding model combinations, collecting top-k results. (3) Evaluation — retrieved contexts and generated answers are scored against ground-truth using DeepEval and RAGAS metrics, with results aggregated into a comparison matrix.

Results

Tested 72 unique pipeline configurations (6 chunking x 4 embeddings x 3 retrieval) across 150 questions. Identified that the optimal configuration varies by domain — no universal winner exists. Best overall: semantic chunking + OpenAI embeddings + hybrid retrieval (0.92 faithfulness, 0.89 context precision). Best cost-efficient: recursive chunking + BGE-large + hybrid retrieval (0.87 faithfulness, 0.83 context precision, $0 embedding cost).

Challenges

Key technical challenges encountered during this experiment.

Challenge 1

Evaluation metric disagreement

RAGAS faithfulness and DeepEval hallucination scores sometimes contradicted each other. Solved by establishing a meta-evaluation protocol: if metrics disagree, human judgment breaks the tie, and the more reliable metric is weighted higher in the composite score.

Challenge 2

Embedding cost at scale

Testing 4 embedding models x 150 questions x 6 chunking strategies generated 3,600 embedding API calls. Implemented aggressive caching so re-runs only embed new or changed chunks.

NEXT

Interested in working with Forward?

We build production AI systems and run experiments like this for teams who value rigorous engineering.