LLMResearch

Prompt Optimization Lab

An experiment in automated prompt optimization. Takes a base prompt and a scoring function, then uses evolutionary strategies (mutation, crossover, selection) to improve prompt performance over generations. Includes A/B testing infrastructure to validate improvements with statistical significance.

Claude SonnetGPT-4oGemini 2.5 Pro

15% avg accuracy liftGenetic algorithm evolutionStatistical significance testing

Source Code

Status

Research

Tech Stack

PythonDSPyDeepEvalStreamlitRedis

Models

Claude SonnetGPT-4oGemini 2.5 Pro

Overview

Writing good prompts is still an artisanal craft. This research experiment automates prompt engineering using evolutionary algorithms — starting from a base prompt and a scoring function, the system mutates, crosses over, and selects prompts over generations to maximize performance on your specific evaluation suite. Think genetic algorithms, but for natural language instructions.

Methodology

I implemented a genetic algorithm with a population of 20 prompt variants, running 15 generations per optimization cycle. Mutations include paraphrasing, instruction reordering, example swapping, constraint addition/removal, and format changes. Crossover combines high-scoring sections from two parent prompts. Selection uses tournament selection with elitism (top 2 always survive). I tested on 5 optimization tasks: classification accuracy, JSON schema compliance, code generation pass rate, summarization quality, and extraction precision. Each task had a held-out test set to measure generalization.

Tech Stack

Python implements the genetic algorithm with configurable mutation operators. DSPy provides the baseline comparison and some mutation primitives. DeepEval scores prompt variants against task-specific metrics. Streamlit visualizes the evolution process with generation-over-generation fitness plots. Redis queues parallel evaluation jobs.

Key Findings

The most important insights from this experiment.

15% average accuracy improvement over human-written prompts

Across 5 tasks, the optimizer improved accuracy by 15% on average compared to carefully hand-written prompts. The largest gain was in JSON schema compliance (+23%), where the optimizer discovered that adding negative examples dramatically reduced formatting errors.

Prompt structure matters more than wording

The optimizer consistently converged on structural patterns — instruction ordering, example placement, constraint specificity — rather than clever wording. This suggests that prompt engineering is more about information architecture than creative writing.

Overfitting is a real risk

Without a held-out test set, optimized prompts achieved 28% improvement on the training eval but only 8% on unseen examples. The held-out set is essential — prompt overfitting mirrors model overfitting.

DSPy provides a strong baseline

Comparing against DSPy's automatic prompt optimization, the genetic approach achieved similar results on simple tasks but outperformed on complex multi-step prompts where DSPy's gradient-based optimization struggled with discrete structural changes.

Architecture

The optimization loop runs: (1) Initialize population of 20 prompt variants from the base prompt via diverse mutations. (2) Evaluate each variant against the scoring function using DeepEval. (3) Select parents via tournament selection. (4) Generate offspring via crossover and mutation. (5) Replace bottom performers with offspring. (6) Repeat for 15 generations. The best prompt is validated against a held-out test set. The Streamlit dashboard shows real-time fitness curves, mutation type effectiveness, and prompt diff visualization.

Results

15% average accuracy improvement across 5 tasks. Best single improvement: +23% on JSON schema compliance. Held-out generalization gap: 7% average (28% training improvement, 21% test improvement with proper regularization). Each optimization run takes ~45 minutes and costs ~$12 in API calls. Outperformed DSPy on 3 of 5 tasks.

Challenges

Key technical challenges encountered during this experiment.

Challenge 1

Evaluation cost and latency

20 variants x 50 eval examples x 15 generations = 15,000 LLM calls per optimization run. Implemented smart caching (unchanged prompt sections reuse cached scores) and parallel evaluation (8 concurrent workers) to reduce wall time from 6 hours to 45 minutes.

Challenge 2

Mutation quality variance

Random mutations often produce nonsensical prompts that waste evaluation budget. Implemented a pre-filter that uses a fast model (GPT-3.5) to score mutation coherence before expensive evaluation, rejecting ~30% of mutations early.

Interested in working with Forward?

We build production AI systems and run experiments like this for teams who value rigorous engineering.

Start a Conversation More Experiments

Previous Experiment

Prompt Optimization Lab

Category

Status

Tech Stack

Models

15% average accuracy improvement over human-written prompts

Prompt structure matters more than wording

Overfitting is a real risk

DSPy provides a strong baseline

Evaluation cost and latency

Mutation quality variance

Interested in working with Forward?

Browser Automation Agent