The Decision We Keep Getting Wrong
The most expensive mistake we see teams make with LLMs isn't choosing the wrong model โ it's fine-tuning when they should be prompt engineering, or prompt engineering when they should be fine-tuning. After shipping fine-tuned models for legal document analysis, customer support routing, and medical coding, we've developed a practical framework for making this decision.
When Fine-Tuning Actually Makes Sense
Fine-tuning is justified when at least two of these conditions are true:
| Condition | Why It Matters |
|---|---|
| Consistent output format required | Prompt engineering struggles with strict schema adherence across edge cases |
| Domain-specific vocabulary or reasoning | General models hallucinate domain terms or apply incorrect reasoning patterns |
| Latency budget under 500ms | Smaller fine-tuned models outperform larger prompted models |
| Cost per request must decrease | A fine-tuned 8B model is 50-100x cheaper than prompting GPT-4o |
| Thousands of labeled examples available | Fine-tuning without sufficient data produces worse results than prompting |
If none of these apply, start with prompt engineering. If only one applies, try few-shot prompting with structured output first. We've seen teams burn months fine-tuning a model that a well-crafted system prompt could have handled.
Fine-Tuning Doesn't Add Knowledge
Fine-tuning adjusts a model's behavior and style โ it doesn't reliably inject new factual knowledge. If your use case requires the model to know about your proprietary product catalog or internal policies, RAG (retrieval-augmented generation) is the correct approach, potentially combined with fine-tuning for output formatting.
Data Preparation: The Unglamorous Foundation
Data quality determines fine-tuning success more than any hyperparameter choice. We spend 60-70% of total project time on data preparation.
Building the Training Set
import json
import hashlib
from pathlib import Path
from dataclasses import dataclass, field
from pydantic import BaseModel, validator
class TrainingExample(BaseModel):
"""A single training example in chat format."""
messages: list[dict[str, str]]
@validator("messages")
def validate_messages(cls, v):
roles = [m["role"] for m in v]
# Must start with system or user
if roles[0] not in ("system", "user"):
raise ValueError("Conversation must start with system or user message")
# Must end with assistant
if roles[-1] != "assistant":
raise ValueError("Conversation must end with assistant message")
# No empty content
for msg in v:
if not msg["content"].strip():
raise ValueError(f"Empty content in {msg['role']} message")
return v
@dataclass
class DatasetBuilder:
"""Builds and validates a fine-tuning dataset with deduplication,
quality filtering, and train/eval splitting."""
examples: list[TrainingExample] = field(default_factory=list)
seen_hashes: set[str] = field(default_factory=set)
rejected: list[dict] = field(default_factory=list)
def add_example(self, raw: dict) -> bool:
# Deduplicate by content hash
content_hash = hashlib.sha256(
json.dumps(raw, sort_keys=True).encode()
).hexdigest()[:16]
if content_hash in self.seen_hashes:
self.rejected.append({"reason": "duplicate", "hash": content_hash})
return False
try:
example = TrainingExample(**raw)
except Exception as e:
self.rejected.append({"reason": str(e), "data": raw})
return False
# Quality filters
assistant_msgs = [m for m in example.messages if m["role"] == "assistant"]
avg_response_len = sum(len(m["content"]) for m in assistant_msgs) / len(assistant_msgs)
if avg_response_len < 20:
self.rejected.append({"reason": "response_too_short", "hash": content_hash})
return False
if avg_response_len > 8000:
self.rejected.append({"reason": "response_too_long", "hash": content_hash})
return False
self.seen_hashes.add(content_hash)
self.examples.append(example)
return True
def export(self, output_dir: Path, eval_ratio: float = 0.1):
"""Export train/eval split as JSONL files."""
import random
random.seed(42)
indices = list(range(len(self.examples)))
random.shuffle(indices)
split_point = int(len(indices) * (1 - eval_ratio))
train_indices = indices[:split_point]
eval_indices = indices[split_point:]
output_dir.mkdir(parents=True, exist_ok=True)
for name, idx_set in [("train", train_indices), ("eval", eval_indices)]:
path = output_dir / f"{name}.jsonl"
with open(path, "w") as f:
for i in idx_set:
f.write(self.examples[i].json() + "\n")
print(f"{name}: {len(idx_set)} examples -> {path}")
# Write rejection log
with open(output_dir / "rejected.json", "w") as f:
json.dump(self.rejected, f, indent=2)
print(f"Rejected: {len(self.rejected)} examples")Data Quality Checklist
Before every fine-tuning run, we verify:
- No data leakage: Eval examples must not appear in training data (we hash-check this).
- Label consistency: Two annotators should agree on at least 90% of examples. Below that, the labels are too noisy for the model to learn a coherent pattern.
- Distribution match: Training data should roughly match the production distribution of categories. Severe imbalance teaches the model to over-predict the majority class.
- No PII in training data: We run a PII detection pass and redact before training. Fine-tuned models can memorize and regurgitate training data.
Training Strategies: LoRA vs. QLoRA vs. Full Fine-Tuning
| Method | VRAM Required (7B model) | Training Speed | Quality | Best For |
|---|---|---|---|---|
| Full fine-tune | 60+ GB | Slow | Highest | When you have budget and >50k examples |
| LoRA (r=16) | 16-24 GB | Fast | Very good | Most production use cases |
| QLoRA (4-bit) | 6-10 GB | Medium | Good | Experimentation, constrained hardware |
| Adapter fusion | 20-30 GB | Medium | Very good | Multi-task models |
We default to LoRA for production fine-tuning. The quality gap between LoRA and full fine-tuning is negligible for most tasks, and the resource savings are dramatic.
from dataclasses import dataclass
from pathlib import Path
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
@dataclass
class FinetuneConfig:
base_model: str = "meta-llama/Llama-3.1-8B-Instruct"
dataset_path: str = "data/prepared/train.jsonl"
eval_path: str = "data/prepared/eval.jsonl"
output_dir: str = "outputs/legal-classifier-v3"
lora_r: int = 16
lora_alpha: int = 32
lora_dropout: float = 0.05
epochs: int = 3
batch_size: int = 4
gradient_accumulation: int = 8
learning_rate: float = 2e-4
max_seq_length: int = 2048
use_4bit: bool = False
def train(config: FinetuneConfig):
quantization_config = None
if config.use_4bit:
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
config.base_model,
quantization_config=quantization_config,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
if config.use_4bit:
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=config.lora_r,
lora_alpha=config.lora_alpha,
lora_dropout=config.lora_dropout,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
tokenizer = AutoTokenizer.from_pretrained(config.base_model)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
training_args = TrainingArguments(
output_dir=config.output_dir,
num_train_epochs=config.epochs,
per_device_train_batch_size=config.batch_size,
gradient_accumulation_steps=config.gradient_accumulation,
learning_rate=config.learning_rate,
weight_decay=0.01,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
logging_steps=10,
save_strategy="epoch",
evaluation_strategy="epoch",
bf16=True,
gradient_checkpointing=True,
report_to="wandb",
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=load_jsonl(config.dataset_path),
eval_dataset=load_jsonl(config.eval_path),
tokenizer=tokenizer,
max_seq_length=config.max_seq_length,
dataset_text_field="text",
)
trainer.train()
trainer.save_model(f"{config.output_dir}/final")
return trainer.state.best_metricStart With QLoRA for Experimentation
Use QLoRA (4-bit quantization) during experimentation to iterate quickly on data quality and hyperparameters. Once you've found a configuration that works, switch to LoRA or full fine-tuning for the production model. The quality difference between QLoRA and LoRA is usually 1-3% on task-specific benchmarks, but QLoRA uses 60% less VRAM.
Evaluation Beyond Loss Curves
Training loss going down doesn't mean your model is good. We evaluate fine-tuned models on three axes:
Task-specific accuracy
For classification tasks, we measure precision, recall, and F1 per class. For generation tasks, we use a combination of automated metrics (ROUGE, BERTScore) and LLM-as-judge evaluation where a stronger model grades the output.
Regression testing
We maintain a set of "golden" examples that the base model handles correctly. If fine-tuning degrades performance on these, the model has overfit to the fine-tuning distribution at the expense of general capability. More than 5% regression on golden examples is a red flag.
Adversarial testing
We test with out-of-distribution inputs, prompt injections, and edge cases. Fine-tuned models can become brittle โ confidently producing wrong outputs for inputs slightly outside the training distribution.
Deployment and Cost Analysis
Serving Infrastructure
We serve fine-tuned models using vLLM behind a FastAPI gateway. For models under 13B parameters, a single A10G instance handles 50-100 requests/second depending on sequence length.
Cost Comparison: Real Numbers
These are actual costs from a legal document classification project processing 500k documents/month (pricing as of Q4 2024):
| Approach | Cost/Month | Accuracy | P95 Latency |
|---|---|---|---|
| GPT-4o with 8-shot prompt | $4,200 | 93.1% | 2,400ms |
| GPT-4o-mini with 8-shot prompt | $310 | 87.4% | 890ms |
| Fine-tuned Llama 3.1 8B (LoRA) | $180 (GPU hosting) | 94.7% | 120ms |
| Fine-tuned Llama 3.1 8B (QLoRA) | $180 (GPU hosting) | 93.2% | 125ms |
The fine-tuned 8B model is cheaper, faster, and more accurate than prompting GPT-4o for this specific task. That's the sweet spot for fine-tuning: narrow, well-defined tasks with sufficient training data.
When Our Fine-Tuning Failed
Not every project succeeded. We attempted to fine-tune a model for open-ended legal reasoning and got worse results than prompting a frontier model. The task required broad world knowledge and multi-step reasoning that a smaller model simply couldn't replicate, regardless of training data quality. We reverted to Claude 3.5 Sonnet with a carefully engineered prompt chain and RAG โ more expensive per request, but actually correct.
The Iterative Process
Fine-tuning is never one-and-done. Our production models go through a continuous improvement cycle:
- Monitor production predictions โ log inputs, outputs, confidence scores, and user feedback.
- Identify failure modes โ cluster low-confidence predictions and user corrections weekly.
- Augment training data โ add corrected examples to the training set, re-validate distribution.
- Retrain and evaluate โ run the full pipeline, compare against the current production model.
- Deploy behind feature flag โ shadow test before promoting to primary.
Each cycle typically improves task accuracy by 1-3 percentage points until diminishing returns set in around 95-97% for most classification tasks.
The models that work in production are built on clean data, rigorous evaluation, and disciplined deployment โ not clever hyperparameter tuning. We've seen teams spend weeks optimizing learning rates while their training data contained 15% mislabeled examples. Fix the data first. The model will follow.