Document Vision Parser
A template-free document extraction system that uses vision LLMs to understand document layouts and extract structured fields. Handles receipts, invoices, tax forms, and business cards without pre-defined templates. Outputs clean JSON with confidence scores and bounding box annotations.
Category
Vision
Status
LiveTech Stack
Models
Traditional OCR requires templates for every document type. This experiment replaces template-based extraction with vision LLMs that understand document layout and semantics natively. Upload a receipt, invoice, or tax form — get clean structured JSON with field labels, values, confidence scores, and bounding box coordinates, without ever defining a template.
I tested GPT-4o Vision and Claude Vision on 500 real-world documents across 8 categories: receipts (120), invoices (100), tax forms (80), business cards (60), bank statements (50), purchase orders (40), contracts (30), and shipping labels (20). Ground truth was manually annotated with field names, values, and bounding boxes. I compared against AWS Textract and Google Document AI as baselines. Extraction quality was measured on field-level F1 score, and I tracked processing time and cost per document.
GPT-4o Vision and Claude Vision handle document understanding. FastAPI serves the extraction API with async processing for batch uploads. Pydantic enforces output schemas and validates extracted fields. React with Canvas API renders bounding box overlays on original documents for verification.
The most important insights from this experiment.
96% field accuracy without any templates
GPT-4o Vision achieved 96% field-level F1 across all document types with zero template configuration — matching Textract's accuracy on receipts while significantly outperforming it on unusual layouts.
Vision LLMs handle rotated and skewed documents
Unlike traditional OCR, vision models maintained >90% accuracy on documents photographed at angles up to 30 degrees. Textract accuracy dropped to 67% on the same rotated inputs.
Structured output schemas prevent hallucination
Using Pydantic response schemas with field-level validation reduced hallucinated fields from 8% to 1.2%. The model is less likely to invent data when constrained to a predefined output structure.
Cost competitive at moderate scale
At $0.03 per document, the vision approach costs 2x Textract per-page but requires zero template development time. Break-even is approximately 50 document types — after that, templates cost more to maintain.
Documents are uploaded to the FastAPI backend, which preprocesses images (orientation detection, contrast normalization) and sends them to the vision model with a structured extraction prompt specifying the output schema. The model response is validated against the Pydantic schema, confidence scores are computed by running extraction twice and comparing field agreement, and the final result is returned with bounding box coordinates computed via a secondary vision prompt focused on spatial localization.
96% field-level F1 across 500 documents and 8 categories. 1.8s average processing time per document. $0.03 average cost per document. Zero template maintenance overhead. Confidence scoring correctly flagged 89% of extraction errors for human review.
Key technical challenges encountered during this experiment.
Bounding box precision
Vision models return approximate spatial descriptions, not pixel coordinates. Built a secondary pipeline that crops predicted regions and runs a focused extraction to refine bounding box coordinates to within 5px accuracy.
Multi-page document handling
Invoices spanning 3+ pages exceeded single-image context limits. Implemented page-level extraction with a merge step that deduplicates fields and resolves cross-page references (e.g., "continued on next page" line items).
Interested in working with Forward?
We build production AI systems and run experiments like this for teams who value rigorous engineering.