VisionLive

Document Vision Parser

A template-free document extraction system that uses vision LLMs to understand document layouts and extract structured fields. Handles receipts, invoices, tax forms, and business cards without pre-defined templates. Outputs clean JSON with confidence scores and bounding box annotations.

GPT-4o VisionClaude Vision

96% field accuracy1.8s processingZero template setup

Source Code Live Demo

Status

Live

Tech Stack

GPT-4o VisionFastAPIPydanticReactCanvas API

Models

GPT-4o VisionClaude Vision

Overview

Traditional OCR requires templates for every document type. This experiment replaces template-based extraction with vision LLMs that understand document layout and semantics natively. Upload a receipt, invoice, or tax form — get clean structured JSON with field labels, values, confidence scores, and bounding box coordinates, without ever defining a template.

Methodology

I tested GPT-4o Vision and Claude Vision on 500 real-world documents across 8 categories: receipts (120), invoices (100), tax forms (80), business cards (60), bank statements (50), purchase orders (40), contracts (30), and shipping labels (20). Ground truth was manually annotated with field names, values, and bounding boxes. I compared against AWS Textract and Google Document AI as baselines. Extraction quality was measured on field-level F1 score, and I tracked processing time and cost per document.

Tech Stack

GPT-4o Vision and Claude Vision handle document understanding. FastAPI serves the extraction API with async processing for batch uploads. Pydantic enforces output schemas and validates extracted fields. React with Canvas API renders bounding box overlays on original documents for verification.

Key Findings

The most important insights from this experiment.

96% field accuracy without any templates

GPT-4o Vision achieved 96% field-level F1 across all document types with zero template configuration — matching Textract's accuracy on receipts while significantly outperforming it on unusual layouts.

Vision LLMs handle rotated and skewed documents

Unlike traditional OCR, vision models maintained >90% accuracy on documents photographed at angles up to 30 degrees. Textract accuracy dropped to 67% on the same rotated inputs.

Structured output schemas prevent hallucination

Using Pydantic response schemas with field-level validation reduced hallucinated fields from 8% to 1.2%. The model is less likely to invent data when constrained to a predefined output structure.

Cost competitive at moderate scale

At $0.03 per document, the vision approach costs 2x Textract per-page but requires zero template development time. Break-even is approximately 50 document types — after that, templates cost more to maintain.

Architecture

Documents are uploaded to the FastAPI backend, which preprocesses images (orientation detection, contrast normalization) and sends them to the vision model with a structured extraction prompt specifying the output schema. The model response is validated against the Pydantic schema, confidence scores are computed by running extraction twice and comparing field agreement, and the final result is returned with bounding box coordinates computed via a secondary vision prompt focused on spatial localization.

Results

96% field-level F1 across 500 documents and 8 categories. 1.8s average processing time per document. $0.03 average cost per document. Zero template maintenance overhead. Confidence scoring correctly flagged 89% of extraction errors for human review.

Challenges

Key technical challenges encountered during this experiment.

Challenge 1

Bounding box precision

Vision models return approximate spatial descriptions, not pixel coordinates. Built a secondary pipeline that crops predicted regions and runs a focused extraction to refine bounding box coordinates to within 5px accuracy.

Challenge 2

Multi-page document handling

Invoices spanning 3+ pages exceeded single-image context limits. Implemented page-level extraction with a merge step that deduplicates fields and resolves cross-page references (e.g., "continued on next page" line items).