PDF → Knowledge Graph
An automated pipeline that extracts entities, relationships, and claims from PDF documents, constructs a navigable knowledge graph, and provides a visual explorer. Uses LLM-powered entity extraction with Neo4j graph storage and D3.js visualization. Great for research papers, legal documents, and technical specs.
Category
RAG
Status
LiveTech Stack
Models
Reading a 50-page PDF is slow. Navigating a knowledge graph is fast. This experiment builds an automated pipeline that extracts entities, relationships, and claims from PDF documents, constructs a queryable knowledge graph in Neo4j, and renders an interactive D3.js visualization. Upload a research paper and explore its concepts, methodologies, and findings as a connected graph rather than linear text.
I tested the pipeline on 40 documents: 15 research papers (ML/AI), 10 legal contracts, 10 technical specifications, and 5 annual reports. LlamaParse extracts structured text from PDFs. Claude Sonnet performs entity extraction (people, organizations, concepts, methods, metrics) and relationship extraction (uses, cites, compares, contradicts, extends) in a two-pass approach. Neo4j stores the graph with entity and relationship type indexing. I measured entity extraction F1 against human-annotated ground truth.
LlamaParse handles PDF extraction with layout-aware parsing. Claude Sonnet performs entity and relationship extraction via structured prompts. Neo4j stores the knowledge graph with full-text search and Cypher query support. D3.js renders the interactive force-directed graph visualization. Next.js serves the frontend with server-side graph queries.
The most important insights from this experiment.
Two-pass extraction improves F1 from 0.79 to 0.89
A single LLM pass for entity+relationship extraction scored 0.79 F1. Splitting into entity extraction (pass 1) and relationship extraction with known entities (pass 2) improved F1 to 0.89 by reducing missed relationships.
Knowledge graphs reveal cross-document insights
When multiple papers are ingested, the graph reveals connections the authors never made explicit — shared methodologies, contradictory findings, and citation chains that highlight research lineages.
Interactive visualization beats search for exploration
In user testing (n=10), participants found answers to open-ended questions 2.3x faster using the graph visualization compared to keyword search in the original PDFs.
Citation linking enables provenance tracking
Every node and edge in the graph links back to the specific page and paragraph in the source PDF. Clicking a relationship opens the relevant passage, building trust in the extracted knowledge.
PDFs are uploaded and processed by LlamaParse into structured text blocks with page and position metadata. Each block is sent to Claude Sonnet for entity extraction, then entities are sent back with the text for relationship extraction. Extracted triples (entity-relationship-entity) are written to Neo4j with source citations. The D3.js frontend queries Neo4j via a Cypher API, rendering nodes as entities and edges as relationships with type-based coloring and interactive filtering.
Entity extraction F1: 0.89 across 40 documents. Relationship extraction F1: 0.82. Average processing time: 45 seconds per 20-page document. Graph visualization renders up to 500 nodes at 60fps. User testing showed 2.3x faster exploratory analysis vs keyword search.
Key technical challenges encountered during this experiment.
Entity coreference resolution
The same entity appears with different names ("GPT-4", "GPT4", "OpenAI's latest model"). Implemented a fuzzy matching + embedding similarity step that merges entity variants into canonical nodes with alias lists.
Graph layout readability at scale
Graphs with 300+ nodes became visually overwhelming. Added hierarchical clustering that groups related entities into collapsible super-nodes, with semantic zoom that reveals detail as users focus on specific regions.
Interested in working with Forward?
We build production AI systems and run experiments like this for teams who value rigorous engineering.