AgentLive

Voice-to-SQL Agent

A conversational interface that converts spoken questions into SQL queries, executes them against your database, and explains the results in natural language. Uses Whisper for speech recognition and GPT-4o function-calling for SQL generation with schema-aware validation.

GPT-4oWhisper Large-v3

94% query accuracy1.2s voice-to-resultSchema-aware validation

Source Code

Status

Live

Tech Stack

WhisperFastAPISQLiteReactLangChain

Models

GPT-4oWhisper Large-v3

Overview

This experiment tests whether a voice-first interface to databases can make data exploration accessible to non-technical stakeholders. Speak a question like "What were our top 5 customers by revenue last quarter?" and get the SQL, the results, and a natural language explanation — all in under 2 seconds from end of speech.

Methodology

I built a pipeline connecting Whisper Large-v3 for speech-to-text, GPT-4o with function-calling for SQL generation, and a React frontend for result visualization. The system was tested against a 150-question benchmark covering single-table lookups, multi-join aggregations, date-range filters, and ambiguous natural language. Each question was tested with 3 different speakers to evaluate robustness to accents and speech patterns. Schema awareness was implemented by injecting CREATE TABLE statements and sample rows into the system prompt.

Tech Stack

Whisper Large-v3 handles speech recognition with word-level timestamps. GPT-4o generates SQL through structured function-calling with Pydantic-validated schemas. FastAPI serves the backend with connection pooling to SQLite (extendable to Postgres). The React frontend uses Web Audio API for recording and Recharts for result visualization.

Key Findings

The most important insights from this experiment.

94% query accuracy on structured questions

GPT-4o with schema injection correctly generated SQL for 94% of benchmark questions. Failures concentrated on ambiguous temporal references ("recently", "a while ago") and implicit joins.

Schema injection outperforms fine-tuning

Injecting CREATE TABLE statements with 3 sample rows per table achieved higher accuracy than a fine-tuned model trained on 500 query pairs, while being instantly adaptable to new schemas.

Voice adds 300ms but improves accessibility

Whisper transcription adds ~300ms latency, but user testing showed non-technical users asked 3x more exploratory queries via voice compared to a text SQL interface.

Function-calling prevents injection attacks

Using structured function-calling instead of raw SQL string generation eliminated SQL injection vectors entirely — the model can only call pre-defined query functions with validated parameters.

Architecture

Audio is captured in the browser via Web Audio API and streamed to the FastAPI backend. Whisper transcribes the audio, then the transcript is combined with the database schema context and sent to GPT-4o with function-calling tools defined for SELECT queries. The generated SQL is validated against an allowlist of operations (read-only), executed against the database, and results are sent back alongside a natural language summary generated by a second LLM call.

Results

94% query accuracy across 150 benchmark questions. 1.2s average voice-to-result latency (300ms STT + 600ms LLM + 200ms execution + 100ms rendering). 100% of test users (n=8) preferred voice over writing raw SQL for exploratory analysis. Zero successful injection attempts in adversarial testing.

Challenges

Key technical challenges encountered during this experiment.

Challenge 1

Ambiguous temporal references

"Last quarter" means different things depending on fiscal calendar. Solved by adding a configurable fiscal calendar context and clarification prompts when temporal ambiguity is detected.

Challenge 2

Multi-turn conversation state

Follow-up questions like "Now break that down by region" need to reference the previous query. Implemented a conversation memory that maintains the last 3 query contexts for pronoun resolution.