Voice-to-SQL Agent
A conversational interface that converts spoken questions into SQL queries, executes them against your database, and explains the results in natural language. Uses Whisper for speech recognition and GPT-4o function-calling for SQL generation with schema-aware validation.
Category
Agent
Status
LiveTech Stack
Models
This experiment tests whether a voice-first interface to databases can make data exploration accessible to non-technical stakeholders. Speak a question like "What were our top 5 customers by revenue last quarter?" and get the SQL, the results, and a natural language explanation — all in under 2 seconds from end of speech.
I built a pipeline connecting Whisper Large-v3 for speech-to-text, GPT-4o with function-calling for SQL generation, and a React frontend for result visualization. The system was tested against a 150-question benchmark covering single-table lookups, multi-join aggregations, date-range filters, and ambiguous natural language. Each question was tested with 3 different speakers to evaluate robustness to accents and speech patterns. Schema awareness was implemented by injecting CREATE TABLE statements and sample rows into the system prompt.
Whisper Large-v3 handles speech recognition with word-level timestamps. GPT-4o generates SQL through structured function-calling with Pydantic-validated schemas. FastAPI serves the backend with connection pooling to SQLite (extendable to Postgres). The React frontend uses Web Audio API for recording and Recharts for result visualization.
The most important insights from this experiment.
94% query accuracy on structured questions
GPT-4o with schema injection correctly generated SQL for 94% of benchmark questions. Failures concentrated on ambiguous temporal references ("recently", "a while ago") and implicit joins.
Schema injection outperforms fine-tuning
Injecting CREATE TABLE statements with 3 sample rows per table achieved higher accuracy than a fine-tuned model trained on 500 query pairs, while being instantly adaptable to new schemas.
Voice adds 300ms but improves accessibility
Whisper transcription adds ~300ms latency, but user testing showed non-technical users asked 3x more exploratory queries via voice compared to a text SQL interface.
Function-calling prevents injection attacks
Using structured function-calling instead of raw SQL string generation eliminated SQL injection vectors entirely — the model can only call pre-defined query functions with validated parameters.
Audio is captured in the browser via Web Audio API and streamed to the FastAPI backend. Whisper transcribes the audio, then the transcript is combined with the database schema context and sent to GPT-4o with function-calling tools defined for SELECT queries. The generated SQL is validated against an allowlist of operations (read-only), executed against the database, and results are sent back alongside a natural language summary generated by a second LLM call.
94% query accuracy across 150 benchmark questions. 1.2s average voice-to-result latency (300ms STT + 600ms LLM + 200ms execution + 100ms rendering). 100% of test users (n=8) preferred voice over writing raw SQL for exploratory analysis. Zero successful injection attempts in adversarial testing.
Key technical challenges encountered during this experiment.
Ambiguous temporal references
"Last quarter" means different things depending on fiscal calendar. Solved by adding a configurable fiscal calendar context and clarification prompts when temporal ambiguity is detected.
Multi-turn conversation state
Follow-up questions like "Now break that down by region" need to reference the previous query. Implemented a conversation memory that maintains the last 3 query contexts for pronoun resolution.
Interested in working with Forward?
We build production AI systems and run experiments like this for teams who value rigorous engineering.