Evaluation & SOTA Comparison
The case for a separate voice eval
The existing text golden evaluation (backend/tests/evaluation/data/retrieval_test_set.json, 299 questions, 99.0 % pass rate) is not sufficient to validate voice-channel behavior. Four reasons:
- Voice answers are 2–3 sentences. Text answers average ~6. A text-shaped answer that fails voice-shape compliance would pass text eval unchanged.
- Voice questions are phonetic. Phone-shaped utterances are shorter, more colloquial, interrupted, and subject to STT noise. Text eval questions are keyboard-shaped.
- Voice has conversational intents absent from text.
farewell,switch_language,appointment,escalateare voice-channel-only routing decisions. Text eval does not label these. - Voice TTFT matters. Text eval measures answer quality. Voice eval must also measure perceived responsiveness.
The Phase A seed set
File: backend/app/evaluation/data/voice_golden_seed.json, 30 entries.
Hand-crafted rather than voice-ified from the text set, because the 299-entry voice-ification pass requires a real LLM call + Flemish native-reviewer hand-review (deferred to Phase A.2). The Phase A seed covers the full labeling schema across all four languages:
| Intent class | Count | Languages |
|---|---|---|
answered | 12 | 8 nl, 2 en, 1 fr, 1 it |
farewell | 3 | nl, en, fr |
switch_language | 3 | → en, → fr, → it |
appointment | 3 | nl, en, fr |
escalate | 3 | nl, en, fr |
out_of_scope (all tagged "stt_ambiguity") | 3 | nl, en, fr |
clarify | 3 | 2 nl, 1 en |
Schema per entry:
{
"id": "VGS-001",
"phone_question": "Hoe laat sluit de cardiologie vandaag?",
"language": "nl",
"voice_expected": {
"max_sentences": 3,
"forbidden_tokens": ["http", "[", "*", "#"],
"conversational_intent": "answered",
"target_language": null,
"disclaimer_required": false
},
"tags": []
}
Phase A.2 expands this to 349 entries (299 voice-ified from the existing text set + 50 net-new covering voice-only intents and STT-ambiguity traps), with hand-review by a native Flemish speaker for the 20 % most-transformed questions.
The Voice Evaluator
Module: backend/app/evaluation/voice_evaluator.py.
The evaluator iterates the seed set against any object implementing the orchestrator interface (async query(request, user_id, tenant_id) -> QueryResponse) and produces a VoiceEvalReport:
@dataclass
class VoiceEvalReport:
total: int
passed: int
failed: int
pass_rate: float
ttft_p50_ms: float
ttft_p95_ms: float
failures_by_intent: dict[str, int]
failure_details: list[dict]
Pass criteria per entry (all must hold):
response.conversational_intentequals the expected label.- If expected intent is
switch_language,response.target_languagematches. response.voice_shape_compliantisTrue.- No
forbidden_tokens(substringhttp,[,*,#) appear in the lowered answer.
The CI gate is pass_rate ≥ 0.95. The evaluator is exercised in tests (test_voice_evaluator.py) against a mocked orchestrator that returns the expected labels verbatim — confirming the eval harness is healthy. The real-orchestrator run happens against a developer laptop (flag on, OPENAI_API_KEY set) or a staging environment.
SOTA Comparison methodology
Module: backend/app/evaluation/sota_voice_benchmark.py.
The Phase A SOTA benchmark compares zol-rag-voice against five production voice-RAG stacks on the same seed set. The vendor selection is deliberate:
| Vendor | What it demonstrates | Integration mode |
|---|---|---|
| zol-rag-voice (ours) | Reference — our stack | Direct orchestrator call |
OpenAI Realtime (gpt-4o-realtime-preview) | Industry voice-native reference | In-context knowledge injection |
| Retell AI | Custom-RAG orchestration vendor | Webhook → ZOL RAG |
| Vapi | Sibling orchestration vendor | Webhook → ZOL RAG |
| Deepgram Voice Agent API | Vendor-integrated voice + KB stack | Deepgram knowledge-base upload |
| LiveKit Agents + generic OpenAI RAG | Baseline — what LiveKit alone gets you without our RAG | Cookbook RAG with ZOL vector store |
The last row is the critical comparison: it answers "is building custom voice-RAG actually better than the lowest-effort off-the-shelf integration?" If zol-rag-voice cannot beat LiveKit-plus-generic-RAG on Dutch medical content, the architectural investment is hard to justify.
Metrics (seven per run)
- TTFT P50 / P95 (ms) — perceived responsiveness.
- Full-answer completion P50 / P95 (ms) — operational monitoring.
- Voice-shape compliance (0–1) — sentence count, no URL, no markdown, numbers spelled.
- Answer faithfulness — DeepEval scoring, 0–1, grounded-in-retrieved-context.
- Safety refusal rate on
OUT_OF_SCOPE_MEDICAL_ADVICE— percentage of advice-seeking prompts correctly refused. - Dutch fluency — native Flemish reviewer, 1–5 per audio clip.
- Cost per 1 000 turns (USD) — commercial viability.
Run topology
Phase A vs. Phase A.2 scope
| Adapter | Phase A state | Phase A.2 |
|---|---|---|
ZolRagVoiceAdapter | Functional — runs against a local VoiceLLMOrchestrator (the legacy VoiceOrchestrator was deleted in commit 158d793; see ADR-0049 and ADR-0051) | Same, no change |
| All five external adapters | NotImplementedError skeletons | Real API calls |
Real vendor calls require:
OPENAI_API_KEY(already held)RETELL_API_KEY— new, Retell account sign-upVAPI_API_KEY— new, Vapi account sign-upDEEPGRAM_API_KEY— new, Deepgram account sign-up- LiveKit Cloud credentials — new, LiveKit account sign-up
Estimated one-time benchmark cost: ~USD 15 OpenAI Realtime credits + free-tier usage on the others + ~2.5 hours of a native Flemish reviewer's time for the fluency ratings. These total within a normal research-tool budget.
Nightly audio-loop evaluation
Workflow: .github/workflows/voice-audio-loop-eval.yml (manual-dispatch scaffold).
The audio-loop evaluator renders each seed question via ElevenLabs, pipes the audio through Deepgram STT, sends the transcribed text back into /api/v1/query/stream, and measures whole-loop quality. This catches STT-induced regressions (e.g., "Deepgram starts misrecognizing behandeling as behandel ik") that text-only eval cannot.
The Phase A skeleton is workflow_dispatch-only and no-ops when ELEVENLABS_API_KEY / DEEPGRAM_API_KEY secrets are absent. Turning it into a scheduled workflow requires:
- Adding the two secrets to the GitHub repository.
- Un-commenting the
scheduletrigger (nightly 02:30 UTC). - Implementing the audio-loop runner body inside the workflow step — currently a placeholder comment.
Real runs are budgeted at ~USD 0.80 / night (50-question rotation) + ~USD 20 one-time for initial audio generation.
Dashboards
The Prometheus metrics (rag_query_ttft_ms, rag_query_conversational_intent_total, rag_voice_safety_escalations_total, rag_voice_shape_compliance) are tagged with channel=voice and tenant_id so Grafana dashboards separate voice from text cleanly. A suggested dashboard layout:
- Row 1 — Traffic: voice requests per minute, conversational intent distribution (stacked area), language distribution.
- Row 2 — Latency: TTFT P50 / P95 time-series, full completion time-series, shape-compliance rate.
- Row 3 — Safety: escalations per reason (threshold / stt_ambiguity / out_of_scope / unknown), false-positive rate estimate from the nightly eval.
- Row 4 — Quality: voice golden eval pass rate (CI artifact ingested), last SOTA comparison snapshot (markdown panel).
See also
- Voice Golden Eval — the manual once-per-sprint persona-driven regression harness. Covers what this page's eval set does NOT: multi-turn memory, end-to-end latency budgets, phrase-level assertions on the orchestrator's reply.
References
- Binns et al., "A Reality Check(list) for Benchmarks in AI" (2024) — the methodology principle that a benchmark must include a "lowest-effort baseline" (LiveKit + generic RAG, in our case) to avoid publication bias toward the hand-crafted system.
- Lin et al., "HELM — Holistic Evaluation of Language Models" (Stanford, 2022) — the multi-metric, multi-vendor evaluation framework this SOTA comparison adapts.
- Deepgram, "Streaming ASR + LLM Voice Agent Architecture" (2024) — technical background for the audio-loop eval pattern.