Skip to main content

Evaluation & SOTA Comparison

The case for a separate voice eval

The existing text golden evaluation (backend/tests/evaluation/data/retrieval_test_set.json, 299 questions, 99.0 % pass rate) is not sufficient to validate voice-channel behavior. Four reasons:

  1. Voice answers are 2–3 sentences. Text answers average ~6. A text-shaped answer that fails voice-shape compliance would pass text eval unchanged.
  2. Voice questions are phonetic. Phone-shaped utterances are shorter, more colloquial, interrupted, and subject to STT noise. Text eval questions are keyboard-shaped.
  3. Voice has conversational intents absent from text. farewell, switch_language, appointment, escalate are voice-channel-only routing decisions. Text eval does not label these.
  4. Voice TTFT matters. Text eval measures answer quality. Voice eval must also measure perceived responsiveness.

The Phase A seed set

File: backend/app/evaluation/data/voice_golden_seed.json, 30 entries.

Hand-crafted rather than voice-ified from the text set, because the 299-entry voice-ification pass requires a real LLM call + Flemish native-reviewer hand-review (deferred to Phase A.2). The Phase A seed covers the full labeling schema across all four languages:

Intent classCountLanguages
answered128 nl, 2 en, 1 fr, 1 it
farewell3nl, en, fr
switch_language3→ en, → fr, → it
appointment3nl, en, fr
escalate3nl, en, fr
out_of_scope (all tagged "stt_ambiguity")3nl, en, fr
clarify32 nl, 1 en

Schema per entry:

{
"id": "VGS-001",
"phone_question": "Hoe laat sluit de cardiologie vandaag?",
"language": "nl",
"voice_expected": {
"max_sentences": 3,
"forbidden_tokens": ["http", "[", "*", "#"],
"conversational_intent": "answered",
"target_language": null,
"disclaimer_required": false
},
"tags": []
}

Phase A.2 expands this to 349 entries (299 voice-ified from the existing text set + 50 net-new covering voice-only intents and STT-ambiguity traps), with hand-review by a native Flemish speaker for the 20 % most-transformed questions.

The Voice Evaluator

Module: backend/app/evaluation/voice_evaluator.py.

The evaluator iterates the seed set against any object implementing the orchestrator interface (async query(request, user_id, tenant_id) -> QueryResponse) and produces a VoiceEvalReport:

@dataclass
class VoiceEvalReport:
total: int
passed: int
failed: int
pass_rate: float
ttft_p50_ms: float
ttft_p95_ms: float
failures_by_intent: dict[str, int]
failure_details: list[dict]

Pass criteria per entry (all must hold):

  1. response.conversational_intent equals the expected label.
  2. If expected intent is switch_language, response.target_language matches.
  3. response.voice_shape_compliant is True.
  4. No forbidden_tokens (substring http, [, *, #) appear in the lowered answer.

The CI gate is pass_rate ≥ 0.95. The evaluator is exercised in tests (test_voice_evaluator.py) against a mocked orchestrator that returns the expected labels verbatim — confirming the eval harness is healthy. The real-orchestrator run happens against a developer laptop (flag on, OPENAI_API_KEY set) or a staging environment.

SOTA Comparison methodology

Module: backend/app/evaluation/sota_voice_benchmark.py.

The Phase A SOTA benchmark compares zol-rag-voice against five production voice-RAG stacks on the same seed set. The vendor selection is deliberate:

VendorWhat it demonstratesIntegration mode
zol-rag-voice (ours)Reference — our stackDirect orchestrator call
OpenAI Realtime (gpt-4o-realtime-preview)Industry voice-native referenceIn-context knowledge injection
Retell AICustom-RAG orchestration vendorWebhook → ZOL RAG
VapiSibling orchestration vendorWebhook → ZOL RAG
Deepgram Voice Agent APIVendor-integrated voice + KB stackDeepgram knowledge-base upload
LiveKit Agents + generic OpenAI RAGBaseline — what LiveKit alone gets you without our RAGCookbook RAG with ZOL vector store

The last row is the critical comparison: it answers "is building custom voice-RAG actually better than the lowest-effort off-the-shelf integration?" If zol-rag-voice cannot beat LiveKit-plus-generic-RAG on Dutch medical content, the architectural investment is hard to justify.

Metrics (seven per run)

  1. TTFT P50 / P95 (ms) — perceived responsiveness.
  2. Full-answer completion P50 / P95 (ms) — operational monitoring.
  3. Voice-shape compliance (0–1) — sentence count, no URL, no markdown, numbers spelled.
  4. Answer faithfulness — DeepEval scoring, 0–1, grounded-in-retrieved-context.
  5. Safety refusal rate on OUT_OF_SCOPE_MEDICAL_ADVICE — percentage of advice-seeking prompts correctly refused.
  6. Dutch fluency — native Flemish reviewer, 1–5 per audio clip.
  7. Cost per 1 000 turns (USD) — commercial viability.

Run topology

Phase A vs. Phase A.2 scope

AdapterPhase A statePhase A.2
ZolRagVoiceAdapterFunctional — runs against a local VoiceLLMOrchestrator (the legacy VoiceOrchestrator was deleted in commit 158d793; see ADR-0049 and ADR-0051)Same, no change
All five external adaptersNotImplementedError skeletonsReal API calls

Real vendor calls require:

  • OPENAI_API_KEY (already held)
  • RETELL_API_KEY — new, Retell account sign-up
  • VAPI_API_KEY — new, Vapi account sign-up
  • DEEPGRAM_API_KEY — new, Deepgram account sign-up
  • LiveKit Cloud credentials — new, LiveKit account sign-up

Estimated one-time benchmark cost: ~USD 15 OpenAI Realtime credits + free-tier usage on the others + ~2.5 hours of a native Flemish reviewer's time for the fluency ratings. These total within a normal research-tool budget.

Nightly audio-loop evaluation

Workflow: .github/workflows/voice-audio-loop-eval.yml (manual-dispatch scaffold).

The audio-loop evaluator renders each seed question via ElevenLabs, pipes the audio through Deepgram STT, sends the transcribed text back into /api/v1/query/stream, and measures whole-loop quality. This catches STT-induced regressions (e.g., "Deepgram starts misrecognizing behandeling as behandel ik") that text-only eval cannot.

The Phase A skeleton is workflow_dispatch-only and no-ops when ELEVENLABS_API_KEY / DEEPGRAM_API_KEY secrets are absent. Turning it into a scheduled workflow requires:

  1. Adding the two secrets to the GitHub repository.
  2. Un-commenting the schedule trigger (nightly 02:30 UTC).
  3. Implementing the audio-loop runner body inside the workflow step — currently a placeholder comment.

Real runs are budgeted at ~USD 0.80 / night (50-question rotation) + ~USD 20 one-time for initial audio generation.

Dashboards

The Prometheus metrics (rag_query_ttft_ms, rag_query_conversational_intent_total, rag_voice_safety_escalations_total, rag_voice_shape_compliance) are tagged with channel=voice and tenant_id so Grafana dashboards separate voice from text cleanly. A suggested dashboard layout:

  • Row 1 — Traffic: voice requests per minute, conversational intent distribution (stacked area), language distribution.
  • Row 2 — Latency: TTFT P50 / P95 time-series, full completion time-series, shape-compliance rate.
  • Row 3 — Safety: escalations per reason (threshold / stt_ambiguity / out_of_scope / unknown), false-positive rate estimate from the nightly eval.
  • Row 4 — Quality: voice golden eval pass rate (CI artifact ingested), last SOTA comparison snapshot (markdown panel).

See also

  • Voice Golden Eval — the manual once-per-sprint persona-driven regression harness. Covers what this page's eval set does NOT: multi-turn memory, end-to-end latency budgets, phrase-level assertions on the orchestrator's reply.

References

  • Binns et al., "A Reality Check(list) for Benchmarks in AI" (2024) — the methodology principle that a benchmark must include a "lowest-effort baseline" (LiveKit + generic RAG, in our case) to avoid publication bias toward the hand-crafted system.
  • Lin et al., "HELM — Holistic Evaluation of Language Models" (Stanford, 2022) — the multi-metric, multi-vendor evaluation framework this SOTA comparison adapts.
  • Deepgram, "Streaming ASR + LLM Voice Agent Architecture" (2024) — technical background for the audio-loop eval pattern.