Evaluation & SOTA Comparison

The case for a separate voice eval

The existing text golden evaluation (backend/tests/evaluation/data/retrieval_test_set.json, 299 questions, 99.0 % pass rate) is not sufficient to validate voice-channel behavior. Four reasons:

Voice answers are 2–3 sentences. Text answers average ~6. A text-shaped answer that fails voice-shape compliance would pass text eval unchanged.
Voice questions are phonetic. Phone-shaped utterances are shorter, more colloquial, interrupted, and subject to STT noise. Text eval questions are keyboard-shaped.
Voice has conversational intents absent from text. farewell, switch_language, appointment, escalate are voice-channel-only routing decisions. Text eval does not label these.
Voice TTFT matters. Text eval measures answer quality. Voice eval must also measure perceived responsiveness.

The Phase A seed set

File: backend/app/evaluation/data/voice_golden_seed.json, 30 entries.

Hand-crafted rather than voice-ified from the text set, because the 299-entry voice-ification pass requires a real LLM call + Flemish native-reviewer hand-review (deferred to Phase A.2). The Phase A seed covers the full labeling schema across all four languages:

Intent class	Count	Languages
`answered`	12	8 nl, 2 en, 1 fr, 1 it
`farewell`	3	nl, en, fr
`switch_language`	3	→ en, → fr, → it
`appointment`	3	nl, en, fr
`escalate`	3	nl, en, fr
`out_of_scope` (all tagged `"stt_ambiguity"`)	3	nl, en, fr
`clarify`	3	2 nl, 1 en

Schema per entry:

{
  "id": "VGS-001",
  "phone_question": "Hoe laat sluit de cardiologie vandaag?",
  "language": "nl",
  "voice_expected": {
    "max_sentences": 3,
    "forbidden_tokens": ["http", "[", "*", "#"],
    "conversational_intent": "answered",
    "target_language": null,
    "disclaimer_required": false
  },
  "tags": []
}

Phase A.2 expands this to 349 entries (299 voice-ified from the existing text set + 50 net-new covering voice-only intents and STT-ambiguity traps), with hand-review by a native Flemish speaker for the 20 % most-transformed questions.

The Voice Evaluator

Module: backend/app/evaluation/voice_evaluator.py.

The evaluator iterates the seed set against any object implementing the orchestrator interface (async query(request, user_id, tenant_id) -> QueryResponse) and produces a VoiceEvalReport:

@dataclass
class VoiceEvalReport:
    total: int
    passed: int
    failed: int
    pass_rate: float
    ttft_p50_ms: float
    ttft_p95_ms: float
    failures_by_intent: dict[str, int]
    failure_details: list[dict]

Pass criteria per entry (all must hold):

response.conversational_intent equals the expected label.
If expected intent is switch_language, response.target_language matches.
response.voice_shape_compliant is True.
No forbidden_tokens (substring http, [, *, #) appear in the lowered answer.

The CI gate is pass_rate ≥ 0.95. The evaluator is exercised in tests (test_voice_evaluator.py) against a mocked orchestrator that returns the expected labels verbatim — confirming the eval harness is healthy. The real-orchestrator run happens against a developer laptop (flag on, OPENAI_API_KEY set) or a staging environment.

SOTA Comparison methodology

Module: backend/app/evaluation/sota_voice_benchmark.py.

The Phase A SOTA benchmark compares zol-rag-voice against five production voice-RAG stacks on the same seed set. The vendor selection is deliberate:

Vendor	What it demonstrates	Integration mode
zol-rag-voice (ours)	Reference — our stack	Direct orchestrator call
OpenAI Realtime (`gpt-4o-realtime-preview`)	Industry voice-native reference	In-context knowledge injection
Retell AI	Custom-RAG orchestration vendor	Webhook → ZOL RAG
Vapi	Sibling orchestration vendor	Webhook → ZOL RAG
Deepgram Voice Agent API	Vendor-integrated voice + KB stack	Deepgram knowledge-base upload
LiveKit Agents + generic OpenAI RAG	Baseline — what LiveKit alone gets you without our RAG	Cookbook RAG with ZOL vector store

The last row is the critical comparison: it answers "is building custom voice-RAG actually better than the lowest-effort off-the-shelf integration?" If zol-rag-voice cannot beat LiveKit-plus-generic-RAG on Dutch medical content, the architectural investment is hard to justify.

Metrics (seven per run)

TTFT P50 / P95 (ms) — perceived responsiveness.
Full-answer completion P50 / P95 (ms) — operational monitoring.
Voice-shape compliance (0–1) — sentence count, no URL, no markdown, numbers spelled.
Answer faithfulness — DeepEval scoring, 0–1, grounded-in-retrieved-context.
Safety refusal rate on OUT_OF_SCOPE_MEDICAL_ADVICE — percentage of advice-seeking prompts correctly refused.
Dutch fluency — native Flemish reviewer, 1–5 per audio clip.
Cost per 1 000 turns (USD) — commercial viability.

Run topology

Phase A vs. Phase A.2 scope

Adapter	Phase A state	Phase A.2
`ZolRagVoiceAdapter`	Functional — runs against a local `VoiceLLMOrchestrator` (the legacy `VoiceOrchestrator` was deleted in commit `158d793`; see ADR-0049 and ADR-0051)	Same, no change
All five external adapters	`NotImplementedError` skeletons	Real API calls

Real vendor calls require:

OPENAI_API_KEY (already held)
RETELL_API_KEY — new, Retell account sign-up
VAPI_API_KEY — new, Vapi account sign-up
DEEPGRAM_API_KEY — new, Deepgram account sign-up
LiveKit Cloud credentials — new, LiveKit account sign-up

Estimated one-time benchmark cost: ~USD 15 OpenAI Realtime credits + free-tier usage on the others + ~2.5 hours of a native Flemish reviewer's time for the fluency ratings. These total within a normal research-tool budget.

Nightly audio-loop evaluation

Workflow: .github/workflows/voice-audio-loop-eval.yml (manual-dispatch scaffold).

The audio-loop evaluator renders each seed question via ElevenLabs, pipes the audio through Deepgram STT, sends the transcribed text back into /api/v1/query/stream, and measures whole-loop quality. This catches STT-induced regressions (e.g., "Deepgram starts misrecognizing behandeling as behandel ik") that text-only eval cannot.

The Phase A skeleton is workflow_dispatch-only and no-ops when ELEVENLABS_API_KEY / DEEPGRAM_API_KEY secrets are absent. Turning it into a scheduled workflow requires:

Adding the two secrets to the GitHub repository.
Un-commenting the schedule trigger (nightly 02:30 UTC).
Implementing the audio-loop runner body inside the workflow step — currently a placeholder comment.

Real runs are budgeted at ~USD 0.80 / night (50-question rotation) + ~USD 20 one-time for initial audio generation.

Dashboards

The Prometheus metrics (rag_query_ttft_ms, rag_query_conversational_intent_total, rag_voice_safety_escalations_total, rag_voice_shape_compliance) are tagged with channel=voice and tenant_id so Grafana dashboards separate voice from text cleanly. A suggested dashboard layout:

Row 1 — Traffic: voice requests per minute, conversational intent distribution (stacked area), language distribution.
Row 2 — Latency: TTFT P50 / P95 time-series, full completion time-series, shape-compliance rate.
Row 3 — Safety: escalations per reason (threshold / stt_ambiguity / out_of_scope / unknown), false-positive rate estimate from the nightly eval.
Row 4 — Quality: voice golden eval pass rate (CI artifact ingested), last SOTA comparison snapshot (markdown panel).

References

Binns et al., "A Reality Check(list) for Benchmarks in AI" (2024) — the methodology principle that a benchmark must include a "lowest-effort baseline" (LiveKit + generic RAG, in our case) to avoid publication bias toward the hand-crafted system.
Lin et al., "HELM — Holistic Evaluation of Language Models" (Stanford, 2022) — the multi-metric, multi-vendor evaluation framework this SOTA comparison adapts.
Deepgram, "Streaming ASR + LLM Voice Agent Architecture" (2024) — technical background for the audio-loop eval pattern.

The case for a separate voice eval​

The Phase A seed set​

The Voice Evaluator​

SOTA Comparison methodology​

Metrics (seven per run)​

Run topology​

Phase A vs. Phase A.2 scope​

Nightly audio-loop evaluation​

Dashboards​

See also​

References​