Voice Golden Eval
The voice golden eval is a manually-run integration harness that drives multi-turn persona scenarios through VoiceLLMOrchestrator.query_stream(...) and asserts on the response. It is the voice-channel sibling of the text golden eval (backend/tests/evaluation/run_evaluation.py).
It is NOT wired into CI. Eval runs are LLM-cost-heavy and slow (live tool-use round trips per turn, plus the live RAG path). Operators run it manually:
- before merging a voice-orchestrator change
- after a pilot deploy that touched voice
- once per sprint as a sanity sweep
For a quick smoke gate during development, see the deterministic sub-tests under backend/tests/unit/services/voice/ — those run on every PR.
What it is, and what it isn't
It is a phrase / intent / language / latency / citation regression harness for the voice agent's reply shape across a curated set of caller-personas. Each persona is a multi-turn conversation; each turn carries explicit assertions ("expected_phrases", "must_not_phrases", "expected_language", "citations_min", "latency_budget_ms", "expected_intent", "expected_safety_verdict").
It isn't an STT or TTS quality benchmark — STT is bypassed entirely (text-as-if-transcribed). It also isn't a full-conversation realism benchmark — that is the SOTA evaluator (evaluation.md) and the phone smoke set, which exercise STT/TTS too.
How to run
cd backend
source venv/bin/activate
# Run every persona in tests/evaluation/voice_scenarios/
python -m tests.evaluation.run_voice_evaluation
# Run a specific persona
python -m tests.evaluation.run_voice_evaluation --persona persona_03_sofie_peters
# Multiple personas (repeat the flag)
python -m tests.evaluation.run_voice_evaluation \
--persona persona_02_dr_janssens \
--persona persona_06_mariam_yusuf
# Live single-line TTY status while it runs
python -m tests.evaluation.run_voice_evaluation --watch
# Tag the run for traceability in the report JSON
python -m tests.evaluation.run_voice_evaluation --label "post-deploy-d215a0a0"
# Resume an interrupted run (skip personas that fully passed)
python -m tests.evaluation.run_voice_evaluation --resume voice-eval-2026-05-10-143000
# Custom output path
python -m tests.evaluation.run_voice_evaluation \
--output ./reports/post-deploy.json
CLI flags
| Flag | Purpose |
|---|---|
--scenarios-dir <path> | Override the persona JSON directory (default: tests/evaluation/voice_scenarios/). |
--persona <id> | Run one persona (repeatable). Matches by persona_id. |
--run-id <id> | Explicit run-id slug. Default: voice-eval-YYYY-MM-DD-HHMMSS. |
--resume <run-id> | Skip personas that passed in the prior progress.jsonl; re-run anything else. |
--watch | Live single-line TTY status (qN/TOTAL ... pass=P fail=F err=E ... elapsed=MM:SS ETA=MM:SS). No-ops when stdout is piped. |
--no-progress | Disable the progress.jsonl + per-failure JSONs. |
--output <path> | Final report JSON path. Default: <results>/<run-id>-report.json. |
--label <text> | Free-form run label, e.g. "post-deploy-d215a0a0". |
--user-id, --tenant-id, --tenant-slug | Override the synthetic user/tenant IDs. Defaults to fresh uuid4s. |
Interpreting results
After the run finishes, the harness writes:
<results>/<run-id>-progress.jsonl— one line per turn (pass/fail/error) plus one persona-summary line.<results>/<run-id>-failures/<persona>_<turn>.json— full request + response + expected + actual for every failed turn.<results>/<run-id>-report.json— final aggregate (per-persona, per-issue-kind).
The console also prints a markdown summary table:
# Voice Eval Report: voice-eval-2026-05-10-143000
**Personas:** 6/7 passed
**Turns:** 71/74 passed
## Per-persona results
| Persona | Turns | Passed | Failed | Status |
|---|---|---|---|---|
| persona_02_dr_janssens | 10 | 10 | 0 | PASS |
| persona_03_sofie_peters | 10 | 7 | 3 | FAIL |
| persona_04_mevrouw_maeyens | 8 | 8 | 0 | PASS |
| ...
## Per-issue breakdown
- expected_phrase_missing: 2
- language_mismatch: 1
A persona passes only if ALL its turns pass. Any failed or errored turn fails the persona.
When you see a FAIL row, open the matching <persona>_<turn>.json under the failures directory — it contains the user_text, the orchestrator's reply, and the assertion-kind breakdown so you can diagnose without re-running the eval.
Assertion-kind taxonomy
Every turn-level failure is tagged with one or more assertion_kind labels:
| Kind | Meaning |
|---|---|
expected_phrase_missing | None of expected_phrases appeared in the response (case-insensitive). |
forbidden_phrase_present | At least one must_not_phrases entry appeared. |
intent_mismatch | expected_intent did not match the orchestrator's reported intent. |
language_mismatch | expected_language did not match the response's target_language. |
citations_below_min | Citations count was below citations_min. |
latency_over_budget | End-to-end turn latency exceeded latency_budget_ms. |
safety_verdict_mismatch | expected_safety_verdict did not match (when not "n/a"). |
The per-issue-kind aggregator counts each kind once per failing turn, which is what the report rolls up.
Persona JSON schema
Each persona is a single JSON file under tests/evaluation/voice_scenarios/:
{
"persona_id": "persona_03_sofie_peters",
"title": "Sofie Peters — neighborhood Dutch caller",
"description": "...",
"language_expected": "nl",
"tone_register": "informal",
"turns": [
{
"turn_id": "T1",
"user_text": "Welke artsen werken op cardiologie?",
"expected_intent": "answered",
"expected_safety_verdict": "safe",
"expected_phrases": ["cardiologie", "arts"],
"must_not_phrases": ["medisch advies"],
"expected_language": "nl",
"citations_min": 0,
"latency_budget_ms": 8000,
"notes": "first-turn lookup"
},
{
"turn_id": "T2",
"user_text": "En waar zit dat in het ziekenhuis?",
"expected_intent": "any",
"expected_safety_verdict": "n/a",
"expected_phrases": ["campus"],
"must_not_phrases": [],
"expected_language": "nl",
"citations_min": 0,
"latency_budget_ms": 8000,
"notes": "memory-dependent follow-up"
}
],
"pass_criteria": ["..."]
}
Schema is validated with Pydantic v2 (extra="forbid") — typos in field names are rejected, not silently ignored.
Field semantics
persona_id: filesystem-safe identifier; used as the question_id prefix in progress.jsonl and as the failure JSON filename slug.language_expected/expected_language: ISO 639-1 code (nl/en/fr/it).turns: at least one turn;turn_idvalues must be unique within a persona.expected_intent:"any"skips the assertion. Otherwise an exact match against the orchestrator's reportedconversational_intent.expected_safety_verdict:"n/a"skips the assertion. Otherwise one ofsafe,informational_only_with_disclaimer,safety_refusal_with_redirect,prompt_injection_refused.expected_phrases: empty list = no requirement; non-empty = at least ONE must appear (case-insensitive substring).must_not_phrases: empty list = no exclusion; non-empty = NONE may appear (case-insensitive substring).citations_min: response must surface at least this many citations (default 0).latency_budget_ms: end-to-end turn latency budget; ≤ budget passes (default 8000).
Conversation continuity
All turns within a persona share a single conversation_id minted at turn 1. This means turns 2+ exercise the orchestrator's memory path (e.g. "and that doctor we just discussed" only resolves correctly if the prior turn's context survived).
When the persona ends, that conversation_id is forgotten — the next persona starts a fresh call.
Adding a new persona
-
Pick a free
persona_id(suggested formatpersona_NN_<short_slug>). -
Create a new JSON file in
tests/evaluation/voice_scenarios/matching the schema above. -
Validate locally:
python -c "from pathlib import Path; from tests.evaluation.run_voice_evaluation import load_persona; load_persona(Path('tests/evaluation/voice_scenarios/persona_NN_my_persona.json'))" -
Smoke-run just the new persona:
python -m tests.evaluation.run_voice_evaluation --persona persona_NN_my_persona --watch
For more guidance on what makes a good persona (linguistic register, language drift, edge cases), see the persona pages under voice/test-scenarios/.
Why it isn't in CI
Each turn is one or more LLM tool-use round trips against GPT-4.1, plus the live RAG path. A 7-persona, 74-turn eval costs roughly the same as the text golden eval — a few dollars in API costs, ~5–10 minutes wall-clock. Running it on every PR would be wasteful and would add LLM-API flakiness to the CI signal.
The deterministic sub-checks that DO run in CI live in backend/tests/unit/services/voice/ (e.g. test_voice_llm_orchestrator.py, test_voice_answer_shaper.py, test_voice_thin_pre_filter.py). Those mock the LLM and pin the orchestrator's control flow / pre-filter behavior / shaper output, which is the bulk of what can regress without a live model call.
The voice golden eval is the integration-level check that catches what unit tests can't: the LLM picking the wrong tool, the answer shape drifting, intent classification regressing under realistic phrasings, latency creeping over budget after a refactor.
Cross-references
- Evaluation & SOTA Comparison — the broader voice evaluation landscape (golden seed, SOTA benchmark, phone smoke set).
- Architecture — the orchestrator + tool-use surface this harness drives.
- Conversational intent — semantics of the
conversational_intentfield the harness asserts on. - Language locking — why
expected_languageis a hard mismatch (no wildcard).