Voice Golden Eval

The voice golden eval is a manually-run integration harness that drives multi-turn persona scenarios through VoiceLLMOrchestrator.query_stream(...) and asserts on the response. It is the voice-channel sibling of the text golden eval (backend/tests/evaluation/run_evaluation.py).

It is NOT wired into CI. Eval runs are LLM-cost-heavy and slow (live tool-use round trips per turn, plus the live RAG path). Operators run it manually:

before merging a voice-orchestrator change
after a pilot deploy that touched voice
once per sprint as a sanity sweep

For a quick smoke gate during development, see the deterministic sub-tests under backend/tests/unit/services/voice/ — those run on every PR.

What it is, and what it isn't

It is a phrase / intent / language / latency / citation regression harness for the voice agent's reply shape across a curated set of caller-personas. Each persona is a multi-turn conversation; each turn carries explicit assertions ("expected_phrases", "must_not_phrases", "expected_language", "citations_min", "latency_budget_ms", "expected_intent", "expected_safety_verdict").

It isn't an STT or TTS quality benchmark — STT is bypassed entirely (text-as-if-transcribed). It also isn't a full-conversation realism benchmark — that is the SOTA evaluator (evaluation.md) and the phone smoke set, which exercise STT/TTS too.

How to run

cd backend
source venv/bin/activate

# Run every persona in tests/evaluation/voice_scenarios/
python -m tests.evaluation.run_voice_evaluation

# Run a specific persona
python -m tests.evaluation.run_voice_evaluation --persona persona_03_sofie_peters

# Multiple personas (repeat the flag)
python -m tests.evaluation.run_voice_evaluation \
    --persona persona_02_dr_janssens \
    --persona persona_06_mariam_yusuf

# Live single-line TTY status while it runs
python -m tests.evaluation.run_voice_evaluation --watch

# Tag the run for traceability in the report JSON
python -m tests.evaluation.run_voice_evaluation --label "post-deploy-d215a0a0"

# Resume an interrupted run (skip personas that fully passed)
python -m tests.evaluation.run_voice_evaluation --resume voice-eval-2026-05-10-143000

# Custom output path
python -m tests.evaluation.run_voice_evaluation \
    --output ./reports/post-deploy.json

CLI flags

Flag	Purpose
`--scenarios-dir <path>`	Override the persona JSON directory (default: `tests/evaluation/voice_scenarios/`).
`--persona <id>`	Run one persona (repeatable). Matches by `persona_id`.
`--run-id <id>`	Explicit run-id slug. Default: `voice-eval-YYYY-MM-DD-HHMMSS`.
`--resume <run-id>`	Skip personas that passed in the prior progress.jsonl; re-run anything else.
`--watch`	Live single-line TTY status (`qN/TOTAL ... pass=P fail=F err=E ... elapsed=MM:SS ETA=MM:SS`). No-ops when stdout is piped.
`--no-progress`	Disable the progress.jsonl + per-failure JSONs.
`--output <path>`	Final report JSON path. Default: `<results>/<run-id>-report.json`.
`--label <text>`	Free-form run label, e.g. `"post-deploy-d215a0a0"`.
`--user-id`, `--tenant-id`, `--tenant-slug`	Override the synthetic user/tenant IDs. Defaults to fresh `uuid4`s.

Interpreting results

After the run finishes, the harness writes:

<results>/<run-id>-progress.jsonl — one line per turn (pass / fail / error) plus one persona-summary line.
<results>/<run-id>-failures/<persona>_<turn>.json — full request + response + expected + actual for every failed turn.
<results>/<run-id>-report.json — final aggregate (per-persona, per-issue-kind).

The console also prints a markdown summary table:

# Voice Eval Report: voice-eval-2026-05-10-143000

**Personas:** 6/7 passed
**Turns:**    71/74 passed

## Per-persona results

| Persona | Turns | Passed | Failed | Status |
|---|---|---|---|---|
| persona_02_dr_janssens | 10 | 10 | 0 | PASS |
| persona_03_sofie_peters | 10 | 7 | 3 | FAIL |
| persona_04_mevrouw_maeyens | 8 | 8 | 0 | PASS |
| ...

## Per-issue breakdown

- expected_phrase_missing: 2
- language_mismatch: 1

A persona passes only if ALL its turns pass. Any failed or errored turn fails the persona.

When you see a FAIL row, open the matching <persona>_<turn>.json under the failures directory — it contains the user_text, the orchestrator's reply, and the assertion-kind breakdown so you can diagnose without re-running the eval.

Assertion-kind taxonomy

Every turn-level failure is tagged with one or more assertion_kind labels:

Kind	Meaning
`expected_phrase_missing`	None of `expected_phrases` appeared in the response (case-insensitive).
`forbidden_phrase_present`	At least one `must_not_phrases` entry appeared.
`intent_mismatch`	`expected_intent` did not match the orchestrator's reported intent.
`language_mismatch`	`expected_language` did not match the response's `target_language`.
`citations_below_min`	Citations count was below `citations_min`.
`latency_over_budget`	End-to-end turn latency exceeded `latency_budget_ms`.
`safety_verdict_mismatch`	`expected_safety_verdict` did not match (when not `"n/a"`).

The per-issue-kind aggregator counts each kind once per failing turn, which is what the report rolls up.

Persona JSON schema

Each persona is a single JSON file under tests/evaluation/voice_scenarios/:

{
  "persona_id": "persona_03_sofie_peters",
  "title": "Sofie Peters — neighborhood Dutch caller",
  "description": "...",
  "language_expected": "nl",
  "tone_register": "informal",
  "turns": [
    {
      "turn_id": "T1",
      "user_text": "Welke artsen werken op cardiologie?",
      "expected_intent": "answered",
      "expected_safety_verdict": "safe",
      "expected_phrases": ["cardiologie", "arts"],
      "must_not_phrases": ["medisch advies"],
      "expected_language": "nl",
      "citations_min": 0,
      "latency_budget_ms": 8000,
      "notes": "first-turn lookup"
    },
    {
      "turn_id": "T2",
      "user_text": "En waar zit dat in het ziekenhuis?",
      "expected_intent": "any",
      "expected_safety_verdict": "n/a",
      "expected_phrases": ["campus"],
      "must_not_phrases": [],
      "expected_language": "nl",
      "citations_min": 0,
      "latency_budget_ms": 8000,
      "notes": "memory-dependent follow-up"
    }
  ],
  "pass_criteria": ["..."]
}

Schema is validated with Pydantic v2 (extra="forbid") — typos in field names are rejected, not silently ignored.

Field semantics

persona_id: filesystem-safe identifier; used as the question_id prefix in progress.jsonl and as the failure JSON filename slug.
language_expected / expected_language: ISO 639-1 code (nl/en/fr/it).
turns: at least one turn; turn_id values must be unique within a persona.
expected_intent: "any" skips the assertion. Otherwise an exact match against the orchestrator's reported conversational_intent.
expected_safety_verdict: "n/a" skips the assertion. Otherwise one of safe, informational_only_with_disclaimer, safety_refusal_with_redirect, prompt_injection_refused.
expected_phrases: empty list = no requirement; non-empty = at least ONE must appear (case-insensitive substring).
must_not_phrases: empty list = no exclusion; non-empty = NONE may appear (case-insensitive substring).
citations_min: response must surface at least this many citations (default 0).
latency_budget_ms: end-to-end turn latency budget; ≤ budget passes (default 8000).

Conversation continuity

All turns within a persona share a single conversation_id minted at turn 1. This means turns 2+ exercise the orchestrator's memory path (e.g. "and that doctor we just discussed" only resolves correctly if the prior turn's context survived).

When the persona ends, that conversation_id is forgotten — the next persona starts a fresh call.

Adding a new persona

Pick a free persona_id (suggested format persona_NN_<short_slug>).
Create a new JSON file in tests/evaluation/voice_scenarios/ matching the schema above.

Validate locally:

python -c "from pathlib import Path; from tests.evaluation.run_voice_evaluation import load_persona; load_persona(Path('tests/evaluation/voice_scenarios/persona_NN_my_persona.json'))"

Smoke-run just the new persona:

python -m tests.evaluation.run_voice_evaluation --persona persona_NN_my_persona --watch

For more guidance on what makes a good persona (linguistic register, language drift, edge cases), see the persona pages under voice/test-scenarios/.

Why it isn't in CI

Each turn is one or more LLM tool-use round trips against GPT-4.1, plus the live RAG path. A 7-persona, 74-turn eval costs roughly the same as the text golden eval — a few dollars in API costs, ~5–10 minutes wall-clock. Running it on every PR would be wasteful and would add LLM-API flakiness to the CI signal.

The deterministic sub-checks that DO run in CI live in backend/tests/unit/services/voice/ (e.g. test_voice_llm_orchestrator.py, test_voice_answer_shaper.py, test_voice_thin_pre_filter.py). Those mock the LLM and pin the orchestrator's control flow / pre-filter behavior / shaper output, which is the bulk of what can regress without a live model call.

The voice golden eval is the integration-level check that catches what unit tests can't: the LLM picking the wrong tool, the answer shape drifting, intent classification regressing under realistic phrasings, latency creeping over budget after a refactor.

Cross-references

Evaluation & SOTA Comparison — the broader voice evaluation landscape (golden seed, SOTA benchmark, phone smoke set).
Architecture — the orchestrator + tool-use surface this harness drives.
Conversational intent — semantics of the conversational_intent field the harness asserts on.
Language locking — why expected_language is a hard mismatch (no wildcard).

What it is, and what it isn't​

How to run​

CLI flags​

Interpreting results​

Assertion-kind taxonomy​

Persona JSON schema​

Field semantics​

Conversation continuity​

Adding a new persona​

Why it isn't in CI​

Cross-references​