A Voice Call, End-to-End
The voice channel is documented one concept at a time — architecture, language-locking, the Value Framework, citation pipeline, answer-shaping, triple-defense. This page is the connective tissue for the spoken channel: it follows one real four-turn phone call — a caller booking a cardiology appointment — through every stage, so you can see how the pieces compose across a conversation. It is the voice sibling of the text channel's A Query, End-to-End.
The intents, confidences, follow-up resolutions, routing decisions, capability hits, enrichment counts, and spoken answers below were captured live from the pilot on 2026-06-01 — by driving the public WebSocket with channel="voice" for the four turns (conversation 24f9d157…) and reading the backend's own structlog output by conversation id, not illustrative placeholders.
This page is an honest hybrid: the backend cognition is live-captured; the agent-side stages with no backend log — Deepgram STT confidence, language voting, the three-tier filler ladder, ElevenLabs TTS prosody — are sourced from real code constants and documented pilot traces (e.g. the a0757271 grace-tuning chain), and are labeled as such rather than dramatized.
A voice turn at a glance
Every turn flows through this pipeline. The fork in the middle is the part the text channel doesn't have: a turn either matches a registry capability and is answered deterministically (no LLM), or falls through to the agentic GPT-4.1 tool loop that calls search_hospital_kb → RAG.
Stage legend
Each stage links to the page that explains it in depth, plus the glossary terms it uses.
| Stage | What happens | Concept page | Glossary |
|---|---|---|---|
| STT | Deepgram Nova-3 streams the caller's audio to a transcript; doctor names bias the recognizer. | Architecture | — |
| Language lock | The first confirmed utterance pins the call's language; later turns don't re-detect (locked-out languages produce silent transcripts). | Language Locking | — |
| Orchestrator entry | The transcript reaches VoiceLLMOrchestrator over the WS with channel="voice". classify_terminal short-circuits greetings/farewells/safety with zero LLM cost. | Architecture, Triple Defense | intent classifier |
| Capability fork | The shared classifier runs first; if the intent maps to a registry capability (doctor-schedule, billing, address) it is answered deterministically. Otherwise the agentic GPT-4.1 tool loop runs. | Architecture | intent classifier |
| Retrieve (agentic path) | The search_hospital_kb tool calls the same RAGService as chat, with channel="voice": hybrid retrieval + the department doctor-list injection. | A Query, End-to-End, What is RAG | RAG, hybrid |
| Rerank | The Value Framework reranks chunks by intent × content-category affinity before context assembly. | Value Framework | Value Framework, reranking |
| Safety + shape | A regex post-filter detects medical content; VoiceAnswerShaper strips markdown / URLs / citation markers, formats phone numbers for prosody, caps the answer at two sentences, and prepends the disclaimer once per call. | Triple Defense, Answer Shaping, Safety | grounding, medical-advice refusal |
| Citations | Voice answers carry no inline [N] markers (TTS would read them aloud); citations are derived from the retrieved chunks instead. | Citation Pipeline | — |
| TTS | ElevenLabs Multilingual v2 streams the answer back in the locked voice; punctuation drives prosody (commas chunk a phone number). The filler ladder masks backend latency. | Prosody Injection, Adaptive TTS Speed, Context-Aware Filler | — |
Driving four turns through the live pilot surfaced three things the architecture diagram alone does not:
- One conversation, two cognition paths. Turns 1–3 (
ambiguous_symptom_description,doctor_lookup) have no registry capability, so they fall through to the agentic GPT-4.1 →search_hospital_kb→ RAG path. Turn 4 (doctor_schedule_query) matches a capability and short-circuits the LLM entirely (voice_classify_first_capability_hit) — no retrieval at all. - Follow-up resolution carries the thread. "professor Dens" becomes the full name on turn 3, and the bare pronoun "hij" resolves to "Prof. Dr. Jo Dens" on turn 4 — each done by the intent classifier reading 2–3 prior turns of history.
- The voice shape diverges from chat on identical retrieval. The same
doctor_lookupthat the chat channel renders as a full 23-name roster, the voice channel compresses to two names plus "andere cardiologen" to honor the two-sentence TTS cap.
The call, turn by turn
A caller dials in with a symptom and walks all the way to a specific doctor's consultation hours. All values are from the live pilot trace (conversation 24f9d157).
Turn 1 — symptom → department (agentic RAG path)
Star stage: the intent rewrite turns the lay symptom into a clinical term, and the safety layer prepends a spoken disclaimer before routing the caller to a department.
| Stage | Real trace | Reference |
|---|---|---|
| Raw | "Ik heb last van hartkloppingen, bij welke dienst moet ik zijn?" | — |
| Intent | ambiguous_symptom_description (0.90), is_followup=False; rewrite "Welke afdelingen bij zol behandelen Palpitaties?" — hartkloppingen → Palpitaties (lay → clinical); latency=3643ms | intent classifier |
| Path | no registry capability for this intent → agentic search_hospital_kb → RAG ([RAG] Intent: … [pre-classified, LLM skipped]) | Architecture |
| Retrieve | specialty_resolved_via_taxonomy key='hartkloppingen' specialty='cardiologie'; top_k=5; Value Framework primary=clinical_info, mismatch 0.000 | hybrid |
| Safety + shape | medical content detected → disclaimer prepended; phone formatted for prosody | Answer Shaping |
| Spoken | "Daar kan ik geen specifiek medisch advies over geven; voor hartkloppingen kunt u terecht bij de afdeling Cardiologie van ZOL via het nummer nul-acht-negen, tweeëndertig, eenenvijftig, eenenvijftig." (089 32 51 51) | medical-advice refusal |
Turn 2 — doctor list (agentic RAG path, doctor-list injection)
Star stage: the department doctor-list injection feeds the full roster into context, but the voice answer-shaper compresses it to fit the two-sentence cap.
| Stage | Real trace | Reference |
|---|---|---|
| Raw | "Welke artsen werken er op de dienst cardiologie?" | — |
| Intent | doctor_lookup (0.95), is_followup=True, history_turns=1; rewrite "Welke artsen werken bij de afdeling Cardiologie…" | intent classifier |
| Path | doctor_lookup has no registry capability → agentic search_hospital_kb → RAG | Architecture |
| Retrieve | _inject_department_doctor_list dept='Cardiologie' found 25 doctor(s); taxonomy_enrichment_applied=True doctors=23; Value Framework primary=general | Stage 5c injection |
| Shape | two-sentence cap → two named doctors + "andere cardiologen" (the chat channel lists all 23) | Answer Shaping |
| Spoken | "Op de afdeling Cardiologie werken onder andere Dr. Koen Ameloot en Prof. Dr. Jo Dens, samen met andere cardiologen. Wenst u informatie over een specifieke arts of wilt u een afspraak maken?" | grounding |
Turn 3 — doctor profile (agentic RAG path, follow-up resolution)
Star stage: follow-up resolution expands the bare reference "professor Dens" into the canonical full name using two turns of history.
| Stage | Real trace | Reference |
|---|---|---|
| Raw | "Vertel me meer over professor Dens." | — |
| Intent | doctor_lookup (0.95), is_followup=True, history_turns=2; rewrite "Wie is Prof. Dr. Jo Dens en op welke afdeling werkt hij bij zol" — "professor Dens" resolved to the full name | intent classifier |
| Path | agentic search_hospital_kb → RAG; taxonomy_enrichment doctors=24; Value Framework primary=appointments | Architecture |
| Spoken | "Prof. Dr. Jo Dens is cardioloog op de afdeling Cardiologie van Ziekenhuis Oost-Limburg. Hij houdt raadplegingen op campus Sint-Jan en in het Medisch Centrum André Dumont. Wenst u een afspraak of meer details?" | grounding |
Turn 4 — schedule (deterministic capability short-circuit)
Star stage: the bare pronoun "hij" resolves to the doctor, the intent matches a registry capability, and the structured consultation_schedule lookup answers without any RAG retrieval.
| Stage | Real trace | Reference |
|---|---|---|
| Raw | "Wanneer werkt hij?" | — |
| Intent | doctor_schedule_query (0.95), is_followup=True, history_turns=3; rewrite "Wanneer heeft Prof. Dr. Jo Dens raadplegingen bij de afdeling…" — pronoun "hij" → "Prof. Dr. Jo Dens" | intent classifier |
| Path | intent matches a registry capability → voice_classify_first_capability_hit → deterministic DoctorScheduleCapability; the agentic orchestrator and RAG are skipped | Architecture |
| Lookup | capability_doctor_schedule {doctor: 'Prof. Dr. Jo Dens', day: None, cited: True}; structured consultation_schedule read from the doctor's profile; tenant fallback phone (per-doctor phone empty) | Citation Pipeline |
| Spoken | "Prof. Dr. Jo Dens houdt raadpleging op woensdagvoormiddag (wekelijks), donderdagnamiddag (wekelijks), vrijdagvoormiddag (wekelijks) en vrijdagnamiddag (wekelijks). … U kan een afspraak maken via 089/acht nul, … acht nul, … acht nul." (089 80 80 80) | grounding |
What's voice-specific
The same RAGService powers chat and voice, but channel="voice" flips a set of behaviours visible in the trace above:
- Language is locked at the first confirmed utterance — every turn above stayed
nl(Language Locking). - No inline
[N]markers — TTS would read "open bracket one"; citations are derived from chunks instead. - Two-sentence cap — turn 2 lists two cardiologists + "andere cardiologen" rather than the chat channel's full roster (Answer Shaping).
- Disclaimer once — turn 1's medical content triggers the spoken disclaimer; later turns don't repeat it.
- Phone prosody —
089 32 51 51and089 80 80 80are spoken with comma boundaries so ElevenLabs chunks the digits (Prosody Injection). - Filler ladder — while the backend thinks, tiers at 1.5 s / 4 s / 10 s mask the silence and cancel the moment the first sentence streams back (Context-Aware Filler).
Where to go next
- A Query, End-to-End — the same treatment for the text channel (five one-shot queries).
- Voice Channel — Architecture — the orchestrator composition, the full dial-through-BYE sequence diagram, and the latency budget.
- Glossary — canonical definition of every term used above.