A Voice Call, End-to-End

The voice channel is documented one concept at a time — architecture, language-locking, the Value Framework, citation pipeline, answer-shaping, triple-defense. This page is the connective tissue for the spoken channel: it follows one real four-turn phone call — a caller booking a cardiology appointment — through every stage, so you can see how the pieces compose across a conversation. It is the voice sibling of the text channel's A Query, End-to-End.

Every value here is real

The intents, confidences, follow-up resolutions, routing decisions, capability hits, enrichment counts, and spoken answers below were captured live from the pilot on 2026-06-01 — by driving the public WebSocket with channel="voice" for the four turns (conversation 24f9d157…) and reading the backend's own structlog output by conversation id, not illustrative placeholders.

This page is an honest hybrid: the backend cognition is live-captured; the agent-side stages with no backend log — Deepgram STT confidence, language voting, the three-tier filler ladder, ElevenLabs TTS prosody — are sourced from real code constants and documented pilot traces (e.g. the a0757271 grace-tuning chain), and are labeled as such rather than dramatized.

A voice turn at a glance

Every turn flows through this pipeline. The fork in the middle is the part the text channel doesn't have: a turn either matches a registry capability and is answered deterministically (no LLM), or falls through to the agentic GPT-4.1 tool loop that calls search_hospital_kb → RAG.

Stage legend

Each stage links to the page that explains it in depth, plus the glossary terms it uses.

Stage	What happens	Concept page	Glossary
STT	Deepgram Nova-3 streams the caller's audio to a transcript; doctor names bias the recognizer.	Architecture	—
Language lock	The first confirmed utterance pins the call's language; later turns don't re-detect (locked-out languages produce silent transcripts).	Language Locking	—
Orchestrator entry	The transcript reaches `VoiceLLMOrchestrator` over the WS with `channel="voice"`. `classify_terminal` short-circuits greetings/farewells/safety with zero LLM cost.	Architecture, Triple Defense	intent classifier
Capability fork	The shared classifier runs first; if the intent maps to a registry capability (doctor-schedule, billing, address) it is answered deterministically. Otherwise the agentic GPT-4.1 tool loop runs.	Architecture	intent classifier
Retrieve (agentic path)	The `search_hospital_kb` tool calls the same `RAGService` as chat, with `channel="voice"`: hybrid retrieval + the department doctor-list injection.	A Query, End-to-End, What is RAG	RAG, hybrid
Rerank	The Value Framework reranks chunks by intent × content-category affinity before context assembly.	Value Framework	Value Framework, reranking
Safety + shape	A regex post-filter detects medical content; `VoiceAnswerShaper` strips markdown / URLs / citation markers, formats phone numbers for prosody, caps the answer at two sentences, and prepends the disclaimer once per call.	Triple Defense, Answer Shaping, Safety	grounding, medical-advice refusal
Citations	Voice answers carry no inline `[N]` markers (TTS would read them aloud); citations are derived from the retrieved chunks instead.	Citation Pipeline	—
TTS	ElevenLabs Multilingual v2 streams the answer back in the locked voice; punctuation drives prosody (commas chunk a phone number). The filler ladder masks backend latency.	Prosody Injection, Adaptive TTS Speed, Context-Aware Filler	—

What the real trace reveals

Driving four turns through the live pilot surfaced three things the architecture diagram alone does not:

One conversation, two cognition paths. Turns 1–3 (ambiguous_symptom_description, doctor_lookup) have no registry capability, so they fall through to the agentic GPT-4.1 → search_hospital_kb → RAG path. Turn 4 (doctor_schedule_query) matches a capability and short-circuits the LLM entirely (voice_classify_first_capability_hit) — no retrieval at all.
Follow-up resolution carries the thread. "professor Dens" becomes the full name on turn 3, and the bare pronoun "hij" resolves to "Prof. Dr. Jo Dens" on turn 4 — each done by the intent classifier reading 2–3 prior turns of history.
The voice shape diverges from chat on identical retrieval. The same doctor_lookup that the chat channel renders as a full 23-name roster, the voice channel compresses to two names plus "andere cardiologen" to honor the two-sentence TTS cap.

The call, turn by turn

A caller dials in with a symptom and walks all the way to a specific doctor's consultation hours. All values are from the live pilot trace (conversation 24f9d157).

Turn 1 — symptom → department (agentic RAG path)

Star stage: the intent rewrite turns the lay symptom into a clinical term, and the safety layer prepends a spoken disclaimer before routing the caller to a department.

Stage	Real trace	Reference
Raw	`"Ik heb last van hartkloppingen, bij welke dienst moet ik zijn?"`	—
Intent	`ambiguous_symptom_description` (0.90), `is_followup=False`; rewrite `"Welke afdelingen bij zol behandelen Palpitaties?"` — hartkloppingen → Palpitaties (lay → clinical); `latency=3643ms`	intent classifier
Path	no registry capability for this intent → agentic `search_hospital_kb` → RAG (`[RAG] Intent: … [pre-classified, LLM skipped]`)	Architecture
Retrieve	`specialty_resolved_via_taxonomy key='hartkloppingen' specialty='cardiologie'`; `top_k=5`; Value Framework `primary=clinical_info`, mismatch 0.000	hybrid
Safety + shape	medical content detected → disclaimer prepended; phone formatted for prosody	Answer Shaping
Spoken	"Daar kan ik geen specifiek medisch advies over geven; voor hartkloppingen kunt u terecht bij de afdeling Cardiologie van ZOL via het nummer nul-acht-negen, tweeëndertig, eenenvijftig, eenenvijftig." (089 32 51 51)	medical-advice refusal

Turn 2 — doctor list (agentic RAG path, doctor-list injection)

Star stage: the department doctor-list injection feeds the full roster into context, but the voice answer-shaper compresses it to fit the two-sentence cap.

Stage	Real trace	Reference
Raw	`"Welke artsen werken er op de dienst cardiologie?"`	—
Intent	`doctor_lookup` (0.95), `is_followup=True`, `history_turns=1`; rewrite `"Welke artsen werken bij de afdeling Cardiologie…"`	intent classifier
Path	`doctor_lookup` has no registry capability → agentic `search_hospital_kb` → RAG	Architecture
Retrieve	`_inject_department_doctor_list dept='Cardiologie' found 25 doctor(s)`; `taxonomy_enrichment_applied=True doctors=23`; Value Framework `primary=general`	Stage 5c injection
Shape	two-sentence cap → two named doctors + "andere cardiologen" (the chat channel lists all 23)	Answer Shaping
Spoken	"Op de afdeling Cardiologie werken onder andere Dr. Koen Ameloot en Prof. Dr. Jo Dens, samen met andere cardiologen. Wenst u informatie over een specifieke arts of wilt u een afspraak maken?"	grounding

Turn 3 — doctor profile (agentic RAG path, follow-up resolution)

Star stage: follow-up resolution expands the bare reference "professor Dens" into the canonical full name using two turns of history.

Stage	Real trace	Reference
Raw	`"Vertel me meer over professor Dens."`	—
Intent	`doctor_lookup` (0.95), `is_followup=True`, `history_turns=2`; rewrite `"Wie is Prof. Dr. Jo Dens en op welke afdeling werkt hij bij zol"` — "professor Dens" resolved to the full name	intent classifier
Path	agentic `search_hospital_kb` → RAG; `taxonomy_enrichment doctors=24`; Value Framework `primary=appointments`	Architecture
Spoken	"Prof. Dr. Jo Dens is cardioloog op de afdeling Cardiologie van Ziekenhuis Oost-Limburg. Hij houdt raadplegingen op campus Sint-Jan en in het Medisch Centrum André Dumont. Wenst u een afspraak of meer details?"	grounding

Turn 4 — schedule (deterministic capability short-circuit)

Star stage: the bare pronoun "hij" resolves to the doctor, the intent matches a registry capability, and the structured consultation_schedule lookup answers without any RAG retrieval.

Stage	Real trace	Reference
Raw	`"Wanneer werkt hij?"`	—
Intent	`doctor_schedule_query` (0.95), `is_followup=True`, `history_turns=3`; rewrite `"Wanneer heeft Prof. Dr. Jo Dens raadplegingen bij de afdeling…"` — pronoun "hij" → "Prof. Dr. Jo Dens"	intent classifier
Path	intent matches a registry capability → `voice_classify_first_capability_hit` → deterministic `DoctorScheduleCapability`; the agentic orchestrator and RAG are skipped	Architecture
Lookup	`capability_doctor_schedule {doctor: 'Prof. Dr. Jo Dens', day: None, cited: True}`; structured `consultation_schedule` read from the doctor's profile; tenant fallback phone (per-doctor phone empty)	Citation Pipeline
Spoken	"Prof. Dr. Jo Dens houdt raadpleging op woensdagvoormiddag (wekelijks), donderdagnamiddag (wekelijks), vrijdagvoormiddag (wekelijks) en vrijdagnamiddag (wekelijks). … U kan een afspraak maken via 089/acht nul, … acht nul, … acht nul." (089 80 80 80)	grounding

What's voice-specific

The same RAGService powers chat and voice, but channel="voice" flips a set of behaviours visible in the trace above:

Language is locked at the first confirmed utterance — every turn above stayed nl (Language Locking).
No inline [N] markers — TTS would read "open bracket one"; citations are derived from chunks instead.
Two-sentence cap — turn 2 lists two cardiologists + "andere cardiologen" rather than the chat channel's full roster (Answer Shaping).
Disclaimer once — turn 1's medical content triggers the spoken disclaimer; later turns don't repeat it.
Phone prosody — 089 32 51 51 and 089 80 80 80 are spoken with comma boundaries so ElevenLabs chunks the digits (Prosody Injection).
Filler ladder — while the backend thinks, tiers at 1.5 s / 4 s / 10 s mask the silence and cancel the moment the first sentence streams back (Context-Aware Filler).

Where to go next

A Query, End-to-End — the same treatment for the text channel (five one-shot queries).
Voice Channel — Architecture — the orchestrator composition, the full dial-through-BYE sequence diagram, and the latency budget.
Glossary — canonical definition of every term used above.

A voice turn at a glance​

Stage legend​

The call, turn by turn​

Turn 1 — symptom → department (agentic RAG path)​

Turn 2 — doctor list (agentic RAG path, doctor-list injection)​

Turn 3 — doctor profile (agentic RAG path, follow-up resolution)​

Turn 4 — schedule (deterministic capability short-circuit)​

What's voice-specific​

Where to go next​