Skip to main content

A Voice Call, End-to-End

The voice channel is documented one concept at a time — architecture, language-locking, the Value Framework, citation pipeline, answer-shaping, triple-defense. This page is the connective tissue for the spoken channel: it follows one real four-turn phone call — a caller booking a cardiology appointment — through every stage, so you can see how the pieces compose across a conversation. It is the voice sibling of the text channel's A Query, End-to-End.

Every value here is real

The intents, confidences, follow-up resolutions, routing decisions, capability hits, enrichment counts, and spoken answers below were captured live from the pilot on 2026-06-01 — by driving the public WebSocket with channel="voice" for the four turns (conversation 24f9d157…) and reading the backend's own structlog output by conversation id, not illustrative placeholders.

This page is an honest hybrid: the backend cognition is live-captured; the agent-side stages with no backend log — Deepgram STT confidence, language voting, the three-tier filler ladder, ElevenLabs TTS prosody — are sourced from real code constants and documented pilot traces (e.g. the a0757271 grace-tuning chain), and are labeled as such rather than dramatized.

A voice turn at a glance

Every turn flows through this pipeline. The fork in the middle is the part the text channel doesn't have: a turn either matches a registry capability and is answered deterministically (no LLM), or falls through to the agentic GPT-4.1 tool loop that calls search_hospital_kb → RAG.

Stage legend

Each stage links to the page that explains it in depth, plus the glossary terms it uses.

StageWhat happensConcept pageGlossary
STTDeepgram Nova-3 streams the caller's audio to a transcript; doctor names bias the recognizer.Architecture
Language lockThe first confirmed utterance pins the call's language; later turns don't re-detect (locked-out languages produce silent transcripts).Language Locking
Orchestrator entryThe transcript reaches VoiceLLMOrchestrator over the WS with channel="voice". classify_terminal short-circuits greetings/farewells/safety with zero LLM cost.Architecture, Triple Defenseintent classifier
Capability forkThe shared classifier runs first; if the intent maps to a registry capability (doctor-schedule, billing, address) it is answered deterministically. Otherwise the agentic GPT-4.1 tool loop runs.Architectureintent classifier
Retrieve (agentic path)The search_hospital_kb tool calls the same RAGService as chat, with channel="voice": hybrid retrieval + the department doctor-list injection.A Query, End-to-End, What is RAGRAG, hybrid
RerankThe Value Framework reranks chunks by intent × content-category affinity before context assembly.Value FrameworkValue Framework, reranking
Safety + shapeA regex post-filter detects medical content; VoiceAnswerShaper strips markdown / URLs / citation markers, formats phone numbers for prosody, caps the answer at two sentences, and prepends the disclaimer once per call.Triple Defense, Answer Shaping, Safetygrounding, medical-advice refusal
CitationsVoice answers carry no inline [N] markers (TTS would read them aloud); citations are derived from the retrieved chunks instead.Citation Pipeline
TTSElevenLabs Multilingual v2 streams the answer back in the locked voice; punctuation drives prosody (commas chunk a phone number). The filler ladder masks backend latency.Prosody Injection, Adaptive TTS Speed, Context-Aware Filler
What the real trace reveals

Driving four turns through the live pilot surfaced three things the architecture diagram alone does not:

  1. One conversation, two cognition paths. Turns 1–3 (ambiguous_symptom_description, doctor_lookup) have no registry capability, so they fall through to the agentic GPT-4.1 → search_hospital_kb → RAG path. Turn 4 (doctor_schedule_query) matches a capability and short-circuits the LLM entirely (voice_classify_first_capability_hit) — no retrieval at all.
  2. Follow-up resolution carries the thread. "professor Dens" becomes the full name on turn 3, and the bare pronoun "hij" resolves to "Prof. Dr. Jo Dens" on turn 4 — each done by the intent classifier reading 2–3 prior turns of history.
  3. The voice shape diverges from chat on identical retrieval. The same doctor_lookup that the chat channel renders as a full 23-name roster, the voice channel compresses to two names plus "andere cardiologen" to honor the two-sentence TTS cap.

The call, turn by turn

A caller dials in with a symptom and walks all the way to a specific doctor's consultation hours. All values are from the live pilot trace (conversation 24f9d157).

Turn 1 — symptom → department (agentic RAG path)

Star stage: the intent rewrite turns the lay symptom into a clinical term, and the safety layer prepends a spoken disclaimer before routing the caller to a department.

StageReal traceReference
Raw"Ik heb last van hartkloppingen, bij welke dienst moet ik zijn?"
Intentambiguous_symptom_description (0.90), is_followup=False; rewrite "Welke afdelingen bij zol behandelen Palpitaties?"hartkloppingen → Palpitaties (lay → clinical); latency=3643msintent classifier
Pathno registry capability for this intent → agentic search_hospital_kb → RAG ([RAG] Intent: … [pre-classified, LLM skipped])Architecture
Retrievespecialty_resolved_via_taxonomy key='hartkloppingen' specialty='cardiologie'; top_k=5; Value Framework primary=clinical_info, mismatch 0.000hybrid
Safety + shapemedical content detected → disclaimer prepended; phone formatted for prosodyAnswer Shaping
Spoken"Daar kan ik geen specifiek medisch advies over geven; voor hartkloppingen kunt u terecht bij de afdeling Cardiologie van ZOL via het nummer nul-acht-negen, tweeëndertig, eenenvijftig, eenenvijftig." (089 32 51 51)medical-advice refusal

Turn 2 — doctor list (agentic RAG path, doctor-list injection)

Star stage: the department doctor-list injection feeds the full roster into context, but the voice answer-shaper compresses it to fit the two-sentence cap.

StageReal traceReference
Raw"Welke artsen werken er op de dienst cardiologie?"
Intentdoctor_lookup (0.95), is_followup=True, history_turns=1; rewrite "Welke artsen werken bij de afdeling Cardiologie…"intent classifier
Pathdoctor_lookup has no registry capability → agentic search_hospital_kb → RAGArchitecture
Retrieve_inject_department_doctor_list dept='Cardiologie' found 25 doctor(s); taxonomy_enrichment_applied=True doctors=23; Value Framework primary=generalStage 5c injection
Shapetwo-sentence cap → two named doctors + "andere cardiologen" (the chat channel lists all 23)Answer Shaping
Spoken"Op de afdeling Cardiologie werken onder andere Dr. Koen Ameloot en Prof. Dr. Jo Dens, samen met andere cardiologen. Wenst u informatie over een specifieke arts of wilt u een afspraak maken?"grounding

Turn 3 — doctor profile (agentic RAG path, follow-up resolution)

Star stage: follow-up resolution expands the bare reference "professor Dens" into the canonical full name using two turns of history.

StageReal traceReference
Raw"Vertel me meer over professor Dens."
Intentdoctor_lookup (0.95), is_followup=True, history_turns=2; rewrite "Wie is Prof. Dr. Jo Dens en op welke afdeling werkt hij bij zol""professor Dens" resolved to the full nameintent classifier
Pathagentic search_hospital_kb → RAG; taxonomy_enrichment doctors=24; Value Framework primary=appointmentsArchitecture
Spoken"Prof. Dr. Jo Dens is cardioloog op de afdeling Cardiologie van Ziekenhuis Oost-Limburg. Hij houdt raadplegingen op campus Sint-Jan en in het Medisch Centrum André Dumont. Wenst u een afspraak of meer details?"grounding

Turn 4 — schedule (deterministic capability short-circuit)

Star stage: the bare pronoun "hij" resolves to the doctor, the intent matches a registry capability, and the structured consultation_schedule lookup answers without any RAG retrieval.

StageReal traceReference
Raw"Wanneer werkt hij?"
Intentdoctor_schedule_query (0.95), is_followup=True, history_turns=3; rewrite "Wanneer heeft Prof. Dr. Jo Dens raadplegingen bij de afdeling…"pronoun "hij" → "Prof. Dr. Jo Dens"intent classifier
Pathintent matches a registry capability → voice_classify_first_capability_hitdeterministic DoctorScheduleCapability; the agentic orchestrator and RAG are skippedArchitecture
Lookupcapability_doctor_schedule {doctor: 'Prof. Dr. Jo Dens', day: None, cited: True}; structured consultation_schedule read from the doctor's profile; tenant fallback phone (per-doctor phone empty)Citation Pipeline
Spoken"Prof. Dr. Jo Dens houdt raadpleging op woensdagvoormiddag (wekelijks), donderdagnamiddag (wekelijks), vrijdagvoormiddag (wekelijks) en vrijdagnamiddag (wekelijks). … U kan een afspraak maken via 089/acht nul, … acht nul, … acht nul." (089 80 80 80)grounding

What's voice-specific

The same RAGService powers chat and voice, but channel="voice" flips a set of behaviours visible in the trace above:

  • Language is locked at the first confirmed utterance — every turn above stayed nl (Language Locking).
  • No inline [N] markers — TTS would read "open bracket one"; citations are derived from chunks instead.
  • Two-sentence cap — turn 2 lists two cardiologists + "andere cardiologen" rather than the chat channel's full roster (Answer Shaping).
  • Disclaimer once — turn 1's medical content triggers the spoken disclaimer; later turns don't repeat it.
  • Phone prosody089 32 51 51 and 089 80 80 80 are spoken with comma boundaries so ElevenLabs chunks the digits (Prosody Injection).
  • Filler ladder — while the backend thinks, tiers at 1.5 s / 4 s / 10 s mask the silence and cancel the moment the first sentence streams back (Context-Aware Filler).

Where to go next