ADR-0052: Voice channel language locks at first utterance — no mid-call switching
Master record:
docs/ADR/0052-voice-language-locked-at-first-utterance.md. The master is canonical; this Docusaurus rendering is for in-site navigation.
Date: 2026-05-07 Status: Accepted Deciders: Tsunami-max (operator), Claude Supersedes the mid-call language-switch sub-decisions in ADR-0051 Relates to: ADR-0049 (voice pipeline), ADR-0050 (master record) (Twilio/LiveKit SIP)
Context
ADR-0051 made the agentic VoiceLLMOrchestrator the only voice path and added a switch_language LLM tool plus a regex fast-path to support mid-call language switching. Two pilot calls in the following days surfaced a structural problem with the design:
- 2026-05-07 conv
5c81a578— caller said "Do you speak English?" in English. Deepgram-NL phonetically hallucinated "Duursteking licht spelen". The fast-path eventually caught it but the LLM loop already burned 47 s. - 2026-05-07 conv
fb4b4bae— caller said "Do you speak English?" mid-call, after 8 successful Dutch turns. Deepgram-NL produced NO transcript at all for the English speech. The fast-path could not fire because there was nothing for the regex to match. The caller waited 30 s and gave up.
The second case is the structural one: once a voice call has locked to a single language at the STT layer, Deepgram in single-language mode silently emits zero transcripts on speech in other languages. There is no signal — neither gibberish nor partial — for any mid-call detector (regex, lexical, or LLM) to act on. Multi-language mode would solve this but it badly degrades Flemish accuracy ("Wat zijn de bezoekuren" → "Hå at zen de bezukjuren" on the prior team's measurements). The trade-off costs the 95 % happy-path caller for a < 1 % mid-call-switch case.
Decision
The voice channel's language is locked at the first utterance for the duration of the call. Mid-call switching is unsupported.
The design is hospital-agnostic: tenants configure their supported language set and default language via the existing get_taxonomy(slug) infrastructure, but the lock-and-stay policy applies regardless of which languages a tenant supports — whether that's a single-language tenant (nl-only) or a multi-language tenant (nl/fr/en/it).
The first-utterance probe in voice_agent continues to detect the caller's actual spoken language at call start (multi-mode STT for the first turn, then locks). After that, the call stays in that language even if the caller asks to switch. The agent's response when asked to switch mid-call: politely offer transfer_to_helpdesk.
Consequences
Removed
switch_languagetool from the agentic orchestrator's_TOOLSlist and the corresponding handler in_execute_tool- Mid-call regex fast-path block in
query_stream(_detect_language_request+_LANG_REQUEST_PATTERNS— the backend mirror of voice_agent's probe regex) - In-loop
switch_languageshort-circuit handling - System-prompt tool description for
switch_language - Two integration tests for the tool's success + invalid-code paths
- The
TestLanguageFastPathunit-test class for the orchestrator-level fast-path test_voice_llm_language_fast_path.py(the helper-level tests for the now-deleted backend mirror)
Net: ~80 LOC removed.
Preserved
voice_agent's first-utterance probe (still the primary language-detection mechanism)voice_agent.language_detection.detect_language_request— the client-side regex still feeds the probe at call startvoice_agent.agent._switch_language()— used internally by the probe to reconfigure STT + TTS once at call start- The
conversational_intentenum valueswitch_languagein the response schema (preserved for backwards compat; no orchestrator path emits it currently, but voice_agent's probe-time event uses it) - The system-prompt instruction to politely transfer if the caller asks to switch language mid-call
Positive
- Single source of truth for language state — voice_agent owns it, sets it once, never mutates it. The backend orchestrator is language-aware (via
QueryRequest.detected_language) but never changes it. - Eliminates the silent-failure class caused by Deepgram's single-mode-no-transcript behaviour on cross-language speech.
- Simpler agentic surface — 3 tools instead of 4, fewer short-circuit paths in the loop, fewer integration tests to maintain.
- Hospital-agnostic by construction — no language-set hardcoded in the policy. Each tenant configures via taxonomy.
Negative / Trade-offs
- Caller who starts in EN and needs a Dutch medical term mid-call cannot have the agent switch language. Mitigation: the agent transfers to helpdesk on explicit ask. Frequency: < 1 % of calls per pilot data.
- If a tenant wants real multi-language switching mid-call in the future (e.g., a medical-tourism hospital where bilingual conversations are common), this ADR will need to be revisited. At that point we'd reconsider Deepgram multi-mode plus a per-tenant flag, or evaluate a different STT vendor.
Implementation
Single commit (this batch). Files touched:
| File | Change |
|---|---|
backend/app/services/voice/voice_llm_orchestrator.py | Tool removed; helper + patterns removed; in-loop short-circuit simplified |
backend/app/prompts.py | System-prompt tool list updated |
backend/tests/unit/services/voice/test_voice_llm_orchestrator.py | TestLanguageFastPath class removed |
backend/tests/unit/services/voice/test_voice_llm_language_fast_path.py | Deleted |
backend/tests/integration/services/voice/test_voice_llm_orchestrator_integration.py | Two switch_language tool tests removed |
docs/ADR/0052-voice-language-locked-at-first-utterance.md | Master record (this Docusaurus page is its mirror) |
Follow-ups
- The remaining script-2 (English) test deck still exercises a language-switch turn ("Kan je in het Nederlands verder gaan?") — that turn's expectation needs to update from "agent switches" to "agent politely declines and offers transfer". Will update when next test deck is generated.
- Voice eval harness (next batch) — with mid-call switching out of scope, the harness can assume single-language conversations, which simplifies its scoring rubric.
References
- ADR-0049: voice pipeline lineage.
- ADR-0051: the immediate predecessor; this ADR removes the language-switch tool and patterns it added.
- Deepgram Nova-3 announcement — STT model spec, language coverage, and the single-language-mode silence behaviour that motivates the lock-and-stay policy.
- LiveKit Agents documentation — runtime where the first-utterance probe and the orchestrator co-exist.