Voice Language Locking
ADR: ADR-0052 Date: 2026-05-07 Status: Accepted
Deepgram Nova-3 is the production STT model; this page documents how the voice channel adapts to its single-language vs multi-language operating modes.
The problem
ADR-0051 added a switch_language tool to the agentic orchestrator and a regex fast-path for mid-call language detection. Two pilot calls within days proved the design structurally broken:
Conv 5c81a578 — Caller said "Do you speak English?" in English. Deepgram (configured for Dutch/NL) phonetically hallucinated "Duursteking licht spelen" — no English phoneme coverage. The fast-path regex eventually matched, but the LLM loop had already run for 47 seconds before the switch fired.
Conv fb4b4bae — Caller said "Do you speak English?" mid-call, after 8 successful Dutch turns. Deepgram in single-language mode produced no transcript at all for the English speech — not gibberish, not a partial transcription, nothing. The fast-path could not fire because there was no text for the regex to match. The caller waited 30 seconds in silence and hung up.
The second case is structural: once Deepgram is configured for a single language at the STT layer, it silently emits zero transcripts on speech in any other language. There is no signal — neither garbage text nor an error event — that any downstream detector (regex, lexical, or LLM) can act on.
Multi-language Deepgram mode exists and would solve the silence problem, but it badly degrades Flemish accuracy. Prior team measurements showed Dutch queries going from "Wat zijn de bezoekuren" to "Hå at zen de bezukjuren" under multi-language mode. The trade-off costs every Dutch caller (95%+ of calls) to handle a <1% mid-call-switch case.
Deepgram operating-point comparison
The team evaluated three Deepgram operating points before the decision. The trade-off table below records what was observed (qualitative; numerics from prior team measurements which would benefit from a structured re-measurement pass — flagged as * markers below).
| Operating point | Flemish (nl-BE) accuracy | Cross-language coverage | Mid-call switching |
|---|---|---|---|
Single-language (nl) | High* — "Wat zijn de bezoekuren" recognised cleanly | Zero — silent on English / French / Italian speech (no transcript emitted at all) | Impossible — no signal for any detector to act on |
Multi-language (multi) | Degraded — "Wat zijn de bezoekuren" → "Hå at zen de bezukjuren" (prior team measurement) | Yes — English / French / Italian transcribed | Possible — both languages parsed |
| First-utterance probe + lock | High* — same as single-language for 99%+ of the call | Detected once at call start; locked thereafter | Not supported (deliberate trade-off) |
The numbers above merit an updated measurement pass against the current Deepgram Nova-3 (@deepgram_nova3) build; the qualitative ranking has held in pilot calls.
The decision (ADR-0052)
The voice channel's language is locked at the first utterance for the duration of the call. Mid-call language switching is not supported.
The voice_agent worker uses multi-language STT for the very first utterance only, detects the caller's actual spoken language, then locks Deepgram to that language for the remainder of the call. If the caller asks to switch language mid-call, the agent's response is a polite transfer offer:
"Ik kan u helaas niet omschakelen naar een andere taal. Ik kan u wel doorverbinden met een medewerker die u kan helpen. Wilt u dat ik dat doe?"
(Translated: "Unfortunately I cannot switch to another language. I can transfer you to a colleague who can help. Would you like me to do that?")
Implementation
The lock is owned entirely by voice_agent — the backend never changes language state, it only reads it:
voice_agent._current_language = None # before first utterance
voice_agent.probe_first_utterance() # multi-language STT
→ detect spoken language
→ voice_agent._switch_language(detected) # reconfigure STT + TTS once
→ voice_agent._current_language = "nl" # locked for call duration
# Every subsequent turn:
voice_agent → backend: QueryRequest{detected_language: "nl", ...}
backend VoiceLLMOrchestrator: reads detected_language, never modifies it
QueryRequest.detected_language carries the locked language on every backend call. The orchestrator is language-aware (uses it to select the right tenant FAQ overlay and voice system prompt variant) but has no mechanism to change it.
What was removed
The following were deleted in the ADR-0052 commit:
| Artifact | Why removed |
|---|---|
switch_language tool from VoiceLLMOrchestrator._TOOLS | Language switching is no longer an orchestrator responsibility |
Mid-call regex fast-path (_detect_language_request, _LANG_REQUEST_PATTERNS) | Covered a case that Deepgram's silence makes undetectable |
In-loop switch_language short-circuit handling | No tool to dispatch to |
switch_language system-prompt tool description | Removed from tool list |
TestLanguageFastPath unit-test class | Tests the deleted path |
test_voice_llm_language_fast_path.py | Deleted entirely |
2 integration tests for switch_language tool success + invalid-code paths | Deleted |
Net: ~80 LOC removed, 4 tools → 3 tools (search_hospital_kb, transfer_to_helpdesk, end_call remain; switch_language is gone).
What was preserved
| Artifact | Why kept |
|---|---|
voice_agent.language_detection.detect_language_request | Still used for the first-utterance probe |
voice_agent.agent._switch_language() | Called once at call start by the probe |
conversational_intent enum value switch_language | Backwards compatibility; no orchestrator path emits it currently |
| System-prompt instruction to politely decline and offer transfer on mid-call switch requests | The agent still needs to handle the request gracefully |
Hospital-agnostic parameterisation
The lock-and-stay policy is universal — it applies to every tenant regardless of their supported language set. A single-language tenant (ZOL: Dutch) and a multi-language tenant (e.g., a medical-tourism hospital: nl/fr/en/it) both use the lock. The difference is what language the first-utterance probe detects and locks to.
Tenant language configuration is via get_taxonomy(slug) — same DB-driven path as all other tenant data. No language hardcoding in source.
If a future tenant requires genuine mid-call language switching (bilingual conversations are common in their population), this ADR will need revisiting. The options at that point are: Deepgram multi-language mode with a per-tenant Flemish-accuracy trade-off accepted, or an alternative STT vendor that handles code-switching natively.
Contract test
backend/tests/integration/services/voice/test_voice_llm_orchestrator_integration.py — the contract test that pins the language plumbing across the voice_agent → backend handoff:
async def test_detected_language_from_voice_agent_is_respected_by_orchestrator(
orchestrator, make_request
):
"""voice_agent locks language at first utterance and sends
detected_language on every turn. The orchestrator must read and
use it — never infer or override."""
request = make_request(query="Parkeren bij ZOL", detected_language="nl")
response = await orchestrator.query_stream(request).__anext__()
assert response.detected_language == "nl"
# No switch_language tool call in the response conversational_intent
assert response.conversational_intent != "switch_language"
References
- ADR-0052: Voice Language Locked at First Utterance
- ADR-0051: Agentic VoiceLLMOrchestrator (the ADR this one supersedes on language switching)
backend/app/services/voice/voice_llm_orchestrator.py— tools list (3 tools:search_hospital_kb,transfer_to_helpdesk,end_call)- Deepgram Nova-3 — production STT model; the single-language vs multi-language trade-off is the empirical input to this ADR
- Pilot call transcripts: conv
5c81a578(47s gibberish loop), convfb4b4bae(30s silent hang-up) — both inbackend/app/services/voice/log archive - Radford et al. 2023
- {/* TODO Wave 2.D: bibkey for "code-switching ASR survey" needed (foundational on multi-language STT degradation) */}