Skip to main content

Voice Language Locking

ADR: ADR-0052 Date: 2026-05-07 Status: Accepted

Deepgram Nova-3 is the production STT model; this page documents how the voice channel adapts to its single-language vs multi-language operating modes.

The problem

ADR-0051 added a switch_language tool to the agentic orchestrator and a regex fast-path for mid-call language detection. Two pilot calls within days proved the design structurally broken:

Conv 5c81a578 — Caller said "Do you speak English?" in English. Deepgram (configured for Dutch/NL) phonetically hallucinated "Duursteking licht spelen" — no English phoneme coverage. The fast-path regex eventually matched, but the LLM loop had already run for 47 seconds before the switch fired.

Conv fb4b4bae — Caller said "Do you speak English?" mid-call, after 8 successful Dutch turns. Deepgram in single-language mode produced no transcript at all for the English speech — not gibberish, not a partial transcription, nothing. The fast-path could not fire because there was no text for the regex to match. The caller waited 30 seconds in silence and hung up.

The second case is structural: once Deepgram is configured for a single language at the STT layer, it silently emits zero transcripts on speech in any other language. There is no signal — neither garbage text nor an error event — that any downstream detector (regex, lexical, or LLM) can act on.

Multi-language Deepgram mode exists and would solve the silence problem, but it badly degrades Flemish accuracy. Prior team measurements showed Dutch queries going from "Wat zijn de bezoekuren" to "Hå at zen de bezukjuren" under multi-language mode. The trade-off costs every Dutch caller (95%+ of calls) to handle a <1% mid-call-switch case.

Deepgram operating-point comparison

The team evaluated three Deepgram operating points before the decision. The trade-off table below records what was observed (qualitative; numerics from prior team measurements which would benefit from a structured re-measurement pass — flagged as * markers below).

Operating pointFlemish (nl-BE) accuracyCross-language coverageMid-call switching
Single-language (nl)High* — "Wat zijn de bezoekuren" recognised cleanlyZero — silent on English / French / Italian speech (no transcript emitted at all)Impossible — no signal for any detector to act on
Multi-language (multi)Degraded — "Wat zijn de bezoekuren""Hå at zen de bezukjuren" (prior team measurement)Yes — English / French / Italian transcribedPossible — both languages parsed
First-utterance probe + lockHigh* — same as single-language for 99%+ of the callDetected once at call start; locked thereafterNot supported (deliberate trade-off)

The numbers above merit an updated measurement pass against the current Deepgram Nova-3 (@deepgram_nova3) build; the qualitative ranking has held in pilot calls.

The decision (ADR-0052)

The voice channel's language is locked at the first utterance for the duration of the call. Mid-call language switching is not supported.

The voice_agent worker uses multi-language STT for the very first utterance only, detects the caller's actual spoken language, then locks Deepgram to that language for the remainder of the call. If the caller asks to switch language mid-call, the agent's response is a polite transfer offer:

"Ik kan u helaas niet omschakelen naar een andere taal. Ik kan u wel doorverbinden met een medewerker die u kan helpen. Wilt u dat ik dat doe?"

(Translated: "Unfortunately I cannot switch to another language. I can transfer you to a colleague who can help. Would you like me to do that?")

Implementation

The lock is owned entirely by voice_agent — the backend never changes language state, it only reads it:

voice_agent._current_language = None # before first utterance
voice_agent.probe_first_utterance() # multi-language STT
→ detect spoken language
→ voice_agent._switch_language(detected) # reconfigure STT + TTS once
→ voice_agent._current_language = "nl" # locked for call duration

# Every subsequent turn:
voice_agent → backend: QueryRequest{detected_language: "nl", ...}
backend VoiceLLMOrchestrator: reads detected_language, never modifies it

QueryRequest.detected_language carries the locked language on every backend call. The orchestrator is language-aware (uses it to select the right tenant FAQ overlay and voice system prompt variant) but has no mechanism to change it.

What was removed

The following were deleted in the ADR-0052 commit:

ArtifactWhy removed
switch_language tool from VoiceLLMOrchestrator._TOOLSLanguage switching is no longer an orchestrator responsibility
Mid-call regex fast-path (_detect_language_request, _LANG_REQUEST_PATTERNS)Covered a case that Deepgram's silence makes undetectable
In-loop switch_language short-circuit handlingNo tool to dispatch to
switch_language system-prompt tool descriptionRemoved from tool list
TestLanguageFastPath unit-test classTests the deleted path
test_voice_llm_language_fast_path.pyDeleted entirely
2 integration tests for switch_language tool success + invalid-code pathsDeleted

Net: ~80 LOC removed, 4 tools → 3 tools (search_hospital_kb, transfer_to_helpdesk, end_call remain; switch_language is gone).

What was preserved

ArtifactWhy kept
voice_agent.language_detection.detect_language_requestStill used for the first-utterance probe
voice_agent.agent._switch_language()Called once at call start by the probe
conversational_intent enum value switch_languageBackwards compatibility; no orchestrator path emits it currently
System-prompt instruction to politely decline and offer transfer on mid-call switch requestsThe agent still needs to handle the request gracefully

Hospital-agnostic parameterisation

The lock-and-stay policy is universal — it applies to every tenant regardless of their supported language set. A single-language tenant (ZOL: Dutch) and a multi-language tenant (e.g., a medical-tourism hospital: nl/fr/en/it) both use the lock. The difference is what language the first-utterance probe detects and locks to.

Tenant language configuration is via get_taxonomy(slug) — same DB-driven path as all other tenant data. No language hardcoding in source.

If a future tenant requires genuine mid-call language switching (bilingual conversations are common in their population), this ADR will need revisiting. The options at that point are: Deepgram multi-language mode with a per-tenant Flemish-accuracy trade-off accepted, or an alternative STT vendor that handles code-switching natively.

Contract test

backend/tests/integration/services/voice/test_voice_llm_orchestrator_integration.py — the contract test that pins the language plumbing across the voice_agent → backend handoff:

async def test_detected_language_from_voice_agent_is_respected_by_orchestrator(
orchestrator, make_request
):
"""voice_agent locks language at first utterance and sends
detected_language on every turn. The orchestrator must read and
use it — never infer or override."""
request = make_request(query="Parkeren bij ZOL", detected_language="nl")
response = await orchestrator.query_stream(request).__anext__()
assert response.detected_language == "nl"
# No switch_language tool call in the response conversational_intent
assert response.conversational_intent != "switch_language"

References

  • ADR-0052: Voice Language Locked at First Utterance
  • ADR-0051: Agentic VoiceLLMOrchestrator (the ADR this one supersedes on language switching)
  • backend/app/services/voice/voice_llm_orchestrator.py — tools list (3 tools: search_hospital_kb, transfer_to_helpdesk, end_call)
  • Deepgram Nova-3 — production STT model; the single-language vs multi-language trade-off is the empirical input to this ADR
  • Pilot call transcripts: conv 5c81a578 (47s gibberish loop), conv fb4b4bae (30s silent hang-up) — both in backend/app/services/voice/ log archive
  • Radford et al. 2023
  • {/* TODO Wave 2.D: bibkey for "code-switching ASR survey" needed (foundational on multi-language STT degradation) */}