ADR-0061: Multilingual handling via query-rewriting to the corpus language
Date: 2026-06-07 | Status: Accepted (principle in force for ZOL; per-tenant canonical_language parameterization planned) | Relates to: ADR-0030 — LLM Entity Extraction, ADR-0031 — Semantic Query Cache, Voice Language Locking (ADR-0052), Multilingual Prompts
In one sentence: we make a multilingual product work against a single-language knowledge base by rewriting every inbound query into the corpus language once, then running all retrieval and symbolic resolution against that canonical form — rather than teaching the knowledge base to recognise every language.
This is the detailed companion to the pipeline page Query Rewriting → Why the corpus language?. It exists because "why rewrite to Dutch?" is a question we are asked often — by stakeholders, by a future maintainer, and by anyone onboarding a non-Dutch hospital — and the answer deserves a precise, citable home.
The question that surfaced it
A French voice caller said "je crois que j'ai une syphilis" (2026-06-06). The assistant recommended seeing a urologist — ungrounded, and clinically wrong: syphilis is handled by dermatology-venereology / infectious diseases. The condition never resolved to a department because the lookup ran against the caller's raw French utterance, and the taxonomy is keyed in Dutch (syfilis). That raised the design question this record settles: when a non-Dutch query must resolve against a Dutch knowledge base, do we (A) enumerate per-language spellings, or (B) normalize the query to the corpus language first?
Two layers, two very different language needs
| Layer | What it is | Does it need the corpus language? |
|---|---|---|
| Vector layer | embeddings → cosine retrieval | Partly. The embedding model is multilingual, so "syphilis" (fr) and "syfilis" (nl) already land near each other — cross-lingual retrieval half-works untouched. |
| Symbolic layer | taxonomy CONDITION_TO_DEPT_MAP, condition/treatment aliases, SNOMED lookups, keyword-rescue safe_contains, graph resolution | Yes, absolutely. This is exact / fuzzy string matching against Dutch surface forms. "syphilis" never matches the key "syfilis" — vector closeness is irrelevant to a dict lookup. |
The syphilis failure lived entirely in the symbolic layer. That is the crux: the layer that broke is the one that cannot be fixed by better embeddings.
Decision
Multilingual handling is done by rewriting every inbound query to the corpus language, and authoring the knowledge base once in that language. We do not enumerate per-language spellings, nor maintain per-language regex/alias tables for the symbolic layer.
The rewrite already happens: the intent-classification step (ADR-0030) instructs "ALWAYS rewrite to Dutch" and emits rewritten_query (standalone Dutch) plus an extracted Dutch entities.condition. Every symbolic-layer consumer must resolve against that normalized signal — preferring entities.condition → rewritten_query → raw input — never the raw multilingual utterance.
Why the corpus language, in priority order
- Symbolic resolution — decisive. The Dutch-authored taxonomy/aliases/graph only match Dutch surface forms. Normalization is what makes them resolve at all. This is the reason the syphilis bug needed it; it is not a vector concern.
- Author the knowledge base once — decisive for scale. One canonical language means the taxonomy, aliases, prompts, and safety heuristics are written a single time, not once per supported language.
- Single-language downstream — operational. The semantic cache key, the reranker, and the Dutch-tuned answer-shaping / number-normalization / safety heuristics all assume one canonical language.
- Tighter vectors — secondary bonus. Multilingual embedding spaces cluster somewhat by language, so a same-language query↔document pair scores higher cosine than an equal-meaning cross-language pair. Rewriting tightens those scores — which matters for threshold-based gates (the retrieval-confidence abstain floor, rerank cutoffs, keyword-rescue scoring), not for whether retrieval works at all.
Partly, but that is the secondary benefit. We rewrite to the corpus language primarily so the deterministic knowledge layer resolves at all, and so we author that knowledge once. Tighter vectors sharpen ranking thresholds; they are not the foundation.
The rejected alternative: per-language enumeration
Adding syphilis / sifilide / gonorrhée / … to the taxonomy and aliases (plus per-language regex tables elsewhere) scales as conditions × languages and never ends — "chasing our tail." We reject it. The rewrite collapses all input languages to one canonical form before the symbolic layer runs, so the knowledge base only ever needs the corpus language. A multilingual-spelling patch to the taxonomy was, in fact, started during the syphilis fix and deliberately abandoned in favour of resolving against the already-rewritten Dutch query.
What a French / English / Romanian hospital changes
Almost nothing structural — the design holds one canonical language per tenant: the corpus language. Onboarding a non-Dutch hospital changes which language is canonical, not how the machinery works:
| Layer | ZOL (today) | A French / English / Romanian tenant |
|---|---|---|
| Rewrite target | Dutch (hardcoded in the prompt) | the tenant's canonical_language (fr / en / ro) |
| Taxonomy & condition→dept maps | authored in Dutch | authored in the corpus language (already per-tenant under tenant_overlays/) |
| Symbolic resolution | against the Dutch taxonomy | against that tenant's corpus-language taxonomy |
| Cross-language reach | multilingual embeddings + rewrite | unchanged — same mechanism |
The intent prompt is already built per-tenant (build_intent_and_rewrite_prompt(ctx)), so the hook is half-present; the planned change replaces the hardcoded "Dutch" with the tenant's canonical_language. No per-language enumeration is ever introduced — only the single normalization target moves.
Not the same as the language lock
This is a different concern from voice/chat language locking. The lock decides what language we answer in (the caller's, pinned per conversation). Rewriting-to-corpus-language decides what language we retrieve and resolve in (always the corpus language). A French call is answered in French while its knowledge-base lookups run in Dutch. The two compose; they do not conflict.
Consequences
- + The symbolic layer resolves for any input language with zero per-language tables.
- + The knowledge base is authored once; onboarding a non-Dutch hospital is a
canonical_language+ corpus-language-taxonomy change, not a combinatorial spelling effort. - + Consistent cache keys, reranking, and Dutch-tuned heuristics.
- − Correctness depends on an accurate rewrite; a rare mis-rewrite degrades that turn's symbolic resolution. Mitigated by fail-open guards that abstain rather than guess.
- − The per-tenant
canonical_languageparameterization is not yet implemented — ZOL's rewrite target is hardcoded Dutch.
See also
- Query Rewriting — the pipeline page; "Why the corpus language?" section mirrors this record.
- ADR-0030 — LLM Entity Extraction — emits
{intent, rewritten_query, entities}in one call. - ADR-0031 — Semantic Query Cache — keys on the rewritten query.
- Voice Language Locking — the distinct answer-language concern.