ADR-0061: Multilingual handling via query-rewriting to the corpus language

Date: 2026-06-07 | Status: Accepted (principle in force for ZOL; per-tenant canonical_language parameterization planned) | Relates to: ADR-0030 — LLM Entity Extraction, ADR-0031 — Semantic Query Cache, Voice Language Locking (ADR-0052), Multilingual Prompts

In one sentence: we make a multilingual product work against a single-language knowledge base by rewriting every inbound query into the corpus language once, then running all retrieval and symbolic resolution against that canonical form — rather than teaching the knowledge base to recognise every language.

This is the detailed companion to the pipeline page Query Rewriting → Why the corpus language?. It exists because "why rewrite to Dutch?" is a question we are asked often — by stakeholders, by a future maintainer, and by anyone onboarding a non-Dutch hospital — and the answer deserves a precise, citable home.

The question that surfaced it

A French voice caller said "je crois que j'ai une syphilis" (2026-06-06). The assistant recommended seeing a urologist — ungrounded, and clinically wrong: syphilis is handled by dermatology-venereology / infectious diseases. The condition never resolved to a department because the lookup ran against the caller's raw French utterance, and the taxonomy is keyed in Dutch (syfilis). That raised the design question this record settles: when a non-Dutch query must resolve against a Dutch knowledge base, do we (A) enumerate per-language spellings, or (B) normalize the query to the corpus language first?

Two layers, two very different language needs

Layer	What it is	Does it need the corpus language?
Vector layer	embeddings → cosine retrieval	Partly. The embedding model is multilingual, so "syphilis" (fr) and "syfilis" (nl) already land near each other — cross-lingual retrieval half-works untouched.
Symbolic layer	taxonomy `CONDITION_TO_DEPT_MAP`, condition/treatment aliases, SNOMED lookups, keyword-rescue `safe_contains`, graph resolution	Yes, absolutely. This is exact / fuzzy string matching against Dutch surface forms. "syphilis" never matches the key "syfilis" — vector closeness is irrelevant to a dict lookup.

The syphilis failure lived entirely in the symbolic layer. That is the crux: the layer that broke is the one that cannot be fixed by better embeddings.

Decision

Multilingual handling is done by rewriting every inbound query to the corpus language, and authoring the knowledge base once in that language. We do not enumerate per-language spellings, nor maintain per-language regex/alias tables for the symbolic layer.

The rewrite already happens: the intent-classification step (ADR-0030) instructs "ALWAYS rewrite to Dutch" and emits rewritten_query (standalone Dutch) plus an extracted Dutch entities.condition. Every symbolic-layer consumer must resolve against that normalized signal — preferring entities.condition → rewritten_query → raw input — never the raw multilingual utterance.

Why the corpus language, in priority order

Symbolic resolution — decisive. The Dutch-authored taxonomy/aliases/graph only match Dutch surface forms. Normalization is what makes them resolve at all. This is the reason the syphilis bug needed it; it is not a vector concern.
Author the knowledge base once — decisive for scale. One canonical language means the taxonomy, aliases, prompts, and safety heuristics are written a single time, not once per supported language.
Single-language downstream — operational. The semantic cache key, the reranker, and the Dutch-tuned answer-shaping / number-normalization / safety heuristics all assume one canonical language.
Tighter vectors — secondary bonus. Multilingual embedding spaces cluster somewhat by language, so a same-language query↔document pair scores higher cosine than an equal-meaning cross-language pair. Rewriting tightens those scores — which matters for threshold-based gates (the retrieval-confidence abstain floor, rerank cutoffs, keyword-rescue scoring), not for whether retrieval works at all.

So — "is it because closer vectors?"

Partly, but that is the secondary benefit. We rewrite to the corpus language primarily so the deterministic knowledge layer resolves at all, and so we author that knowledge once. Tighter vectors sharpen ranking thresholds; they are not the foundation.

The rejected alternative: per-language enumeration

Adding syphilis / sifilide / gonorrhée / … to the taxonomy and aliases (plus per-language regex tables elsewhere) scales as conditions × languages and never ends — "chasing our tail." We reject it. The rewrite collapses all input languages to one canonical form before the symbolic layer runs, so the knowledge base only ever needs the corpus language. A multilingual-spelling patch to the taxonomy was, in fact, started during the syphilis fix and deliberately abandoned in favour of resolving against the already-rewritten Dutch query.

What a French / English / Romanian hospital changes

Almost nothing structural — the design holds one canonical language per tenant: the corpus language. Onboarding a non-Dutch hospital changes which language is canonical, not how the machinery works:

Layer	ZOL (today)	A French / English / Romanian tenant
Rewrite target	Dutch (hardcoded in the prompt)	the tenant's `canonical_language` (fr / en / ro)
Taxonomy & condition→dept maps	authored in Dutch	authored in the corpus language (already per-tenant under `tenant_overlays/`)
Symbolic resolution	against the Dutch taxonomy	against that tenant's corpus-language taxonomy
Cross-language reach	multilingual embeddings + rewrite	unchanged — same mechanism

The intent prompt is already built per-tenant (build_intent_and_rewrite_prompt(ctx)), so the hook is half-present; the planned change replaces the hardcoded "Dutch" with the tenant's canonical_language. No per-language enumeration is ever introduced — only the single normalization target moves.

Not the same as the language lock

This is a different concern from voice/chat language locking. The lock decides what language we answer in (the caller's, pinned per conversation). Rewriting-to-corpus-language decides what language we retrieve and resolve in (always the corpus language). A French call is answered in French while its knowledge-base lookups run in Dutch. The two compose; they do not conflict.

Consequences

+ The symbolic layer resolves for any input language with zero per-language tables.
+ The knowledge base is authored once; onboarding a non-Dutch hospital is a canonical_language + corpus-language-taxonomy change, not a combinatorial spelling effort.
+ Consistent cache keys, reranking, and Dutch-tuned heuristics.
− Correctness depends on an accurate rewrite; a rare mis-rewrite degrades that turn's symbolic resolution. Mitigated by fail-open guards that abstain rather than guess.
− The per-tenant canonical_language parameterization is not yet implemented — ZOL's rewrite target is hardcoded Dutch.

The question that surfaced it​

Two layers, two very different language needs​

Decision​

Why the corpus language, in priority order​

The rejected alternative: per-language enumeration​

What a French / English / Romanian hospital changes​

Not the same as the language lock​

Consequences​

See also​