Lingua Language Detection Validation

ADR-0037 | Accepted | 2026-02-20

Problem

The chatbot responds in the user's detected language, a requirement of the cross-lingual RAG pipeline (Lewis et al., 2020). Language detection is performed by the LLM as part of intent classification. However, short Dutch queries like "welke arts bij psoriasis" were occasionally misclassified as Romanian, causing the system to respond in the wrong language.

Solution

Added lingua-language-detector as a statistical confidence check that validates the LLM's language detection.

How It Works

User query → LLM classifies intent + language → Lingua validates → Final language

Scenario	Action
Lingua agrees with LLM	Use LLM's detection
Lingua disagrees (confidence >= 50%)	Override with lingua
Lingua uncertain (short/ambiguous)	Trust LLM
Lingua unavailable	Trust LLM (graceful degradation)

Supported Languages

Dutch, English, French, German, Romanian, Turkish, Italian, Greek, Arabic, Polish, Russian, Spanish.

Performance

One-time initialization: ~50ms (lazy, on first query)
Per-query validation: <1ms
No impact on overall pipeline latency (LLM calls take seconds)

Why Lingua?

Library	Short text accuracy	Deterministic	Speed
Lingua	Best	Yes	Good
langdetect	Poor	No	Slow
fast-langdetect	Good	Yes	Fastest

Lingua is specifically optimized for short text detection using n-grams of sizes 1-5, making it ideal for our use case of 3-6 word hospital search queries.

Verification

"welke arts bij psoriasis" + LLM says "ro" → lingua overrides to "nl" ✓
"Cum pot face o programare?" + LLM says "ro" → lingua agrees: "ro" ✓
"What are visiting hours?" + LLM says "en" → lingua agrees: "en" ✓
"hartchirurgie" + LLM says "ro" → lingua uncertain, trusts LLM: "ro" ✓

ADR-0037
Multilingual Prompts
Implementation: backend/app/services/intent_classification_service.py → validate_detected_language()

Problem​

Solution​

How It Works​

Supported Languages​

Performance​

Why Lingua?​

Verification​

Related​