Lingua Language Detection Validation
ADR-0037 | Accepted | 2026-02-20
Problem
The chatbot responds in the user's detected language, a requirement of the cross-lingual RAG pipeline (Lewis et al., 2020). Language detection is performed by the LLM as part of intent classification. However, short Dutch queries like "welke arts bij psoriasis" were occasionally misclassified as Romanian, causing the system to respond in the wrong language.
Solution
Added lingua-language-detector as a statistical confidence check that validates the LLM's language detection.
How It Works
User query → LLM classifies intent + language → Lingua validates → Final language
| Scenario | Action |
|---|---|
| Lingua agrees with LLM | Use LLM's detection |
| Lingua disagrees (confidence >= 50%) | Override with lingua |
| Lingua uncertain (short/ambiguous) | Trust LLM |
| Lingua unavailable | Trust LLM (graceful degradation) |
Supported Languages
Dutch, English, French, German, Romanian, Turkish, Italian, Greek, Arabic, Polish, Russian, Spanish.
Performance
- One-time initialization: ~50ms (lazy, on first query)
- Per-query validation: <1ms
- No impact on overall pipeline latency (LLM calls take seconds)
Why Lingua?
| Library | Short text accuracy | Deterministic | Speed |
|---|---|---|---|
| Lingua | Best | Yes | Good |
| langdetect | Poor | No | Slow |
| fast-langdetect | Good | Yes | Fastest |
Lingua is specifically optimized for short text detection using n-grams of sizes 1-5, making it ideal for our use case of 3-6 word hospital search queries.
Verification
"welke arts bij psoriasis" + LLM says "ro" → lingua overrides to "nl" ✓
"Cum pot face o programare?" + LLM says "ro" → lingua agrees: "ro" ✓
"What are visiting hours?" + LLM says "en" → lingua agrees: "en" ✓
"hartchirurgie" + LLM says "ro" → lingua uncertain, trusts LLM: "ro" ✓
Related
- ADR-0037
- Multilingual Prompts
- Implementation:
backend/app/services/intent_classification_service.py→validate_detected_language()