Skip to main content

Lingua Language Detection Validation

ADR-0037 | Accepted | 2026-02-20

Problem

The chatbot responds in the user's detected language, a requirement of the cross-lingual RAG pipeline (Lewis et al., 2020). Language detection is performed by the LLM as part of intent classification. However, short Dutch queries like "welke arts bij psoriasis" were occasionally misclassified as Romanian, causing the system to respond in the wrong language.

Solution

Added lingua-language-detector as a statistical confidence check that validates the LLM's language detection.

How It Works

User query → LLM classifies intent + language → Lingua validates → Final language
ScenarioAction
Lingua agrees with LLMUse LLM's detection
Lingua disagrees (confidence >= 50%)Override with lingua
Lingua uncertain (short/ambiguous)Trust LLM
Lingua unavailableTrust LLM (graceful degradation)

Supported Languages

Dutch, English, French, German, Romanian, Turkish, Italian, Greek, Arabic, Polish, Russian, Spanish.

Performance

  • One-time initialization: ~50ms (lazy, on first query)
  • Per-query validation: <1ms
  • No impact on overall pipeline latency (LLM calls take seconds)

Why Lingua?

LibraryShort text accuracyDeterministicSpeed
LinguaBestYesGood
langdetectPoorNoSlow
fast-langdetectGoodYesFastest

Lingua is specifically optimized for short text detection using n-grams of sizes 1-5, making it ideal for our use case of 3-6 word hospital search queries.

Verification

"welke arts bij psoriasis" + LLM says "ro" → lingua overrides to "nl" ✓
"Cum pot face o programare?" + LLM says "ro" → lingua agrees: "ro" ✓
"What are visiting hours?" + LLM says "en" → lingua agrees: "en" ✓
"hartchirurgie" + LLM says "ro" → lingua uncertain, trusts LLM: "ro" ✓