Retrieval Improvements Roadmap
This document describes improvements to the retrieval pipeline based on findings from the A/B experiment and gap analysis. Query decomposition (item 3) was implemented on 2026-02-17 behind a feature flag -- see ADR-0032. The embedding-migration item (item 1) has now seen two completed migrations: BGE-M3 in February 2026 (ADR-0033) and text-embedding-3-large in April 2026 (ADR-0048). Item 2 remains planned.
1. Embedding Model Migrations
The original BGE-M3 migration was completed in February 2026 (ADR-0033). It was subsequently superseded by a second migration to OpenAI text-embedding-3-large (1,536-dim, OpenAI API) in April 2026 — see ADR-0048 and @openai2024embeddings. See Embedding Models for the current state. BGE-M3 still survives in the stack as the optional ColBERT reranker model only.
The original embedding model (nomic-embed-text, 768 dimensions) was primarily trained on English data with limited multilingual coverage. It was not benchmarked on MTEB-NL (the Dutch embedding benchmark), making its Dutch retrieval quality unknown. The A/B experiment revealed that structurally similar Dutch medical queries could produce dangerously similar embeddings, contributing to semantic cache contamination.
Migration Summary
| Property | nomic-embed-text (initial) | BGE-M3 (Feb–Apr 2026) | text-embedding-3-large (current — ADR-0048) |
|---|---|---|---|
| MTEB-NL Retrieval Score | Not benchmarked | 60.0 | ~64.6 |
| Dimensions | 768 | 1,024 | 1,536 (truncated from 3,072) |
| Context Window | 8,192 tokens | 8,192 tokens | 8,191 tokens |
| Languages | English-primary | 100+ languages | Strong multilingual |
| Provider | Ollama | Ollama | OpenAI API |
| Cost | Free | Free | $0.13 / 1M tokens |
| Retrieval Modes | Dense only | Dense + Sparse + ColBERT | Dense only (ColBERT delegated to BGE-M3) |
| Architecture | nomic-bert | XLM-RoBERTa | OpenAI proprietary |
| Status | Replaced Feb 2026 | Replaced Apr 2026 (still used for ColBERT) | Current |
Outcomes
- Better Dutch retrieval: Benchmarked 60.0 on MTEB-NL vs unknown for nomic
- Improved semantic cache discrimination: Higher-dimensional embeddings produce more distinctive vectors for similar-but-different queries
- Better multilingual support: Superior cross-lingual embedding quality
- ColBERT support: Future option for late interaction retrieval (multi-vector matching per token)
2. UMCU Dutch Medical Terminology Enrichment
Problem
The current taxonomy contains approximately 55 condition aliases, 20 treatment aliases, and 20 examination aliases -- all manually curated. While these cover the most common patient queries, they miss thousands of Dutch medical terms, patient-friendly synonyms, and colloquial expressions. The golden question gap analysis revealed that major conditions (astma, COPD, epilepsie, endometriose, Crohn, Alzheimer) and departments (Gastro-enterologie, Reumatologie, Infectiologie, Vaatchirurgie) had zero test coverage, partly because the taxonomy lacks the aliases patients would use to describe these conditions.
Proposed Solution
Integrate Dutch medical terminology from the UMCU Dutch Medical Concepts repository, which provides structured access to:
| Source | Concepts | Dutch Names | Semantic Types |
|---|---|---|---|
| UMLS (MeSH, MedDRA, ICD-10, ICPC) | 254,835 | 574,475 | Diseases, procedures, anatomy |
| SNOMED CT (Dutch edition) | 230,277 | 521,118 | Clinical terms, findings, procedures |
| HPO (Human Phenotype Ontology) | 13,360 | 29,164 | Rare diseases, phenotypes |
What This Enables
The UMCU data includes patient-friendly Dutch synonyms -- exactly the vocabulary gap our system needs to bridge. For example:
| Patient types... | UMCU provides... | Maps to... |
|---|---|---|
| "zuurbranden" | pyrosis, brandend maagzuur, gastro-oesofageale reflux | Gastro-enterologie |
| "spataders" | varices, varicosis, varikeuze venen | Vaatchirurgie |
| "vergeetachtig" | geheugenstoornis, cognitieve achteruitgang, dementie | Geriatrie / Neurologie |
| "benauwdheid" | dyspnoe, kortademigheid, respiratoire insufficiëntie | Pneumologie |
Integration Architecture
UMCU Repository (GitHub)
└── create_concept_table.py
├── UMLS Metathesaurus (requires free UTS account)
└── SNOMED CT Dutch (requires free MLDS registration)
│
▼
Dutch Concept Tables (CSV)
│
▼
Filter Script (keep: diseases, procedures, examinations)
│
▼
Curated Subset (~2,000 most relevant terms)
│
▼
zol.yaml search_aliases + CONDITION_ALIASES + TREATMENT_ALIASES
│
▼
resolve_search_query() enhanced with 10x more aliases
Licensing
Both data sources are free:
- UMLS: Free license from NLM (US National Library of Medicine). Individual registration at UTS.
- SNOMED CT: Free for Belgian healthcare organizations via MLDS (Belgium is an IHTSDO member state).
Implementation Steps
- Register for UMLS (UTS account) and SNOMED CT (MLDS Belgian affiliate) -- ~30 minutes
- Clone the UMCU repository and generate concept tables
- Filter for relevant semantic types: diseases/conditions, procedures/treatments, examinations
- Curate a subset of ~2,000 most relevant terms (cross-reference with ZOL department list)
- Merge into
zol.yamlsearch aliases and taxonomy alias maps - Validate with golden evaluation (ensure no false-positive routing)
- Document the data lineage and update this page
Expected Impact
- 10-20x more condition/treatment aliases: From ~55 conditions to 500+ with patient-friendly Dutch synonyms
- Better entity resolution: More queries correctly route to the right department
- Reduced "information not found" responses: Patient vocabulary matches expanded taxonomy
- Improved query enrichment: More terms available for the existing
resolve_search_query()pipeline
Risks
- False positives: Overly broad matching could route queries to wrong departments
- Maintenance burden: Periodic updates needed when UMLS/SNOMED releases new versions
- Curation effort: Raw data needs manual filtering to avoid irrelevant medical jargon
Effort Estimate
4-6 hours for registration + data import + curation + integration + validation.
3. Query Decomposition for Multi-Hop Reasoning
This improvement was implemented on 2026-02-17. See ADR-0032: Query Decomposition for full implementation details. Feature flag: query_decomposition_enabled (default: false).
Problem
Multi-hop queries require traversing multiple entity relationships to construct an answer. For example:
"Welke dokter behandelt rugpijn op campus Sint-Jan?"
This requires three traversals: rugpijn → Fysische Geneeskunde (condition→department) → Dr. X (department→doctor) → Sint-Jan (doctor→campus). The current pipeline rewrites this into a single query template, which may lose specificity or fail to capture all required entities.
The A/B experiment showed that multi-hop queries (2+ hops) had the lowest entity recall among non-safety categories, and the knowledge graph improved 2-hop queries by +9.4pp -- but there is still room for improvement.
Proposed Solution
Implement query decomposition: detect multi-hop queries during intent classification and split them into sequential sub-queries, each targeting a single relationship traversal.
How It Works
Original query:
"Welke dokter behandelt rugpijn op campus Sint-Jan?"
Decomposition (gpt-4.1-mini, via `structured_call(output_model=DecompositionOutput)` with retries):
Sub-query 1: "Welke afdeling behandelt rugpijn?"
Sub-query 2: "Welke dokters werken bij die afdeling?"
Sub-query 3: "Welke van die dokters werkt op campus Sint-Jan?"
Execution:
Sub-query 1 → Graph: rugpijn → Fysische Geneeskunde ✓
Sub-query 2 → Graph: Fysische Geneeskunde → [Dr. A, Dr. B, Dr. C] ✓
Sub-query 3 → Graph: filter by campus Sint-Jan → [Dr. A] ✓
Context assembly:
Merge all sub-query results → feed to LLM for response generation
Implementation Architecture
Intent Classification
│
├── Simple query (0-1 hops) → existing pipeline
│
└── Multi-hop query (2+ hops) → Decomposition
│
├── gpt-4.1-mini generates sub-queries (structured_call structured output)
│
├── Each sub-query executes independently
│ (graph lookup OR vector search)
│
├── Results merged with deduplication
│
└── Combined context → LLM generation
When to Decompose
Not all queries benefit from decomposition. The system should decompose when:
- Multiple entities detected: Query contains 2+ entity types (e.g., condition + campus)
- Graph hops > 1: Intent classification detects a multi-hop pattern
- Compound question structure: Question contains "en", "welke...op welke", "waar...bij wie"
Simple queries (single entity, direct lookup) should bypass decomposition entirely.
Expected Impact
- +10-15% entity recall on multi-hop queries (current: 0.88 with hybrid)
- Better 3-hop coverage: Currently the weakest hop category (0.857)
- More complete answers: Each sub-query captures entities that a single query might miss
- Composable reasoning: Future foundation for agentic RAG patterns
Risks
- Latency: +500-800ms per query (additional LLM call + multiple graph lookups)
- Error propagation: If sub-query 1 returns wrong department, sub-queries 2-3 chain the error
- Over-decomposition: Simple queries might be unnecessarily split, reducing quality
- Complexity: Requires robust fallback logic when sub-queries fail
Effort Estimate
4-6 hours for implementation + testing + integration.
Priority Order
Based on impact-to-effort ratio and the principle of fixing fundamentals first:
| Priority | Improvement | Impact | Effort | Prerequisite |
|---|---|---|---|---|
| Implemented (Feb 2026, ADR-0033) | ||||
| Implemented (Apr 2026, ADR-0048) | ||||
| 2 | UMCU terminology enrichment | Medium-High | L | UMLS/SNOMED registration |
Implemented (2026-02-17; structured_call helper since 2026-05-12) |
Dependency Chain
Embedding Migrations (1a, 1b — both complete)
└── BGE-M3 (Feb 2026) replaced nomic-embed-text → +13% retrieval quality
└── text-embedding-3-large (Apr 2026, ADR-0048) replaced BGE-M3 → +~5% additional
└── BGE-M3 retained as ColBERT reranker model only
UMCU Terminology (2)
└── enhances existing query rewriting + taxonomy resolution
└── independent of embedding model choice
└── can run in parallel with (1) but validate after
Query Decomposition (3)
└── depends on accurate entity resolution (benefits from 2)
└── depends on reliable graph traversal
└── should be implemented last (builds on improved foundation)
Success Criteria
The criteria below were drafted before the embedding migration and the latency-optimization sprint; they have since been overtaken by reality. The current measured baseline (definitive run 2026-03-21, 302-question set v3.6, see thesis Chapter 4) is 99.0% (296/299) with median end-to-end latency 7.8 s (@beyer2016sre tail-reporting). The remaining roadmap goal is therefore the UMCU enrichment item; the embedding and decomposition items have shipped.
| Original criterion | Status | Current measured value |
|---|---|---|
| Golden evaluation pass rate ≥ 90 % (146 questions, v2.4) | Met (overtaken by v3.6) | 99.0 % (296/299) on 302-q v3.6 |
| Entity recall ≥ 0.92 (from current 0.915 with hybrid) | Met | 0.932 (95% CI [0.916, 0.965]) |
| Multi-hop entity recall ≥ 0.93 (from current 0.88) | Met | multi_hop_graph 100.0 % (37/37) |
| Multilingual entity recall maintained at 1.00 | Met | multilingual 100.0 % (16/16) |
| No safety regressions (100 % refusal accuracy) | Met | safety_refusal 100.0 % (14/14), adversarial_gcg 100.0 % (12/12) |
Median response time < 16 seconds | Met (post ADR-0034) | P50 7,829 ms, P90 12,182 ms, P99 20,925 ms |
References
- Banar, N., & Lotfi, E. (2025). MTEB-NL and E5-NL: Embedding benchmark and models for Dutch. arXiv preprint, arXiv:2509.12340. https://arxiv.org/abs/2509.12340
- Chen, J., et al. (2024). BGE M3-Embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint, arXiv:2402.03216. https://arxiv.org/abs/2402.03216
- Gao, L., et al. (2022). Precise zero-shot dense retrieval without relevance labels (HyDE). arXiv preprint, arXiv:2212.10496. https://arxiv.org/abs/2212.10496
- Tan, S., et al. (2024). UMCU Dutch Medical Concepts. GitHub. https://github.com/umcu/dutch-medical-concepts