Retrieval Improvements Roadmap

Status

This document describes improvements to the retrieval pipeline based on findings from the A/B experiment and gap analysis. Query decomposition (item 3) was implemented on 2026-02-17 behind a feature flag -- see ADR-0032. The embedding-migration item (item 1) has now seen two completed migrations: BGE-M3 in February 2026 (ADR-0033) and text-embedding-3-large in April 2026 (ADR-0048). Item 2 remains planned.

1. Embedding Model Migrations

Completed (twice)

The original BGE-M3 migration was completed in February 2026 (ADR-0033). It was subsequently superseded by a second migration to OpenAI text-embedding-3-large (1,536-dim, OpenAI API) in April 2026 — see ADR-0048 and @openai2024embeddings. See Embedding Models for the current state. BGE-M3 still survives in the stack as the optional ColBERT reranker model only.

The original embedding model (nomic-embed-text, 768 dimensions) was primarily trained on English data with limited multilingual coverage. It was not benchmarked on MTEB-NL (the Dutch embedding benchmark), making its Dutch retrieval quality unknown. The A/B experiment revealed that structurally similar Dutch medical queries could produce dangerously similar embeddings, contributing to semantic cache contamination.

Migration Summary

Property	nomic-embed-text (initial)	BGE-M3 (Feb–Apr 2026)	text-embedding-3-large (current — ADR-0048)
MTEB-NL Retrieval Score	Not benchmarked	60.0	~64.6
Dimensions	768	1,024	1,536 (truncated from 3,072)
Context Window	8,192 tokens	8,192 tokens	8,191 tokens
Languages	English-primary	100+ languages	Strong multilingual
Provider	Ollama	Ollama	OpenAI API
Cost	Free	Free	$0.13 / 1M tokens
Retrieval Modes	Dense only	Dense + Sparse + ColBERT	Dense only (ColBERT delegated to BGE-M3)
Architecture	nomic-bert	XLM-RoBERTa	OpenAI proprietary
Status	Replaced Feb 2026	Replaced Apr 2026 (still used for ColBERT)	Current

Outcomes

Better Dutch retrieval: Benchmarked 60.0 on MTEB-NL vs unknown for nomic
Improved semantic cache discrimination: Higher-dimensional embeddings produce more distinctive vectors for similar-but-different queries
Better multilingual support: Superior cross-lingual embedding quality
ColBERT support: Future option for late interaction retrieval (multi-vector matching per token)

2. UMCU Dutch Medical Terminology Enrichment

Problem

The current taxonomy contains approximately 55 condition aliases, 20 treatment aliases, and 20 examination aliases -- all manually curated. While these cover the most common patient queries, they miss thousands of Dutch medical terms, patient-friendly synonyms, and colloquial expressions. The golden question gap analysis revealed that major conditions (astma, COPD, epilepsie, endometriose, Crohn, Alzheimer) and departments (Gastro-enterologie, Reumatologie, Infectiologie, Vaatchirurgie) had zero test coverage, partly because the taxonomy lacks the aliases patients would use to describe these conditions.

Proposed Solution

Integrate Dutch medical terminology from the UMCU Dutch Medical Concepts repository, which provides structured access to:

Source	Concepts	Dutch Names	Semantic Types
UMLS (MeSH, MedDRA, ICD-10, ICPC)	254,835	574,475	Diseases, procedures, anatomy
SNOMED CT (Dutch edition)	230,277	521,118	Clinical terms, findings, procedures
HPO (Human Phenotype Ontology)	13,360	29,164	Rare diseases, phenotypes

What This Enables

The UMCU data includes patient-friendly Dutch synonyms -- exactly the vocabulary gap our system needs to bridge. For example:

Patient types...	UMCU provides...	Maps to...
"zuurbranden"	pyrosis, brandend maagzuur, gastro-oesofageale reflux	Gastro-enterologie
"spataders"	varices, varicosis, varikeuze venen	Vaatchirurgie
"vergeetachtig"	geheugenstoornis, cognitieve achteruitgang, dementie	Geriatrie / Neurologie
"benauwdheid"	dyspnoe, kortademigheid, respiratoire insufficiëntie	Pneumologie

Integration Architecture

UMCU Repository (GitHub)
  └── create_concept_table.py
       ├── UMLS Metathesaurus (requires free UTS account)
       └── SNOMED CT Dutch (requires free MLDS registration)
            │
            ▼
     Dutch Concept Tables (CSV)
            │
            ▼
  Filter Script (keep: diseases, procedures, examinations)
            │
            ▼
  Curated Subset (~2,000 most relevant terms)
            │
            ▼
  zol.yaml search_aliases + CONDITION_ALIASES + TREATMENT_ALIASES
            │
            ▼
  resolve_search_query() enhanced with 10x more aliases

Licensing

Both data sources are free:

UMLS: Free license from NLM (US National Library of Medicine). Individual registration at UTS.
SNOMED CT: Free for Belgian healthcare organizations via MLDS (Belgium is an IHTSDO member state).

Implementation Steps

Register for UMLS (UTS account) and SNOMED CT (MLDS Belgian affiliate) -- ~30 minutes
Clone the UMCU repository and generate concept tables
Filter for relevant semantic types: diseases/conditions, procedures/treatments, examinations
Curate a subset of ~2,000 most relevant terms (cross-reference with ZOL department list)
Merge into zol.yaml search aliases and taxonomy alias maps
Validate with golden evaluation (ensure no false-positive routing)
Document the data lineage and update this page

Expected Impact

10-20x more condition/treatment aliases: From ~55 conditions to 500+ with patient-friendly Dutch synonyms
Better entity resolution: More queries correctly route to the right department
Reduced "information not found" responses: Patient vocabulary matches expanded taxonomy
Improved query enrichment: More terms available for the existing resolve_search_query() pipeline

Risks

False positives: Overly broad matching could route queries to wrong departments
Maintenance burden: Periodic updates needed when UMLS/SNOMED releases new versions
Curation effort: Raw data needs manual filtering to avoid irrelevant medical jargon

Effort Estimate

4-6 hours for registration + data import + curation + integration + validation.

3. Query Decomposition for Multi-Hop Reasoning

Implemented

This improvement was implemented on 2026-02-17. See ADR-0032: Query Decomposition for full implementation details. Feature flag: query_decomposition_enabled (default: false).

Problem

Multi-hop queries require traversing multiple entity relationships to construct an answer. For example:

"Welke dokter behandelt rugpijn op campus Sint-Jan?"

This requires three traversals: rugpijn → Fysische Geneeskunde (condition→department) → Dr. X (department→doctor) → Sint-Jan (doctor→campus). The current pipeline rewrites this into a single query template, which may lose specificity or fail to capture all required entities.

The A/B experiment showed that multi-hop queries (2+ hops) had the lowest entity recall among non-safety categories, and the knowledge graph improved 2-hop queries by +9.4pp -- but there is still room for improvement.

Proposed Solution

Implement query decomposition: detect multi-hop queries during intent classification and split them into sequential sub-queries, each targeting a single relationship traversal.

How It Works

Original query:
  "Welke dokter behandelt rugpijn op campus Sint-Jan?"

Decomposition (gpt-4.1-mini, via `structured_call(output_model=DecompositionOutput)` with retries):
  Sub-query 1: "Welke afdeling behandelt rugpijn?"
  Sub-query 2: "Welke dokters werken bij die afdeling?"
  Sub-query 3: "Welke van die dokters werkt op campus Sint-Jan?"

Execution:
  Sub-query 1 → Graph: rugpijn → Fysische Geneeskunde ✓
  Sub-query 2 → Graph: Fysische Geneeskunde → [Dr. A, Dr. B, Dr. C] ✓
  Sub-query 3 → Graph: filter by campus Sint-Jan → [Dr. A] ✓

Context assembly:
  Merge all sub-query results → feed to LLM for response generation

Implementation Architecture

Intent Classification
  │
  ├── Simple query (0-1 hops) → existing pipeline
  │
  └── Multi-hop query (2+ hops) → Decomposition
       │
       ├── gpt-4.1-mini generates sub-queries (structured_call structured output)
       │
       ├── Each sub-query executes independently
       │   (graph lookup OR vector search)
       │
       ├── Results merged with deduplication
       │
       └── Combined context → LLM generation

When to Decompose

Not all queries benefit from decomposition. The system should decompose when:

Multiple entities detected: Query contains 2+ entity types (e.g., condition + campus)
Graph hops > 1: Intent classification detects a multi-hop pattern
Compound question structure: Question contains "en", "welke...op welke", "waar...bij wie"

Simple queries (single entity, direct lookup) should bypass decomposition entirely.

Expected Impact

+10-15% entity recall on multi-hop queries (current: 0.88 with hybrid)
Better 3-hop coverage: Currently the weakest hop category (0.857)
More complete answers: Each sub-query captures entities that a single query might miss
Composable reasoning: Future foundation for agentic RAG patterns

Risks

Latency: +500-800ms per query (additional LLM call + multiple graph lookups)
Error propagation: If sub-query 1 returns wrong department, sub-queries 2-3 chain the error
Over-decomposition: Simple queries might be unnecessarily split, reducing quality
Complexity: Requires robust fallback logic when sub-queries fail

Effort Estimate

4-6 hours for implementation + testing + integration.

Priority Order

Based on impact-to-effort ratio and the principle of fixing fundamentals first:

Priority	Improvement	Impact	Effort	Prerequisite
1a	~~BGE-M3 embedding migration~~	~~High~~	L	Implemented (Feb 2026, ADR-0033)
1b	~~text-embedding-3-large migration~~	~~High~~	M	Implemented (Apr 2026, ADR-0048)
2	UMCU terminology enrichment	Medium-High	L	UMLS/SNOMED registration
3	~~Query decomposition~~	~~Medium~~	L	Implemented (2026-02-17; `structured_call` helper since 2026-05-12)

Dependency Chain

Embedding Migrations (1a, 1b — both complete)
  └── BGE-M3 (Feb 2026) replaced nomic-embed-text → +13% retrieval quality
  └── text-embedding-3-large (Apr 2026, ADR-0048) replaced BGE-M3 → +~5% additional
  └── BGE-M3 retained as ColBERT reranker model only

UMCU Terminology (2)
  └── enhances existing query rewriting + taxonomy resolution
  └── independent of embedding model choice
  └── can run in parallel with (1) but validate after

Query Decomposition (3)
  └── depends on accurate entity resolution (benefits from 2)
  └── depends on reliable graph traversal
  └── should be implemented last (builds on improved foundation)

Success Criteria

The criteria below were drafted before the embedding migration and the latency-optimization sprint; they have since been overtaken by reality. The current measured baseline (definitive run 2026-03-21, 302-question set v3.6, see thesis Chapter 4) is 99.0% (296/299) with median end-to-end latency 7.8 s (@beyer2016sre tail-reporting). The remaining roadmap goal is therefore the UMCU enrichment item; the embedding and decomposition items have shipped.

Original criterion	Status	Current measured value
Golden evaluation pass rate ≥ 90 % (146 questions, v2.4)	Met (overtaken by v3.6)	99.0 % (296/299) on 302-q v3.6
Entity recall ≥ 0.92 (from current 0.915 with hybrid)	Met	0.932 (95% CI [0.916, 0.965])
Multi-hop entity recall ≥ 0.93 (from current 0.88)	Met	multi_hop_graph 100.0 % (37/37)
Multilingual entity recall maintained at 1.00	Met	multilingual 100.0 % (16/16)
No safety regressions (100 % refusal accuracy)	Met	safety_refusal 100.0 % (14/14), adversarial_gcg 100.0 % (12/12)
Median response time `<` 16 seconds	Met (post ADR-0034)	P50 7,829 ms, P90 12,182 ms, P99 20,925 ms

References

Banar, N., & Lotfi, E. (2025). MTEB-NL and E5-NL: Embedding benchmark and models for Dutch. arXiv preprint, arXiv:2509.12340. https://arxiv.org/abs/2509.12340
Chen, J., et al. (2024). BGE M3-Embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint, arXiv:2402.03216. https://arxiv.org/abs/2402.03216
Gao, L., et al. (2022). Precise zero-shot dense retrieval without relevance labels (HyDE). arXiv preprint, arXiv:2212.10496. https://arxiv.org/abs/2212.10496
Tan, S., et al. (2024). UMCU Dutch Medical Concepts. GitHub. https://github.com/umcu/dutch-medical-concepts

1. Embedding Model Migrations​

Migration Summary​

Outcomes​

2. UMCU Dutch Medical Terminology Enrichment​

Problem​

Proposed Solution​

What This Enables​

Integration Architecture​

Licensing​

Implementation Steps​

Expected Impact​

Risks​

Effort Estimate​

3. Query Decomposition for Multi-Hop Reasoning​

Problem​

Proposed Solution​

How It Works​

Implementation Architecture​

When to Decompose​

Expected Impact​

Risks​

Effort Estimate​

Priority Order​

Dependency Chain​

Success Criteria​

References​

1. Embedding Model Migrations

Migration Summary

Outcomes

2. UMCU Dutch Medical Terminology Enrichment

Problem

Proposed Solution

What This Enables

Integration Architecture

Licensing

Implementation Steps

Expected Impact

Risks

Effort Estimate

3. Query Decomposition for Multi-Hop Reasoning

Problem

Proposed Solution

How It Works

Implementation Architecture

When to Decompose

Expected Impact

Risks

Effort Estimate

Priority Order

Dependency Chain

Success Criteria

References