Chapter 2: Literature Review
2.1 Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation, introduced by Lewis et al. 2020, addresses a fundamental limitation of large language models: their knowledge is frozen at training time. RAG augments generation by first retrieving relevant documents from an external knowledge base, then conditioning the language model's output on this retrieved context. This combines the broad language understanding of LLMs with the factual grounding of an information-retrieval system. Earlier, Guu et al. (2020) introduced REALM, demonstrating that retrieval-augmented pre-training could improve language-model factuality (REALM (Guu et al. 2020)). Gao et al. 2024 survey the RAG landscape, identifying three generations — Naive, Advanced, and Modular RAG — with current production systems increasingly combining multiple retrieval and filtering techniques. The system described in this thesis is a Modular RAG system in that typology.
The canonical RAG architecture follows a retrieve-then-generate pattern:
- Query encoding: The user's question is embedded into a dense vector representation.
- Retrieval: The query vector is compared against a document store to find semantically similar passages.
- Context assembly: Retrieved passages are concatenated and formatted as context for the LLM.
- Generation: The LLM produces a response conditioned on both the query and the retrieved context.
This simple architecture has proven remarkably effective across domains but suffers from well-documented failure modes. When retrieval returns irrelevant or partially relevant documents, the LLM may hallucinate plausible-sounding but unsupported claims (Ji et al. 2023). When the correct information exists in the knowledge base but is not retrieved (recall failure), the system cannot answer correctly regardless of LLM capability. Liu et al. 2024 further showed that LLMs systematically under-attend to information in the middle of long contexts — a "lost-in-the-middle" effect that motivates careful chunk-ordering and a bounded context budget. These challenges are particularly acute in medical information systems, where hallucination has real consequences.
2.1.1 Hybrid Search and Reciprocal Rank Fusion
Dense retrieval (using learned embeddings) and sparse retrieval (using keyword matching such as BM25, Robertson and Zaragoza 2009) have complementary strengths. Dense retrieval excels at semantic matching — finding documents that discuss the same concept using different terminology — while sparse retrieval is superior for exact entity matching, such as doctor names or specific medical codes (Karpukhin et al. 2020).
Reciprocal Rank Fusion (RRF) (Cormack et al. 2009) combines ranked lists from multiple retrieval systems without requiring score calibration. For each document, RRF computes a fused score as the sum of reciprocal ranks across all systems:
RRF(d) = Σᵢ 1 / (k + rankᵢ(d))
where rankᵢ(d) is the rank of document d in the i-th ranked list and k is a constant (typically 60). The application of RRF to multilingual medical information retrieval is detailed in Section 3.2.
2.1.2 Embedding Models for Multilingual Retrieval
Embedding model selection is critical for retrieval quality in multilingual settings. General-purpose English embedding models (e.g., nomic-embed-text) often underperform on non-English content due to limited multilingual training data.
BGE-M3 [Chen et al., 2024] addresses this limitation by supporting over 100 languages with a single model, producing 1024-dimensional embeddings trained on multilingual retrieval benchmarks. Its multi-granularity approach — combining dense, sparse, and multi-vector retrieval in one model — makes it particularly suitable for hybrid search architectures. The model's multilingual capability is essential for domains where content exists in one language (e.g., Dutch medical documentation) but queries arrive in multiple languages. The thesis system as evaluated used BGE-M3 via Ollama; subsequent operational experience with voice-channel latency motivated a migration to OpenAI text-embedding-3-large (1536 dim, hosted) — see ADR-0048. The literature comparison in this section, however, remains the relevant academic backdrop for the BGE-M3-class of models.
2.1.3 Reranking
Cross-encoder reranking (Nogueira and Cho 2019) is a well-established technique for improving retrieval precision. Unlike bi-encoder retrieval — which independently encodes query and document — cross-encoders jointly process the query-document pair, enabling fine-grained interaction modelling at the cost of computational expense. Lightweight cross-encoder models such as ms-marco-MiniLM-L-6-v2 offer a practical trade-off between reranking quality and latency, making always-on reranking feasible in production pipelines. Khattab and Zaharia 2020 (ColBERT) provide a third architectural option — late-interaction multi-vector matching — at intermediate cost between bi- and cross-encoders.
While Lewis et al. 2020 established the foundational RAG paradigm, and subsequent work has refined retrieval quality through hybrid approaches (Karpukhin et al. 2020) and reranking (Nogueira and Cho 2019), these works primarily target English-language corpora and benchmarks such as MS MARCO and Natural Questions (see BEIR (Thakur et al. 2021); MS MARCO and Natural Questions {/* TODO Wave 2.D: bibkey for "Bajaj 2016 MS MARCO" + "Kwiatkowski 2019 NQ" needed */}). The application of RAG to multilingual medical information retrieval — where vocabulary mismatch between patient terminology and clinical nomenclature compounds the standard retrieval challenge — remains underexplored in the literature.
2.2 Corrective RAG (CRAG)
Yan et al. (2024) introduce Corrective Retrieval-Augmented Generation (CRAG), which addresses the brittleness of standard RAG when retrieval quality is uncertain (Yan et al. 2024). The key insight is that retrieval results fall on a spectrum from clearly relevant to clearly irrelevant, with a significant "ambiguous" middle zone that standard binary quality gates handle poorly.
CRAG introduces a ternary classification of retrieval quality:
Table 2.1. CRAG ternary retrieval quality classification and corresponding actions.
| Classification | Action |
|---|---|
| Correct | Proceed with generation using retrieved context |
| Ambiguous | Refine retrieval with relaxed parameters, then re-assess |
| Incorrect | Abstain from answering |
The refinement step for ambiguous queries is what distinguishes CRAG from simpler quality gates. Rather than immediately refusing a query when retrieval confidence is moderate, CRAG attempts recovery by broadening the search — lowering similarity thresholds, expanding the candidate set, or relaxing filters. This is particularly valuable in medical information retrieval, where a patient's colloquial phrasing may not closely match clinical documentation but where relevant content nonetheless exists.
The application of CRAG to domain-specific RAG systems is discussed in Section 3.2, Stage 9.
2.3 Context Filtering (FILCO)
Wang et al. (2024) propose FILCO (Filtering Context), a technique for improving RAG generation quality by removing irrelevant sentences from retrieved passages before they are presented to the LLM (Wang et al. 2024). The observation is that retrieved documents, even when topically relevant, often contain sentences not directly pertinent to the query. These irrelevant sentences consume context-window tokens, dilute the signal-to-noise ratio, and can cause the LLM to generate responses that drift from the query's intent — a generative analogue of the "lost-in-the-middle" effect documented by Liu et al. 2024.
FILCO operates at sentence granularity:
- Each retrieved passage is segmented into sentences.
- Each sentence is scored for relevance to the query (using embedding similarity or a lightweight classifier).
- Sentences below a relevance threshold are removed.
- The filtered context is passed to the LLM for generation.
The application of FILCO to domain-specific RAG systems is discussed in Section 3.2, Stage 9.
2.4 Knowledge Graphs in Healthcare
Knowledge graphs represent structured relationships between entities and have been applied extensively in biomedical domains (Chandak et al. 2023, Scientific Data 10:67 — bibkey pending; for KG-RAG work already in the canonical bibliography see Sarmah et al. 2024 and Soman et al. 2024). In healthcare information systems, knowledge graphs enable reasoning about entity relationships that flat document retrieval cannot capture:
- Doctor–Department relationships: "Which doctors work in Cardiology?"
- Condition–Department mappings: "Which department treats diabetes?"
- Multi-hop reasoning: "What treatments does the department that treats heart failure offer?" (patient → condition → department → treatments)
The integration of knowledge graphs with RAG systems has emerged as a significant research direction. GraphRAG (Edge et al. 2024) demonstrated that graph-derived context can improve LLM response quality for questions requiring relationship reasoning. HybridRAG (Sarmah et al. 2024) formalises the pattern of fusing knowledge-graph retrieval with vector retrieval. Soman et al. 2024 demonstrate the value of biomedical knowledge-graph-grounded prompt generation for medical-domain LLM applications. However, these approaches generally assume unconditional fusion of structured and unstructured knowledge; healthcare information systems benefit from domain-specific graph schemas designed for the particular entity relationships of the medical domain, and — as the experiments in Chapter 4 show — from a conditional injection policy that selects when graph context helps versus when it harms.
The application to hospital entity relationships is detailed in Section 3.3.
2.5 SNOMED CT Medical Terminology
SNOMED CT (Systematized Nomenclature of Medicine — Clinical Terms) is the most comprehensive multilingual clinical-terminology system, maintained by SNOMED International. The Belgian Edition contains approximately 280 000 concepts with 580 000 Dutch descriptions (including synonyms) (SNOMED International 2024). SNOMED CT builds on decades of biomedical-terminology work; Bodenreider (2004) provides a comprehensive overview of the Unified Medical Language System (UMLS), the broader terminology framework within which SNOMED CT operates (Bodenreider 2004).
For hospital information retrieval systems, SNOMED CT addresses a fundamental challenge: medical synonym resolution. A hand-maintained taxonomy of medical term mappings (e.g., "suikerziekte" → "Diabetes Mellitus") does not scale: every new document ingested surfaces terminology not yet in the taxonomy — a "whack-a-mole" pattern that worsens with corpus growth.
SNOMED CT provides:
- Validated Dutch synonyms: Each concept has a Preferred Term and multiple Acceptable Synonyms, eliminating the need for hand-curated alias dictionaries.
- Hierarchical relationships: The IS-A hierarchy enables queries like "find all types of cancer" (Borstkanker IS_A Kanker IS_A Clinical Finding).
- Cross-terminology mappings: Official maps to ICD-10, LOINC, and ATC facilitate interoperability.
- Institutional mandate: Belgium mandates SNOMED CT for primary diagnoses by 2027, making integration a strategic investment.
The application of SNOMED CT to knowledge graph enrichment is detailed in Section 3.3.3.
2.6 Safety in Medical NLP Systems
Deploying NLP systems in healthcare contexts requires rigorous safety considerations. The distinction between information provision and medical advice is legally and ethically critical: a system that tells a patient "Cardiology treats heart conditions at ZOL" is providing navigational information, while a system that says "You should see a cardiologist" is providing medical advice. The European AI Act (Regulation (EU) 2024/1689) classifies healthcare AI systems as high-risk in Annex III, imposing requirements on risk management, data governance, technical documentation, transparency, and human oversight (Articles 9–15). While the Act does not explicitly distinguish information provision from medical advice, its risk-based classification framework implies that hospital information-retrieval systems must carefully define their intended purpose: a system that navigates patients to services carries different obligations than one recommending diagnoses or treatments. The negative classification of the system as a non-medical-device under the EU Medical Device Regulation (Regulation (EU) 2017/745) Article 2(1) and Annex VIII Rule 11 is a load-bearing argument for the system's safety posture, and is documented in Section 3.7. The HLEG Ethics Guidelines for Trustworthy AI (European Commission HLEG 2019) provide the policy lineage that informed the AI Act's ethics framing.
Several safety mechanisms are relevant to hospital information retrieval:
2.6.1 Intent Classification
Intent classification serves as a pre-retrieval safety gate, categorizing user queries into intent types (informational, navigational, medical_advice, greeting, etc.) and blocking unsafe categories before any retrieval or generation occurs. This is the most cost-effective safety layer because it prevents expensive LLM calls for queries that should never receive a generated response.
2.6.2 Adversarial Attack Detection
Zou et al. 2023 demonstrate that short gibberish-token suffixes (GCG attacks) can bypass LLM safety alignment with reported success rates around 88 % on contemporaneous models. These attacks are invisible to regex-based injection filters because they exploit model-internal token patterns rather than semantic content. Liao et al. 2024 generalise the threat further with a generative model of adversarial suffixes (AmpleGCG), demonstrating that the threat class is not static and that defences must assume a steady stream of new suffix patterns. OWASP 2025 LLM Top 10 classifies prompt injection (LLM01) as the top risk for LLM applications. Statistical detection approaches — measuring perplexity, dictionary-word ratios, and character entropy — can identify GCG-style inputs in under 5 ms without requiring LLM inference; this is the basis for the system's first-line defence (Section 3.4, Layer 2).
2.6.3 LLM-as-Judge Safety Validation
Post-generation safety validation uses a secondary LLM call to evaluate whether the generated response contains medical advice, regardless of what the safety filters detected in the input. This defense-in-depth approach catches cases where safe-looking queries elicit unsafe responses through subtle prompt manipulation.
2.6.4 Guardrails
Meta's Llama Guard (Inan et al. 2023) provides input/output safety classification using a fine-tuned LLM. It categorises content across multiple safety dimensions and can be applied both to user inputs and to generated outputs, providing an independent safety assessment from a different model architecture than the generation LLM.
2.7 Evaluation Frameworks
Evaluating RAG systems is challenging because multiple components — retrieval, context assembly, and generation — must work in concert. The RAGAS framework (Es et al. 2024) proposes four metrics for RAG evaluation:
- Faithfulness: The proportion of claims in the generated response that are supported by the retrieved context. Measures hallucination.
- Answer Relevancy: How well the generated response addresses the original question. Measures relevance.
- Context Precision: The proportion of retrieved passages that are relevant to the question. Measures retrieval noise.
- Context Recall: The proportion of information needed to answer the question that appears in the retrieved context. Measures retrieval coverage.
Additionally, entity recall — the fraction of expected entities mentioned in the response — provides a concrete, deterministic metric for evaluating factual completeness. The evaluation framework design is described in Section 3.5.
A golden standard evaluation framework can define questions across multiple categories — including entity lookups, relationship mappings, multi-hop reasoning, safety refusals, adversarial inputs, and multilingual queries — each specifying expected entities, expected behavior (answer or refuse), and metadata for stratified analysis. The design and application of such a framework is detailed in Section 3.5.
2.8 Research Gap
The literature establishes RAG as an effective paradigm for knowledge-intensive tasks, with hybrid retrieval, context filtering, and knowledge graphs each addressing specific limitations. However, several gaps emerge when applying these techniques to multilingual hospital information retrieval:
- Conditional knowledge source fusion: Existing GraphRAG and HybridRAG approaches assume unconditional fusion of structured and unstructured knowledge. No prior work systematically evaluates when graph enrichment helps versus harms retrieval quality.
- Feature interaction in advanced RAG: While CRAG, FILCO, and guardrails-based safety have been studied individually, their interaction effects when combined in a single pipeline are unexplored.
- Dutch medical NLP: The intersection of Dutch-language processing, medical terminology resolution, and conversational search has limited coverage in the literature, with most medical NLP work focusing on English or Chinese corpora.
- Safety architecture for navigational healthcare AI: The distinction between medical advice and medical information navigation is acknowledged in regulatory frameworks but lacks concrete architectural patterns in the RAG literature.
This thesis addresses these gaps through a production implementation that combines all five techniques in a single pipeline, with systematic ablation to isolate individual contributions.
2.9 Summary
The literature establishes several key principles that guide the design of a medical RAG system:
- RAG provides the foundational pattern for semantic search with grounded generation.
- Hybrid search (dense + sparse) with RRF fusion improves retrieval robustness.
- CRAG's ternary quality gate handles the ambiguous retrieval zone more gracefully than binary thresholds.
- FILCO's sentence-level filtering improves context quality for generation.
- Knowledge graphs enable relationship-aware retrieval that pure vector search cannot achieve.
- SNOMED CT provides scalable medical terminology resolution for Dutch.
- Multi-layer safety architecture is essential for healthcare deployment.
- Systematic evaluation with golden questions enables continuous quality monitoring.
The following chapter details how these concepts are realized in the system's architecture and methodology.