{/* Import from thesis-chapters/01-introduction.md */}
Chapter 1: Introduction
1.1 Context and Motivation
Ziekenhuis Oost-Limburg (ZOL) is the largest hospital in the Belgian province of Limburg, serving a diverse population across multiple campuses: Sint-Jan, Sint-Barbara, Munsterbilzen, and the Revalidation campus. With over 100,000 monthly visitors to its website and approximately 25,000 search queries per month — representing 25% of all web traffic — the hospital's online search function is a critical touchpoint between patients and healthcare services [ZOL, 2025].
The existing search infrastructure relies on Elasticsearch keyword matching, which produces consistently frustrating results for patients attempting to navigate the hospital's extensive content library. The library comprises over 1 000 patient brochures, 700+ condition descriptions, and hundreds of pages detailing doctors, departments, treatments, and examinations. Despite this wealth of information, users frequently fail to find answers to straightforward questions such as "Which doctor treats herniated discs?" or "Where do I go for a heart scan?" — questions that require understanding relationships between medical entities rather than matching keywords. This is a manifestation of the vocabulary-mismatch problem documented in classical information-retrieval research (Manning et al. 2008).
The consequences are tangible. ZOL's helpdesk and call centre are overwhelmed with inquiries that could in principle be answered through the website. Patients resort to phone calls because the search function cannot bridge the gap between their natural-language questions and the structured medical information available online.
This challenge is compounded by linguistic complexity. The Belgian Limburg region is multilingual, with patients querying in Dutch, English, Turkish, French, German, Italian, Romanian, and Greek. Medical terminology itself poses additional barriers: a patient searching for "hoge bloeddruk" (high blood pressure) needs to find information filed under "hypertensie" (hypertension), while "suikerziekte" (sugar disease) must resolve to "Diabetes Mellitus." This terminology gap is a well-known challenge in medical information retrieval, where patient-facing vocabulary diverges significantly from clinical nomenclature (Bodenreider 2004).
1.2 Partnership and Stakeholders
This project is a collaboration between three parties. Novation, a marketing agency, is building ZOL's new Drupal-based website and requires an intelligent search solution that integrates with their front-end. ZOL provides the domain expertise, content, and clinical validation requirements. The author, as a PXL University College AI Technology Architect graduation candidate, designed and implemented the technical solution.
The partnership structure imposes practical constraints. The search system must operate as a standalone service that Novation can embed via API integration. It must respect ZOL's data governance requirements. And above all, it must adhere to a non-negotiable safety constraint: the system must never provide medical advice.
1.3 Problem Statement
The core problem can be stated concisely: ZOL's website contains comprehensive healthcare information, but patients cannot find it. Keyword search fails because patients think in natural language and colloquial terms, while content is organized using medical terminology and institutional structure. A semantic gap exists between how patients ask questions and how information is stored.
Retrieval-Augmented Generation (RAG) (Lewis et al. 2020) offers a promising approach. By combining semantic vector search with large-language-model generation, RAG systems can understand the intent behind natural-language queries and generate grounded, cited responses (Gao et al. 2024). However, deploying RAG in a hospital context introduces unique challenges that go beyond standard information retrieval:
- Safety: The system must assist patients in finding information without crossing the line into medical advice — a distinction that requires careful architectural safeguards.
- Entity relationships: Understanding that "Dr. Janssen works in Cardiology, which treats heart failure" requires structured knowledge that flat document retrieval cannot provide.
- Terminology: Dutch medical terminology includes extensive synonymy and patient-friendly variants that must be resolved at query time.
- Evaluation: Quality must be measured systematically, not anecdotally, with automated evaluation that can run after every code change.
1.4 Research Questions
This thesis addresses one central research question:
How can a RAG system with knowledge graph improve the search experience on the ZOL hospital website while maintaining safety guarantees?
This question is decomposed into five sub-questions:
- Architecture: What RAG architecture is suitable for a multilingual medical information environment?
- Knowledge graph: How can a medical knowledge graph add structured relationships to search results?
- Safety: What safety layers are needed to prevent the generation of medical advice?
- Evaluation: How can RAG answer quality be systematically evaluated?
- Advanced techniques: What is the impact of advanced RAG techniques (CRAG, FILCO) on answer quality?
1.4.1 Hypotheses
Based on the research questions and the capabilities documented in the RAG literature, this thesis tests the following hypotheses:
- H1: A hybrid RAG architecture with conditional knowledge graph enrichment achieves at least 90% entity recall on a 302-question golden evaluation set spanning 21 query categories.
- H2: Each advanced RAG technique (CRAG, FILCO, Guardrails) individually improves pass rate over the baseline, but their combined activation does not yield additive improvement due to feature interaction effects.
- H3: Conditional graph injection (applied only when the query contains recognized medical entities) outperforms unconditional graph injection for overall answer quality.
- H4: A five-layer defense-in-depth safety architecture achieves zero medical advice incidents across all evaluation configurations.
These hypotheses are evaluated in Chapter 6 based on the experimental results presented in Chapter 4.
1.5 Scope and Evolution
The project began as a proof of concept and evolved into a production-grade system over an intensive full-time development effort, building on prior experience with the technology stack and leveraging established open-source frameworks (FastAPI, React, PostgreSQL with pgvector, Redis, Docker Compose). At the time of writing, the development process has produced more than 335 git commits and 50 Architecture Decision Records (ADRs), documenting every significant technical choice.
1.5.1 Codebase Metrics
The project's scale provides context for the engineering effort:
Table 1.1. Codebase metrics at the time of thesis submission.
| Metric | Value |
|---|---|
| Total lines of code | 188 835 |
Backend Python (app/) | 65 075 |
| Backend tests | 84 467 |
| Frontend TypeScript/TSX | 39 293 |
| Total files | 448 |
| Git commits | 335+ |
| Architecture Decision Records | 50 |
| Database migrations | 60+ |
| Frontend components | 106 |
| API route modules | 21 |
| i18n translation keys | 51 |
| Configuration parameters | 507 lines |
The test-to-production code ratio of approximately 1.3 : 1 (84 K test code vs 65 K application code) reflects the no-mocking policy (ADR-0002), which requires comprehensive integration tests against real infrastructure via testcontainers.
The scope encompasses:
- Ingestion pipeline: automated crawling of the ZOL website, content extraction, chunking, and embedding generation.
- Entity taxonomy: a PostgreSQL-based taxonomy with 1 564 entities (352 doctors, 64 departments, 169 conditions, 207 treatments, 103 examinations) and 3 029 relationships at the time of evaluation. Subsequent fuzzy-deduplication work (ADR-0028 SP-7) consolidated this to 2 663 entities and 3 591 relationships, enriched with SNOMED CT Belgian Edition terminology.
- Query pipeline: an eleven-stage retrieval-and-generation pipeline with hybrid search, reranking, graph enrichment, and safety validation.
- Safety architecture: five independent safety layers — intent classification, adversarial-input detection, quality gates, LLM-as-judge validation, and output guardrails (Llama Guard 3).
- Evaluation framework: 302 golden questions across 21 categories with automated LLM-based judging.
- Management interface: admin dashboard with analytics, pipeline debugging, feature flags, and graph exploration.
Out of scope are real user studies, integration with ZOL's internal clinical systems, and multi-hospital federation. The system has been deployed to a pre-production pilot environment but has not yet served production visitors.
1.6 Methodology
The development methodology combines agile iteration with systematic documentation and evidence-based decision-making:
- ADR-driven development: every significant technical choice — from embedding-model selection to safety-layer design — is captured in an Architecture Decision Record. The 50 ADRs serve as a living audit trail of context, alternatives considered, decisions made, and consequences observed (ADR-001 through ADR-0053; see
docs/ADR/). - Golden Standard evaluation: a framework of 302 golden questions, each with expected entities, safety requirements, and category labels, enables automated regression testing after every code change. An LLM-based judge (GPT-5.2 in mid-pilot evaluation runs) computes entity recall, faithfulness, and answer relevancy following the RAGAS evaluation pattern (Es et al. 2023).
- Ablation studies: controlled experiments isolating the contribution of individual features (CRAG, FILCO, Guardrails) through fractional-factorial experiment design (Wohlin et al. 2012).
- No-mocking test policy (ADR-0002): all tests run against real infrastructure (PostgreSQL, Redis) via testcontainers; mocking is forbidden by default to ensure tests reflect production behaviour. The policy is informed by the long-running practitioner argument against mock-based test fragility (Fowler 2007).
1.6.1 Ethical Review and Research-Subject Status
This thesis describes the development and offline evaluation of an information-retrieval system. No human subjects are involved in any of the empirical work reported here. All evaluation results derive from a synthetic golden-question set authored by the development team and applied programmatically against the system. No real users were recruited; no patient data was processed for evaluation purposes.
For the project as a whole, ethical-review status is therefore: exempt as no human subjects. The PXL University College ethics-committee submission for the deferred real-user study (Section 6.4) is identified as future work. When such a study is undertaken, it will be subject to ethics review on its own terms.
This does not exempt the system from data-protection obligations once it serves real ZOL website visitors. Section 3.7 documents the privacy and data-protection architecture under GDPR (Regulation (EU) 2016/679) and the EU AI Act (Regulation (EU) 2024/1689), and the (negative) MDR (Regulation (EU) 2017/745) classification.
1.7 Thesis Structure
The remainder of this thesis is organized as follows:
- Chapter 2 (Literature Review) surveys the academic foundations: RAG, Corrective RAG, context filtering, knowledge graphs in healthcare, SNOMED CT, medical NLP safety, and evaluation frameworks.
- Chapter 3 (Methodology) details the system architecture, the 11-stage query pipeline, knowledge graph design, safety layers, and evaluation methodology.
- Chapter 4 (Results) presents quantitative findings: golden evaluation pass rates, ablation study outcomes, graph enrichment impact, pipeline performance, and safety metrics.
- Chapter 5 (Discussion) answers each research question with evidence, discusses strengths and limitations, and reflects on lessons learned.
- Chapter 6 (Conclusion) summarizes contributions, provides recommendations for ZOL, and outlines future work.