Skip to main content

Chapter 6: Conclusion

6.1 Summary of Contributions

This thesis presents the design, implementation, and evaluation of a production-ready RAG system with knowledge graph for the ZOL hospital website. The work makes five key contributions:

The system replaces keyword-based search with semantic understanding, processing natural-language queries in eight languages and generating grounded responses with source citations. The eleven-stage pipeline architecture demonstrates that production medical RAG requires significantly more complexity than the canonical retrieve-then-generate pattern — including hybrid retrieval, reranking, conditional graph enrichment, multi-tier LLM routing, and comprehensive safety layers. The implementation spans 188 835 lines of code, documented through 50 Architecture Decision Records and 335+ git commits of iterative refinement.

Contribution 2: Conditional Knowledge Graph Injection

The integration of a PostgreSQL-based entity taxonomy with 1 564 entities and 3 029 relationships at the time of evaluation (consolidated to 2 663 entities and 3 591 relationships post-deduplication, ADR-0028) demonstrates the value of structured entity relationships for navigational and multi-hop queries. Critically, the evaluation revealed that graph enrichment should be conditional: injecting graph context only when the query contains recognised medical entities improves quality, while unconditional injection can harm factual queries by introducing noise. We present this finding as a working hypothesis that, once independently replicated in other hospital-RAG settings, may generalise to other domains where structured and unstructured knowledge sources are fused.

Contribution 3: Five-Layer Safety Architecture with Zero Incidents

The defence-in-depth safety architecture achieves its zero-medical-advice-incidents KPI in golden-evaluation runs through five independent layers: intent classification, GCG adversarial detection, CRAG quality gate, LLM safety judge, and Llama Guard guardrails. The GCG detection layer demonstrates that safety must address not just obvious misuse but also sophisticated adversarial attacks documented in the academic literature (Zou et al. 2023; generalised by Liao et al. 2024), detecting such attacks in under 5 ms using statistical heuristics without requiring an LLM call. The ablation study showed that Guardrails alone achieves 99.4 % pass rate, but for a production hospital system, redundancy across all layers is the defensible choice. Threat-model coverage follows OWASP 2025 LLM Top 10.

Contribution 4: Systematic Evaluation with Golden Standard

The golden evaluation framework with 302 questions across 21 categories enables automated regression testing after every code change. The ablation study methodology isolates individual feature contributions, revealing that the combined activation of CRAG, FILCO, and Guardrails (96.3%) performs worse than any individual feature (98.2-99.4%) — a finding that would not have been discoverable without systematic experimentation. The latest evaluation achieves 99.0% pass rate with 95% bootstrap confidence interval [0.972, 1.000].

Contribution 5: SNOMED CT Integration for Medical Terminology

The three-phase SNOMED CT integration demonstrates a practical approach to medical synonym resolution at scale. By loading the Belgian Edition RF2 dataset (356K concepts, 656K descriptions) into PostgreSQL and enriching taxonomy entities with SNOMED concept IDs and synonyms, the system moves from 50 hand-maintained alias entries to comprehensive medical terminology coverage. The 15 SNOMED-specific golden questions achieve a 93.3% pass rate, validating the integration's effectiveness.

6.1.1 Hypothesis Evaluation

The four hypotheses stated in Section 1.4.1 are evaluated against the experimental evidence:

Table 6.1. Hypothesis evaluation summary.

HypothesisOutcomeEvidence
H1: Hybrid RAG achieves ≥90% entity recallSupportedEntity recall 0.932 [95% CI: 0.910–0.955]; pass rate 99.0% [95% CI: 0.972–1.000] (Section 4.1.2)
H2: Individual features improve; combined does not addSupportedIndividual: CRAG 98.2%, FILCO 98.2%, Guardrails 99.4%. Combined: 96.3% with 4 regressions vs. 5 improvements (Section 4.2)
H3: Conditional graph injection outperforms unconditionalSupportedUnconditional injection reduced pass rate by 0.6pp vs. graph-off; conditional injection improved it by 1.7pp (Section 4.3)
H4: Five-layer safety achieves zero incidentsSupportedZero medical advice incidents across all evaluation configurations; 100% safety refusal accuracy; 100% GCG detection (Section 4.5)

All four hypotheses are supported by the experimental evidence on the synthetic golden-question set. H2 is particularly noteworthy: the finding that feature-interaction effects can negate individual improvements challenges the common assumption that combining beneficial techniques yields cumulative benefit. H3 offers a practical architectural pattern — conditional knowledge-source fusion — that we hypothesise may generalise beyond the hospital-search domain; independent replication is required before it can be claimed as a general result.

6.2 Answering the Central Research Question

How can a RAG system with knowledge graph improve the search experience on the ZOL hospital website while maintaining safety guarantees?

The answer is multi-faceted:

A RAG system improves hospital search by bridging the semantic gap between patient language and medical terminology. The combination of hybrid retrieval (vector + BM25), cross-encoder reranking, and conditional knowledge graph enrichment enables the system to understand queries that keyword matching cannot handle — from colloquial Dutch medical terms to multi-hop relationship questions.

The knowledge graph adds structured reasoning about entity relationships — doctor-department affiliations, condition-department mappings, treatment-condition associations — that pure document retrieval cannot surface. However, graph enrichment must be applied selectively based on query intent, as unconditional injection introduces noise for factual queries.

Safety guarantees are achieved through defense in depth: five independent layers that address different attack vectors, validated by a zero-incident record across 302 golden questions and multiple ablation study configurations. No single safety layer is sufficient; redundancy is the principled architectural choice for healthcare deployment.

6.3 Recommendations for ZOL

Production Deployment

The system is technically ready for production deployment. The recommended rollout strategy is:

  1. Shadow deployment: Run alongside existing Elasticsearch for 2-4 weeks, logging queries and responses without replacing the current search.
  2. A/B testing: Serve RAG results to a percentage of users, measuring satisfaction and helpdesk call deflection.
  3. Full rollout: Replace Elasticsearch search after validating real-user metrics.

Content Optimization

The evaluation identified content gaps that ZOL can address:

  • General phone number visibility: The GQ-062 failure suggests the general appointment line (089 32 50 50) should be more prominently embedded in department pages.
  • Department-condition mappings: Ensuring each condition page explicitly names the handling department would improve condition_department query performance.
  • SNOMED-aligned terminology: Using SNOMED-preferred Dutch terms in patient-facing content would improve query-time matching.

SNOMED CT Expansion

The Phase 1 integration loads reference tables and enriches graph nodes. Future phases could:

  • Use SNOMED's IS-A hierarchy for automatic condition taxonomy expansion (e.g., automatically linking all cancer subtypes to Oncology).
  • Leverage FINDING_SITE relationships for anatomical routing (condition → body structure → department).
  • Cross-reference ICD-10 codes from ZOL's clinical systems for bi-directional integration.

6.4 Future Work

Agentic RAG

ADR-0021 (Self-RAG) and ADR-0022 (Dynamic Retrieval) were deferred as future considerations. An agentic approach where the system can reason about its own retrieval quality and autonomously decide to refine, expand, or redirect queries would further improve the ambiguous query handling that CRAG addresses partially.

Multi-Hospital Federation

The architecture's multi-tenancy support (ADR-002) and site configuration system lay the groundwork for serving multiple hospitals. Each hospital would maintain its own taxonomy, knowledge graph, and content corpus while sharing the pipeline infrastructure. The SNOMED CT integration provides a common medical terminology layer that bridges hospital-specific terminologies.

Real User Study

The most critical next step is evaluating the system with real ZOL website visitors. The golden evaluation framework provides a strong quality baseline, but only real-user interaction data can reveal the true distribution of query types, failure modes, and user satisfaction patterns. This study should measure:

  • Search success rate (compared to Elasticsearch baseline)
  • Helpdesk call deflection rate
  • User satisfaction (post-interaction survey)
  • Time-to-answer for common query types
  • Safety incident monitoring

Patient Portal Integration

Extending the system beyond public website search to authenticated patient portal queries would enable personalized information retrieval — e.g., "When is my next appointment?" or "What should I bring for my procedure tomorrow?" This would require integration with ZOL's clinical appointment system and additional privacy safeguards.

6.5 Final Reflection

This project demonstrates that AI-powered semantic search for a hospital domain is technically feasible at production-grade quality, with the safety guarantees that the regulatory environment (GDPR, AI Act, MDR negative classification) and the ethical posture of healthcare information delivery require. The key lessons — conditional knowledge-graph injection, ablation-driven feature selection, defence-in-depth safety, and ADR-driven development — are applicable beyond hospital search to any RAG system operating in a safety-critical information environment.

The development process itself illustrates the pace of change in AI engineering: the project went through three embedding-model generations (nomic-embed-text → BGE-M3 → OpenAI text-embedding-3-large), multiple LLM generations, and dozens of architectural iterations, each documented in an ADR and validated against the golden evaluation framework. The ability to make evidence-based decisions quickly — changing the embedding model and validating its retrieval impact within a single working day — was enabled by the automated evaluation pipeline and the testcontainer-backed integration-test suite.

The system awaits its most important evaluation: deployment to real ZOL website visitors. The 99.0 % golden-evaluation pass rate, zero recorded safety incidents in evaluation, and production-grade infrastructure provide a strong foundation. The ultimate measure of success will be whether patients find the information they need — and whether the call-centre phones ring a little less.