Skip to main content

Chapter 5: Discussion

5.1 Answering the Research Questions

5.1.1 RQ1: What RAG architecture is suitable for a multilingual medical information environment?

The 11-stage pipeline architecture demonstrates that a multilingual medical RAG system requires significantly more complexity than the canonical retrieve-then-generate pattern. Several architectural choices proved essential:

Hybrid retrieval with RRF fusion (ADR-0020) addresses the complementary failure modes of dense and sparse retrieval. Dense retrieval (BGE-M3, Chen et al. 2024, at the time of evaluation; subsequently OpenAI text-embedding-3-large, OpenAI 2024, per ADR-0048) excels at semantic matching across languages — a Turkish query about "kalp ameliyatı" (heart surgery) retrieves Dutch content about "hartchirurgie" — while BM25 (Robertson and Zaragoza 2009) ensures exact entity names (doctor names, department codes) are not lost in embedding space. RRF (Cormack et al. 2009) combines the two without requiring score calibration. The 70/30 weight split was determined empirically.

Cross-lingual generation (ADR-0027) separates retrieval language from response language. All retrieval operates in Dutch (the language of the content), while the LLM generates responses in the user's detected language. This approach leverages the LLM's multilingual capabilities without requiring multilingual embeddings to perfectly capture cross-language semantic similarity.

4-tier LLM routing optimizes cost without sacrificing quality. The vast majority of queries (intent classification, followup detection) use the cheapest model (gpt-4.1-nano), while only complex generation tasks use the full model. The escalation tier (gpt-5.2) is reserved for "Think Harder" queries where the primary model's response fails the quality gate.

Always-on reranking (ADR-0026) proved to be a non-optional component. Cross-encoder reranking (Nogueira and Cho 2019) consistently improved retrieval precision across all evaluation runs, and the ~300 ms latency cost is justified by the quality gain.

The BGE-M3 embedding model (ADR-0033) was the correct choice for multilingual retrieval at the time of the thesis evaluation, superseding the initial nomic-embed-text selection (ADR-0005). The migration required re-embedding the entire corpus but produced measurable improvements in Dutch retrieval quality. Subsequent operational experience — specifically Ollama's CPU-bound serialization tax on voice-channel turns — motivated a third migration to OpenAI text-embedding-3-large (1536 dim, hosted) — see ADR-0048. The retrieval-quality conclusions of this thesis remain valid for the BGE-M3-class of multilingual embedders; the production system simply now uses a hosted equivalent with substantially lower per-call latency.

5.1.2 RQ2: How can a medical knowledge graph add structured relationships to search results?

The knowledge graph's contribution is real but conditional — a finding that emerged from systematic evaluation rather than assumption.

The graph excels for navigational and relationship queries: "Which doctor treats herniated discs?", "Where is Cardiology located?", "What examinations does Radiology perform?" These queries require traversing entity relationships that no amount of vector similarity can surface. The multi_hop_graph category achieved 100% pass rate (19/19) consistently across all evaluation runs, validating the graph's value for relationship reasoning.

However, the graph can harm factual queries where the answer exists entirely in document text. When graph context is injected for a query like "What are the visiting hours?", the entity relationship data consumes context window tokens and can cause the LLM to produce a response that emphasizes relationships rather than the specific factual content requested.

This led to conditional graph injection — the system's key architectural finding. Graph context is only injected when the query contains recognized medical entities that would benefit from graph traversal. The decision is made at strategy selection (Stage 5) based on taxonomy resolution results. This pattern may generalize to other RAG systems that combine structured and unstructured knowledge sources.

The taxonomy-driven normalization architecture (ADR-0014) proved essential for graph quality. Without the plausibility guards (positive domain maps, negative maps, entity type overrides), the extracted graph contained significant noise: departments linked to unrelated conditions, doctors associated with incorrect specialties, and scope leakage from co-occurring entities on the same page. The frozen taxonomy provides a curated quality layer that LLM extraction alone cannot guarantee.

5.1.3 RQ3: What safety layers are needed to prevent medical advice?

The five-layer safety architecture achieves its zero-incident KPI through defense in depth. Each layer addresses a different attack vector:

Table 5.1. Safety layer effectiveness by attack vector.

LayerAttack VectorEffectiveness
Intent classificationDirect medical advice requestsBlocks ~95% of unsafe queries
GCG detectionAdversarial suffix attacks100% detection (12/12), under 5ms
Quality gate (CRAG)Low-confidence generationPrevents unsupported responses
LLM safety judgeSubtle prompt manipulationPost-generation catch-all
Guardrails (Llama Guard)Model-independent safetyIndependent validation

The key insight is that no single layer is sufficient. Intent classification is the most cost-effective (blocks queries before retrieval), but it cannot catch adversarial inputs designed to bypass classification. GCG detection handles adversarial patterns but cannot evaluate semantic safety. The quality gate prevents hallucination from poor retrieval but does not assess the safety of well-retrieved content. The LLM safety judge and Guardrails provide post-generation validation but add latency and cost.

In practice, the ablation study showed that Guardrails alone achieves the highest individual pass rate (99.4%), suggesting that a strong output classifier can compensate for gaps in input filtering. However, for a production hospital system where the cost of a single medical advice incident is potentially severe, redundancy across all layers is the defensible architectural choice.

5.1.4 RQ4: How can RAG answer quality be systematically evaluated?

The golden standard framework with automated LLM-based judging provides a practical and effective evaluation methodology:

  1. Entity recall as the primary metric is fast (~25 minutes for 302 questions), deterministic, and directly measures the information the user needs. It avoids the subjectivity of human evaluation while capturing factual completeness.

  2. Category-stratified analysis reveals which query types benefit from specific features, enabling targeted optimization rather than aggregate metric chasing.

  3. Ablation studies isolate individual feature contributions, preventing the false assumption that "more features = better quality" (as the all-three-on result demonstrates).

  4. Bootstrap confidence intervals provide statistical rigor, confirming that observed improvements are not artifacts of question sampling.

The limitation of entity recall is that it does not capture response quality dimensions such as coherence, helpfulness, or appropriate level of detail. The RAGAS metrics (faithfulness, relevancy) address some of these dimensions but require expensive LLM-based evaluation (~90 minutes for full metrics).

5.1.5 RQ5: What is the impact of advanced RAG techniques (CRAG, FILCO) on answer quality?

The ablation study provides clear evidence:

  • Individual features help: CRAG (+2.5pp), FILCO (+2.5pp), and Guardrails (+3.7pp) each improve pass rate over the baseline when applied individually.
  • Feature interactions exist: The all-three-on configuration (96.3%) performs worse than any individual feature, regressing 4 questions while improving only 5. This suggests the features conflict in their modification of the retrieval-generation pipeline.
  • FILCO reduces latency: By filtering irrelevant sentences, FILCO reduces the context sent to the LLM, yielding a 29% reduction in average response time with no quality loss.
  • Feature selection is empirical: There is no theoretical basis for predicting which feature combination will work best for a given domain. Ablation studies are essential.

The practical recommendation for production deployment is to enable features individually based on empirical evidence from the golden evaluation, rather than assuming that combining all available techniques will yield the best result.

Direct comparison with other hospital RAG systems is limited by the domain-specific nature of the golden evaluation set. However, several reference points provide context:

Retrieval quality: The system's entity recall of 0.932 compares favourably with self-reported RAG evaluation benchmarks. Gao et al. 2024 survey results where typical RAG systems achieve faithfulness scores in the 0.7–0.9 range on general knowledge tasks. The ZOL system's faithfulness of 0.941 (ablation baseline) is at the upper end of this range, likely benefiting from the domain-specific knowledge graph and taxonomy-driven normalisation. Direct comparison is limited because the ZOL system has not yet been evaluated on external benchmarks (BEIR, Thakur et al. 2021; MTEB, Muennighoff et al. 2022); this is acknowledged as a methodology gap in the academic critical assessment.

Safety architecture: The five-layer defence-in-depth approach goes beyond what is documented in most medical NLP literature, where safety typically relies on a single classifier or post-hoc filtering. The zero-incident record across 302 questions and multiple ablation configurations validates this architectural investment, though production deployment with adversarial real users would provide a more rigorous test. The threat-model coverage maps to the OWASP 2025 LLM Top 10 practitioner taxonomy, and the gradient-based suffix detector follows the threat-class identified by Zou et al. 2023 and generalised by Liao et al. 2024.

Response latency: The median response time of 7.8 s is comparable to other RAG systems reported in the literature, where LLM generation dominates pipeline latency. The FILCO-enabled configuration achieves a 29 % latency reduction through context filtering, demonstrating that context quality and response speed need not be in tension. The latency profile sits within the Nielsen 1993 10 s attention bound, with streaming keeping time-to-first-token under the 1 s flow threshold.

Knowledge graph integration: The conditional-graph-injection finding — that unconditional graph enrichment can harm quality — has not, to our knowledge, been reported in prior GraphRAG (Edge et al. 2024) or HybridRAG (Sarmah et al. 2024) literature, which generally assume that additional structured context is beneficial. This is offered as a working hypothesis (per the academic-bar guidance: a claim is not a contribution until it has survived independent replication) that subsequent work in other hospital-RAG settings can test.

5.2 Strengths

Production-Quality Engineering

The system is not a research prototype but a production-ready application with 188K lines of code, comprehensive test coverage, Docker containerization, and an admin interface for non-technical configuration. The 44 ADRs provide a complete decision audit trail that would support handoff to a different development team.

ADR-Driven Development

The practice of recording every significant technical decision in an Architecture Decision Record created a valuable feedback loop. ADRs that were later superseded (e.g., ADR-0005 nomic-embed-text → ADR-0033 BGE-M3 → ADR-0048 OpenAI text-embedding-3-large) documented the reasoning for the original choice and the evidence that motivated each change, enabling future developers to understand not just what was built but why each decision turned over.

Comprehensive Safety Architecture

The zero-incident safety record across all testing validates the defense-in-depth approach. No single layer is relied upon; the system continues to function safely even if individual layers fail. The GCG adversarial detection (ADR-0036) demonstrates that safety must address not just obvious misuse but also sophisticated attacks documented in the academic literature.

Systematic Evaluation

The golden evaluation framework with 302 questions across 21 categories provides a quantitative foundation for every technical decision. The ablation study methodology, borrowed from machine learning research, ensures that feature decisions are evidence-based rather than intuition-driven.

5.3 Limitations

Single Hospital Scope

The system is built for and tested against ZOL's content exclusively. While the architecture supports multi-tenancy (ADR-002) and the site configuration system is designed for extensibility, the taxonomy, plausibility guards, and golden questions are ZOL-specific. Generalization to other hospitals would require rebuilding these domain-specific components.

No Real User Study

All evaluation uses synthetic golden questions designed by the development team. While these questions are based on real search patterns observed in ZOL's analytics, they do not capture the full diversity of actual patient queries. A production deployment with real users would likely surface failure modes not represented in the golden evaluation.

LLM API Cost Dependency

The system relies on external LLM APIs (OpenRouter routing to OpenAI models) for generation, intent classification, entity validation, and safety judging. This creates both a cost dependency (approximately $0.01-0.05 per query depending on configuration) and a latency dependency (API response times are not under the system's control). The 4-tier routing mitigates cost but cannot eliminate the dependency.

Dutch-Only Content

While the system generates responses in 8 languages, all content is in Dutch. Cross-lingual retrieval works because the chosen multilingual embedder (BGE-M3 in the evaluated build; OpenAI text-embedding-3-large post-migration) captures cross-language semantic similarity, but the quality of cross-lingual matching is inherently lower than same-language matching. The multilingual category's 87.5% pass rate (7/8) is the lowest of any non-safety category, confirming this limitation.

Feature Interaction Complexity

The ablation study revealed that feature interactions are non-trivial: CRAG + FILCO + Guardrails combined performs worse than any individual feature. This suggests that the pipeline's behavior under multiple simultaneous modifications is not well understood. A more comprehensive experiment design (full factorial with pairwise interactions) would be needed to identify optimal configurations.

5.4 Lessons Learned

Conditional Graph Injection

The conditional graph injection finding is discussed in detail in Section 5.1.2.

Taxonomy-Driven Quality

A curated taxonomy with explicit plausibility guards produces higher-quality graph extraction than unconstrained LLM extraction. The LLM is good at identifying potential entities but poor at judging whether a relationship is plausible in the domain context (e.g., whether a department actually handles a specific condition). The taxonomy provides the domain knowledge that the LLM lacks.

No-Mocking Test Policy

The decision to ban mocking (ADR-0002) and require all tests to run against real infrastructure via testcontainers was initially controversial due to slower test execution. In practice, it caught numerous bugs that mocked tests would have missed — particularly in database queries, Redis caching logic, and taxonomy query patterns. The test suite's reliability as a regression detector justified the execution time cost.

Cost Tracking From Day One

Instrumenting LLM costs per query from the beginning of the project (including per-model breakdown and cached token tracking) enabled data-driven decisions about model routing. Without this instrumentation, the 4-tier routing optimization would have been based on guesswork rather than measured cost data.

Ablation Studies Are Essential

The assumption that enabling all available features produces the best result was disproven by the ablation study. Without systematic experimentation, the all-three-on configuration would likely have been deployed as the default, yielding worse results than the baseline with significantly higher latency. This reinforces the importance of evidence-based feature selection over intuition.