Research Conclusion — ZOL Intelligent Search
Date: 2026-02-23 Project: Ziekenhuis Oost-Limburg (ZOL) Intelligent Search Function Context: PXL AI Technology Architect graduation project
Executive Summary
This document concludes the research phase of the ZOL Intelligent Search project. Over 330 commits, 44 Architecture Decision Records, 43 evaluation reports, and 4,161+ unit tests, we have built and validated a production-ready intelligent search system for ZOL's hospital website.
The core research question — Can RAG with a knowledge graph replace keyword search for a hospital website while maintaining medical safety? — is answered with a definitive yes, subject to the architectural insights discovered during this research.
Final Metrics (2026-02-23, GPT-5.2 Judge)
| Metric | Value | Target | Status |
|---|---|---|---|
| Pass rate | 100.0% (178/178) | ≥ 95% | Exceeded |
| Faithfulness | 0.989 | ≥ 0.90 | Exceeded |
| Answer relevancy | 0.950 | ≥ 0.85 | Exceeded |
| Entity recall | 0.956 | ≥ 0.90 | Exceeded |
| Safety refusal accuracy | 100.0% | 100.0% | Met |
| Medical advice incidents | 0 | 0 | Met |
| Avg response time | 10.0s | < 15s | Met |
Research Journey
Phase 1: Foundation (Feb 1–10)
Objective: Build baseline RAG pipeline with semantic search.
- Ingested ~1,000 ZOL brochures and web pages via custom crawlers
- Implemented pgvector embeddings (BAAI/bge-m3, 1024d) with hybrid BM25 search
- Built Neo4j knowledge graph with entity extraction: doctors ↔ departments ↔ conditions ↔ treatments ↔ campuses
- Established evaluation framework: 178 golden questions across 20 categories
- Implemented multi-layer safety: content filtering, prompt injection detection, LLM safety judge
Key decisions:
- ADR-0001: Tiktoken + markdown chunking
- ADR-0002: No-mocking test policy (testcontainers)
- ADR-0014: Taxonomy-driven graph with LLM validation
- ADR-0029: Neo4j service naming (GraphitiService → Neo4jService)
Phase 2: Graph Quality (Feb 8–14)
Objective: Fix extraction errors and establish data quality.
Nine iterations of graph quality fixes (v1–v9) addressing:
- Cross-product bugs: departments linked to ALL campuses instead of correct ones
- Garbage entity names: body parts and job titles parsed as doctor names
- Self-referential relationships and hub concept inflation
- Entity type collisions (treatment vs. service vs. examination)
Result: Database doctor score 91/100. 109 departments, 120 conditions, 528 doctors searchable.
Key innovations:
- Frozen taxonomy architecture with scraper-driven entity registry
- SNOMED CT integration for medical terminology normalization
- Comprehensive domain knowledge maps (DEPT_CONDITION_MAP, SPECIALTY_DEPARTMENT_MAP, etc.)
- LLM entity validation with cross-page caching
Phase 3: Retrieval Quality (Feb 14–21)
Objective: Optimize answer quality through retrieval pipeline improvements.
Implemented and evaluated through systematic ablation studies:
| Feature | Impact on Pass Rate | Decision |
|---|---|---|
| CRAG (web search fallback) | +0.6% | Enabled |
| FILCO (context filtering) | +1.1% | Enabled |
| Safety guardrails | No regression | Enabled |
| Query decomposition | -0.6% | Disabled |
| Embedding upgrade (bge-m3) | Baseline established | Production ready |
Key evaluation runs:
- Ablation study: 8 configurations tested (baseline, each feature solo, all combinations)
- Root cause analysis: Fixed FILCO regression, optimized CRAG thresholds
- SNOMED integration: 15 medical terminology questions added, all passing
Phase 4: Graph Optimization (Feb 22–23)
Objective: Determine optimal knowledge graph utilization strategy.
Graph Value Assessment (90 questions, GPT-4.1 judge, randomized A/B):
| Finding | Detail |
|---|---|
| Graph hurts overall quality | -0.067 average (4.609 vs 4.676) |
| Judge prefers Graph OFF | 61% vs 23% |
| Graph provides critical rescue | GQ-088: +2.8 (4.8 vs 2.0) |
| Graph helps entity lookups | doctor_department: +0.10 |
| Graph hurts content queries | ambiguous_symptom: -0.32 |
Solution: Conditional Graph Injection — gate that only injects graph context when intent or retrieval signals indicate it will help.
Final validation: 100% pass rate with premium judge (GPT-5.2), 0 regressions.
Architecture
User Query (Dutch natural language)
│
▼
┌─────────────────────────────────────────────────┐
│ Safety Layer (prompt injection detection) │
│ → Blocks adversarial/harmful queries │
└─────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Intent Classification (gpt-4.1-nano) │
│ → 10 intent categories + entity extraction │
│ → Retrieval strategy selection │
└─────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Parallel Retrieval │
│ ┌──────────────┐ ┌─────────────────────────┐ │
│ │ Vector Search │ │ Knowledge Graph Search │ │
│ │ (pgvector + │ │ (PostgreSQL taxonomy │ │
│ │ BM25 hybrid) │ │ tables + SQL queries) │ │
│ └──────────────┘ └─────────────────────────┘ │
└─────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ ★ Conditional Graph Injection Gate │
│ → Intent-based: doctor/dept → always inject │
│ → Sparse results: < 3 vectors → inject │
│ → Low similarity: < 0.65 → inject (rescue) │
│ → Default: suppress graph (avoid dilution) │
└─────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Context Assembly │
│ → FILCO filtering (remove irrelevant chunks) │
│ → Token budget management │
│ → Citation generation │
└─────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ CRAG (Corrective RAG) │
│ → Evaluates retrieval quality │
│ → Web search fallback if insufficient │
└─────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ LLM Generation (gpt-4.1-mini) │
│ → Grounded answer with citations │
│ → Safety guardrails in system prompt │
│ → Dutch medical disclaimer │
└─────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Post-Generation Safety │
│ → LLM safety judge (medical advice detection) │
│ → Content policy enforcement │
│ → Citation verification │
└─────────────────────────────────────────────────┘
Key Research Findings
1. Knowledge Graphs Are Conditionally Valuable
The most significant finding is that knowledge graphs in RAG systems are not universally beneficial. They help for structural queries (entity lookups, relationship traversals) but hurt for content-rich queries (symptoms, treatments, conditions) where document chunks already contain comprehensive information.
Implication: Always-on graph injection is an anti-pattern. A gate based on intent classification and retrieval quality signals achieves the best of both worlds.
2. Eval Model Quality Matters
Using GPT-4.1-mini as the evaluation judge produced 5 false negatives (97.2% pass rate). GPT-5.2 correctly scored all 178 questions as passing (100.0%). The premium model is worth the cost for final validation runs.
| Judge Model | Pass Rate | False Negatives | Cost per Run |
|---|---|---|---|
| gpt-4.1-mini | 97.2% | 5 | ~$0.36 |
| gpt-5.2 | 100.0% | 0 | ~$14.24 |
3. Medical Safety Is Achievable at Zero Incidents
Multi-layer safety (content filtering + prompt injection detection + LLM safety judge + system prompt guardrails) achieves 100% refusal accuracy across 21 safety and adversarial test cases, including:
- Direct medical advice requests
- Prompt injection attempts (GCG, role hijacking, system prompt extraction)
- Edge cases (urgent scenarios, dosage questions)
4. Dutch Medical NLP Is Tractable
Despite the limited Dutch medical NLP ecosystem, the combination of:
- SNOMED CT synonyms (400+ medical terms)
- Handcrafted taxonomy (222 aliases)
- LLM entity validation
- Fuzzy matching fallbacks
...achieves robust entity resolution for Dutch medical queries.
5. Hybrid Search Outperforms Pure Vector
BM25 hybrid search (0.3 weight) consistently improves retrieval for:
- Exact medical terms (drug names, procedure codes)
- Proper nouns (doctor names, department names)
- Compound Dutch words (spoedgevallendienst, kinderpsychiatrie)
What Remains (Engineering, Not Research)
The following items are production engineering tasks, not research questions:
| Item | Type | Priority |
|---|---|---|
| Load testing and performance optimization | Engineering | High |
| Monitoring, alerting, and observability | Operations | High |
| A/B testing framework for production | Engineering | Medium |
| Threshold tuning with real user signals | Optimization | Medium |
| UI/UX refinement (chat interface) | Design | Medium |
| Embedding model upgrade (bge-m3 deployed) | Incremental | Low |
| Agentic RAG for multi-step reasoning | Future research | Deferred |
Evaluation Coverage
Golden Questions: 178 across 20 categories
| Category | Count | Description |
|---|---|---|
| condition_department | 19 | "Where do I go for condition X?" |
| multi_hop_graph | 19 | Multi-step entity traversals |
| snomed_terminology | 15 | SNOMED medical terms in Dutch |
| adversarial_gcg | 12 | Prompt injection attacks |
| out_of_scope | 12 | Non-ZOL and irrelevant questions |
| practical_info | 12 | Visiting hours, parking, phone |
| safety_refusal | 9 | Medical advice, dosage requests |
| service_info | 9 | "Does ZOL offer service X?" |
| entity_disambiguation | 8 | Ambiguous entity references |
| multilingual | 8 | French, English, German queries |
| treatment_info | 8 | Treatment procedures and details |
| campus_info | 6 | Campus locations and services |
| compound_word | 6 | Dutch compound word handling |
| doctor_department | 6 | Doctor ↔ department lookups |
| followup_chain | 6 | Multi-turn conversation chains |
| ambiguous_symptom | 5 | Vague symptom descriptions |
| navigation | 5 | Wayfinding and practical navigation |
| emergency | 3 | Emergency and urgent care |
| referral | 3 | Referral process questions |
Evaluation Methodology
- Entity recall: Rule-based metric — expected entities present in answer
- Faithfulness: LLM judge — answer supported by retrieved context
- Answer relevancy: LLM judge — answer addresses the question
- Context precision/recall: LLM judge — retrieved context quality
- Safety refusal: Deterministic — blocked queries return refusal message
- A/B assessment: Randomized LLM-as-judge comparison (90 questions)
Historical Pass Rates
| Date | Configuration | Pass Rate | Judge |
|---|---|---|---|
| Feb 17 | v25 baseline | 93.3% | gpt-4.1-mini |
| Feb 20 | All features on | 97.8% | gpt-4.1-mini |
| Feb 21 | SNOMED integration | 97.8% | gpt-4.1-mini |
| Feb 22 | Post-safety fixes | 98.9% | o4-mini |
| Feb 22 | Phase C (SNOMED aliases) | 98.9% | o4-mini |
| Feb 23 | Conditional graph injection | 100.0% | gpt-5.2 |
Conclusion
The ZOL Intelligent Search system is research-complete and production-ready for the core use case: replacing keyword search with an intelligent, safety-aware, grounded search function for the ZOL hospital website.
The key architectural insight — conditional graph injection — resolves the tension between knowledge graph utility and answer quality dilution, achieving a system that:
- Answers structural queries (doctors, departments, campuses) with graph-sourced authority
- Answers content queries (conditions, treatments, symptoms) with rich document-sourced detail
- Rescues failed queries where vector search returns low-quality results
- Refuses unsafe queries with 100% accuracy across all tested attack vectors
- Handles Dutch medical terminology via SNOMED CT integration and fuzzy matching
The remaining work is engineering (production hardening, monitoring, A/B testing) — not research.
Addendum (March 2026)
Since the research conclusion was written, the system has been deployed to pilot at test.medchat.health. Several architectural evolutions have taken place: Neo4j was fully removed and replaced by PostgreSQL-based taxonomy tables (taxonomy_entities, taxonomy_relationships) with the HospitalTaxonomy class, authentication was migrated from custom cookie-based JWT to Keycloak OIDC, and the golden evaluation set was expanded to 271 questions (268 content + 3 cache). The latest pilot evaluation achieves 268/268 (100%) pass rate with faithfulness 0.959 and entity recall 0.902, confirming that the architectural changes preserved and improved upon the research-phase quality baselines.