Skip to main content

Research Conclusion — ZOL Intelligent Search

Date: 2026-02-23 Project: Ziekenhuis Oost-Limburg (ZOL) Intelligent Search Function Context: PXL AI Technology Architect graduation project


Executive Summary

This document concludes the research phase of the ZOL Intelligent Search project. Over 330 commits, 44 Architecture Decision Records, 43 evaluation reports, and 4,161+ unit tests, we have built and validated a production-ready intelligent search system for ZOL's hospital website.

The core research questionCan RAG with a knowledge graph replace keyword search for a hospital website while maintaining medical safety? — is answered with a definitive yes, subject to the architectural insights discovered during this research.

Final Metrics (2026-02-23, GPT-5.2 Judge)

MetricValueTargetStatus
Pass rate100.0% (178/178)≥ 95%Exceeded
Faithfulness0.989≥ 0.90Exceeded
Answer relevancy0.950≥ 0.85Exceeded
Entity recall0.956≥ 0.90Exceeded
Safety refusal accuracy100.0%100.0%Met
Medical advice incidents00Met
Avg response time10.0s< 15sMet

Research Journey

Phase 1: Foundation (Feb 1–10)

Objective: Build baseline RAG pipeline with semantic search.

  • Ingested ~1,000 ZOL brochures and web pages via custom crawlers
  • Implemented pgvector embeddings (BAAI/bge-m3, 1024d) with hybrid BM25 search
  • Built Neo4j knowledge graph with entity extraction: doctors ↔ departments ↔ conditions ↔ treatments ↔ campuses
  • Established evaluation framework: 178 golden questions across 20 categories
  • Implemented multi-layer safety: content filtering, prompt injection detection, LLM safety judge

Key decisions:

  • ADR-0001: Tiktoken + markdown chunking
  • ADR-0002: No-mocking test policy (testcontainers)
  • ADR-0014: Taxonomy-driven graph with LLM validation
  • ADR-0029: Neo4j service naming (GraphitiService → Neo4jService)

Phase 2: Graph Quality (Feb 8–14)

Objective: Fix extraction errors and establish data quality.

Nine iterations of graph quality fixes (v1–v9) addressing:

  • Cross-product bugs: departments linked to ALL campuses instead of correct ones
  • Garbage entity names: body parts and job titles parsed as doctor names
  • Self-referential relationships and hub concept inflation
  • Entity type collisions (treatment vs. service vs. examination)

Result: Database doctor score 91/100. 109 departments, 120 conditions, 528 doctors searchable.

Key innovations:

  • Frozen taxonomy architecture with scraper-driven entity registry
  • SNOMED CT integration for medical terminology normalization
  • Comprehensive domain knowledge maps (DEPT_CONDITION_MAP, SPECIALTY_DEPARTMENT_MAP, etc.)
  • LLM entity validation with cross-page caching

Phase 3: Retrieval Quality (Feb 14–21)

Objective: Optimize answer quality through retrieval pipeline improvements.

Implemented and evaluated through systematic ablation studies:

FeatureImpact on Pass RateDecision
CRAG (web search fallback)+0.6%Enabled
FILCO (context filtering)+1.1%Enabled
Safety guardrailsNo regressionEnabled
Query decomposition-0.6%Disabled
Embedding upgrade (bge-m3)Baseline establishedProduction ready

Key evaluation runs:

  • Ablation study: 8 configurations tested (baseline, each feature solo, all combinations)
  • Root cause analysis: Fixed FILCO regression, optimized CRAG thresholds
  • SNOMED integration: 15 medical terminology questions added, all passing

Phase 4: Graph Optimization (Feb 22–23)

Objective: Determine optimal knowledge graph utilization strategy.

Graph Value Assessment (90 questions, GPT-4.1 judge, randomized A/B):

FindingDetail
Graph hurts overall quality-0.067 average (4.609 vs 4.676)
Judge prefers Graph OFF61% vs 23%
Graph provides critical rescueGQ-088: +2.8 (4.8 vs 2.0)
Graph helps entity lookupsdoctor_department: +0.10
Graph hurts content queriesambiguous_symptom: -0.32

Solution: Conditional Graph Injection — gate that only injects graph context when intent or retrieval signals indicate it will help.

Final validation: 100% pass rate with premium judge (GPT-5.2), 0 regressions.


Architecture

User Query (Dutch natural language)


┌─────────────────────────────────────────────────┐
│ Safety Layer (prompt injection detection) │
│ → Blocks adversarial/harmful queries │
└─────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────┐
│ Intent Classification (gpt-4.1-nano) │
│ → 10 intent categories + entity extraction │
│ → Retrieval strategy selection │
└─────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────┐
│ Parallel Retrieval │
│ ┌──────────────┐ ┌─────────────────────────┐ │
│ │ Vector Search │ │ Knowledge Graph Search │ │
│ │ (pgvector + │ │ (PostgreSQL taxonomy │ │
│ │ BM25 hybrid) │ │ tables + SQL queries) │ │
│ └──────────────┘ └─────────────────────────┘ │
└─────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────┐
│ ★ Conditional Graph Injection Gate │
│ → Intent-based: doctor/dept → always inject │
│ → Sparse results: < 3 vectors → inject │
│ → Low similarity: < 0.65 → inject (rescue) │
│ → Default: suppress graph (avoid dilution) │
└─────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────┐
│ Context Assembly │
│ → FILCO filtering (remove irrelevant chunks) │
│ → Token budget management │
│ → Citation generation │
└─────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────┐
│ CRAG (Corrective RAG) │
│ → Evaluates retrieval quality │
│ → Web search fallback if insufficient │
└─────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────┐
│ LLM Generation (gpt-4.1-mini) │
│ → Grounded answer with citations │
│ → Safety guardrails in system prompt │
│ → Dutch medical disclaimer │
└─────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────┐
│ Post-Generation Safety │
│ → LLM safety judge (medical advice detection) │
│ → Content policy enforcement │
│ → Citation verification │
└─────────────────────────────────────────────────┘

Key Research Findings

1. Knowledge Graphs Are Conditionally Valuable

The most significant finding is that knowledge graphs in RAG systems are not universally beneficial. They help for structural queries (entity lookups, relationship traversals) but hurt for content-rich queries (symptoms, treatments, conditions) where document chunks already contain comprehensive information.

Implication: Always-on graph injection is an anti-pattern. A gate based on intent classification and retrieval quality signals achieves the best of both worlds.

2. Eval Model Quality Matters

Using GPT-4.1-mini as the evaluation judge produced 5 false negatives (97.2% pass rate). GPT-5.2 correctly scored all 178 questions as passing (100.0%). The premium model is worth the cost for final validation runs.

Judge ModelPass RateFalse NegativesCost per Run
gpt-4.1-mini97.2%5~$0.36
gpt-5.2100.0%0~$14.24

3. Medical Safety Is Achievable at Zero Incidents

Multi-layer safety (content filtering + prompt injection detection + LLM safety judge + system prompt guardrails) achieves 100% refusal accuracy across 21 safety and adversarial test cases, including:

  • Direct medical advice requests
  • Prompt injection attempts (GCG, role hijacking, system prompt extraction)
  • Edge cases (urgent scenarios, dosage questions)

4. Dutch Medical NLP Is Tractable

Despite the limited Dutch medical NLP ecosystem, the combination of:

  • SNOMED CT synonyms (400+ medical terms)
  • Handcrafted taxonomy (222 aliases)
  • LLM entity validation
  • Fuzzy matching fallbacks

...achieves robust entity resolution for Dutch medical queries.

5. Hybrid Search Outperforms Pure Vector

BM25 hybrid search (0.3 weight) consistently improves retrieval for:

  • Exact medical terms (drug names, procedure codes)
  • Proper nouns (doctor names, department names)
  • Compound Dutch words (spoedgevallendienst, kinderpsychiatrie)

What Remains (Engineering, Not Research)

The following items are production engineering tasks, not research questions:

ItemTypePriority
Load testing and performance optimizationEngineeringHigh
Monitoring, alerting, and observabilityOperationsHigh
A/B testing framework for productionEngineeringMedium
Threshold tuning with real user signalsOptimizationMedium
UI/UX refinement (chat interface)DesignMedium
Embedding model upgrade (bge-m3 deployed)IncrementalLow
Agentic RAG for multi-step reasoningFuture researchDeferred

Evaluation Coverage

Golden Questions: 178 across 20 categories

CategoryCountDescription
condition_department19"Where do I go for condition X?"
multi_hop_graph19Multi-step entity traversals
snomed_terminology15SNOMED medical terms in Dutch
adversarial_gcg12Prompt injection attacks
out_of_scope12Non-ZOL and irrelevant questions
practical_info12Visiting hours, parking, phone
safety_refusal9Medical advice, dosage requests
service_info9"Does ZOL offer service X?"
entity_disambiguation8Ambiguous entity references
multilingual8French, English, German queries
treatment_info8Treatment procedures and details
campus_info6Campus locations and services
compound_word6Dutch compound word handling
doctor_department6Doctor ↔ department lookups
followup_chain6Multi-turn conversation chains
ambiguous_symptom5Vague symptom descriptions
navigation5Wayfinding and practical navigation
emergency3Emergency and urgent care
referral3Referral process questions

Evaluation Methodology

  • Entity recall: Rule-based metric — expected entities present in answer
  • Faithfulness: LLM judge — answer supported by retrieved context
  • Answer relevancy: LLM judge — answer addresses the question
  • Context precision/recall: LLM judge — retrieved context quality
  • Safety refusal: Deterministic — blocked queries return refusal message
  • A/B assessment: Randomized LLM-as-judge comparison (90 questions)

Historical Pass Rates

DateConfigurationPass RateJudge
Feb 17v25 baseline93.3%gpt-4.1-mini
Feb 20All features on97.8%gpt-4.1-mini
Feb 21SNOMED integration97.8%gpt-4.1-mini
Feb 22Post-safety fixes98.9%o4-mini
Feb 22Phase C (SNOMED aliases)98.9%o4-mini
Feb 23Conditional graph injection100.0%gpt-5.2

Conclusion

The ZOL Intelligent Search system is research-complete and production-ready for the core use case: replacing keyword search with an intelligent, safety-aware, grounded search function for the ZOL hospital website.

The key architectural insight — conditional graph injection — resolves the tension between knowledge graph utility and answer quality dilution, achieving a system that:

  1. Answers structural queries (doctors, departments, campuses) with graph-sourced authority
  2. Answers content queries (conditions, treatments, symptoms) with rich document-sourced detail
  3. Rescues failed queries where vector search returns low-quality results
  4. Refuses unsafe queries with 100% accuracy across all tested attack vectors
  5. Handles Dutch medical terminology via SNOMED CT integration and fuzzy matching

The remaining work is engineering (production hardening, monitoring, A/B testing) — not research.


Addendum (March 2026)

Since the research conclusion was written, the system has been deployed to pilot at test.medchat.health. Several architectural evolutions have taken place: Neo4j was fully removed and replaced by PostgreSQL-based taxonomy tables (taxonomy_entities, taxonomy_relationships) with the HospitalTaxonomy class, authentication was migrated from custom cookie-based JWT to Keycloak OIDC, and the golden evaluation set was expanded to 271 questions (268 content + 3 cache). The latest pilot evaluation achieves 268/268 (100%) pass rate with faithfulness 0.959 and entity recall 0.902, confirming that the architectural changes preserved and improved upon the research-phase quality baselines.