Skip to main content

Medical Knowledge Architecture

A central design challenge in building a hospital knowledge graph is separating universal medical knowledge from hospital-specific organisational data. Which department handles which condition is a fact about medicine; which doctors work in that department at a particular hospital is a fact about that hospital. Conflating these two layers produces a system that cannot be transferred, audited, or maintained.

This page documents the Three-Source Knowledge Architecture — a principled separation of automated knowledge derivation from human curation. The architecture organises all knowledge inputs by their provenance: automated web scraping (Source 1), standards-based medical ontology (Source 2: SNOMED CT), and irreducible human judgment (Source 3: curated configuration). Research demonstrates that ontology-enhanced knowledge graphs improve retrieval accuracy by 22–40% (Jimeno-Yepes et al., 2012; Soman et al., 2024), and the Belgian government mandates SNOMED CT for primary diagnoses by 2027 (FPS Public Health, 2024). By grounding the knowledge graph in these foundations, the system achieves both academic rigour and practical reliability.

The Problem: Coupled Domain Knowledge

In early iterations of the graph extraction pipeline, medical relationships were encoded as hardcoded Python constants scattered across multiple source files. A dictionary in medical_extraction.py mapped conditions to departments; another dictionary in typed_nodes.py mapped treatments to conditions; a third in zol_taxonomy.py defined department aliases. This arrangement had three consequences:

  1. No portability. Hospital-specific data (department names, campus locations, doctor rosters) was interleaved with general medical knowledge (which conditions a cardiology department typically handles). Deploying the system for a second hospital would require disentangling hundreds of constants.
  2. No auditability. When a clinician asked "why does the graph say Neurochirurgie handles Aneurysma?", the answer was buried in a Python dictionary with no provenance, no generation date, and no model attribution.
  3. Inconsistent coverage. Each constant was maintained by hand. Some departments had 20 condition mappings; others had none. There was no systematic process to ensure completeness.

The Three-Source Knowledge Architecture resolves all three problems by cleanly separating knowledge inputs by provenance: what the hospital website says (Source 1), what the medical ontology says (Source 2), and what requires human judgment (Source 3).

Architectural Overview: Three-Source Knowledge Architecture

The system organises knowledge inputs by their provenance — the mechanism by which the knowledge was derived. This separation enables automated verification: scraped data can be re-scraped, SNOMED relationships can be validated against the ontology, and curated overrides are explicitly flagged as human judgment.

Why Three Sources?

An audit of the knowledge pipeline identified 57 hardcoded data structures totalling ~3,400 entries across 10 source files. Classification by derivability:

CategoryConstantsEntries%
Web-scrapable (Source 1)12~50015%
SNOMED-derivable (Source 2)11~70020%
LLM-derivable (model-inferable)7~2608%
Curated (Source 3)12~3009%
Hybrid (partially derivable)17~1,60047%

85% of hardcoded data is theoretically derivable from automated sources. For an academic project evaluated on architectural quality, hand-maintained Python dictionaries encoding medical knowledge is a defensibility risk — especially when SNOMED CT, a standards-based medical ontology mandated for Belgian primary diagnoses by 2027, is already imported into PostgreSQL.

Source Priority and Confidence

Each knowledge source carries a provenance tag and confidence score:

SourceProvenance TagConfidencePriority
Curated configurationsource: "curated"1.0Highest — human judgment overrides all
Web scrapersource: "scraper"1.0High — directly observed on hospital website
SNOMED CTsource: "snomed"0.7Lower — ontology-derived, may not reflect local practice
LLM enrichmentsource: "llm_enrichment"0.8Medium — LLM-classified, human-reviewed

Curated negative maps (e.g., DEPT_CONDITION_NEGATIVE_MAP) filter relationships from all sources equally, ensuring that plausibility guards apply regardless of provenance.

Implementation: Code Organisation (Three Layers)

The Three-Source Architecture is implemented through three code layers that mirror the source separation. Each layer has a strict dependency direction: imports flow from general to specific, never the reverse.

Layer 1: Universal Medical Knowledge (medical_knowledge/)

Layer 1 is the portable foundation, implementing the vocabulary and relationship mappings shared by Sources 1, 2, and 3. It contains general Dutch medical knowledge applicable to any Belgian hospital — entity aliases, classification rules, plausibility guards, and LLM-generated relationship mappings. Nothing in this layer references ZOL-specific entities (campuses, department rosters, department-campus assignments). The layer is composed of three distinct module groups:

1a. Dutch Medical Vocabulary (dutch_medical_vocabulary.py)

The vocabulary module (988 lines, 12 sections) encodes general medical-domain knowledge in Dutch. It has zero imports from any other part of the codebase — only Python standard library (logging, re). This makes it trivially portable to any Dutch-speaking hospital deployment.

The 12 sections:

SectionContentsPurpose
1. Utility Functionssafe_contains(), _strip_punctuation()Dutch compound word matching, fuzzy text matching
2. Doctor Name CleanupRole tokens, blocklists, clean_doctor_name()Filter non-physician names (job titles, body parts)
3. Entity Type ClassificationENTITY_TYPE_OVERRIDES, DUAL_ENTITY_MAP63 type overrides, 4 dual-entity maps — resolve ambiguous entities (e.g., "radiotherapie" is both a department and a treatment)
4. Condition NormalizationCONDITION_ALIASES, NOT_CONDITIONS, OVERLY_BROAD_CONDITIONS108 patient-friendly Dutch aliases ("hoge bloeddruk" → "Hypertensie"), 12 overly-broad guards, noise filtering
5. Treatment NormalizationTREATMENT_ALIASES, NOT_TREATMENTS60 aliases, 12 noise terms filtered
6. Examination NormalizationEXAMINATION_CASING, EXAMINATION_ALIASES, lookup_examination_aliases()32 casing rules (incl. RX→Röntgen), 22 exam alias groups, case-insensitive lookup
7. Service AliasesSERVICE_ALIASESDeduplication map for hospital services
8. Specialty AliasesSPECIALTY_ALIASESMedical specialty name normalization
9. Noise & Hub GuardsHUB_CONDITIONS, HUB_TREATMENTS, ENTITY_BLOCKLISTPrevent over-connected nodes ("pijn", "tumor", "allergie") from corrupting the graph
10. Domain Plausibility MapsEXAM_DOMAIN_MAP, CONDITION_DOMAIN_MAP, IMAGING_EXAMSDomain-group-based plausibility rules
11. Plausibility Guard Functionsis_plausible_used_for(), is_plausible_performs()Runtime validation of extracted relationships
12. Resolver Functionsresolve_condition(), resolve_treatment(), resolve_specialty(), resolve_entity_type()Query-time and extraction-time alias resolution

Key design property: Because this module has zero project imports, it can be extracted into a standalone Python package and shared across multiple hospital deployments without modification. The separation was validated by verifying that the import graph flows strictly one-way: zol_taxonomy.py imports from dutch_medical_vocabulary.py, never the reverse.

1b. Standard Belgian Hospital Departments (belgian_hospital_departments.py)

This module defines 56 standard Belgian hospital department names — the canonical department vocabulary shared across Belgian hospitals. It serves two critical roles:

  1. Enrichment portability. The LLM enrichment script uses STANDARD_DEPARTMENTS as the allowed-values list in its classification prompts. This ensures that generated relationship mappings reference standard department names (e.g., "Cardiologie", "Neurologie") rather than hospital-specific names (e.g., "Hartcentrum Genk", "Beroertecentrum").

  2. Hospital-specific resolution. The ZOL_TO_STANDARD_MAP dictionary (38 entries) maps ZOL's branded center and program names back to their standard equivalents. During graph seeding, this mapping resolves standard names to hospital-specific canonical names. For a new hospital deployment, only this mapping needs to be replaced.

# Example: ZOL-specific names → standard Belgian equivalents
ZOL_TO_STANDARD_MAP = {
"Hartcentrum Genk": "Cardiologie",
"Limburgs Vaatcentrum": "Vaatchirurgie",
"Beroertecentrum": "Neurologie",
"Slaapcentrum": "Slaapgeneeskunde",
...
}

1c. Enrichment Modules (Relationship Mappings)

Five Python modules contain pure data dictionaries mapping medical entities to each other. These mappings represent standard medical knowledge — they encode facts like "Cardiologie handles Hartfalen" or "CT Onderzoeken diagnoses Aneurysma" that are true regardless of which hospital's website is being indexed.

Key properties:

  • LLM-generated, human-reviewed. The dictionaries are produced by scripts/enrich_taxonomy_llm.py, which calls a Tier 3 (reasoning/advanced) model via OpenAI with the full entity inventories from the hospital's taxonomy database. The LLM's task is not to invent relationships but to map known entities to known departments — a constrained classification task where the LLM's medical knowledge is most reliable.
  • Standard department names. The enrichment prompt uses STANDARD_DEPARTMENTS from belgian_hospital_departments.py, ensuring portable output. Post-generation resolution maps standard names back to hospital-specific canonical names during graph seeding.
  • Provenance metadata. Each file records the generation date, model used, and a note that human review is required before deployment.
  • Five relationship types:
ModuleRelationshipDescriptionEntitiesMappings
department_conditions.pyHANDLESWhich departments handle which conditions148232
department_treatments.pyOFFERSWhich departments offer which treatments205327
department_examinations.pyPERFORMSWhich departments perform which examinations101164
condition_examinations.pyDIAGNOSESWhich examinations diagnose which conditions132308
treatment_conditions.pyTREATSWhich treatments treat which conditions6792

The distinction between "entities mapped" and "total mappings" reflects the many-to-many nature of medical relationships: a single condition may be handled by multiple departments, and a single department handles multiple conditions.

Layer 2: Hospital-Specific Taxonomy (zol_taxonomy.py, taxonomy/)

The second layer contains everything specific to ZOL as an organisation. Following the layer separation refactoring, zol_taxonomy.py was reduced from approximately 2,300 lines to 1,071 lines — all general Dutch medical vocabulary was extracted to Layer 1. The hospital-specific configuration is driven by a 2,294-line YAML file (zol.yaml) validated through a Pydantic schema. What remains is strictly ZOL-specific:

  • Campus definitions (exactly 4): Sint-Jan, André Dumont, Sint-Barbara, and Maas en Kempen, each with addresses, aliases, and contact information.
  • Department-campus mappings (DEPARTMENT_CAMPUS_MAP, 149 entries): Which departments operate at which campuses — a fact about ZOL's organisational structure, not about medicine.
  • Department roster (DEPARTMENTS, 67 entries): The canonical list of ZOL departments with their aliases, including ZOL-specific branded names (e.g., "Limburgs Vaatcentrum", "Beweegsaam").
  • Center-department mappings (CENTER_DEPARTMENT_MAP, 21 entries, 50 links): Maps multidisciplinary centres to their constituent departments, enabling transitive doctor inference (see Center–Doctor Inference).
  • ZOL-specific domain knowledge maps: DEPT_CONDITION_MAP, DEPT_TREATMENT_MAP, EXAM_PERFORMS_MAP — overrides derived from ZOL's hub pages that take precedence over universal knowledge.
  • Search aliases (SEARCH_ALIASES, 239 entries): Query-time resolution rules that map user search terms to ZOL department names (e.g., "hartfilmpje" to "ECG", "NMR" to "MRI").
  • Doctor page classification: URL patterns and rules for identifying authoritative doctor profile pages on ZOL's website.

The taxonomy/ package provides a FrozenTaxonomyRegistry scraped from ZOL's hub pages (authoritative listing pages for all doctors, departments, conditions, and treatments). The registry offers O(1) lookups by canonical name, with pre-built indexes for 359 doctors, 85 department aliases, 168 condition aliases, 217 treatment aliases, and 104 examination aliases.

Import direction: zol_taxonomy.py imports from dutch_medical_vocabulary.py (8 symbols: aliases, guards, utility functions). The reverse import does not exist. This enforces the architectural constraint that general medical knowledge never depends on hospital-specific data.

Layer 3: Graph Seeding (GoldenPageSeeder)

The third layer combines the previous two. The GoldenPageSeeder merges universal medical knowledge with hospital-specific taxonomy data and seeds the resulting relationships into PostgreSQL taxonomy tables:

def _merge_knowledge_maps(
universal: dict[str, list[str]],
hospital: dict[str, list[str]],
) -> dict[str, list[str]]:
"""Merge universal medical knowledge with hospital-specific overrides.
Hospital-specific values take precedence. Deduplication is case-insensitive."""

The merge follows a simple precedence rule: hospital-specific data takes priority. If the taxonomy's hub pages say that "Hartfalen" is handled by "Cardiologie" and "Geriatrie", and the universal knowledge layer also maps "Hartfalen" to "Cardiologie" and "Interne Geneeskunde", the merged result includes all three departments — but "Cardiologie" and "Geriatrie" (from the hospital's own data) are listed first, and duplicate entries are removed via case-insensitive deduplication.

Layer Separation: Implementation Details

The layer separation refactoring was the most significant architectural change to the knowledge architecture since its initial design. This section documents how the separation was executed and what it enables.

Before: Monolithic Taxonomy

Before the refactoring, zol_taxonomy.py was a single 2,300+ line file containing both general Dutch medical knowledge and ZOL-specific data. The file had grown organically through iterative graph quality fixes (v1 through v7), accumulating condition aliases, treatment aliases, examination casing rules, entity type overrides, plausibility guards, hub condition blocklists, and domain plausibility maps — all interleaved with ZOL campus definitions and department mappings.

The dependency graph was flat: every module that needed medical knowledge imported from zol_taxonomy.py, making it impossible to reuse medical knowledge without also importing ZOL-specific data.

After: Three-Layer Dependency Graph

The refactoring enforced a strict dependency rule: imports flow from general to specific, never the reverse. The dutch_medical_vocabulary.py module sits at the bottom of the dependency graph with zero project imports. zol_taxonomy.py imports 8 symbols from it (aliases, guards, utility functions). Consumer modules (medical_extraction.py, typed_nodes.py) import from both layers but never create circular dependencies.

What Moved Where

ContentBefore (location)After (location)Lines moved
Condition aliases (108)zol_taxonomy.pydutch_medical_vocabulary.py~80
Treatment aliases (60)zol_taxonomy.pydutch_medical_vocabulary.py~40
Examination aliases (22 groups)zol_taxonomy.pydutch_medical_vocabulary.py~60
Entity type overrideszol_taxonomy.pydutch_medical_vocabulary.py~50
Hub condition/treatment guardszol_taxonomy.pydutch_medical_vocabulary.py~80
Domain plausibility mapszol_taxonomy.pydutch_medical_vocabulary.py~100
Doctor name cleanup ruleszol_taxonomy.pydutch_medical_vocabulary.py~120
Plausibility functionszol_taxonomy.pydutch_medical_vocabulary.py~70
Resolver functionszol_taxonomy.pydutch_medical_vocabulary.py~50
Service/specialty aliaseszol_taxonomy.pydutch_medical_vocabulary.py~30
Noise guards & blocklistszol_taxonomy.pydutch_medical_vocabulary.py~50
Standard dept names(did not exist)belgian_hospital_departments.py138 (new)

The total migration moved approximately 830 lines of general medical knowledge out of the hospital-specific module. Subsequent growth from iterative quality fixes brought the current counts to dutch_medical_vocabulary.py (988 lines) and zol_taxonomy.py (1,071 lines, down from ~2,300). The reduction in zol_taxonomy.py is even more dramatic because the YAML-driven hospital configuration (zol.yaml, 2,294 lines) now carries the bulk of hospital-specific data declarations.

Registry Fallback Enhancement

The FrozenTaxonomyRegistry (Layer 2) provides hospital-specific entity resolution at query time and extraction time. When the registry does not contain a match for an entity, the system now falls back to the general medical knowledge maps in Layer 1 (dutch_medical_vocabulary.py) rather than to ZOL-specific maps. This ensures that fallback resolution remains portable and does not introduce hospital-specific assumptions into the general resolution path.

Design Rationale

Why Separate Universal Knowledge from Hospital Data?

The separation is motivated by three engineering concerns and one epistemological principle.

1. Portability. If the system is deployed for a second hospital (e.g., Novation's other clients), Layer 1 transfers unchanged. Only Layer 2 needs to be rebuilt — scrape the new hospital's hub pages, define its campuses, and configure its department aliases. The medical knowledge that "Cardiologie handles Hartfalen" remains valid.

2. Testability. Layer 1 modules are pure data with zero side effects. They can be validated by a domain expert reading a Python dictionary — no database connections, no API calls, no runtime state. Layer 2 can be tested against the live hospital website. Layer 3 can be tested with mocked inputs from both layers.

3. Auditability. Each layer has a clear provenance chain. Layer 1 dictionaries record their generation model and date. Layer 2 data is scraped from URLs that can be revisited. Layer 3's merge logic is a deterministic function. When a clinician questions why the graph contains a particular relationship, the provenance chain identifies exactly where it came from.

4. The epistemological principle. Medical knowledge and organisational knowledge are fundamentally different kinds of facts. "Cardiologie handles Hartfalen" is a fact about medicine — it is true at ZOL, at UZ Leuven, and at any hospital with a cardiology department. "Dr. Peeters works in Cardiologie on Monday and Wednesday at campus Sint-Jan" is a fact about ZOL's staffing schedule. Mixing these in a single data structure obscures the difference and makes it impossible to reason about what is universally true versus what is locally configured.

Why LLM-Generated Knowledge?

The relationship mappings in Layer 1 could have been authored manually by domain experts. The decision to use LLM generation instead was driven by three factors:

Coverage. The ZOL taxonomy contains 65 departments, 168 conditions, 217 treatments, and 104 examinations. Manually mapping the cross-product of these entities is a combinatorial task: 65 x 168 = 10,920 potential HANDLES relationships alone. An LLM processes the full matrix in minutes.

Medical accuracy. Modern LLMs trained on medical literature encode extensive knowledge of which departments handle which conditions. For the constrained task of classifying known entities into known departments, LLM accuracy is high — the model is not generating novel medical knowledge but recognising established associations.

Human review as the quality gate. The LLM output is written to Python files that are committed to version control. A domain expert reviews the diff before deployment. The LLM handles the bulk classification; the human handles the edge cases. This division of labour is more efficient than either approach alone.

SNOMED CT as Source 2: From Query-Time to Seeding-Time

SNOMED CT is the international clinical-terminology standard (glossary); its full role, the 5-tier matcher, and the query-time mechanism are documented canonically on the SNOMED CT Terminology page. This section covers only how it participates as Source 2 of the three-source merge — its integration evolved through two phases:

Phase 1 (Implemented): Query-time synonym expansion — patient terms resolve to clinical synonyms and unknown conditions route to departments via FINDING_SITE. See SNOMED CT Terminology for the mechanism.

Phase 2 (Approved Design): Seeding-time graph enrichment. SNOMED CT becomes a first-class knowledge source at graph seeding time, not just a query-time fallback. This phase addresses the root cause of poor performance on SNOMED-specific queries (4/15 pass rate, 26.7%): conditions exist as graph nodes but lack HANDLES relationships to the correct departments. SNOMED FINDING_SITE data auto-creates these missing relationships.

EnrichmentMechanismImpact
Concept IDs on nodesMatch entity names against snomed_descriptionsDeterministic concept-level matching; language-independent identity
Dutch synonyms as propertiesFetch all Dutch descriptions for matched conceptReplaces ~250 hand-maintained aliases with 656K SNOMED descriptions
IS_A hierarchy relationshipsTraverse snomed_transitive_closure between existing graph nodes"Find all subtypes" queries (diabetes → type 1, type 2, gestational)
FINDING_SITE → HANDLESCondition concept → body structure → departmentAuto-creates missing condition→department links (target: +7 SNOMED golden questions)
PROCEDURE_SITE → OFFERSTreatment concept → body structure → departmentAuto-creates missing treatment→department links

Scoping guard: Only the ~260 entities already in the taxonomy are enriched. SNOMED's 356K concepts are NOT bulk-imported. The taxonomy grows by ~50–100 IS_A relationships and ~100–200 SNOMED-derived HANDLES/OFFERS — negligible growth with significant search quality improvement.

Why SNOMED CT complements (not replaces) LLM enrichment: SNOMED CT maps clinical concepts, not hospital workflows. It encodes that "Heart failure" IS_A "Disorder of cardiovascular system" and has FINDING_SITE "Heart structure." It does NOT encode that heart failure patients in a Belgian hospital should be directed to the Cardiology department. The bridge from body structures to hospital departments requires an organisational mapping layer (BODY_STRUCTURE_TO_DEPARTMENT, 47 entries) that is curated but universal across Belgian hospitals. For the department→condition/treatment/examination relationships that lack anatomical grounding, the LLM enrichment pipeline remains the primary source.

The Dutch Medical Terminology Challenge

Building a medical search system in Dutch presents unique challenges that do not arise in English-language systems. This section documents the linguistic phenomena that drive the vocabulary architecture.

Morphological Complexity

Dutch is a head-final compounding language (Booij, 2012) in which medical terms routinely appear as single compounds without spaces: hartchirurgie (heart surgery), bloedonderzoek (blood test), ruggenmergtumor (spinal cord tumour). A keyword search for "hart" (heart) will not match "hartchirurgie" unless the system explicitly decomposes the compound or maintains an alias map. The safe_contains() utility function in the vocabulary module implements substring matching that handles this phenomenon, allowing "hart" to match "hartchirurgie", "hartritmestoornissen", and "hartfalen" without requiring morphological decomposition.

Register Gap: Patient vs. Clinical Terminology

Hospital websites simultaneously serve two audiences — patients and referring physicians — who use fundamentally different vocabularies for the same concepts:

Patient DutchClinical DutchLatin/International
hoge bloeddrukhypertensiehypertensio arterialis
suikerziektediabetes mellitusdiabetes mellitus type 2
beroertecerebrovasculair accidentCVA
spatadersvaricesvaricosis
grijze staarcataractcataracta senilis
open rugspina bifidamyelomeningocele
huidkankermelanoommelanoma malignum
nierstenenurolithiasisnephrolithiasis

The CONDITION_ALIASES map (108 entries) bridges this gap by mapping patient-friendly terms to their canonical clinical equivalents. Similarly, TREATMENT_ALIASES (60 entries) normalises treatment terminology, and EXAMINATION_ALIASES (22 groups) maps colloquial examination names to canonical forms (e.g., "NMR" → "MRI", "bloedafname" → "Bloedonderzoek").

Ambiguity: Departments, Treatments, and Conditions

In Dutch hospital terminology, a single term frequently denotes multiple entity types simultaneously. "Radiotherapie" is both a department (an organisational unit with staff) and a treatment (a therapeutic modality). "Orthopedie" is a department and a medical specialty. "Dialyse" is a treatment but "Nefrologie" (the department that provides it) is a specialty.

The ENTITY_TYPE_OVERRIDES map (63 entries) and DUAL_ENTITY_MAP (4 entries) resolve these ambiguities at extraction time. The DUAL_ENTITY_MAP is specifically designed for cases where both interpretations are correct — "Radiotherapie" genuinely needs to exist as both a department node and a treatment node in the knowledge graph:

DUAL_ENTITY_MAP = {
"radiotherapie": ["Bestralingstherapie", "Uitwendige bestraling", "Brachytherapie"],
"nucleaire geneeskunde": ["Radioisotopentherapie", "PET-CT"],
"neonatologie": ["Neonatale Zorg"],
"klinische biologie": ["Laboratoriumonderzoek"],
}

Hub Node Prevention

Certain medical terms — "pijn" (pain), "tumor", "allergie" (allergy), "infectie" (infection) — are so broadly applicable that they would create hub nodes connecting to nearly every department in the graph if treated as conditions. Graph theory indicates that such hubs degrade search specificity: a query for "pijn" would return every department rather than routing to the Multidisciplinair Pijncentrum (Pain Centre) where it belongs.

The HUB_CONDITIONS (8 entries) and HUB_TREATMENTS (5 entries) blocklists prevent these terms from creating HANDLES/TREATS relationships. The OVERLY_BROAD_CONDITIONS (12 entries) extends this guard to terms like "koorts" (fever), "vermoeidheid" (fatigue), and "hoofdpijn" (headache) that are symptoms rather than conditions.

Examination Normalisation

Medical imaging examinations present a specific normalisation challenge. The ZOL website uses inconsistent naming for the same examination: "RX", "RX-onderzoek", "RX Onderzoeken", "RX-beeld", and "Röntgen" all refer to X-ray imaging. Similarly, "NMR", "MRI", and "Kernspinresonantie" all refer to magnetic resonance imaging.

The EXAMINATION_CASING map (32 entries) canonicalises these variants to a single display name. The v18 update added 4 RX→Röntgen consolidation entries to eliminate fragmented examination nodes in the graph:

"rx": "Röntgen",
"rx-onderzoek": "Röntgen",
"rx onderzoeken": "Röntgen",
"rx-beeld": "Röntgen",

Specific RX procedures (RX Arthrografie, RX Colon) retain their identity as distinct examinations — the normalisation only collapses generic RX references.

Current Vocabulary Coverage

CategoryEntriesExamples
Condition aliases108"hoge bloeddruk" → Hypertensie, "suikerziekte" → Diabetes Mellitus
Treatment aliases60"hartoperatie" → Hartchirurgie, "nierdialyse" → Hemodialyse
Examination alias groups22Röntgen (4 aliases), MRI (3 aliases), Echografie (2 aliases)
Examination casing rules32"rx" → Röntgen, "nmr" → MRI, "pet" → PET-CT
Entity type overrides63"dialyse" → treatment, "biopsie" → examination
Service aliases17"afspraak" → Afspraken, "comfort care" → Comfortzorg
Search aliases239"hartfilmpje" → ECG, "scan" → Radiologie
Hub/noise guards468 hub conditions + 5 hub treatments + 12 overly broad + 21 blocklist

Together with the SNOMED CT synonym expansion layer (656K Dutch descriptions), the system resolves patient-facing Dutch, clinical Dutch, and international medical terminology to the same canonical entities.

The LLM Enrichment Pipeline

The scripts/enrich_taxonomy_llm.py script implements the generation pipeline for Layer 1's relationship mappings. Its design reflects the principle that LLMs are classifiers, not inventors: the script provides the full list of entities and asks the LLM to classify relationships between them, rather than asking it to generate entities from scratch.

Standard Department Names for Portability

A key design decision in the enrichment pipeline is the use of standard Belgian hospital department names rather than hospital-specific names in the LLM classification prompt. The pipeline loads STANDARD_DEPARTMENTS (55 entries) from belgian_hospital_departments.py as the allowed-values list for the "department" field in LLM output. This ensures that the generated relationship mappings are portable across hospitals.

The resolution flow:

When the GoldenPageSeeder seeds the taxonomy, a post-generation resolution step maps standard names back to hospital-specific canonical names using the ZOL_TO_STANDARD_MAP. For a new hospital deployment, only the *_TO_STANDARD_MAP dictionary needs to be replaced — the enrichment modules themselves remain unchanged.

Pipeline Stages

Stage 1: Entity Filtering. Raw entity names from the taxonomy database include noise — section headers from web pages ("Brochures 2", "Dag 1"), generic medical terms ("Aandoeningen", "Behandeling"), and very short strings. The filtering stage applies:

  • Section header detection (pattern-based)
  • Generic term removal (curated blocklist of ~60 terms)
  • Minimum length threshold (3 characters)
  • Case-insensitive deduplication

Stage 2: LLM Classification. Five separate LLM calls, one per relationship type. Each call provides the full list of filtered entities and the full list of standard Belgian departments (not hospital-specific names), asking the LLM to produce a JSON mapping. The prompt is structured to minimise hallucination: the LLM can only use entity names that appear in the provided lists.

Stage 3: Output Validation. The LLM occasionally maps entities to department names that do not exist in the standard list (misspellings, synonyms). The validation stage removes any mapping whose target is not in the known department set, with case-insensitive matching and deduplication.

Stage 4: Hub Page Merge. The hospital's hub pages contain authoritative relationship data scraped directly from the website (e.g., a department page that lists the conditions it handles). This data is merged with the LLM output using case-insensitive key matching, with hub page data taking precedence. This ensures that hospital-verified relationships are preserved even if the LLM disagrees.

Stage 5: Post-Merge Cleanup. A final pass removes any remaining noise keys that survived into the merged output (e.g., section headers that appeared as condition names on hub pages).

Stage 6: Output. The pipeline writes five Python modules (one per relationship type), an enrichment review report (Markdown), and the raw JSON output for auditing.

Quality Assurance

The enrichment review report (enrichment_review.md) provides transparency into the generation process:

## HANDLES
- Total entities mapped: 149
- Total mappings: 233
- Confirmed by hub pages: 139
- New from LLM: 10

This shows that 93% of HANDLES mappings were confirmed by the hospital's own hub page data. Only 10 new mappings came exclusively from the LLM — these are the ones requiring the most careful human review.

For cross-entity relationships (DIAGNOSES, TREATS), the hub page confirmation rate is 0% because these relationships are not explicitly stated on hospital web pages. They represent general medical knowledge that only the LLM can provide. This is expected and acceptable: the relationships are medically standard (e.g., "CT Onderzoeken diagnoses Aneurysma") and can be validated by any clinician.

Center–Doctor Inference

A specific challenge in hospital knowledge graphs is linking doctors to multidisciplinary centres. Centres like the Borstcentrum (Breast Centre) or Beroertecentrum (Stroke Centre) are not departments with their own staff rosters — they are cross-departmental programs whose doctors are employed by constituent departments (Oncologie, Neurologie, Chirurgie, etc.). The ZOL website lists doctors under departments, not centres. Without explicit centre–doctor relationships, a patient searching for "borstcentrum dokter" would find no results.

The Transitive Inference Pattern

The CENTER_DEPARTMENT_MAP (YAML-driven, 21 centres, 50 department links) defines the constituent departments for each centre. At seeding time, _infer_center_doctor_links() implements a transitive closure:

For each centre C:
For each constituent department D in CENTER_DEPARTMENT_MAP[C]:
For each doctor who WORKS_IN D:
Create WORKS_IN(doctor, C) with confidence=0.7

The confidence=0.7 (vs. 1.0 for direct WORKS_IN from hub pages) signals that the relationship is inferred rather than directly observed. This distinction is available to downstream consumers (e.g., search ranking) but is not currently surfaced to users.

Coverage

CategoryCount
Total centres21
Centres with department mappings21 (100%)
Total department→centre links50
Centres with campus entries21 (100%)

The mapping covers all centre types: clinical centres (Borstcentrum, Beroertecentrum), rehabilitation programs (BeweegSaam, Revalidatie, Cognitieve Revalidatie), specialised clinics (Endometriosekliniek, Diabetische Voetkliniek), and support programs (Infectiepreventie, Zorgeenheid Gaudium).

Integration with the RAG Pipeline

The medical knowledge layer enhances the RAG pipeline at two points: graph seeding (ingestion time) and graph querying (query time).

At Ingestion Time

When the GoldenPageSeeder runs, it merges Layer 1 (universal knowledge) with Layer 2 (hospital taxonomy) and stores relationships in PostgreSQL taxonomy tables. The resulting data quality is monitored by an automated Database Doctor service:

MetricScore
Naming Consistency94/100
Search Effectiveness92/100
Doctors with department links100%
Departments with campus links100%
Centres with inferred doctors100% (via CENTER_DEPARTMENT_MAP)
Orphan nodes0

The Database Doctor evaluates naming consistency (canonical name adherence, alias coverage, casing uniformity) and search effectiveness (relationship connectivity, entity-to-document coverage, cross-entity navigability). Both scores are in the diminishing-returns zone, indicating a production-ready knowledge base.

At Query Time

When a user asks "Welke onderzoeken worden gebruikt om hartfalen vast te stellen?" (Which examinations are used to diagnose heart failure?), the query pipeline:

  1. Intent classification identifies this as a condition-information query
  2. Entity extraction identifies "hartfalen" as a condition
  3. Taxonomy resolution resolves "hartfalen" to the canonical condition name (with SNOMED synonym fallback)
  4. DIAGNOSES traversal queries the taxonomy tables for examinations linked to this condition via the DIAGNOSES relationship
  5. SNOMED synonym matching (Phase B): queries can also match via snomed_synonyms property on nodes — e.g., "suikerziekte" matches Diabetes Mellitus without requiring a static alias entry
  6. Response generation formats the results with source citations

Without the knowledge graph relationships, this query would fall through to vector search and return generic text passages that mention heart failure — useful but imprecise. With the graph, the system returns a structured list of specific diagnostic examinations (Echocardiografie, Bloedafname, Thorax RX) with the departments that perform them.

Query Pipeline Simplification (Phase B)

With SNOMED synonyms stored directly on taxonomy entities, queries become synonym-aware without requiring query-time SNOMED lookups for known entities:

-- Before: string match on canonical_name only
SELECT * FROM app.taxonomy_entities WHERE canonical_name = :term;

-- After: synonym-aware match (Phase B)
SELECT * FROM app.taxonomy_entities
WHERE canonical_name = :term
OR metadata->'snomed_synonyms' ? :term;

This reduces query-time latency by pre-computing synonym matches at seeding time, while preserving the query-time SNOMED fallback for terms not in the taxonomy.

Multi-Tenancy Path

The layer separation is not merely an engineering exercise in clean code — it is the architectural foundation for evolving from a single-hospital deployment into a multi-tenant hospital search product. This section covers multi-tenancy from the knowledge-layer angle — which medical-vocabulary code is portable versus hospital-specific; for the system-level onboarding pathway, tenant isolation, and routing, see Multi-Tenancy Architecture.

What Transfers Unchanged (Layer 1)

The following components work for any Dutch-speaking hospital without modification:

ComponentLinesContent
dutch_medical_vocabulary.py988108 condition aliases, 60 treatment aliases, 22 examination alias groups, 63 entity type overrides, plausibility guards, resolver functions, Dutch compound word matching
belgian_hospital_departments.py13856 standard Belgian department names, normalize_to_standard() function
Enrichment modules (5 files)~1,2001,123 total mappings across 5 relationship types (HANDLES, OFFERS, PERFORMS, DIAGNOSES, TREATS)
enrich_taxonomy_llm.py~600LLM enrichment pipeline script
SNOMED CT reference tables4 tables356K concepts, 656K descriptions, 1.2M relationships, 4.7M transitive closure entries (see SNOMED CT integration)

Total portable code: approximately 2,930 lines (excluding SNOMED CT data, which is loaded from the Belgian Edition RF2 distribution).

What Needs Replacement (Layer 2)

For a new hospital deployment, only the following need to be created or replaced:

ComponentEffortDescription
{hospital}_taxonomy.pyMediumCampus definitions, department-campus mappings, department aliases specific to the new hospital
taxonomy/ scrapeAutomatedRun the hub page scraper against the new hospital's website
{HOSPITAL}_TO_STANDARD_MAPLowMap the new hospital's branded department names to standard Belgian department names (analogous to ZOL_TO_STANDARD_MAP)
Site configurationLowUpdate site_config.py with the new hospital's domain and configuration

The Enrichment Pipeline as a Portable Generator

The enrichment pipeline's use of standard Belgian department names means that its output is already portable. When onboarding a new hospital:

  1. Scrape the new hospital's hub pages to populate the taxonomy database.
  2. Run enrich_taxonomy_llm.py — the pipeline uses STANDARD_DEPARTMENTS, not hospital-specific names.
  3. Create a {HOSPITAL}_TO_STANDARD_MAP mapping the new hospital's branded names to standard equivalents.
  4. Seed the taxonomy — the GoldenPageSeeder resolves standard names to hospital-specific names automatically.

The enrichment modules do not need to be regenerated for each hospital. The same "Cardiologie handles Hartfalen" mapping applies universally. Only the resolution from "Cardiologie" to a hospital's specific department name (which may be "Hartcentrum", "Cardiologische Dienst", or simply "Cardiologie") happens at seeding time.

Medical Knowledge Graphs in Literature

The approach taken in this system can be situated within the broader landscape of medical knowledge graph construction:

ApproachExamplesStrengthsLimitations
Ontology-basedSNOMED CT, ICD-10, UMLSStandardised, comprehensive, internationally maintainedClinical focus (not navigational), licensing requirements, Dutch coverage gaps
NER + relation extractionscispaCy, MedCAT, BioBERTAutomated, handles novel entitiesRequires training data, language-specific models needed for Dutch
Manual curationHospital-internal databasesHigh precision, domain-expert validatedDoes not scale, maintenance burden, no portability
LLM classification (this system)Tier 3 model + hub page mergeScalable, auditable, portable, Dutch-nativeRequires human review, no novel entity discovery

The ZOL system's approach is closest to constrained LLM classification with human-in-the-loop validation — a pattern that leverages LLM medical knowledge for bulk classification while maintaining auditability through version-controlled output files and hub page cross-referencing.

Hybrid Architecture Advantages

The three-layer separation offers advantages that no single approach provides:

  1. Scalable like LLM extraction — new entities are classified automatically
  2. Precise like manual curation — hub page data overrides LLM when available
  3. Portable like ontology-based systems — Layer 1 transfers across hospitals
  4. Auditable like all three — provenance chain from LLM output to hub page confirmation to human review

Way Forward: Three-Source Architecture Roadmap

The Three-Source Architecture is being implemented in phases, each with a golden evaluation gate (≥91% overall pass rate required before merging).

Phase A: Web Scraper Campus Inference (COMPLETE)

The scraper now infers department→campus mappings from doctor campus data, eliminating manual YAML campus maintenance. Database doctor confirms complete coverage: 0 departments missing LOCATED_AT relationships.

Phase B: SNOMED Graph Enrichment (APPROVED DESIGN)

Add a Phase 3: SNOMED Enrichment to the graph seeding pipeline, running after all entities and relationships are built:

  1. Concept ID matching: Match entity names against snomed_descriptions to assign SNOMED concept IDs to graph nodes
  2. Synonym enrichment: Add Dutch synonyms from SNOMED descriptions as node properties (snomed_synonyms array)
  3. IS_A hierarchy: Create IS_A relationships between existing condition nodes that have hierarchical SNOMED relationships (depth limit: 3 hops)
  4. FINDING_SITE → HANDLES: Auto-create condition→department links via anatomical routing (estimated +7 SNOMED golden questions)
  5. PROCEDURE_SITE → OFFERS: Auto-create treatment→department links via procedure site mapping

Target: SNOMED golden question pass rate from 4/15 (26.7%) to 11–13/15 (73–87%).

Phase C: Alias Elimination (PLANNED)

Replace hand-maintained alias dictionaries with SNOMED-derived synonyms at seeding time:

Current ConstantEntriesReplaced By
CONDITION_ALIASES~125SNOMED snomed_descriptions synonym lookup
TREATMENT_ALIASES~72SNOMED snomed_descriptions synonym lookup
EXAMINATION_ALIASES~55SNOMED snomed_descriptions synonym lookup
EXAMINATION_CASING~30SNOMED preferred term lookup
CONDITION_DOMAIN_MAP~40SNOMED FINDING_SITE → body structure → domain group
EXAM_DOMAIN_MAP~17SNOMED PROCEDURE_SITE → body structure → domain group

Impact: Eliminates ~610 hand-maintained entries (~18% of all hardcoded data).

Completed: SNOMED CT Phase 1 — Query-Time Synonym Expansion

SNOMED CT Belgian Edition (356K concepts, 656K Dutch descriptions) is integrated as a query-time synonym expansion layer following the BMQExpander pattern (Mao et al., 2024). The implementation includes:

  • PostgreSQL reference tables (4 tables): snomed_concepts, snomed_descriptions, snomed_relationships, and snomed_transitive_closure (4.7M pre-computed IS-A ancestor/descendant pairs).
  • SnomedTerminologyService: BMQExpander-style synonym expansion — resolves patient terms (e.g., "cataract") to clinical synonyms ("staar") that match taxonomy entries.
  • FINDING_SITE routing: Maps unknown conditions to departments via SNOMED CT body structure relationships — 51 curated body-structure-to-department mappings with IS-A hierarchy walk (max depth 5).
  • Always-on architecture: SNOMED is not behind a feature flag. All calls are wrapped in try/except for graceful degradation when tables do not exist.

On a targeted 15-question evaluation set, synonym expansion improved entity recall from 40% to 47–60% (depending on LLM response variability), with zero infrastructure additions beyond PostgreSQL.

With SNOMED CT concept identifiers on graph nodes, the system can accept queries in any language that SNOMED CT supports (Dutch, English, French, German) and resolve them to the same underlying concepts. This is particularly relevant for ZOL's multilingual patient population in the Limburg province of Belgium.

Future: Neural Entity Extraction

For discovering novel entities not in the taxonomy, two approaches are under evaluation: GLiNER-BioMed (zero-shot NER) and MedCAT (NER with SNOMED CT concept linking). Both require golden question baseline evaluation (currently 108 questions across 15 categories).

Theoretical Foundations

The Three-Source Knowledge Architecture draws on established principles from knowledge engineering, medical informatics, and information retrieval:

  • Separation of concerns (Dijkstra, 1982): Universal medical knowledge, standards-based ontology data, and hospital-specific organisational data are orthogonal concerns that change for different reasons and at different rates. Separating them by provenance reduces coupling and enables independent verification of each source.
  • Open-world assumption (Reiter, 1978): The absence of a relationship in the knowledge graph does not mean the relationship does not exist. The tiered query strategy (graph then vector fallback) operationalises this assumption by treating graph results as high-confidence and vector results as exploratory.
  • Knowledge graph completion (Bordes et al., 2013): The LLM enrichment pipeline can be understood as a form of knowledge graph completion, where a pre-trained model predicts missing links between known entities based on learned representations of medical relationships.
  • Ontology-enhanced RAG (Soman et al., 2024): OntologyRAG demonstrated that grounding retrieval-augmented generation in formal ontologies improves accuracy by +40% on medical QA benchmarks. The SNOMED CT graph enrichment phase operationalises this finding by embedding ontological relationships directly into the knowledge graph.
  • SNOMED CT concept-based information retrieval (Ruch et al., 2006): Using SNOMED concepts and FINDING_SITE relationships for information retrieval improved MAP by +25%. This finding directly motivates the FINDING_SITE → HANDLES auto-creation in Source 2.
  • Biomedical query expansion (Jimeno-Yepes et al., 2012; Mao et al., 2024): The BMQExpander pattern — using SNOMED CT and MeSH synonyms for query expansion — achieved +22% NDCG@10 on medical QA benchmarks. This approach underpins both the query-time synonym expansion and the seeding-time synonym-as-property design.
  • Cross-lingual medical embeddings (Yuan et al., 2022): The CODER model demonstrated that cross-lingual embeddings improve medical entity matching by +15% F1. SNOMED CT's multilingual descriptions provide the foundation for cross-lingual search without requiring language-specific embedding models.
  • Frozen ontology pattern (Uschold & Gruninger, 1996): The taxonomy layer implements a frozen snapshot of the hospital's entity inventory at scrape time. This prevents drift between the graph's entity set and the hospital's current data, at the cost of requiring periodic re-scraping.
  • Multi-tenancy through abstraction (Bezemer & Zaidman, 2010): The source separation follows the principle that multi-tenant systems should isolate tenant-specific configuration from shared business logic. Sources 1 and 2 are portable; Source 3 is tenant-specific. This pattern enables horizontal scaling to additional hospitals without code duplication.
  • Dutch morphological compounding (Booij, 2012): Dutch is a productive compounding language where medical terms are formed by concatenating morphemes without spaces (e.g., hartchirurgie = hart + chirurgie, schildklieraandoening = schildklier + aandoening). This morphological productivity means that a finite alias dictionary can never fully cover the space of valid Dutch medical terms, motivating the integration of SNOMED CT's 656K Dutch descriptions as a scalable synonym source.

References

  • Bezemer, C. P., & Zaidman, A. (2010). Multi-tenant SaaS applications: Maintenance dream or nightmare? Joint ERCIM Workshop on Software Evolution and International Workshop on Principles of Software Evolution, 88--92. https://doi.org/10.1145/1862372.1862393
  • Booij, G. (2012). The Grammar of Words: An Introduction to Linguistic Morphology (3rd ed.). Oxford University Press.
  • Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., & Yakhnenko, O. (2013). Translating embeddings for modeling multi-relational data. Advances in Neural Information Processing Systems, 26, 2787--2795.
  • Dijkstra, E. W. (1982). On the role of scientific thought. In Selected Writings on Computing: A Personal Perspective (pp. 60--66). Springer-Verlag.
  • FPS Public Health. (2024). Belgian eHealth Action Plan: SNOMED CT Implementation Roadmap. Federal Public Service Health, Food Chain Safety and Environment.
  • Hartendorp, R., et al. (2024). Biomedical entity linking for Dutch. In Proceedings of CL4Health Workshop, LREC-COLING 2024.
  • Hogan, A., et al. (2021). Knowledge graphs. ACM Computing Surveys, 54(4), 1--37. https://doi.org/10.1145/3447772
  • Jimeno-Yepes, A., Berlanga, R., & Rebholz-Schuhmann, D. (2012). Ontology-based query expansion for biomedical information retrieval. BMC Bioinformatics, 13(S14). https://doi.org/10.1186/1471-2105-13-S14-S1
  • Mao, Y., et al. (2024). BMQExpander: Biomedical query expansion using SNOMED/MeSH synonyms. Proceedings of NAACL 2024.
  • Reiter, R. (1978). On closed world data bases. In H. Gallaire & J. Minker (Eds.), Logic and Data Bases (pp. 55--76). Plenum Press.
  • Ruch, P., et al. (2006). Using SNOMED CT body structure hierarchy for concept-based information retrieval. Proceedings of AMIA Annual Symposium, 674--678.
  • Searle, T., et al. (2024). MedCAT -- Medical concept annotation toolkit. Artificial Intelligence in Medicine, 149, 102779. https://doi.org/10.1016/j.artmed.2024.102779
  • SNOMED International. (2024). SNOMED CT Starter Guide. https://www.snomed.org/
  • Soman, K., et al. (2024). OntologyRAG: Ontology-enhanced retrieval-augmented generation. arXiv preprint, arXiv:2412.09050.
  • Uschold, M., & Gruninger, M. (1996). Ontologies: Principles, methods and applications. Knowledge Engineering Review, 11(2), 93--136. https://doi.org/10.1017/S0269888900007797
  • Yuan, Z., et al. (2022). CODER: Knowledge-infused cross-lingual medical term embeddings. Findings of ACL 2022, 3924--3935. https://doi.org/10.18653/v1/2022.findings-acl.312