This ADR was written when the system used Neo4j for entity storage. As of March 2026, Neo4j has been fully removed and replaced by PostgreSQL taxonomy tables (taxonomy_entities, taxonomy_relationships). The decision rationale documented here remains valid; the storage layer has changed.
ADR-0015: Taxonomy-Driven Entity Normalization
Date: 2026-02-09 | Status: Accepted
Context
After implementing LLM entity validation (ADR-0014), the knowledge graph reached 3,578 nodes and 28,374 relationships — but a quality audit scored it only 69/100 overall, with particularly poor scores for naming consistency (58/100) and search utility (55/100).
Root Cause: Scattered Normalization Data
The normalization logic was distributed across ~20 constants in 2 files (medical_extraction.py and typed_nodes.py), with no single source of truth. This produced three categories of defects:
| Defect Category | Example | Impact |
|---|---|---|
| Contradictory normalization | "Anesthesie" normalized to "Anesthesiologie" in one file but kept as "Anesthesie" in another | Duplicate nodes in Neo4j for the same department |
| Missing campus constraints | A 5th campus ("ZOL") appeared alongside the 4 real campuses (Sint-Jan, André Dumont, Sint-Barbara, Maas en Kempen) | Phantom campus hub node with incorrect relationships |
| Duplicate department pairs | 7+ pairs like "Cardiologie" / "Dienst Cardiologie", "Urologie" / "Dienst Urologie" | Fragmented graph — queries miss half the relationships |
Doctor Name Pollution
Regex extraction parsed role tokens and body parts as doctor name components:
- "Hoofdverpleegkundige Van Damme" → doctor named "Hoofdverpleegkundige Van Damme" (head nurse is a role, not a name)
- "Borstkaschirurgie" → doctor named "Borstkas" (chest is a body part, not a person)
- "Diensthoofd Dr. Janssen" → doctor named "Diensthoofd Janssen" (department head is a role)
These polluted the graph with dozens of fake doctor nodes that connected to real departments, degrading relationship quality.
Entity Type Confusion
Several entities exist as both a department and a treatment in the ZOL context:
- Radiotherapie: a department (afdeling) that patients visit, and a treatment (behandeling) that patients receive
- Dialyse: a department with staff and rooms, and a treatment procedure
- Revalidatie: a department on campus, and a treatment program
Without an authoritative type override, the extraction pipeline would randomly classify these as one or the other depending on page context, creating inconsistent graph structure.
Decision
1. Multi-Layer Taxonomy Architecture
The taxonomy data is organized in a layered architecture, separating hospital-specific configuration from general Dutch medical vocabulary:
Layer 1: Hospital Configuration (hospital_config/zol.yaml)
All ZOL-specific data lives in a YAML configuration file, validated at load time by Pydantic models defined in hospital_config/schema.py. This includes:
- Campus definitions (4 campuses with addresses)
- Department definitions (~57 departments with aliases, campus assignments, domain groups)
- Domain knowledge maps (dept→conditions, dept→treatments, condition→examinations)
- Hub page definitions for curated taxonomy seeding
- Search aliases mapping patient language to clinical terms
Layer 2: Runtime API (zol_taxonomy.py)
The zol_taxonomy.py module (960+ lines) loads the YAML configuration and exposes it as frozen dataclasses and pre-computed lookup structures:
@dataclass(frozen=True)
class CampusDefinition:
id: str
canonical_name: str
aliases: list[str] = field(default_factory=list)
address: str = ""
city: str = ""
postal_code: str = ""
phone: str = ""
@dataclass(frozen=True)
class DepartmentDefinition:
canonical_name: str
aliases: list[str] = field(default_factory=list)
campuses: list[str] = field(default_factory=list)
domain_group: str = "general"
is_diagnostic: bool = False
Key derived structures include:
CAMPUS_CANONICAL_NAMES: set of valid campus namesDEPARTMENT_CAMPUS_MAP: department → campus list mappingDEPT_CONDITION_MAP,DEPT_TREATMENT_MAP: domain knowledge loaded from YAMLALL_DEPARTMENT_NAMES,ALL_CONDITION_NAMES,ALL_SPECIALTY_NAMES: pre-compiled sets for O(1) lookupSEARCH_ALIASES: combined universal + hospital-specific patient-facing aliases
Layer 3: General Medical Vocabulary (dutch_medical_vocabulary.py)
Language-level medical knowledge that is not hospital-specific:
# Entity type overrides — prevents misclassification
ENTITY_TYPE_OVERRIDES: dict[str, str] = {
"radiotherapie": "department",
"dialyse": "department",
"revalidatie": "department",
"patiëntenfiches": "service",
"palliatief support team": "service",
"valkliniek": "facility",
# ...
}
# Dual-entity map — departments that also represent treatments
DUAL_ENTITY_MAP: dict[str, list[str]] = {
"radiotherapie": ["Bestralingstherapie", "Uitwendige bestraling", "Brachytherapie"],
"revalidatie": ["Fysiotherapie", "Ergotherapie", "Logopedie"],
# ...
}
# Alias maps for condition/treatment normalization
CONDITION_ALIASES: dict[str, str] = {
"hoge bloeddruk": "Hypertensie",
"suikerziekte": "Diabetes Mellitus",
# 55+ entries
}
TREATMENT_ALIASES: dict[str, str] = {
"bestraling": "Radiotherapie",
"chemo": "Chemotherapie",
# 20+ entries
}
Also includes:
DOCTOR_NAME_STRIP_TOKENS: role tokens to strip from namesDOCTOR_NAME_BLOCKLIST: body parts and generic terms that are not person namesTREATMENT_SPECIALTY_CONSTRAINTS: prevents cross-specialty leakage (e.g., cochleair implantaat → KNO only)ENTITY_BLOCKLIST: terms that should never become graph entities
Layer 4: Curated Medical Knowledge (medical_knowledge/)
LLM-generated and human-reviewed relationship maps:
| Module | Content |
|---|---|
department_treatments.py | Which department offers which treatments |
department_conditions.py | Which department handles which conditions |
condition_examinations.py | Which examination diagnoses which condition |
belgian_hospital_departments.py | Valid Belgian hospital department names |
2. Consumption by Extraction and Storage
Both downstream modules import from the taxonomy layers:
medical_extraction.pyimports campus lists, department aliases, doctor cleanup rules, entity type overrides, and domain knowledge mapstyped_nodes.pyimports normalization maps, entity type overrides, and canonical namesgolden_page_seeder.pyimports all taxonomy data for curated taxonomy seeding, plus curated knowledge maps frommedical_knowledge/- LLM validation prompt (ADR-0014) includes taxonomy context: the list of valid campuses, departments, and known dual-entities is appended to the system prompt
3. Quality Metrics Targets
| Metric | Before (v1) | Target (v2) | Measurement Method |
|---|---|---|---|
| Overall quality | 69/100 | 80+/100 | Weighted composite of all sub-scores |
| Naming consistency | 58/100 | 75+/100 | Duplicate detection + canonical name coverage |
| Search utility | 55/100 | 70+/100 | Query hit rate for common patient search terms |
| Relationship accuracy | 74/100 | 85+/100 | Manual sampling of 50 random relationships |
| Campus assignment | Unknown | 95+/100 | Departments assigned to correct campus(es) |
Consequences
Positive
- Layered architecture: Hospital data (YAML) separated from general medical vocabulary (Python) — clear ownership boundaries
- Zero duplicate nodes: Canonical names enforced at both extraction and storage layers
- Clean doctor names: Role tokens and body parts systematically stripped before storage
- Entity type clarity:
ENTITY_TYPE_OVERRIDES+DUAL_ENTITY_MAPhandle ambiguous entity types - Improved search: Patient-facing aliases map colloquial Dutch terms to clinical entities (55+ condition aliases, 20+ treatment aliases)
- Better LLM validation: Taxonomy context in the validation prompt gives the LLM authoritative reference data
- Testable: Taxonomy module is pure data — easy to unit test for completeness and consistency
- Extensible: New hospitals can add their own YAML config without modifying Python code
Negative
- Taxonomy maintenance burden: New departments, conditions, or treatments require updating YAML config and/or vocabulary module
- Large refactor: Original consolidation required careful migration from ~20 scattered constants
- Domain expertise required: Taxonomy curation needs clinical/hospital knowledge — not purely a development task
- Risk of over-normalization: Aggressive alias maps could incorrectly merge entities that are actually distinct
Neutral
- Pipeline architecture unchanged — regex extraction, LLM validation, and taxonomy storage still operate in the same sequence
- No database migration required — only the data flowing into the taxonomy changes
- No impact on query-time latency — normalization happens at ingestion time
- LLM validation (ADR-0014) continues to operate as an independent quality gate
Alternatives Considered
Alternative 1: Fix Constants In-Place
Continue maintaining normalization data as separate constants in medical_extraction.py and typed_nodes.py, just fixing the contradictions. Rejected: this does not solve the root cause. With two files independently defining canonical names, contradictions will inevitably re-emerge as new entities are added. The scattered approach also makes auditing impossible — you cannot answer "what is the canonical name for X?" without searching multiple files.
Alternative 2: Database-Driven Taxonomy
Store the taxonomy in PostgreSQL or Neo4j itself, with an admin UI for editing. Rejected: the ZOL domain is stable — departments and campuses rarely change. A database-driven approach adds migration complexity, admin UI development cost, and cache invalidation logic for a problem that a simple YAML + Python module solves cleanly. If the domain becomes dynamic (e.g., frequent department reorganizations), this alternative should be revisited.
Alternative 3: LLM-Only Normalization
Remove all hardcoded normalization and let the LLM (ADR-0014) handle all canonicalization. Rejected: LLM responses are non-deterministic — the same entity might be normalized differently across runs. This makes the graph unreproducible and hard to debug. LLMs also hallucinate canonical names that don't exist. The taxonomy provides the deterministic foundation; the LLM validates what the taxonomy cannot express (semantic plausibility).
Implementation
| File | Purpose |
|---|---|
backend/app/services/graph/hospital_config/zol.yaml | Hospital-specific taxonomy data (YAML) |
backend/app/services/graph/hospital_config/schema.py | Pydantic validation models for YAML config |
backend/app/services/graph/zol_taxonomy.py | Runtime API — loads YAML, exposes frozen dataclasses and lookup maps |
backend/app/services/graph/medical_knowledge/dutch_medical_vocabulary.py | General Dutch medical vocabulary (entity overrides, aliases, blocklists) |
backend/app/services/graph/medical_knowledge/department_treatments.py | Curated dept→treatment knowledge map |
backend/app/services/graph/medical_knowledge/department_conditions.py | Curated dept→condition knowledge map |
backend/app/services/graph/medical_knowledge/condition_examinations.py | Curated condition→examination knowledge map |
backend/app/services/graph/medical_extraction.py | Updated to import from taxonomy layers |
backend/app/services/graph/typed_nodes.py | Updated to import from taxonomy layers |
backend/app/services/graph/llm_entity_validation.py | Taxonomy context appended to validation prompt |
backend/app/services/graph/taxonomy/golden_page_seeder.py | Curated taxonomy seeding from hub pages + knowledge maps |
backend/tests/unit/services/test_zol_taxonomy.py | Taxonomy consistency and completeness tests |
References
- Hogan, A., et al. (2021). Knowledge graphs. ACM Computing Surveys, 54(4), 1--37. https://doi.org/10.1145/3447772
- Paulheim, H. (2017). Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web, 8(3), 489--508. https://doi.org/10.3233/SW-160218