Skip to main content
Architectural Update (March 2026)

This ADR was written when the system used Neo4j for entity storage. As of March 2026, Neo4j has been fully removed and replaced by PostgreSQL taxonomy tables (taxonomy_entities, taxonomy_relationships). The decision rationale documented here remains valid; the storage layer has changed.

ADR-0015: Taxonomy-Driven Entity Normalization

Date: 2026-02-09 | Status: Accepted

Context

After implementing LLM entity validation (ADR-0014), the knowledge graph reached 3,578 nodes and 28,374 relationships — but a quality audit scored it only 69/100 overall, with particularly poor scores for naming consistency (58/100) and search utility (55/100).

Root Cause: Scattered Normalization Data

The normalization logic was distributed across ~20 constants in 2 files (medical_extraction.py and typed_nodes.py), with no single source of truth. This produced three categories of defects:

Defect CategoryExampleImpact
Contradictory normalization"Anesthesie" normalized to "Anesthesiologie" in one file but kept as "Anesthesie" in anotherDuplicate nodes in Neo4j for the same department
Missing campus constraintsA 5th campus ("ZOL") appeared alongside the 4 real campuses (Sint-Jan, André Dumont, Sint-Barbara, Maas en Kempen)Phantom campus hub node with incorrect relationships
Duplicate department pairs7+ pairs like "Cardiologie" / "Dienst Cardiologie", "Urologie" / "Dienst Urologie"Fragmented graph — queries miss half the relationships

Doctor Name Pollution

Regex extraction parsed role tokens and body parts as doctor name components:

  • "Hoofdverpleegkundige Van Damme" → doctor named "Hoofdverpleegkundige Van Damme" (head nurse is a role, not a name)
  • "Borstkaschirurgie" → doctor named "Borstkas" (chest is a body part, not a person)
  • "Diensthoofd Dr. Janssen" → doctor named "Diensthoofd Janssen" (department head is a role)

These polluted the graph with dozens of fake doctor nodes that connected to real departments, degrading relationship quality.

Entity Type Confusion

Several entities exist as both a department and a treatment in the ZOL context:

  • Radiotherapie: a department (afdeling) that patients visit, and a treatment (behandeling) that patients receive
  • Dialyse: a department with staff and rooms, and a treatment procedure
  • Revalidatie: a department on campus, and a treatment program

Without an authoritative type override, the extraction pipeline would randomly classify these as one or the other depending on page context, creating inconsistent graph structure.

Decision

1. Multi-Layer Taxonomy Architecture

The taxonomy data is organized in a layered architecture, separating hospital-specific configuration from general Dutch medical vocabulary:

Layer 1: Hospital Configuration (hospital_config/zol.yaml)

All ZOL-specific data lives in a YAML configuration file, validated at load time by Pydantic models defined in hospital_config/schema.py. This includes:

  • Campus definitions (4 campuses with addresses)
  • Department definitions (~57 departments with aliases, campus assignments, domain groups)
  • Domain knowledge maps (dept→conditions, dept→treatments, condition→examinations)
  • Hub page definitions for curated taxonomy seeding
  • Search aliases mapping patient language to clinical terms

Layer 2: Runtime API (zol_taxonomy.py)

The zol_taxonomy.py module (960+ lines) loads the YAML configuration and exposes it as frozen dataclasses and pre-computed lookup structures:

@dataclass(frozen=True)
class CampusDefinition:
id: str
canonical_name: str
aliases: list[str] = field(default_factory=list)
address: str = ""
city: str = ""
postal_code: str = ""
phone: str = ""

@dataclass(frozen=True)
class DepartmentDefinition:
canonical_name: str
aliases: list[str] = field(default_factory=list)
campuses: list[str] = field(default_factory=list)
domain_group: str = "general"
is_diagnostic: bool = False

Key derived structures include:

  • CAMPUS_CANONICAL_NAMES: set of valid campus names
  • DEPARTMENT_CAMPUS_MAP: department → campus list mapping
  • DEPT_CONDITION_MAP, DEPT_TREATMENT_MAP: domain knowledge loaded from YAML
  • ALL_DEPARTMENT_NAMES, ALL_CONDITION_NAMES, ALL_SPECIALTY_NAMES: pre-compiled sets for O(1) lookup
  • SEARCH_ALIASES: combined universal + hospital-specific patient-facing aliases

Layer 3: General Medical Vocabulary (dutch_medical_vocabulary.py)

Language-level medical knowledge that is not hospital-specific:

# Entity type overrides — prevents misclassification
ENTITY_TYPE_OVERRIDES: dict[str, str] = {
"radiotherapie": "department",
"dialyse": "department",
"revalidatie": "department",
"patiëntenfiches": "service",
"palliatief support team": "service",
"valkliniek": "facility",
# ...
}

# Dual-entity map — departments that also represent treatments
DUAL_ENTITY_MAP: dict[str, list[str]] = {
"radiotherapie": ["Bestralingstherapie", "Uitwendige bestraling", "Brachytherapie"],
"revalidatie": ["Fysiotherapie", "Ergotherapie", "Logopedie"],
# ...
}

# Alias maps for condition/treatment normalization
CONDITION_ALIASES: dict[str, str] = {
"hoge bloeddruk": "Hypertensie",
"suikerziekte": "Diabetes Mellitus",
# 55+ entries
}

TREATMENT_ALIASES: dict[str, str] = {
"bestraling": "Radiotherapie",
"chemo": "Chemotherapie",
# 20+ entries
}

Also includes:

  • DOCTOR_NAME_STRIP_TOKENS: role tokens to strip from names
  • DOCTOR_NAME_BLOCKLIST: body parts and generic terms that are not person names
  • TREATMENT_SPECIALTY_CONSTRAINTS: prevents cross-specialty leakage (e.g., cochleair implantaat → KNO only)
  • ENTITY_BLOCKLIST: terms that should never become graph entities

Layer 4: Curated Medical Knowledge (medical_knowledge/)

LLM-generated and human-reviewed relationship maps:

ModuleContent
department_treatments.pyWhich department offers which treatments
department_conditions.pyWhich department handles which conditions
condition_examinations.pyWhich examination diagnoses which condition
belgian_hospital_departments.pyValid Belgian hospital department names

2. Consumption by Extraction and Storage

Both downstream modules import from the taxonomy layers:

  • medical_extraction.py imports campus lists, department aliases, doctor cleanup rules, entity type overrides, and domain knowledge maps
  • typed_nodes.py imports normalization maps, entity type overrides, and canonical names
  • golden_page_seeder.py imports all taxonomy data for curated taxonomy seeding, plus curated knowledge maps from medical_knowledge/
  • LLM validation prompt (ADR-0014) includes taxonomy context: the list of valid campuses, departments, and known dual-entities is appended to the system prompt

3. Quality Metrics Targets

MetricBefore (v1)Target (v2)Measurement Method
Overall quality69/10080+/100Weighted composite of all sub-scores
Naming consistency58/10075+/100Duplicate detection + canonical name coverage
Search utility55/10070+/100Query hit rate for common patient search terms
Relationship accuracy74/10085+/100Manual sampling of 50 random relationships
Campus assignmentUnknown95+/100Departments assigned to correct campus(es)

Consequences

Positive

  • Layered architecture: Hospital data (YAML) separated from general medical vocabulary (Python) — clear ownership boundaries
  • Zero duplicate nodes: Canonical names enforced at both extraction and storage layers
  • Clean doctor names: Role tokens and body parts systematically stripped before storage
  • Entity type clarity: ENTITY_TYPE_OVERRIDES + DUAL_ENTITY_MAP handle ambiguous entity types
  • Improved search: Patient-facing aliases map colloquial Dutch terms to clinical entities (55+ condition aliases, 20+ treatment aliases)
  • Better LLM validation: Taxonomy context in the validation prompt gives the LLM authoritative reference data
  • Testable: Taxonomy module is pure data — easy to unit test for completeness and consistency
  • Extensible: New hospitals can add their own YAML config without modifying Python code

Negative

  • Taxonomy maintenance burden: New departments, conditions, or treatments require updating YAML config and/or vocabulary module
  • Large refactor: Original consolidation required careful migration from ~20 scattered constants
  • Domain expertise required: Taxonomy curation needs clinical/hospital knowledge — not purely a development task
  • Risk of over-normalization: Aggressive alias maps could incorrectly merge entities that are actually distinct

Neutral

  • Pipeline architecture unchanged — regex extraction, LLM validation, and taxonomy storage still operate in the same sequence
  • No database migration required — only the data flowing into the taxonomy changes
  • No impact on query-time latency — normalization happens at ingestion time
  • LLM validation (ADR-0014) continues to operate as an independent quality gate

Alternatives Considered

Alternative 1: Fix Constants In-Place

Continue maintaining normalization data as separate constants in medical_extraction.py and typed_nodes.py, just fixing the contradictions. Rejected: this does not solve the root cause. With two files independently defining canonical names, contradictions will inevitably re-emerge as new entities are added. The scattered approach also makes auditing impossible — you cannot answer "what is the canonical name for X?" without searching multiple files.

Alternative 2: Database-Driven Taxonomy

Store the taxonomy in PostgreSQL or Neo4j itself, with an admin UI for editing. Rejected: the ZOL domain is stable — departments and campuses rarely change. A database-driven approach adds migration complexity, admin UI development cost, and cache invalidation logic for a problem that a simple YAML + Python module solves cleanly. If the domain becomes dynamic (e.g., frequent department reorganizations), this alternative should be revisited.

Alternative 3: LLM-Only Normalization

Remove all hardcoded normalization and let the LLM (ADR-0014) handle all canonicalization. Rejected: LLM responses are non-deterministic — the same entity might be normalized differently across runs. This makes the graph unreproducible and hard to debug. LLMs also hallucinate canonical names that don't exist. The taxonomy provides the deterministic foundation; the LLM validates what the taxonomy cannot express (semantic plausibility).

Implementation

FilePurpose
backend/app/services/graph/hospital_config/zol.yamlHospital-specific taxonomy data (YAML)
backend/app/services/graph/hospital_config/schema.pyPydantic validation models for YAML config
backend/app/services/graph/zol_taxonomy.pyRuntime API — loads YAML, exposes frozen dataclasses and lookup maps
backend/app/services/graph/medical_knowledge/dutch_medical_vocabulary.pyGeneral Dutch medical vocabulary (entity overrides, aliases, blocklists)
backend/app/services/graph/medical_knowledge/department_treatments.pyCurated dept→treatment knowledge map
backend/app/services/graph/medical_knowledge/department_conditions.pyCurated dept→condition knowledge map
backend/app/services/graph/medical_knowledge/condition_examinations.pyCurated condition→examination knowledge map
backend/app/services/graph/medical_extraction.pyUpdated to import from taxonomy layers
backend/app/services/graph/typed_nodes.pyUpdated to import from taxonomy layers
backend/app/services/graph/llm_entity_validation.pyTaxonomy context appended to validation prompt
backend/app/services/graph/taxonomy/golden_page_seeder.pyCurated taxonomy seeding from hub pages + knowledge maps
backend/tests/unit/services/test_zol_taxonomy.pyTaxonomy consistency and completeness tests

References