Architectural Update (March 2026)

This ADR was written when the system used Neo4j for entity storage. As of March 2026, Neo4j has been fully removed and replaced by PostgreSQL taxonomy tables (taxonomy_entities, taxonomy_relationships). The decision rationale documented here remains valid; the storage layer has changed.

ADR-0015: Taxonomy-Driven Entity Normalization

Date: 2026-02-09 | Status: Accepted

Context

After implementing LLM entity validation (ADR-0014), the knowledge graph reached 3,578 nodes and 28,374 relationships — but a quality audit scored it only 69/100 overall, with particularly poor scores for naming consistency (58/100) and search utility (55/100).

Root Cause: Scattered Normalization Data

The normalization logic was distributed across ~20 constants in 2 files (medical_extraction.py and typed_nodes.py), with no single source of truth. This produced three categories of defects:

Defect Category	Example	Impact
Contradictory normalization	"Anesthesie" normalized to "Anesthesiologie" in one file but kept as "Anesthesie" in another	Duplicate nodes in Neo4j for the same department
Missing campus constraints	A 5th campus ("ZOL") appeared alongside the 4 real campuses (Sint-Jan, André Dumont, Sint-Barbara, Maas en Kempen)	Phantom campus hub node with incorrect relationships
Duplicate department pairs	7+ pairs like "Cardiologie" / "Dienst Cardiologie", "Urologie" / "Dienst Urologie"	Fragmented graph — queries miss half the relationships

Doctor Name Pollution

Regex extraction parsed role tokens and body parts as doctor name components:

"Hoofdverpleegkundige Van Damme" → doctor named "Hoofdverpleegkundige Van Damme" (head nurse is a role, not a name)
"Borstkaschirurgie" → doctor named "Borstkas" (chest is a body part, not a person)
"Diensthoofd Dr. Janssen" → doctor named "Diensthoofd Janssen" (department head is a role)

These polluted the graph with dozens of fake doctor nodes that connected to real departments, degrading relationship quality.

Entity Type Confusion

Several entities exist as both a department and a treatment in the ZOL context:

Radiotherapie: a department (afdeling) that patients visit, and a treatment (behandeling) that patients receive
Dialyse: a department with staff and rooms, and a treatment procedure
Revalidatie: a department on campus, and a treatment program

Without an authoritative type override, the extraction pipeline would randomly classify these as one or the other depending on page context, creating inconsistent graph structure.

Decision

1. Multi-Layer Taxonomy Architecture

The taxonomy data is organized in a layered architecture, separating hospital-specific configuration from general Dutch medical vocabulary:

Layer 1: Hospital Configuration (`hospital_config/zol.yaml`)

All ZOL-specific data lives in a YAML configuration file, validated at load time by Pydantic models defined in hospital_config/schema.py. This includes:

Campus definitions (4 campuses with addresses)
Department definitions (~57 departments with aliases, campus assignments, domain groups)
Domain knowledge maps (dept→conditions, dept→treatments, condition→examinations)
Hub page definitions for curated taxonomy seeding
Search aliases mapping patient language to clinical terms

Layer 2: Runtime API (`zol_taxonomy.py`)

The zol_taxonomy.py module (960+ lines) loads the YAML configuration and exposes it as frozen dataclasses and pre-computed lookup structures:

@dataclass(frozen=True)
class CampusDefinition:
    id: str
    canonical_name: str
    aliases: list[str] = field(default_factory=list)
    address: str = ""
    city: str = ""
    postal_code: str = ""
    phone: str = ""

@dataclass(frozen=True)
class DepartmentDefinition:
    canonical_name: str
    aliases: list[str] = field(default_factory=list)
    campuses: list[str] = field(default_factory=list)
    domain_group: str = "general"
    is_diagnostic: bool = False

Key derived structures include:

CAMPUS_CANONICAL_NAMES: set of valid campus names
DEPARTMENT_CAMPUS_MAP: department → campus list mapping
DEPT_CONDITION_MAP, DEPT_TREATMENT_MAP: domain knowledge loaded from YAML
ALL_DEPARTMENT_NAMES, ALL_CONDITION_NAMES, ALL_SPECIALTY_NAMES: pre-compiled sets for O(1) lookup
SEARCH_ALIASES: combined universal + hospital-specific patient-facing aliases

Layer 3: General Medical Vocabulary (`dutch_medical_vocabulary.py`)

Language-level medical knowledge that is not hospital-specific:

# Entity type overrides — prevents misclassification
ENTITY_TYPE_OVERRIDES: dict[str, str] = {
    "radiotherapie": "department",
    "dialyse": "department",
    "revalidatie": "department",
    "patiëntenfiches": "service",
    "palliatief support team": "service",
    "valkliniek": "facility",
    # ...
}

# Dual-entity map — departments that also represent treatments
DUAL_ENTITY_MAP: dict[str, list[str]] = {
    "radiotherapie": ["Bestralingstherapie", "Uitwendige bestraling", "Brachytherapie"],
    "revalidatie": ["Fysiotherapie", "Ergotherapie", "Logopedie"],
    # ...
}

# Alias maps for condition/treatment normalization
CONDITION_ALIASES: dict[str, str] = {
    "hoge bloeddruk": "Hypertensie",
    "suikerziekte": "Diabetes Mellitus",
    # 55+ entries
}

TREATMENT_ALIASES: dict[str, str] = {
    "bestraling": "Radiotherapie",
    "chemo": "Chemotherapie",
    # 20+ entries
}

Also includes:

DOCTOR_NAME_STRIP_TOKENS: role tokens to strip from names
DOCTOR_NAME_BLOCKLIST: body parts and generic terms that are not person names
TREATMENT_SPECIALTY_CONSTRAINTS: prevents cross-specialty leakage (e.g., cochleair implantaat → KNO only)
ENTITY_BLOCKLIST: terms that should never become graph entities

Layer 4: Curated Medical Knowledge (`medical_knowledge/`)

LLM-generated and human-reviewed relationship maps:

Module	Content
`department_treatments.py`	Which department offers which treatments
`department_conditions.py`	Which department handles which conditions
`condition_examinations.py`	Which examination diagnoses which condition
`belgian_hospital_departments.py`	Valid Belgian hospital department names

2. Consumption by Extraction and Storage

Both downstream modules import from the taxonomy layers:

medical_extraction.py imports campus lists, department aliases, doctor cleanup rules, entity type overrides, and domain knowledge maps
typed_nodes.py imports normalization maps, entity type overrides, and canonical names
golden_page_seeder.py imports all taxonomy data for curated taxonomy seeding, plus curated knowledge maps from medical_knowledge/
LLM validation prompt (ADR-0014) includes taxonomy context: the list of valid campuses, departments, and known dual-entities is appended to the system prompt

3. Quality Metrics Targets

Metric	Before (v1)	Target (v2)	Measurement Method
Overall quality	69/100	80+/100	Weighted composite of all sub-scores
Naming consistency	58/100	75+/100	Duplicate detection + canonical name coverage
Search utility	55/100	70+/100	Query hit rate for common patient search terms
Relationship accuracy	74/100	85+/100	Manual sampling of 50 random relationships
Campus assignment	Unknown	95+/100	Departments assigned to correct campus(es)

Consequences

Positive

Layered architecture: Hospital data (YAML) separated from general medical vocabulary (Python) — clear ownership boundaries
Zero duplicate nodes: Canonical names enforced at both extraction and storage layers
Clean doctor names: Role tokens and body parts systematically stripped before storage
Entity type clarity: ENTITY_TYPE_OVERRIDES + DUAL_ENTITY_MAP handle ambiguous entity types
Improved search: Patient-facing aliases map colloquial Dutch terms to clinical entities (55+ condition aliases, 20+ treatment aliases)
Better LLM validation: Taxonomy context in the validation prompt gives the LLM authoritative reference data
Testable: Taxonomy module is pure data — easy to unit test for completeness and consistency
Extensible: New hospitals can add their own YAML config without modifying Python code

Negative

Taxonomy maintenance burden: New departments, conditions, or treatments require updating YAML config and/or vocabulary module
Large refactor: Original consolidation required careful migration from ~20 scattered constants
Domain expertise required: Taxonomy curation needs clinical/hospital knowledge — not purely a development task
Risk of over-normalization: Aggressive alias maps could incorrectly merge entities that are actually distinct

Neutral

Pipeline architecture unchanged — regex extraction, LLM validation, and taxonomy storage still operate in the same sequence
No database migration required — only the data flowing into the taxonomy changes
No impact on query-time latency — normalization happens at ingestion time
LLM validation (ADR-0014) continues to operate as an independent quality gate

Alternatives Considered

Alternative 1: Fix Constants In-Place

Continue maintaining normalization data as separate constants in medical_extraction.py and typed_nodes.py, just fixing the contradictions. Rejected: this does not solve the root cause. With two files independently defining canonical names, contradictions will inevitably re-emerge as new entities are added. The scattered approach also makes auditing impossible — you cannot answer "what is the canonical name for X?" without searching multiple files.

Alternative 2: Database-Driven Taxonomy

Store the taxonomy in PostgreSQL or Neo4j itself, with an admin UI for editing. Rejected: the ZOL domain is stable — departments and campuses rarely change. A database-driven approach adds migration complexity, admin UI development cost, and cache invalidation logic for a problem that a simple YAML + Python module solves cleanly. If the domain becomes dynamic (e.g., frequent department reorganizations), this alternative should be revisited.

Alternative 3: LLM-Only Normalization

Remove all hardcoded normalization and let the LLM (ADR-0014) handle all canonicalization. Rejected: LLM responses are non-deterministic — the same entity might be normalized differently across runs. This makes the graph unreproducible and hard to debug. LLMs also hallucinate canonical names that don't exist. The taxonomy provides the deterministic foundation; the LLM validates what the taxonomy cannot express (semantic plausibility).

Implementation

File	Purpose
`backend/app/services/graph/hospital_config/zol.yaml`	Hospital-specific taxonomy data (YAML)
`backend/app/services/graph/hospital_config/schema.py`	Pydantic validation models for YAML config
`backend/app/services/graph/zol_taxonomy.py`	Runtime API — loads YAML, exposes frozen dataclasses and lookup maps
`backend/app/services/graph/medical_knowledge/dutch_medical_vocabulary.py`	General Dutch medical vocabulary (entity overrides, aliases, blocklists)
`backend/app/services/graph/medical_knowledge/department_treatments.py`	Curated dept→treatment knowledge map
`backend/app/services/graph/medical_knowledge/department_conditions.py`	Curated dept→condition knowledge map
`backend/app/services/graph/medical_knowledge/condition_examinations.py`	Curated condition→examination knowledge map
`backend/app/services/graph/medical_extraction.py`	Updated to import from taxonomy layers
`backend/app/services/graph/typed_nodes.py`	Updated to import from taxonomy layers
`backend/app/services/graph/llm_entity_validation.py`	Taxonomy context appended to validation prompt
`backend/app/services/graph/taxonomy/golden_page_seeder.py`	Curated taxonomy seeding from hub pages + knowledge maps
`backend/tests/unit/services/test_zol_taxonomy.py`	Taxonomy consistency and completeness tests

References

Hogan, A., et al. (2021). Knowledge graphs. ACM Computing Surveys, 54(4), 1--37. https://doi.org/10.1145/3447772
Paulheim, H. (2017). Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web, 8(3), 489--508. https://doi.org/10.3233/SW-160218

Context​

Root Cause: Scattered Normalization Data​

Doctor Name Pollution​

Entity Type Confusion​

Decision​

1. Multi-Layer Taxonomy Architecture​

Layer 1: Hospital Configuration (hospital_config/zol.yaml)​

Layer 2: Runtime API (zol_taxonomy.py)​

Layer 3: General Medical Vocabulary (dutch_medical_vocabulary.py)​

Layer 4: Curated Medical Knowledge (medical_knowledge/)​

2. Consumption by Extraction and Storage​

3. Quality Metrics Targets​

Consequences​

Positive​

Negative​

Neutral​

Alternatives Considered​

Alternative 1: Fix Constants In-Place​

Alternative 2: Database-Driven Taxonomy​

Alternative 3: LLM-Only Normalization​

Implementation​

References​