Knowledge Graph Architecture

While vector search excels at finding semantically similar text, it fundamentally operates on unstructured content. The ZOL hospital domain, however, contains rich structured relationships -- doctors work in departments, departments are located at campuses, departments treat conditions. Capturing and querying these relationships requires a knowledge graph.

Part of the Knowledge & Retrieval Steering triad

The taxonomy (ZOL's structured store of hospital entities and their typed relationships — see the Glossary) is the structured-knowledge member of a three-subsystem set. It is anchored and rescued by SNOMED CT (the ontology that stamps entities with concept IDs and expands query vocabulary) and its query-time results are arbitrated by the Value Framework (the intent × category reranker). See the Core Concepts flow for how all three compose end-to-end. This page covers the data model; the population lifecycle is below, and query-time use is in Taxonomy Query Enrichment.

Two layers of "taxonomy"

The word "taxonomy" names two distinct layers in this system, and separating them is the key to understanding everything below. Both are hospital-agnostic; both carry SNOMED anchors; but they differ in provenance and trust.

Layer	What it is	Source	Where it lives	Trust
A. Config taxonomy (`HospitalTaxonomy`)	Curated alias maps, department/campus definitions, condition→department routing, plausibility guards	`HospitalConfig` (YAML or DB) + universal Dutch medical knowledge	In-memory object, built per-tenant	Hand-curated — trusted directly
B. Scraped taxonomy (`taxonomy_entities` → `published_entities`)	Entities + typed relationships harvested from the hospital website	Crawl → LLM extraction → dedup → human review → publish	PostgreSQL `app.` tables	Machine-extracted — gated behind human review + versioned publish

The mental model: Layer A is the skeleton's blueprint (the known org-structure and routing rules), while Layer B is the harvested flesh-and-bone (the actual doctors, departments, and relationships scraped from the live site). Layer A changes rarely and is safe to hold in memory and trust blindly; Layer B is rebuilt as the website changes and therefore passes through a draft→approve→publish gate before any query can see it. The remainder of this page is about Layer B's data model; the population lifecycle shows how it is filled.

Layer A supplies the rules (how to normalize "Intensieve Zorgen", which department handles "hartfalen"); Layer B supplies the facts (which actual doctors exist and where they work). At query time both are consulted together — Layer A's alias maps resolve the user's words, Layer B's published snapshot supplies the ground-truth entities.

What Is a Knowledge Graph?

A property graph model where nodes represent entities (doctors, departments, campuses), edges represent relationships (WORKS_IN, LOCATED_AT, HANDLES), and properties store attributes (names, schedules). This model is well-suited for the hospital domain where relationships between entities are as important as the entities themselves.

Why a Knowledge Graph for Hospital Data?

Hospital information is inherently relational. Consider the question: "Welke dokter doet knieoperaties op campus Sint-Jan?" (Which doctor performs knee operations at campus Sint-Jan?). Answering this requires traversing multiple relationships:

Find Treatment node "Knee Operation"
Find Department that OFFERS this treatment
Find Doctors who WORK_IN this department
Filter by Department LOCATED_AT campus Sint-Jan

No single text passage in the hospital's content explicitly states "Dr. X performs knee operations at Sint-Jan." This information is implicit in the relationships between separate pieces of content. A knowledge graph makes these implicit relationships explicit and queryable.

The ZOL Domain Model

Entity Types

Entity	Description	Example Count
Hospital	Top-level organization (ZOL)	1
Doctor	Medical professionals at ZOL (specialty stored as node property)	~300+
Department	Hospital departments and services	~50
Campus	Physical hospital locations	4
Treatment	Medical procedures and therapies	~200
Condition	Diseases, disorders, symptoms	~700
Examination	Diagnostic procedures	~100
Center	Multi-disciplinary centers (e.g., Borstcentrum, Slaapcentrum)	~15
Facility	Physical amenities (parking, cafetaria, charging stations)	~20
Service	Hospital services (visiting hours, appointments, general services)	~30
SourcePage	Provenance tracking — the web page from which entities were extracted. Note: `SourcePage` is a provenance concept used in relationship metadata (`CURATED_FROM`, `MENTIONED_IN`, `SOURCED_FROM`); it is not stored as a first-class entity type in `taxonomy_entities`.	~600

Relationship Types

Relationship	From	To	Properties	Example
HAS_CAMPUS	Hospital	Campus	confidence	ZOL has campus Sint-Jan
BELONGS_TO	Department	Hospital	confidence	Cardiology belongs to ZOL
WORKS_IN	Doctor	Department	role, schedule, status, contacts, confidence	Dr. Peeters works in Cardiology (Mon/Wed)
WORKS_AT_CAMPUS	Doctor	Campus	confidence	Dr. Peeters works at campus Sint-Jan (auto-derived from WORKS_IN + LOCATED_AT)
LOCATED_AT	Department/Center	Campus	building, floor, confidence	Cardiology is at Sint-Jan, Building A, Floor 2
OFFERS	Department	Treatment	confidence	Cardiology offers Cardiac Catheterization
HANDLES	Department	Condition	confidence	Cardiology handles Heart Failure
PERFORMS	Department	Examination	confidence	Radiology performs MRI Scan
TREATS	Treatment	Condition	confidence	Chemotherapy treats Breast Cancer
DIAGNOSES	Examination	Condition	confidence	Echocardiography diagnoses Heart Failure
RELATED_TO	Treatment	Condition	confidence	Radiotherapy is related to Cancer
USED_FOR	Examination	Condition	confidence	CT Scan is used for Pulmonary Embolism
HAS_FACILITY	Campus	Facility	confidence	Sint-Jan has Parking Garage
PROVIDES	Campus	Service	confidence	Sint-Jan provides Visiting Hours Service
AVAILABLE_AT	Service	Campus	confidence	Appointments service available at Sint-Jan
IS_A	Condition/Treatment	Condition/Treatment	source, confidence	Diabetes Type 1 IS_A Diabetes Mellitus (SNOMED CT hierarchy, Phase B)
CURATED_FROM	Entity	SourcePage	--	Entity extracted from hub page source
MENTIONED_IN	Entity	SourcePage	--	Entity mentioned in crawled page
SOURCED_FROM	Entity	SourcePage	--	General provenance link to the source page

WORKS_IN relationships can carry optional schedule properties (consultation days, status, contacts) that link a doctor's consultation schedule to their department rather than directly to a campus.

PostgreSQL as the Entity Store

Entity relationships are stored in PostgreSQL taxonomy tables (taxonomy_entities and taxonomy_relationships), which replaced Neo4j as the primary entity store. PostgreSQL was selected for:

Unified infrastructure: Entities, relationships, vectors (pgvector), and SNOMED CT data all reside in a single database
SQL query language: Well-understood, performant, and directly integrated with the application's SQLAlchemy ORM
Relational integrity: Foreign keys and constraints ensure data consistency across entity types
Simplified operations: No separate graph database to deploy, monitor, and maintain

Migration from Neo4j

Neo4j was fully removed in March 2026. All entity relationships now live in PostgreSQL taxonomy tables (taxonomy_entities, taxonomy_relationships). The HospitalTaxonomy class provides the same query API surface, with SQL queries replacing Cypher.

taxonomy_relationships Schema

The taxonomy_relationships table includes entity-type columns that enable efficient type-filtered queries without joining back to taxonomy_entities:

Column	Type	Description
`source_type`	VARCHAR(30)	Entity type of the source (e.g., `DOCTOR`, `DEPARTMENT`)
`target_type`	VARCHAR(30)	Entity type of the target (e.g., `DEPARTMENT`, `CONDITION`)

Unique constraint: (hospital_id, source_name, target_name, relationship_type, draft_version) — prevents duplicate relationships per draft version.

Draft/Publish System

The draft_version column is managed by the Draft/Publish System (SP-5). Relationships belong to a versioned snapshot, enabling rollback and impact preview before operators promote a new taxonomy version to production.

Query Examples

A query like "Find all doctors in the Orthopedics department at Sint-Jan" uses the taxonomy tables:

SELECT te.canonical_name, te.metadata->>'specialty'
FROM app.taxonomy_entities te
JOIN app.taxonomy_relationships tr ON te.canonical_name = tr.source_name
WHERE tr.target_name = 'Orthopedie'
  AND tr.relationship_type = 'WORKS_IN'
  AND te.entity_type = 'doctor';

These queries execute in milliseconds, returning precise structured results that would be impossible to retrieve reliably from vector search alone.

Tiered Query Strategy

Not all entity queries are straightforward taxonomy lookups. The system employs a two-tier strategy to maximize both precision and recall:

Tier 1: Taxonomy Queries

Direct SQL queries against taxonomy entities (Doctor, Department, Campus, Condition, Treatment, Examination). Used when the intent classifier identifies entity-specific queries. The LLM extracts structured entities from the user query (see ADR-0030), and these are routed to the appropriate query handlers. Produces the most precise results with the highest confidence.

When it works: "Dr. Van den Berg" (direct Doctor entity lookup via WORKS_IN), "Cardiologie afdeling" (Department lookup), "Wie kan me helpen met hartproblemen?" (Condition -> Department lookup via HANDLES relationship), "Biedt ZOL chemotherapie aan?" (Treatment -> Department lookup via OFFERS relationship)

Tier 2: Vector Fallback

When typed node queries produce no results, the system falls back to pgvector semantic search. This ensures that the system always returns results, even for queries that have no graph representation.

When it works: "Wat zijn de bezoekuren?" (What are visiting hours?) -- Visiting hours are not modeled as graph entities but exist in textual content.

Population Lifecycle: the six-stage pipeline

Layer B (taxonomy_entities) is filled by a six-stage pipeline that is separate from document ingestion, though both begin from the same crawl. The pipeline is a trust gradient: raw LLM extraction is never allowed to reach a user answer without passing through deduplication, human review, and an atomic versioned publish.

#	Stage	What happens	Key columns written	Detail page
1	Scrape golden pages	A `TaxonomyScraperBase` subclass fetches the hospital's authoritative listing pages (doctor directories, department indexes) — the AI-discovered, human-confirmed hub pages. The scraper is the one deeply site-specific component.	`source_url`, `source_site`, `scraped_at`	Golden Pages · Entity Extraction
2	LLM entity extraction	Scraped HTML is run through an LLM to pull typed entities (`MedicalExtractionResult`) and relationships, each scored with `ai_confidence`.	`entity_type`, `canonical_name`, `aliases`, `ai_confidence`	Taxonomy Extraction Pipeline
3	SNOMED matching	Every clinical entity is matched against the SNOMED concept table via the 5-tier matcher (exact → normalized → fuzzy → word-overlap → LLM). This is the anchoring step — it gives each entity a language-neutral concept ID.	`snomed_concept_id`, `snomed_preferred_term`, `snomed_match_confidence`	SNOMED CT Terminology
4	Fuzzy dedup	The same doctor/department scraped from multiple pages collapses into one cluster; losers point at the winner. Winner selection is most-content-wins, never oldest-wins (a 2026-05-12 incident proved oldest-wins discards fully-crawled successors).	`dedup_cluster_id`, `dedup_is_primary`, `merged_into`	Dedup & Gap Fill · Entity Resolution
5	Human review	Entities land as `status='draft'`/`'approved'` with a `draft_version`. The Pipeline Wizard UI lets an operator approve/reject/merge. Nothing here is visible to queries.	`status`, `draft_version`	Pipeline Wizard · Draft/Publish System
6	Versioned publish	An advisory-locked transaction copies approved, primary, latest-draft entities into `published_entities` with a new monotonic `version`, remaps relationships, auto-links orphans, writes a `taxonomy_versions` manifest, and invalidates the registry cache. `rollback(version)` reverses it.	`published_entities.version`, `taxonomy_versions`	Draft/Publish System

The draft→published split is an immutable-snapshot pattern

Queries only ever read published_entities filtered by version = (SELECT MAX(version) …). An in-progress re-extraction or dedup on the draft side can never half-leak into live answers — the published snapshot is atomic and versioned, so a user always sees a coherent generation of the taxonomy. The snomed_concept_id anchor (Stage 3) is carried all the way through publish, which is what lets a second-language tenant reuse the structural graph: see SNOMED language coupling.

Graph Population

The knowledge graph is populated through a Three-Source Knowledge Architecture that cleanly separates automated knowledge derivation from human curation. Source 1 (web scraper) scrapes authoritative hub pages from ZOL's website. Source 2 (SNOMED CT) enriches entities with concept IDs, Dutch synonyms, and ontology-derived relationships. Source 3 (curated configuration) provides irreducible human judgment — plausibility guards, negative maps, and organisational mappings. The GoldenPageSeeder merges all three sources into PostgreSQL taxonomy tables, with curated data taking priority over scraped data, which takes priority over SNOMED-derived data. During regular document ingestion, pages go through regex extraction and LLM validation to generate page summaries for contextual retrieval, but entity data is written to doc_metadata only — not to the taxonomy. See Medical Knowledge Architecture for the Three-Source Architecture design, Seeding Pipeline for the seeding methodology, and Graph-Enhanced RAG for how graph data enhances the RAG pipeline at query time.

Taxonomy-Driven Normalization

Between regex extraction and LLM validation, all entities pass through a taxonomy normalization layer powered by a centralized domain taxonomy module (zol_taxonomy.py). This module serves as the single source of truth for all hospital domain knowledge, ensuring consistent entity naming and classification across the entire pipeline.

The taxonomy enforces a strict hospital hierarchy: a single Hospital node (ZOL) connects via HAS_CAMPUS relationships to exactly four Campus nodes (Sint-Jan, André Dumont, Sint-Barbara, Maas en Kempen), which in turn connect to Department nodes via LOCATED_AT relationships. This hierarchy prevents phantom campuses and ensures every department is anchored to a real physical location.

Key normalization capabilities:

Department alias resolution: Over 55 department definitions with alias maps resolve duplicates (e.g., "Intensieve Zorgen" and "Intensive Care" map to a single canonical node)
Campus normalization: Exactly 4 campuses with alias maps; prevents the creation of spurious 5th campus nodes from variant spellings
Doctor name cleanup: Role tokens like "Hoofdverpleegkundige" and "Diensthoofd" are stripped to prevent job titles from becoming doctor names
Entity type overrides: Ambiguous entities are classified correctly (e.g., "Radiotherapie" is recognized as a department, "dialyse" as a treatment)
Dual-entity model: Entities like "Radiotherapie" that function as both a department and a treatment generate two separate typed nodes with appropriate relationships

See Entity Extraction - Taxonomy-Driven Normalization for implementation details and Medical Knowledge Architecture for the three-layer design separating universal medical knowledge from hospital-specific data.

Theoretical Foundation

The ZOL knowledge graph draws on established concepts from graph theory, medical informatics, and ontology engineering:

Property Graph Model: The labeled property graph formalism (Rodriguez & Neubauer, 2010) provides the theoretical foundation for the entity storage model, where both entities and relationships carry typed properties. The system implements this model using PostgreSQL taxonomy tables.
SNOMED CT Integration: SNOMED CT Belgian Edition (356K concepts, 656K Dutch descriptions) provides standards-based medical terminology at both query time (synonym expansion) and seeding time (concept IDs, IS_A hierarchy, FINDING_SITE routing). Research demonstrates ontology-enhanced knowledge graphs improve retrieval accuracy by 22–40% (Jimeno-Yepes et al., 2012; Soman et al., 2024).
Three-Source Architecture: The system separates knowledge inputs by provenance — automated scraping (Source 1), medical ontology (Source 2: SNOMED CT), and curated configuration (Source 3) — enabling independent verification, auditability, and portability across hospital deployments. See Medical Knowledge Architecture for the full design.
Knowledge Graph Completion: The tiered query strategy (taxonomy queries → vector fallback) addresses the open-world assumption in knowledge graphs — the absence of a relationship does not mean it does not exist.

References

Ernst, P., Siu, A., & Weikum, G. (2015). KnowLife: A versatile approach for constructing a large knowledge graph for biomedical sciences. BMC Bioinformatics, 16, 157. https://doi.org/10.1186/s12859-015-0549-5
Hogan, A., et al. (2021). Knowledge graphs. ACM Computing Surveys, 54(4), 1--37. https://doi.org/10.1145/3447772
Jimeno-Yepes, A., Berlanga, R., & Rebholz-Schuhmann, D. (2012). Ontology-based query expansion for biomedical information retrieval. BMC Bioinformatics, 13(S14). https://doi.org/10.1186/1471-2105-13-S14-S1
Robinson, I., Webber, J., & Eifrem, E. (2015). Graph Databases: New Opportunities for Connected Data (2nd ed.). O'Reilly Media.
Rodriguez, M. A., & Neubauer, P. (2010). Constructions from dots and lines. Bulletin of the American Society for Information Science and Technology, 36(6), 35--41. https://doi.org/10.1002/bult.2010.1720360610
SNOMED International. (2024). SNOMED CT Starter Guide. https://www.snomed.org/
Soman, K., et al. (2024). OntologyRAG: Ontology-enhanced retrieval-augmented generation. arXiv preprint, arXiv:2412.09050.

Two layers of "taxonomy"​

What Is a Knowledge Graph?​

Why a Knowledge Graph for Hospital Data?​

The ZOL Domain Model​

Entity Types​

Relationship Types​

PostgreSQL as the Entity Store​

taxonomy_relationships Schema​

Query Examples​

Tiered Query Strategy​

Tier 1: Taxonomy Queries​

Tier 2: Vector Fallback​

Population Lifecycle: the six-stage pipeline​

Graph Population​

Taxonomy-Driven Normalization​

Theoretical Foundation​

References​