Knowledge Graph Architecture
While vector search excels at finding semantically similar text, it fundamentally operates on unstructured content. The ZOL hospital domain, however, contains rich structured relationships -- doctors work in departments, departments are located at campuses, departments treat conditions. Capturing and querying these relationships requires a knowledge graph.
The taxonomy (ZOL's structured store of hospital entities and their typed relationships — see the Glossary) is the structured-knowledge member of a three-subsystem set. It is anchored and rescued by SNOMED CT (the ontology that stamps entities with concept IDs and expands query vocabulary) and its query-time results are arbitrated by the Value Framework (the intent × category reranker). See the Core Concepts flow for how all three compose end-to-end. This page covers the data model; the population lifecycle is below, and query-time use is in Taxonomy Query Enrichment.
Two layers of "taxonomy"
The word "taxonomy" names two distinct layers in this system, and separating them is the key to understanding everything below. Both are hospital-agnostic; both carry SNOMED anchors; but they differ in provenance and trust.
| Layer | What it is | Source | Where it lives | Trust |
|---|---|---|---|---|
A. Config taxonomy (HospitalTaxonomy) | Curated alias maps, department/campus definitions, condition→department routing, plausibility guards | HospitalConfig (YAML or DB) + universal Dutch medical knowledge | In-memory object, built per-tenant | Hand-curated — trusted directly |
B. Scraped taxonomy (taxonomy_entities → published_entities) | Entities + typed relationships harvested from the hospital website | Crawl → LLM extraction → dedup → human review → publish | PostgreSQL app. tables | Machine-extracted — gated behind human review + versioned publish |
The mental model: Layer A is the skeleton's blueprint (the known org-structure and routing rules), while Layer B is the harvested flesh-and-bone (the actual doctors, departments, and relationships scraped from the live site). Layer A changes rarely and is safe to hold in memory and trust blindly; Layer B is rebuilt as the website changes and therefore passes through a draft→approve→publish gate before any query can see it. The remainder of this page is about Layer B's data model; the population lifecycle shows how it is filled.
Layer A supplies the rules (how to normalize "Intensieve Zorgen", which department handles "hartfalen"); Layer B supplies the facts (which actual doctors exist and where they work). At query time both are consulted together — Layer A's alias maps resolve the user's words, Layer B's published snapshot supplies the ground-truth entities.
What Is a Knowledge Graph?
A property graph model where nodes represent entities (doctors, departments, campuses), edges represent relationships (WORKS_IN, LOCATED_AT, HANDLES), and properties store attributes (names, schedules). This model is well-suited for the hospital domain where relationships between entities are as important as the entities themselves.
Why a Knowledge Graph for Hospital Data?
Hospital information is inherently relational. Consider the question: "Welke dokter doet knieoperaties op campus Sint-Jan?" (Which doctor performs knee operations at campus Sint-Jan?). Answering this requires traversing multiple relationships:
- Find Treatment node "Knee Operation"
- Find Department that OFFERS this treatment
- Find Doctors who WORK_IN this department
- Filter by Department LOCATED_AT campus Sint-Jan
No single text passage in the hospital's content explicitly states "Dr. X performs knee operations at Sint-Jan." This information is implicit in the relationships between separate pieces of content. A knowledge graph makes these implicit relationships explicit and queryable.
The ZOL Domain Model
Entity Types
| Entity | Description | Example Count |
|---|---|---|
| Hospital | Top-level organization (ZOL) | 1 |
| Doctor | Medical professionals at ZOL (specialty stored as node property) | ~300+ |
| Department | Hospital departments and services | ~50 |
| Campus | Physical hospital locations | 4 |
| Treatment | Medical procedures and therapies | ~200 |
| Condition | Diseases, disorders, symptoms | ~700 |
| Examination | Diagnostic procedures | ~100 |
| Center | Multi-disciplinary centers (e.g., Borstcentrum, Slaapcentrum) | ~15 |
| Facility | Physical amenities (parking, cafetaria, charging stations) | ~20 |
| Service | Hospital services (visiting hours, appointments, general services) | ~30 |
| SourcePage | Provenance tracking — the web page from which entities were extracted. Note: SourcePage is a provenance concept used in relationship metadata (CURATED_FROM, MENTIONED_IN, SOURCED_FROM); it is not stored as a first-class entity type in taxonomy_entities. | ~600 |
Relationship Types
| Relationship | From | To | Properties | Example |
|---|---|---|---|---|
| HAS_CAMPUS | Hospital | Campus | confidence | ZOL has campus Sint-Jan |
| BELONGS_TO | Department | Hospital | confidence | Cardiology belongs to ZOL |
| WORKS_IN | Doctor | Department | role, schedule, status, contacts, confidence | Dr. Peeters works in Cardiology (Mon/Wed) |
| WORKS_AT_CAMPUS | Doctor | Campus | confidence | Dr. Peeters works at campus Sint-Jan (auto-derived from WORKS_IN + LOCATED_AT) |
| LOCATED_AT | Department/Center | Campus | building, floor, confidence | Cardiology is at Sint-Jan, Building A, Floor 2 |
| OFFERS | Department | Treatment | confidence | Cardiology offers Cardiac Catheterization |
| HANDLES | Department | Condition | confidence | Cardiology handles Heart Failure |
| PERFORMS | Department | Examination | confidence | Radiology performs MRI Scan |
| TREATS | Treatment | Condition | confidence | Chemotherapy treats Breast Cancer |
| DIAGNOSES | Examination | Condition | confidence | Echocardiography diagnoses Heart Failure |
| RELATED_TO | Treatment | Condition | confidence | Radiotherapy is related to Cancer |
| USED_FOR | Examination | Condition | confidence | CT Scan is used for Pulmonary Embolism |
| HAS_FACILITY | Campus | Facility | confidence | Sint-Jan has Parking Garage |
| PROVIDES | Campus | Service | confidence | Sint-Jan provides Visiting Hours Service |
| AVAILABLE_AT | Service | Campus | confidence | Appointments service available at Sint-Jan |
| IS_A | Condition/Treatment | Condition/Treatment | source, confidence | Diabetes Type 1 IS_A Diabetes Mellitus (SNOMED CT hierarchy, Phase B) |
| CURATED_FROM | Entity | SourcePage | -- | Entity extracted from hub page source |
| MENTIONED_IN | Entity | SourcePage | -- | Entity mentioned in crawled page |
| SOURCED_FROM | Entity | SourcePage | -- | General provenance link to the source page |
WORKS_IN relationships can carry optional schedule properties (consultation days, status, contacts) that link a doctor's consultation schedule to their department rather than directly to a campus.
PostgreSQL as the Entity Store
Entity relationships are stored in PostgreSQL taxonomy tables (taxonomy_entities and taxonomy_relationships), which replaced Neo4j as the primary entity store. PostgreSQL was selected for:
- Unified infrastructure: Entities, relationships, vectors (pgvector), and SNOMED CT data all reside in a single database
- SQL query language: Well-understood, performant, and directly integrated with the application's SQLAlchemy ORM
- Relational integrity: Foreign keys and constraints ensure data consistency across entity types
- Simplified operations: No separate graph database to deploy, monitor, and maintain
Neo4j was fully removed in March 2026. All entity relationships now live in PostgreSQL taxonomy tables (taxonomy_entities, taxonomy_relationships). The HospitalTaxonomy class provides the same query API surface, with SQL queries replacing Cypher.
taxonomy_relationships Schema
The taxonomy_relationships table includes entity-type columns that enable efficient type-filtered queries without joining back to taxonomy_entities:
| Column | Type | Description |
|---|---|---|
source_type | VARCHAR(30) | Entity type of the source (e.g., DOCTOR, DEPARTMENT) |
target_type | VARCHAR(30) | Entity type of the target (e.g., DEPARTMENT, CONDITION) |
Unique constraint: (hospital_id, source_name, target_name, relationship_type, draft_version) — prevents duplicate relationships per draft version.
The draft_version column is managed by the Draft/Publish System (SP-5). Relationships belong to a versioned snapshot, enabling rollback and impact preview before operators promote a new taxonomy version to production.
Query Examples
A query like "Find all doctors in the Orthopedics department at Sint-Jan" uses the taxonomy tables:
SELECT te.canonical_name, te.metadata->>'specialty'
FROM app.taxonomy_entities te
JOIN app.taxonomy_relationships tr ON te.canonical_name = tr.source_name
WHERE tr.target_name = 'Orthopedie'
AND tr.relationship_type = 'WORKS_IN'
AND te.entity_type = 'doctor';
These queries execute in milliseconds, returning precise structured results that would be impossible to retrieve reliably from vector search alone.
Tiered Query Strategy
Not all entity queries are straightforward taxonomy lookups. The system employs a two-tier strategy to maximize both precision and recall:
Tier 1: Taxonomy Queries
Direct SQL queries against taxonomy entities (Doctor, Department, Campus, Condition, Treatment, Examination). Used when the intent classifier identifies entity-specific queries. The LLM extracts structured entities from the user query (see ADR-0030), and these are routed to the appropriate query handlers. Produces the most precise results with the highest confidence.
When it works: "Dr. Van den Berg" (direct Doctor entity lookup via WORKS_IN), "Cardiologie afdeling" (Department lookup), "Wie kan me helpen met hartproblemen?" (Condition -> Department lookup via HANDLES relationship), "Biedt ZOL chemotherapie aan?" (Treatment -> Department lookup via OFFERS relationship)
Tier 2: Vector Fallback
When typed node queries produce no results, the system falls back to pgvector semantic search. This ensures that the system always returns results, even for queries that have no graph representation.
When it works: "Wat zijn de bezoekuren?" (What are visiting hours?) -- Visiting hours are not modeled as graph entities but exist in textual content.
Population Lifecycle: the six-stage pipeline
Layer B (taxonomy_entities) is filled by a six-stage pipeline that is separate from document ingestion, though both begin from the same crawl. The pipeline is a trust gradient: raw LLM extraction is never allowed to reach a user answer without passing through deduplication, human review, and an atomic versioned publish.
| # | Stage | What happens | Key columns written | Detail page |
|---|---|---|---|---|
| 1 | Scrape golden pages | A TaxonomyScraperBase subclass fetches the hospital's authoritative listing pages (doctor directories, department indexes) — the AI-discovered, human-confirmed hub pages. The scraper is the one deeply site-specific component. | source_url, source_site, scraped_at | Golden Pages · Entity Extraction |
| 2 | LLM entity extraction | Scraped HTML is run through an LLM to pull typed entities (MedicalExtractionResult) and relationships, each scored with ai_confidence. | entity_type, canonical_name, aliases, ai_confidence | Taxonomy Extraction Pipeline |
| 3 | SNOMED matching | Every clinical entity is matched against the SNOMED concept table via the 5-tier matcher (exact → normalized → fuzzy → word-overlap → LLM). This is the anchoring step — it gives each entity a language-neutral concept ID. | snomed_concept_id, snomed_preferred_term, snomed_match_confidence | SNOMED CT Terminology |
| 4 | Fuzzy dedup | The same doctor/department scraped from multiple pages collapses into one cluster; losers point at the winner. Winner selection is most-content-wins, never oldest-wins (a 2026-05-12 incident proved oldest-wins discards fully-crawled successors). | dedup_cluster_id, dedup_is_primary, merged_into | Dedup & Gap Fill · Entity Resolution |
| 5 | Human review | Entities land as status='draft'/'approved' with a draft_version. The Pipeline Wizard UI lets an operator approve/reject/merge. Nothing here is visible to queries. | status, draft_version | Pipeline Wizard · Draft/Publish System |
| 6 | Versioned publish | An advisory-locked transaction copies approved, primary, latest-draft entities into published_entities with a new monotonic version, remaps relationships, auto-links orphans, writes a taxonomy_versions manifest, and invalidates the registry cache. rollback(version) reverses it. | published_entities.version, taxonomy_versions | Draft/Publish System |
Queries only ever read published_entities filtered by version = (SELECT MAX(version) …). An in-progress re-extraction or dedup on the draft side can never half-leak into live answers — the published snapshot is atomic and versioned, so a user always sees a coherent generation of the taxonomy. The snomed_concept_id anchor (Stage 3) is carried all the way through publish, which is what lets a second-language tenant reuse the structural graph: see SNOMED language coupling.
Graph Population
The knowledge graph is populated through a Three-Source Knowledge Architecture that cleanly separates automated knowledge derivation from human curation. Source 1 (web scraper) scrapes authoritative hub pages from ZOL's website. Source 2 (SNOMED CT) enriches entities with concept IDs, Dutch synonyms, and ontology-derived relationships. Source 3 (curated configuration) provides irreducible human judgment — plausibility guards, negative maps, and organisational mappings. The GoldenPageSeeder merges all three sources into PostgreSQL taxonomy tables, with curated data taking priority over scraped data, which takes priority over SNOMED-derived data. During regular document ingestion, pages go through regex extraction and LLM validation to generate page summaries for contextual retrieval, but entity data is written to doc_metadata only — not to the taxonomy. See Medical Knowledge Architecture for the Three-Source Architecture design, Seeding Pipeline for the seeding methodology, and Graph-Enhanced RAG for how graph data enhances the RAG pipeline at query time.
Taxonomy-Driven Normalization
Between regex extraction and LLM validation, all entities pass through a taxonomy normalization layer powered by a centralized domain taxonomy module (zol_taxonomy.py). This module serves as the single source of truth for all hospital domain knowledge, ensuring consistent entity naming and classification across the entire pipeline.
The taxonomy enforces a strict hospital hierarchy: a single Hospital node (ZOL) connects via HAS_CAMPUS relationships to exactly four Campus nodes (Sint-Jan, André Dumont, Sint-Barbara, Maas en Kempen), which in turn connect to Department nodes via LOCATED_AT relationships. This hierarchy prevents phantom campuses and ensures every department is anchored to a real physical location.
Key normalization capabilities:
- Department alias resolution: Over 55 department definitions with alias maps resolve duplicates (e.g., "Intensieve Zorgen" and "Intensive Care" map to a single canonical node)
- Campus normalization: Exactly 4 campuses with alias maps; prevents the creation of spurious 5th campus nodes from variant spellings
- Doctor name cleanup: Role tokens like "Hoofdverpleegkundige" and "Diensthoofd" are stripped to prevent job titles from becoming doctor names
- Entity type overrides: Ambiguous entities are classified correctly (e.g., "Radiotherapie" is recognized as a department, "dialyse" as a treatment)
- Dual-entity model: Entities like "Radiotherapie" that function as both a department and a treatment generate two separate typed nodes with appropriate relationships
See Entity Extraction - Taxonomy-Driven Normalization for implementation details and Medical Knowledge Architecture for the three-layer design separating universal medical knowledge from hospital-specific data.
Theoretical Foundation
The ZOL knowledge graph draws on established concepts from graph theory, medical informatics, and ontology engineering:
- Property Graph Model: The labeled property graph formalism (Rodriguez & Neubauer, 2010) provides the theoretical foundation for the entity storage model, where both entities and relationships carry typed properties. The system implements this model using PostgreSQL taxonomy tables.
- SNOMED CT Integration: SNOMED CT Belgian Edition (356K concepts, 656K Dutch descriptions) provides standards-based medical terminology at both query time (synonym expansion) and seeding time (concept IDs, IS_A hierarchy, FINDING_SITE routing). Research demonstrates ontology-enhanced knowledge graphs improve retrieval accuracy by 22–40% (Jimeno-Yepes et al., 2012; Soman et al., 2024).
- Three-Source Architecture: The system separates knowledge inputs by provenance — automated scraping (Source 1), medical ontology (Source 2: SNOMED CT), and curated configuration (Source 3) — enabling independent verification, auditability, and portability across hospital deployments. See Medical Knowledge Architecture for the full design.
- Knowledge Graph Completion: The tiered query strategy (taxonomy queries → vector fallback) addresses the open-world assumption in knowledge graphs — the absence of a relationship does not mean it does not exist.
References
- Ernst, P., Siu, A., & Weikum, G. (2015). KnowLife: A versatile approach for constructing a large knowledge graph for biomedical sciences. BMC Bioinformatics, 16, 157. https://doi.org/10.1186/s12859-015-0549-5
- Hogan, A., et al. (2021). Knowledge graphs. ACM Computing Surveys, 54(4), 1--37. https://doi.org/10.1145/3447772
- Jimeno-Yepes, A., Berlanga, R., & Rebholz-Schuhmann, D. (2012). Ontology-based query expansion for biomedical information retrieval. BMC Bioinformatics, 13(S14). https://doi.org/10.1186/1471-2105-13-S14-S1
- Robinson, I., Webber, J., & Eifrem, E. (2015). Graph Databases: New Opportunities for Connected Data (2nd ed.). O'Reilly Media.
- Rodriguez, M. A., & Neubauer, P. (2010). Constructions from dots and lines. Bulletin of the American Society for Information Science and Technology, 36(6), 35--41. https://doi.org/10.1002/bult.2010.1720360610
- SNOMED International. (2024). SNOMED CT Starter Guide. https://www.snomed.org/
- Soman, K., et al. (2024). OntologyRAG: Ontology-enhanced retrieval-augmented generation. arXiv preprint, arXiv:2412.09050.