Skip to main content

Graph Seeding Pipeline

The seeding pipeline populates the PostgreSQL taxonomy tables from three knowledge sources: automated web scraping (Source 1), SNOMED CT medical ontology (Source 2), and curated configuration (Source 3). It is a multi-phase process: scrape (HTML to frozen dataclasses with campus inference), enrich (LLM-generated universal medical knowledge), seed (merge all sources into PostgreSQL taxonomy tables), and SNOMED enrichment (ontology-derived concept IDs, synonyms, and relationships).

Two Paths to Entity Storage

New to the term "golden page"? Read Golden Pages first — it defines what a golden (hub) page is, how the app.golden_pages table and the AI-discover/human-confirm lifecycle work, and why top-down seeding exists. This page assumes that concept and focuses on the pipeline mechanics.

The system has two distinct paths for writing entities to the taxonomy:

PathWhenWhatAuthority
Hub page seedingManual CLI or API triggerFull taxonomy from hub pages + domain knowledge + LLM enrichmentHighest -- defines the authoritative entity set
Content ingestionPer-page during crawlEntity extraction from individual pagesLower -- gated by graph_golden_only flag

When graph_golden_only=True (the default), content ingestion does not write to the taxonomy. Only the hub page seeding path populates the entity store. This prevents unvalidated relationships from leaking into the taxonomy from crawled pages.

CLI Walkthrough

The pipeline runs as three sequential scripts:

Phase 1: Scrape Taxonomy

python -m scripts.scrape_taxonomy [--dry-run] [--tenant-slug zol]

Fetches HTML from hub pages auto-discovered by the LLM classifier (previously defined in hospital_config/zol.yaml), parses doctors, departments, conditions, treatments, examinations, and consultation schedules. Saves to PostgreSQL taxonomy tables using a full-replacement strategy (DELETE existing + INSERT new).

note

Hub page URLs are no longer manually configured in YAML. The system auto-detects hub pages from crawled content using a binary hub/detail LLM classifier.

Phase 1.5: LLM Enrichment (One-Time)

python -m scripts.enrich_taxonomy_llm [--dry-run] [--output-dir OUTPUT_DIR]

Loads the taxonomy from PostgreSQL and calls a Tier 3 reasoning model to classify relationships between entities. Produces 5 Python modules in medical_knowledge/:

ModuleRelationshipDescription
department_conditions.pyHANDLESWhich departments handle which conditions
department_treatments.pyOFFERSWhich departments offer which treatments
department_examinations.pyPERFORMSWhich departments perform which examinations
condition_examinations.pyDIAGNOSESWhich examinations diagnose which conditions
treatment_conditions.pyTREATSWhich treatments treat which conditions

The LLM prompt uses standard Belgian hospital department names (from belgian_hospital_departments.py) rather than hospital-specific names, ensuring the output is portable across hospitals. Resolution to hospital-specific names happens at seeding time.

This step runs once and the output is committed to version control for human review.

Phase 2: Seed Taxonomy

python -m scripts.seed_golden_graph [--wipe] [--no-wipe] [--scrape-first] [--tenant-slug zol]

Loads the ScrapeResult from PostgreSQL, initialises the FrozenTaxonomyRegistry, builds a MedicalExtractionResult by merging all three knowledge layers, and stores everything in PostgreSQL taxonomy tables via the GoldenPageSeeder.

The --wipe flag (default) performs a full replacement of existing taxonomy data before re-seeding. Use --scrape-first to run the scraper before seeding in a single command.

Three-Source Merge

The GoldenPageSeeder._build_extraction_result() method combines three knowledge sources following the Three-Source Knowledge Architecture:

Merge Priority

Hospital-specific data always takes precedence over universal knowledge:

def _merge_knowledge_maps(universal, hospital):
"""Hospital-specific values first, case-insensitive dedup."""
for v in hospital_vals + universal_vals:
if v.lower() not in seen:
combined.append(v)

The result is used for all enrichment passes:

  1. Entity priority: Scraped entities are created first. Synthetic entities (high-traffic search terms not scraped) are only added if they do not already exist.
  2. Relationship priority: Scraped relationships from hub pages get confidence 1.0. Enrichment passes only add relationships that do not already exist (checked via dedup sets).
  3. Type overrides: ENTITY_TYPE_OVERRIDES reclassifies mistyped entities (e.g., items scraped under /behandeling/ that are actually conditions).

What the Merge Produces

The _build_extraction_result() method produces entities and relationships in this order:

Entities:

  1. Hospital entity (ZOL)
  2. Campus entities (4)
  3. Departments -- normalized, deduplicated, reclassified (department/facility/center)
  4. Doctors -- with provider_type inference, multi-fallback department resolution
  5. Conditions, treatments, examinations -- filtered against blocklist and type overrides
  6. Synthetic entities -- curated conditions/treatments for high-traffic search terms
  7. Curated navigational services (Bezoekuren, Route & Parkeren, Afspraken)

Relationships:

  1. Hospital → HAS_CAMPUS → Campus (structural)
  2. Department → BELONGS_TO → Hospital (structural)
  3. Department → LOCATED_AT → Campus (from scraper campus inference + YAML fallback)
  4. Doctor → WORKS_IN → Department (from scrape)
  5. Doctor → WORKS_AT_CAMPUS → Campus (from scrape)
  6. Department → HANDLES → Condition (merged: scraped + hospital config + universal)
  7. Department → OFFERS → Treatment (merged)
  8. Department → PERFORMS → Examination (merged)
  9. Treatment → TREATS → Condition (merged)
  10. Examination → DIAGNOSES → Condition (merged)

Backfill Steps

After the main seeding pass, the GoldenPageSeeder runs two backfill steps:

Doctor Campus Inference

backfill_doctor_campus() creates WORKS_AT_CAMPUS relationships by inference:

Pass 1: Doctors with WORKS_IN relationships where ALL their departments resolve to a single campus get a WORKS_AT_CAMPUS link to that campus.

Pass 2: Doctors with exactly 1 department that has multiple campuses (2--3, not all 4) get WORKS_AT_CAMPUS links for each campus. The guard size(campuses) < 4 prevents doctors from being linked to all campuses.

BELONGS_TO Completion

backfill_belongs_to() finds all Department or Center nodes without a BELONGS_TO relationship to the Hospital node and creates one. Catches any departments that were created during enrichment but missed the structural pass.

Phase 3: SNOMED CT Enrichment (Approved Design)

After the main seeding pass and backfills, the pipeline runs a SNOMED CT enrichment phase that adds ontology-derived data to existing graph nodes. This phase implements Source 2 of the Three-Source Knowledge Architecture.

What It Does

StepInputOutputScope
1. Concept matchingEntity names → snomed_descriptionssnomed_concept_id property on nodes~260 existing entities
2. Synonym enrichmentConcept ID → Dutch descriptionssnomed_synonyms array propertyMatched entities only
3. IS_A hierarchysnomed_transitive_closureIS_A relationships between existing nodesDepth limit: 3 hops
4. FINDING_SITE routingCondition concept → body structure → departmentNew HANDLES relationships (confidence 0.7)Conditions with matched concepts
5. PROCEDURE_SITE routingTreatment concept → body structure → departmentNew OFFERS relationships (confidence 0.7)Treatments with matched concepts

Node Property Additions

After SNOMED enrichment, condition entities carry ontology-derived properties:

-- Before enrichment:
-- canonical_name: 'diabetes mellitus', metadata: {}

-- After enrichment (PostgreSQL taxonomy_entities):
SELECT canonical_name, metadata->>'snomed_concept_id' AS concept_id,
metadata->>'snomed_preferred_term' AS preferred,
metadata->'snomed_synonyms' AS synonyms
FROM app.taxonomy_entities
WHERE canonical_name = 'diabetes mellitus';
-- concept_id: 73211009
-- preferred: diabetes mellitus
-- synonyms: ["suikerziekte", "DM", "diabetes mellitus type niet gespecificeerd"]

Scoping Guard

The enrichment phase operates on a closed entity set: only the ~260 entities already created by Sources 1 and 3 are enriched. The 356K SNOMED concepts are NOT bulk-imported. Estimated growth: ~50–100 IS_A relationships and ~100–200 FINDING_SITE/PROCEDURE_SITE-derived relationships.

Plausibility Filtering

SNOMED-derived relationships pass through the same plausibility guards as all other sources: _is_plausible_handles(), DEPT_CONDITION_NEGATIVE_MAP, and hub condition blocklists. The lower confidence (0.7 vs 1.0) also reduces their weight in search ranking.

The graph_golden_only Gate

The graph_golden_only setting in config.py (default True) controls whether regular content ingestion can write entities to the taxonomy:

graph_golden_only: bool = Field(
default=True,
description="Only hub pages write to the taxonomy."
)

When True, the system checks each page's classification to determine write permissions:

Page TypeCan Write to Taxonomy
hubAlways -- defines the authoritative entity set
detailOnly when graph_golden_only=False

The GoldenPageSeeder always has full write authority regardless of this setting.

Source Page Classification

Hub/Detail Reclassification (2026-03-09)

The original 8-type classification (golden_seed, golden_listing, department_page, zorgaanbod, doctor_page, brochure, general, unknown) was replaced with binary hub/detail classification. Hub page URLs are now auto-discovered by an LLM classifier rather than manually configured in YAML.

Pages are classified into two types:

TypeDescriptionAuthority Level
hubNavigational listing pages (~20-40 per hospital) -- auto-detected by LLM classifierHigh -- defines the authoritative entity set
detailSingle-entity pages (individual doctors, conditions, etc.)Standard -- stored but not used for taxonomy seeding

The graph_golden_only gate (see below) uses hub classification to control write permissions. Hub pages have full write authority; detail pages are gated.

Pipeline Execution Summary

  1. Scrape (scrape_taxonomy): HTML → frozen dataclasses + campus inference from doctor data → PostgreSQL
  2. Enrich (enrich_taxonomy_llm): Taxonomy entities → LLM classification → Python modules (one-time, human-reviewed)
  3. Seed (seed_golden_graph): Load taxonomy + merge 3 sources → PostgreSQL taxonomy tables + backfills
  4. SNOMED Enrich (Phase 3): Existing taxonomy entities → SNOMED concept matching → synonym/hierarchy/relationship enrichment

The same pipeline is available via API endpoints (POST /api/v1/graph/refresh-taxonomy and POST /api/v1/graph/seed-golden-pages) for triggering from the admin UI.

Production Pipeline (SP-4/SP-5/SP-6)

The CLI-based seeding pipeline described above is complemented by the production pipeline wizard:

  1. SP-4 Entity Resolution — LLM extraction from hub pages, deduplication, SNOMED matching
  2. SP-5 Draft/Publish — Versioned snapshots, impact preview, rollback
  3. SP-6 Management UI — 5-stage pipeline wizard for operators

See Pipeline Wizard for the operator workflow.

References

  • Hogan, A., et al. (2021). Knowledge graphs. ACM Computing Surveys, 54(4), 1--37. https://doi.org/10.1145/3447772
  • Robinson, I., Webber, J., & Eifrem, E. (2015). Graph Databases: New Opportunities for Connected Data (2nd ed.). O'Reilly Media.