Graph Seeding Pipeline

The seeding pipeline populates the PostgreSQL taxonomy tables from three knowledge sources: automated web scraping (Source 1), SNOMED CT medical ontology (Source 2), and curated configuration (Source 3). It is a multi-phase process: scrape (HTML to frozen dataclasses with campus inference), enrich (LLM-generated universal medical knowledge), seed (merge all sources into PostgreSQL taxonomy tables), and SNOMED enrichment (ontology-derived concept IDs, synonyms, and relationships).

Two Paths to Entity Storage

New to the term "golden page"? Read Golden Pages first — it defines what a golden (hub) page is, how the app.golden_pages table and the AI-discover/human-confirm lifecycle work, and why top-down seeding exists. This page assumes that concept and focuses on the pipeline mechanics.

The system has two distinct paths for writing entities to the taxonomy:

Path	When	What	Authority
Hub page seeding	Manual CLI or API trigger	Full taxonomy from hub pages + domain knowledge + LLM enrichment	Highest -- defines the authoritative entity set
Content ingestion	Per-page during crawl	Entity extraction from individual pages	Lower -- gated by `graph_golden_only` flag

When graph_golden_only=True (the default), content ingestion does not write to the taxonomy. Only the hub page seeding path populates the entity store. This prevents unvalidated relationships from leaking into the taxonomy from crawled pages.

CLI Walkthrough

The pipeline runs as three sequential scripts:

Phase 1: Scrape Taxonomy

python -m scripts.scrape_taxonomy [--dry-run] [--tenant-slug zol]

Fetches HTML from hub pages auto-discovered by the LLM classifier (previously defined in hospital_config/zol.yaml), parses doctors, departments, conditions, treatments, examinations, and consultation schedules. Saves to PostgreSQL taxonomy tables using a full-replacement strategy (DELETE existing + INSERT new).

note

Hub page URLs are no longer manually configured in YAML. The system auto-detects hub pages from crawled content using a binary hub/detail LLM classifier.

Phase 1.5: LLM Enrichment (One-Time)

python -m scripts.enrich_taxonomy_llm [--dry-run] [--output-dir OUTPUT_DIR]

Loads the taxonomy from PostgreSQL and calls a Tier 3 reasoning model to classify relationships between entities. Produces 5 Python modules in medical_knowledge/:

Module	Relationship	Description
`department_conditions.py`	HANDLES	Which departments handle which conditions
`department_treatments.py`	OFFERS	Which departments offer which treatments
`department_examinations.py`	PERFORMS	Which departments perform which examinations
`condition_examinations.py`	DIAGNOSES	Which examinations diagnose which conditions
`treatment_conditions.py`	TREATS	Which treatments treat which conditions

The LLM prompt uses standard Belgian hospital department names (from belgian_hospital_departments.py) rather than hospital-specific names, ensuring the output is portable across hospitals. Resolution to hospital-specific names happens at seeding time.

This step runs once and the output is committed to version control for human review.

Phase 2: Seed Taxonomy

python -m scripts.seed_golden_graph [--wipe] [--no-wipe] [--scrape-first] [--tenant-slug zol]

Loads the ScrapeResult from PostgreSQL, initialises the FrozenTaxonomyRegistry, builds a MedicalExtractionResult by merging all three knowledge layers, and stores everything in PostgreSQL taxonomy tables via the GoldenPageSeeder.

The --wipe flag (default) performs a full replacement of existing taxonomy data before re-seeding. Use --scrape-first to run the scraper before seeding in a single command.

Three-Source Merge

The GoldenPageSeeder._build_extraction_result() method combines three knowledge sources following the Three-Source Knowledge Architecture:

Merge Priority

Hospital-specific data always takes precedence over universal knowledge:

def _merge_knowledge_maps(universal, hospital):
    """Hospital-specific values first, case-insensitive dedup."""
    for v in hospital_vals + universal_vals:
        if v.lower() not in seen:
            combined.append(v)

The result is used for all enrichment passes:

Entity priority: Scraped entities are created first. Synthetic entities (high-traffic search terms not scraped) are only added if they do not already exist.
Relationship priority: Scraped relationships from hub pages get confidence 1.0. Enrichment passes only add relationships that do not already exist (checked via dedup sets).
Type overrides: ENTITY_TYPE_OVERRIDES reclassifies mistyped entities (e.g., items scraped under /behandeling/ that are actually conditions).

What the Merge Produces

The _build_extraction_result() method produces entities and relationships in this order:

Entities:

Hospital entity (ZOL)
Campus entities (4)
Departments -- normalized, deduplicated, reclassified (department/facility/center)
Doctors -- with provider_type inference, multi-fallback department resolution
Conditions, treatments, examinations -- filtered against blocklist and type overrides
Synthetic entities -- curated conditions/treatments for high-traffic search terms
Curated navigational services (Bezoekuren, Route & Parkeren, Afspraken)

Relationships:

Hospital → HAS_CAMPUS → Campus (structural)
Department → BELONGS_TO → Hospital (structural)
Department → LOCATED_AT → Campus (from scraper campus inference + YAML fallback)
Doctor → WORKS_IN → Department (from scrape)
Doctor → WORKS_AT_CAMPUS → Campus (from scrape)
Department → HANDLES → Condition (merged: scraped + hospital config + universal)
Department → OFFERS → Treatment (merged)
Department → PERFORMS → Examination (merged)
Treatment → TREATS → Condition (merged)
Examination → DIAGNOSES → Condition (merged)

Backfill Steps

After the main seeding pass, the GoldenPageSeeder runs two backfill steps:

Doctor Campus Inference

backfill_doctor_campus() creates WORKS_AT_CAMPUS relationships by inference:

Pass 1: Doctors with WORKS_IN relationships where ALL their departments resolve to a single campus get a WORKS_AT_CAMPUS link to that campus.

Pass 2: Doctors with exactly 1 department that has multiple campuses (2--3, not all 4) get WORKS_AT_CAMPUS links for each campus. The guard size(campuses) < 4 prevents doctors from being linked to all campuses.

BELONGS_TO Completion

backfill_belongs_to() finds all Department or Center nodes without a BELONGS_TO relationship to the Hospital node and creates one. Catches any departments that were created during enrichment but missed the structural pass.

Phase 3: SNOMED CT Enrichment (Approved Design)

After the main seeding pass and backfills, the pipeline runs a SNOMED CT enrichment phase that adds ontology-derived data to existing graph nodes. This phase implements Source 2 of the Three-Source Knowledge Architecture.

What It Does

Step	Input	Output	Scope
1. Concept matching	Entity names → `snomed_descriptions`	`snomed_concept_id` property on nodes	~260 existing entities
2. Synonym enrichment	Concept ID → Dutch descriptions	`snomed_synonyms` array property	Matched entities only
3. IS_A hierarchy	`snomed_transitive_closure`	IS_A relationships between existing nodes	Depth limit: 3 hops
4. FINDING_SITE routing	Condition concept → body structure → department	New HANDLES relationships (confidence 0.7)	Conditions with matched concepts
5. PROCEDURE_SITE routing	Treatment concept → body structure → department	New OFFERS relationships (confidence 0.7)	Treatments with matched concepts

Node Property Additions

After SNOMED enrichment, condition entities carry ontology-derived properties:

-- Before enrichment:
-- canonical_name: 'diabetes mellitus', metadata: {}

-- After enrichment (PostgreSQL taxonomy_entities):
SELECT canonical_name, metadata->>'snomed_concept_id' AS concept_id,
       metadata->>'snomed_preferred_term' AS preferred,
       metadata->'snomed_synonyms' AS synonyms
FROM app.taxonomy_entities
WHERE canonical_name = 'diabetes mellitus';
-- concept_id: 73211009
-- preferred: diabetes mellitus
-- synonyms: ["suikerziekte", "DM", "diabetes mellitus type niet gespecificeerd"]

Scoping Guard

The enrichment phase operates on a closed entity set: only the ~260 entities already created by Sources 1 and 3 are enriched. The 356K SNOMED concepts are NOT bulk-imported. Estimated growth: ~50–100 IS_A relationships and ~100–200 FINDING_SITE/PROCEDURE_SITE-derived relationships.

Plausibility Filtering

SNOMED-derived relationships pass through the same plausibility guards as all other sources: _is_plausible_handles(), DEPT_CONDITION_NEGATIVE_MAP, and hub condition blocklists. The lower confidence (0.7 vs 1.0) also reduces their weight in search ranking.

The `graph_golden_only` Gate

The graph_golden_only setting in config.py (default True) controls whether regular content ingestion can write entities to the taxonomy:

graph_golden_only: bool = Field(
    default=True,
    description="Only hub pages write to the taxonomy."
)

When True, the system checks each page's classification to determine write permissions:

Page Type	Can Write to Taxonomy
`hub`	Always -- defines the authoritative entity set
`detail`	Only when `graph_golden_only=False`

The GoldenPageSeeder always has full write authority regardless of this setting.

Source Page Classification

Hub/Detail Reclassification (2026-03-09)

The original 8-type classification (golden_seed, golden_listing, department_page, zorgaanbod, doctor_page, brochure, general, unknown) was replaced with binary hub/detail classification. Hub page URLs are now auto-discovered by an LLM classifier rather than manually configured in YAML.

Pages are classified into two types:

Type	Description	Authority Level
`hub`	Navigational listing pages (~20-40 per hospital) -- auto-detected by LLM classifier	High -- defines the authoritative entity set
`detail`	Single-entity pages (individual doctors, conditions, etc.)	Standard -- stored but not used for taxonomy seeding

The graph_golden_only gate (see below) uses hub classification to control write permissions. Hub pages have full write authority; detail pages are gated.

Pipeline Execution Summary

Scrape (scrape_taxonomy): HTML → frozen dataclasses + campus inference from doctor data → PostgreSQL
Enrich (enrich_taxonomy_llm): Taxonomy entities → LLM classification → Python modules (one-time, human-reviewed)
Seed (seed_golden_graph): Load taxonomy + merge 3 sources → PostgreSQL taxonomy tables + backfills
SNOMED Enrich (Phase 3): Existing taxonomy entities → SNOMED concept matching → synonym/hierarchy/relationship enrichment

The same pipeline is available via API endpoints (POST /api/v1/graph/refresh-taxonomy and POST /api/v1/graph/seed-golden-pages) for triggering from the admin UI.

Production Pipeline (SP-4/SP-5/SP-6)

The CLI-based seeding pipeline described above is complemented by the production pipeline wizard:

SP-4 Entity Resolution — LLM extraction from hub pages, deduplication, SNOMED matching
SP-5 Draft/Publish — Versioned snapshots, impact preview, rollback
SP-6 Management UI — 5-stage pipeline wizard for operators

See Pipeline Wizard for the operator workflow.

References

Hogan, A., et al. (2021). Knowledge graphs. ACM Computing Surveys, 54(4), 1--37. https://doi.org/10.1145/3447772
Robinson, I., Webber, J., & Eifrem, E. (2015). Graph Databases: New Opportunities for Connected Data (2nd ed.). O'Reilly Media.

Two Paths to Entity Storage​

CLI Walkthrough​

Phase 1: Scrape Taxonomy​

Phase 1.5: LLM Enrichment (One-Time)​

Phase 2: Seed Taxonomy​

Three-Source Merge​

Merge Priority​

What the Merge Produces​

Backfill Steps​

Doctor Campus Inference​

BELONGS_TO Completion​

Phase 3: SNOMED CT Enrichment (Approved Design)​

What It Does​

Node Property Additions​

Scoping Guard​

Plausibility Filtering​

The graph_golden_only Gate​

Source Page Classification​

Pipeline Execution Summary​

Production Pipeline (SP-4/SP-5/SP-6)​

References​