Graph Seeding Pipeline
The seeding pipeline populates the PostgreSQL taxonomy tables from three knowledge sources: automated web scraping (Source 1), SNOMED CT medical ontology (Source 2), and curated configuration (Source 3). It is a multi-phase process: scrape (HTML to frozen dataclasses with campus inference), enrich (LLM-generated universal medical knowledge), seed (merge all sources into PostgreSQL taxonomy tables), and SNOMED enrichment (ontology-derived concept IDs, synonyms, and relationships).
Two Paths to Entity Storage
New to the term "golden page"? Read Golden Pages first — it defines what a golden (hub) page is, how the
app.golden_pagestable and the AI-discover/human-confirm lifecycle work, and why top-down seeding exists. This page assumes that concept and focuses on the pipeline mechanics.
The system has two distinct paths for writing entities to the taxonomy:
| Path | When | What | Authority |
|---|---|---|---|
| Hub page seeding | Manual CLI or API trigger | Full taxonomy from hub pages + domain knowledge + LLM enrichment | Highest -- defines the authoritative entity set |
| Content ingestion | Per-page during crawl | Entity extraction from individual pages | Lower -- gated by graph_golden_only flag |
When graph_golden_only=True (the default), content ingestion does not write to the taxonomy. Only the hub page seeding path populates the entity store. This prevents unvalidated relationships from leaking into the taxonomy from crawled pages.
CLI Walkthrough
The pipeline runs as three sequential scripts:
Phase 1: Scrape Taxonomy
python -m scripts.scrape_taxonomy [--dry-run] [--tenant-slug zol]
Fetches HTML from hub pages auto-discovered by the LLM classifier (previously defined in hospital_config/zol.yaml), parses doctors, departments, conditions, treatments, examinations, and consultation schedules. Saves to PostgreSQL taxonomy tables using a full-replacement strategy (DELETE existing + INSERT new).
Hub page URLs are no longer manually configured in YAML. The system auto-detects hub pages from crawled content using a binary hub/detail LLM classifier.
Phase 1.5: LLM Enrichment (One-Time)
python -m scripts.enrich_taxonomy_llm [--dry-run] [--output-dir OUTPUT_DIR]
Loads the taxonomy from PostgreSQL and calls a Tier 3 reasoning model to classify relationships between entities. Produces 5 Python modules in medical_knowledge/:
| Module | Relationship | Description |
|---|---|---|
department_conditions.py | HANDLES | Which departments handle which conditions |
department_treatments.py | OFFERS | Which departments offer which treatments |
department_examinations.py | PERFORMS | Which departments perform which examinations |
condition_examinations.py | DIAGNOSES | Which examinations diagnose which conditions |
treatment_conditions.py | TREATS | Which treatments treat which conditions |
The LLM prompt uses standard Belgian hospital department names (from belgian_hospital_departments.py) rather than hospital-specific names, ensuring the output is portable across hospitals. Resolution to hospital-specific names happens at seeding time.
This step runs once and the output is committed to version control for human review.
Phase 2: Seed Taxonomy
python -m scripts.seed_golden_graph [--wipe] [--no-wipe] [--scrape-first] [--tenant-slug zol]
Loads the ScrapeResult from PostgreSQL, initialises the FrozenTaxonomyRegistry, builds a MedicalExtractionResult by merging all three knowledge layers, and stores everything in PostgreSQL taxonomy tables via the GoldenPageSeeder.
The --wipe flag (default) performs a full replacement of existing taxonomy data before re-seeding. Use --scrape-first to run the scraper before seeding in a single command.
Three-Source Merge
The GoldenPageSeeder._build_extraction_result() method combines three knowledge sources following the Three-Source Knowledge Architecture:
Merge Priority
Hospital-specific data always takes precedence over universal knowledge:
def _merge_knowledge_maps(universal, hospital):
"""Hospital-specific values first, case-insensitive dedup."""
for v in hospital_vals + universal_vals:
if v.lower() not in seen:
combined.append(v)
The result is used for all enrichment passes:
- Entity priority: Scraped entities are created first. Synthetic entities (high-traffic search terms not scraped) are only added if they do not already exist.
- Relationship priority: Scraped relationships from hub pages get confidence 1.0. Enrichment passes only add relationships that do not already exist (checked via dedup sets).
- Type overrides:
ENTITY_TYPE_OVERRIDESreclassifies mistyped entities (e.g., items scraped under/behandeling/that are actually conditions).
What the Merge Produces
The _build_extraction_result() method produces entities and relationships in this order:
Entities:
- Hospital entity (ZOL)
- Campus entities (4)
- Departments -- normalized, deduplicated, reclassified (department/facility/center)
- Doctors -- with provider_type inference, multi-fallback department resolution
- Conditions, treatments, examinations -- filtered against blocklist and type overrides
- Synthetic entities -- curated conditions/treatments for high-traffic search terms
- Curated navigational services (Bezoekuren, Route & Parkeren, Afspraken)
Relationships:
- Hospital → HAS_CAMPUS → Campus (structural)
- Department → BELONGS_TO → Hospital (structural)
- Department → LOCATED_AT → Campus (from scraper campus inference + YAML fallback)
- Doctor → WORKS_IN → Department (from scrape)
- Doctor → WORKS_AT_CAMPUS → Campus (from scrape)
- Department → HANDLES → Condition (merged: scraped + hospital config + universal)
- Department → OFFERS → Treatment (merged)
- Department → PERFORMS → Examination (merged)
- Treatment → TREATS → Condition (merged)
- Examination → DIAGNOSES → Condition (merged)
Backfill Steps
After the main seeding pass, the GoldenPageSeeder runs two backfill steps:
Doctor Campus Inference
backfill_doctor_campus() creates WORKS_AT_CAMPUS relationships by inference:
Pass 1: Doctors with WORKS_IN relationships where ALL their departments resolve to a single campus get a WORKS_AT_CAMPUS link to that campus.
Pass 2: Doctors with exactly 1 department that has multiple campuses (2--3, not all 4) get WORKS_AT_CAMPUS links for each campus. The guard size(campuses) < 4 prevents doctors from being linked to all campuses.
BELONGS_TO Completion
backfill_belongs_to() finds all Department or Center nodes without a BELONGS_TO relationship to the Hospital node and creates one. Catches any departments that were created during enrichment but missed the structural pass.
Phase 3: SNOMED CT Enrichment (Approved Design)
After the main seeding pass and backfills, the pipeline runs a SNOMED CT enrichment phase that adds ontology-derived data to existing graph nodes. This phase implements Source 2 of the Three-Source Knowledge Architecture.
What It Does
| Step | Input | Output | Scope |
|---|---|---|---|
| 1. Concept matching | Entity names → snomed_descriptions | snomed_concept_id property on nodes | ~260 existing entities |
| 2. Synonym enrichment | Concept ID → Dutch descriptions | snomed_synonyms array property | Matched entities only |
| 3. IS_A hierarchy | snomed_transitive_closure | IS_A relationships between existing nodes | Depth limit: 3 hops |
| 4. FINDING_SITE routing | Condition concept → body structure → department | New HANDLES relationships (confidence 0.7) | Conditions with matched concepts |
| 5. PROCEDURE_SITE routing | Treatment concept → body structure → department | New OFFERS relationships (confidence 0.7) | Treatments with matched concepts |
Node Property Additions
After SNOMED enrichment, condition entities carry ontology-derived properties:
-- Before enrichment:
-- canonical_name: 'diabetes mellitus', metadata: {}
-- After enrichment (PostgreSQL taxonomy_entities):
SELECT canonical_name, metadata->>'snomed_concept_id' AS concept_id,
metadata->>'snomed_preferred_term' AS preferred,
metadata->'snomed_synonyms' AS synonyms
FROM app.taxonomy_entities
WHERE canonical_name = 'diabetes mellitus';
-- concept_id: 73211009
-- preferred: diabetes mellitus
-- synonyms: ["suikerziekte", "DM", "diabetes mellitus type niet gespecificeerd"]
Scoping Guard
The enrichment phase operates on a closed entity set: only the ~260 entities already created by Sources 1 and 3 are enriched. The 356K SNOMED concepts are NOT bulk-imported. Estimated growth: ~50–100 IS_A relationships and ~100–200 FINDING_SITE/PROCEDURE_SITE-derived relationships.
Plausibility Filtering
SNOMED-derived relationships pass through the same plausibility guards as all other sources: _is_plausible_handles(), DEPT_CONDITION_NEGATIVE_MAP, and hub condition blocklists. The lower confidence (0.7 vs 1.0) also reduces their weight in search ranking.
The graph_golden_only Gate
The graph_golden_only setting in config.py (default True) controls whether regular content ingestion can write entities to the taxonomy:
graph_golden_only: bool = Field(
default=True,
description="Only hub pages write to the taxonomy."
)
When True, the system checks each page's classification to determine write permissions:
| Page Type | Can Write to Taxonomy |
|---|---|
hub | Always -- defines the authoritative entity set |
detail | Only when graph_golden_only=False |
The GoldenPageSeeder always has full write authority regardless of this setting.
Source Page Classification
The original 8-type classification (golden_seed, golden_listing, department_page, zorgaanbod, doctor_page, brochure, general, unknown) was replaced with binary hub/detail classification. Hub page URLs are now auto-discovered by an LLM classifier rather than manually configured in YAML.
Pages are classified into two types:
| Type | Description | Authority Level |
|---|---|---|
hub | Navigational listing pages (~20-40 per hospital) -- auto-detected by LLM classifier | High -- defines the authoritative entity set |
detail | Single-entity pages (individual doctors, conditions, etc.) | Standard -- stored but not used for taxonomy seeding |
The graph_golden_only gate (see below) uses hub classification to control write permissions. Hub pages have full write authority; detail pages are gated.
Pipeline Execution Summary
- Scrape (
scrape_taxonomy): HTML → frozen dataclasses + campus inference from doctor data → PostgreSQL - Enrich (
enrich_taxonomy_llm): Taxonomy entities → LLM classification → Python modules (one-time, human-reviewed) - Seed (
seed_golden_graph): Load taxonomy + merge 3 sources → PostgreSQL taxonomy tables + backfills - SNOMED Enrich (Phase 3): Existing taxonomy entities → SNOMED concept matching → synonym/hierarchy/relationship enrichment
The same pipeline is available via API endpoints (POST /api/v1/graph/refresh-taxonomy and POST /api/v1/graph/seed-golden-pages) for triggering from the admin UI.
Production Pipeline (SP-4/SP-5/SP-6)
The CLI-based seeding pipeline described above is complemented by the production pipeline wizard:
- SP-4 Entity Resolution — LLM extraction from hub pages, deduplication, SNOMED matching
- SP-5 Draft/Publish — Versioned snapshots, impact preview, rollback
- SP-6 Management UI — 5-stage pipeline wizard for operators
See Pipeline Wizard for the operator workflow.
References
- Hogan, A., et al. (2021). Knowledge graphs. ACM Computing Surveys, 54(4), 1--37. https://doi.org/10.1145/3447772
- Robinson, I., Webber, J., & Eifrem, E. (2015). Graph Databases: New Opportunities for Connected Data (2nd ed.). O'Reilly Media.