Skip to main content
Architectural Update (March 2026)

This ADR was written when the system used Neo4j for entity storage. As of March 2026, Neo4j has been fully removed and replaced by PostgreSQL taxonomy tables (taxonomy_entities, taxonomy_relationships). The decision rationale documented here remains valid; the storage layer has changed.

ADR-0028: Golden-Page-Seeded Taxonomy-Gated Ingestion

Superseded (2026-03-09): The 8-type classification system described below has been replaced with binary hub/detail classification. See the hub/detail reclassification plan for details.

Date: 2026-02-13 | Status: Accepted

Context

The ZOL RAG knowledge graph (Hogan et al., 2021) had critical quality issues traced to a single root cause: entities were extracted bottom-up from unstructured content (brochures, news pages) instead of being seeded top-down from authoritative golden pages. This caused:

  • Phantom relationships: "Dementie handled by Urologie" from brochure co-occurrence
  • Hub node inflation: Spoedgevallen linked to 244 relationships via co-occurrence
  • Orphan doctors: 53 doctors with no WORKS_IN relationships
  • Meaning drift: Brochure mentions creating structural relationships

Decision

Three-Phase Pipeline

  1. Scrape golden pages → PostgreSQL taxonomy (scrape_taxonomy)
  2. Seed Neo4j from taxonomy (seed_golden_graph) — creates all authoritative entities
  3. Gate content ingestion — source page type determines what can be created/linked

Source Page Type Classification

Every page gets classified by classify_page_type() into one of:

  • GOLDEN_SEED — taxonomy seeder (highest authority)
  • GOLDEN_LISTING — /zol-artsen listing pages
  • DEPARTMENT_PAGE — /medische-diensten, /raadplegingen/{dept}
  • ZORGAANBOD — /ziekte-en-zorg, /ziektebeeld/, /behandeling/
  • DOCTOR_PAGE — /zol-arts/{slug} (skipped: lorem ipsum on dev site)
  • BROCHURE, NEWS, GENERAL, UNKNOWN

Creation Gating

Entity TypeAllowed Sources
Doctor (new)golden_seed, golden_listing only
Department (new)golden_seed, golden_listing, department_page
Treatment/Condition/etc.Any source (relaxed)

Relationship Gating

RelationshipAllowed Sources
WORKS_IN, WORKS_AT_CAMPUSgolden_seed, golden_listing
HANDLES, OFFERS, PERFORMS+ department_page, zorgaanbod, raadpleging
TREATS, RELATED_TO, etc.Any source

Provenance Split

SOURCED_FROM split into:

  • CURATED_FROM — from golden/department pages (strong provenance)
  • MENTIONED_IN — from brochures/news/general pages (weak provenance)

Consequences

Positive

  • Zero phantom HANDLES/OFFERS relationships from brochures
  • All doctors pre-populated with correct WORKS_IN from golden pages
  • Clear provenance trail (CURATED_FROM vs MENTIONED_IN)
  • Content pages can only link to existing entities, not create new ones

Negative

  • Requires running seeder before ingestion (additional pipeline step)
  • New entity types not in taxonomy need manual golden page addition
  • Consultation schedule scraping adds ~65 HTTP requests to scrape phase

Update (2026-03-09): Hub/Detail Reclassification

The original 8-type classification described above was replaced with binary hub/detail classification. See: docs/plans/2026-03-09-golden-page-reclassification-design.md

  • Hub: Navigational listing pages (~20-40 per hospital) -- auto-detected by LLM classifier
  • Detail: Single-entity pages -- stored but not treated as golden
  • YAML no longer contains departments, doctors, or golden page URLs
  • Migration 045 maps existing 8-type values to hub/detail
  • The creation gating and relationship gating tables above are superseded by the simpler hub/detail model
  • classify_page_type() now returns hub or detail instead of the 8-type enum