This ADR was written when the system used Neo4j for entity storage. As of March 2026, Neo4j has been fully removed and replaced by PostgreSQL taxonomy tables (taxonomy_entities, taxonomy_relationships). The decision rationale documented here remains valid; the storage layer has changed.
ADR-0028: Golden-Page-Seeded Taxonomy-Gated Ingestion
Superseded (2026-03-09): The 8-type classification system described below has been replaced with binary hub/detail classification. See the hub/detail reclassification plan for details.
Date: 2026-02-13 | Status: Accepted
Context
The ZOL RAG knowledge graph (Hogan et al., 2021) had critical quality issues traced to a single root cause: entities were extracted bottom-up from unstructured content (brochures, news pages) instead of being seeded top-down from authoritative golden pages. This caused:
- Phantom relationships: "Dementie handled by Urologie" from brochure co-occurrence
- Hub node inflation: Spoedgevallen linked to 244 relationships via co-occurrence
- Orphan doctors: 53 doctors with no WORKS_IN relationships
- Meaning drift: Brochure mentions creating structural relationships
Decision
Three-Phase Pipeline
- Scrape golden pages → PostgreSQL taxonomy (
scrape_taxonomy) - Seed Neo4j from taxonomy (
seed_golden_graph) — creates all authoritative entities - Gate content ingestion — source page type determines what can be created/linked
Source Page Type Classification
Every page gets classified by classify_page_type() into one of:
GOLDEN_SEED— taxonomy seeder (highest authority)GOLDEN_LISTING— /zol-artsen listing pagesDEPARTMENT_PAGE— /medische-diensten, /raadplegingen/{dept}ZORGAANBOD— /ziekte-en-zorg, /ziektebeeld/, /behandeling/DOCTOR_PAGE— /zol-arts/{slug} (skipped: lorem ipsum on dev site)BROCHURE,NEWS,GENERAL,UNKNOWN
Creation Gating
| Entity Type | Allowed Sources |
|---|---|
| Doctor (new) | golden_seed, golden_listing only |
| Department (new) | golden_seed, golden_listing, department_page |
| Treatment/Condition/etc. | Any source (relaxed) |
Relationship Gating
| Relationship | Allowed Sources |
|---|---|
| WORKS_IN, WORKS_AT_CAMPUS | golden_seed, golden_listing |
| HANDLES, OFFERS, PERFORMS | + department_page, zorgaanbod, raadpleging |
| TREATS, RELATED_TO, etc. | Any source |
Provenance Split
SOURCED_FROM split into:
CURATED_FROM— from golden/department pages (strong provenance)MENTIONED_IN— from brochures/news/general pages (weak provenance)
Consequences
Positive
- Zero phantom HANDLES/OFFERS relationships from brochures
- All doctors pre-populated with correct WORKS_IN from golden pages
- Clear provenance trail (CURATED_FROM vs MENTIONED_IN)
- Content pages can only link to existing entities, not create new ones
Negative
- Requires running seeder before ingestion (additional pipeline step)
- New entity types not in taxonomy need manual golden page addition
- Consultation schedule scraping adds ~65 HTTP requests to scrape phase
Update (2026-03-09): Hub/Detail Reclassification
The original 8-type classification described above was replaced with binary hub/detail classification.
See: docs/plans/2026-03-09-golden-page-reclassification-design.md
- Hub: Navigational listing pages (~20-40 per hospital) -- auto-detected by LLM classifier
- Detail: Single-entity pages -- stored but not treated as golden
- YAML no longer contains departments, doctors, or golden page URLs
- Migration 045 maps existing 8-type values to hub/detail
- The creation gating and relationship gating tables above are superseded by the simpler hub/detail model
classify_page_type()now returnshubordetailinstead of the 8-type enum