Golden Pages
The single most confusing thing about this part of the system is that one idea has three names that accreted over time. This page reconciles them up front so the rest makes sense:
| Name you'll see | Where it lives | What it actually means |
|---|---|---|
| Golden page | prose, this page | The concept: a page the hospital itself publishes that authoritatively lists its own entities (all cardiologists, all departments). The "ground truth" we trust. |
hub / detail | the page_type column of app.golden_pages | The classification. A golden page is either a hub (a navigational listing — the trust anchor) or a detail (a single-entity page). |
GOLDEN_SEED | source_page_type on seeded entities | The provenance stamp. Marks an entity as created top-down by the seeder from a golden page, not extracted bottom-up from prose. |
Throughout this page, "golden page" = a hub page unless stated otherwise. The legacy 8-type vocabulary (GOLDEN_LISTING, DEPARTMENT_PAGE, …) was collapsed into binary hub/detail by migration 045 — see ADR-0028.
What problem do golden pages solve?
The ZOL taxonomy is a graph of entities (doctors, departments, conditions, treatments) connected by relationships (Dr. Houben WORKS_IN Neurologie; Cardiologie HANDLES hartfalen). The quality of every downstream answer — entity resolution, department routing, graph-RAG — depends on those relationships being true.
There are two fundamentally different ways to build that graph, and the difference is the whole story:
When entities were extracted bottom-up from unstructured content, mere co-occurrence in a brochure forged structural relationships. The symptoms (documented in ADR-0028) were concrete and severe:
- Phantom relationships — "Dementie
HANDLED_BYUrologie" because the two words appeared in the same leaflet. - Hub-node inflation — Spoedgevallen (Emergency) accrued 244 relationships purely from co-occurrence.
- Orphan doctors — 53 doctors with no
WORKS_INbecause their department was never named in the same paragraph. - Meaning drift — a passing brochure mention became a load-bearing structural edge.
Golden-page seeding is the cure: define the authoritative entity set first, top-down, from pages the hospital publishes specifically to enumerate its own doctors and departments. Everything else is then gated against that ground truth.
What makes a page "golden"?
A golden page is a navigational listing page — a page whose job is to enumerate entities, not describe one. Concretely, for ZOL:
| Example page | Enumerates | Classified |
|---|---|---|
/zol-artsen (doctor directory) | every doctor | hub |
/medische-diensten (department index) | every department | hub |
/raadplegingen/{dept} (consultation listing) | doctors + schedule for one department | hub |
/zol-arts/{slug} (one doctor's page) | a single doctor | detail |
/ziektebeeld/{condition} (one condition) | a single condition | detail |
The distinction is structural authority, not topic: a hub page is the hospital's own canonical list, so the entities on it — and the relationships implied by their grouping (this doctor appears under this department ⇒ WORKS_IN) — are trustworthy by construction. A detail page is about one entity; it is stored and retrievable, but it is not allowed to mint new entities or structural edges into the taxonomy.
Where golden pages live: app.golden_pages
Golden pages are a first-class, DB-backed concept — not a config constant. They are rows in app.golden_pages, each carrying a page_type constrained to 'hub' or 'detail' (migration 045), plus a confirmation status. Hub URLs are no longer hand-listed in zol.yaml — they are discovered automatically and then confirmed by an operator.
The hub-detection lifecycle
- Pre-filter — a content-quality gate demotes pages with insufficient content before classification, so thin pages can't masquerade as hubs (this replaced the old URL-pattern heuristic, which was a false-positive source).
- AI classify — an LLM (
hub_detection/ai_classifier.py) labels each surviving pagehubordetailand proposes the entity types it lists. - Store candidates — results land in
app.golden_pagesas candidates. - Human confirm/reject — an operator confirms, rejects, or manually designates URLs via
HubPageService(surfaced in the Pipeline Wizard). Confirmation is the trust gate: a page is only golden once a human says so.
This AI-proposes / human-confirms loop is the deliberate design choice — it keeps the authoritative entity set auditable without forcing operators to hand-maintain URL lists.
How golden pages seed the taxonomy
Confirmed hubs are consumed top-down by the GoldenPageSeeder (graph/taxonomy/golden_page_seeder.py) during Phase 2 of the seeding pipeline. Entities it creates are stamped with source_page_type = GOLDEN_SEED and a synthetic source ID (GOLDEN_SEED_PAGE_ID), so their golden provenance is permanent and queryable.
Crucially, the seeder only seeds entities that were actually scraped from golden pages — it does not invent. Relationships derived from a hub's structure (a doctor listed under a department ⇒ WORKS_IN) are written with confidence = 1.0, the highest tier. By contrast, SNOMED-derived relationships enter at confidence = 0.7, and curated overrides win over both (see the Three-Source merge priority: Curated > Scraped > SNOMED).
Provenance split: CURATED_FROM vs MENTIONED_IN
Golden-page provenance is encoded directly on the edges, so every relationship carries how much we trust it:
| Provenance | Source | Strength |
|---|---|---|
CURATED_FROM | a golden/hub page | Strong — structural, from the hospital's own list |
MENTIONED_IN | a brochure / news / general page | Weak — a mention, not a structural claim |
This is what makes the graph auditable: you can always ask "is this edge curated or merely mentioned?" and weight it accordingly.
The gate that enforces it all: graph_golden_only
The discipline is enforced by one setting — graph_golden_only in config.py, default True:
| Page type | Can write entities/structural edges to the taxonomy? |
|---|---|
hub (golden) | Always — defines the authoritative entity set |
detail | Only when graph_golden_only = False |
With the default in force, ordinary content ingestion cannot pollute the taxonomy. Crawled brochures still produce page summaries for contextual retrieval and remain fully searchable, but their entity data is written to doc_metadata only — never to taxonomy_entities/taxonomy_relationships. (The GoldenPageSeeder itself always has full write authority regardless of the flag.)
How golden pages relate to retrieval
A common misconception worth heading off for an examiner: golden pages are not themselves retrieved at query time. They operate entirely at build time. Their payoff is indirect but decisive:
Because the taxonomy is seeded from ground truth, the query-time subsystems that read it — the taxonomy query-enrichment resolver chain, SNOMED routing, and graph-RAG — resolve "ik zoek een hartdokter" to the real cardiologists in the real Cardiologie department, instead of to a phantom edge a brochure happened to imply. Golden pages buy retrieval correctness, paid for once at seeding time.
Summary
- A golden page is an authoritative listing page the hospital publishes; in the DB it is a row in
app.golden_pagestypedhub(trust anchor) ordetail(single entity). - They are AI-discovered, human-confirmed — not hand-maintained URL lists — via the hub-detection lifecycle.
- The
GoldenPageSeederseeds the taxonomy top-down from confirmed hubs atconfidence = 1.0, stampingGOLDEN_SEEDprovenance. - The
graph_golden_onlygate (defaultTrue) stops ordinary crawled content from writing to the taxonomy, killing the phantom-relationship / hub-inflation / orphan-doctor / meaning-drift failure modes that motivated the design. - Edges carry
CURATED_FROM(strong) vsMENTIONED_IN(weak) provenance for auditability. - Golden pages act at build time; their dividend is query-time correctness.
See also
- Seeding Pipeline — the three-phase CLI that scrapes, merges, and seeds from golden pages.
- Medical Knowledge Architecture — the Three-Source design (Scraper / SNOMED / Curated) and its priority order.
- Pipeline Wizard — the operator UI for confirming hub pages.
- ADR-0028: Golden-Page Taxonomy Gating — the original decision and the 8-type → hub/detail history.
- SNOMED CT Terminology — Source 2, which enriches the golden-seeded entities.