Skip to main content

Golden Pages

One concept, three names

The single most confusing thing about this part of the system is that one idea has three names that accreted over time. This page reconciles them up front so the rest makes sense:

Name you'll seeWhere it livesWhat it actually means
Golden pageprose, this pageThe concept: a page the hospital itself publishes that authoritatively lists its own entities (all cardiologists, all departments). The "ground truth" we trust.
hub / detailthe page_type column of app.golden_pagesThe classification. A golden page is either a hub (a navigational listing — the trust anchor) or a detail (a single-entity page).
GOLDEN_SEEDsource_page_type on seeded entitiesThe provenance stamp. Marks an entity as created top-down by the seeder from a golden page, not extracted bottom-up from prose.

Throughout this page, "golden page" = a hub page unless stated otherwise. The legacy 8-type vocabulary (GOLDEN_LISTING, DEPARTMENT_PAGE, …) was collapsed into binary hub/detail by migration 045 — see ADR-0028.

What problem do golden pages solve?

The ZOL taxonomy is a graph of entities (doctors, departments, conditions, treatments) connected by relationships (Dr. Houben WORKS_IN Neurologie; Cardiologie HANDLES hartfalen). The quality of every downstream answer — entity resolution, department routing, graph-RAG — depends on those relationships being true.

There are two fundamentally different ways to build that graph, and the difference is the whole story:

When entities were extracted bottom-up from unstructured content, mere co-occurrence in a brochure forged structural relationships. The symptoms (documented in ADR-0028) were concrete and severe:

  • Phantom relationships — "Dementie HANDLED_BY Urologie" because the two words appeared in the same leaflet.
  • Hub-node inflationSpoedgevallen (Emergency) accrued 244 relationships purely from co-occurrence.
  • Orphan doctors53 doctors with no WORKS_IN because their department was never named in the same paragraph.
  • Meaning drift — a passing brochure mention became a load-bearing structural edge.

Golden-page seeding is the cure: define the authoritative entity set first, top-down, from pages the hospital publishes specifically to enumerate its own doctors and departments. Everything else is then gated against that ground truth.

What makes a page "golden"?

A golden page is a navigational listing page — a page whose job is to enumerate entities, not describe one. Concretely, for ZOL:

Example pageEnumeratesClassified
/zol-artsen (doctor directory)every doctorhub
/medische-diensten (department index)every departmenthub
/raadplegingen/{dept} (consultation listing)doctors + schedule for one departmenthub
/zol-arts/{slug} (one doctor's page)a single doctordetail
/ziektebeeld/{condition} (one condition)a single conditiondetail

The distinction is structural authority, not topic: a hub page is the hospital's own canonical list, so the entities on it — and the relationships implied by their grouping (this doctor appears under this department ⇒ WORKS_IN) — are trustworthy by construction. A detail page is about one entity; it is stored and retrievable, but it is not allowed to mint new entities or structural edges into the taxonomy.

Where golden pages live: app.golden_pages

Golden pages are a first-class, DB-backed concept — not a config constant. They are rows in app.golden_pages, each carrying a page_type constrained to 'hub' or 'detail' (migration 045), plus a confirmation status. Hub URLs are no longer hand-listed in zol.yaml — they are discovered automatically and then confirmed by an operator.

The hub-detection lifecycle

  1. Pre-filter — a content-quality gate demotes pages with insufficient content before classification, so thin pages can't masquerade as hubs (this replaced the old URL-pattern heuristic, which was a false-positive source).
  2. AI classify — an LLM (hub_detection/ai_classifier.py) labels each surviving page hub or detail and proposes the entity types it lists.
  3. Store candidates — results land in app.golden_pages as candidates.
  4. Human confirm/reject — an operator confirms, rejects, or manually designates URLs via HubPageService (surfaced in the Pipeline Wizard). Confirmation is the trust gate: a page is only golden once a human says so.

This AI-proposes / human-confirms loop is the deliberate design choice — it keeps the authoritative entity set auditable without forcing operators to hand-maintain URL lists.

How golden pages seed the taxonomy

Confirmed hubs are consumed top-down by the GoldenPageSeeder (graph/taxonomy/golden_page_seeder.py) during Phase 2 of the seeding pipeline. Entities it creates are stamped with source_page_type = GOLDEN_SEED and a synthetic source ID (GOLDEN_SEED_PAGE_ID), so their golden provenance is permanent and queryable.

Crucially, the seeder only seeds entities that were actually scraped from golden pages — it does not invent. Relationships derived from a hub's structure (a doctor listed under a department ⇒ WORKS_IN) are written with confidence = 1.0, the highest tier. By contrast, SNOMED-derived relationships enter at confidence = 0.7, and curated overrides win over both (see the Three-Source merge priority: Curated > Scraped > SNOMED).

Provenance split: CURATED_FROM vs MENTIONED_IN

Golden-page provenance is encoded directly on the edges, so every relationship carries how much we trust it:

ProvenanceSourceStrength
CURATED_FROMa golden/hub pageStrong — structural, from the hospital's own list
MENTIONED_INa brochure / news / general pageWeak — a mention, not a structural claim

This is what makes the graph auditable: you can always ask "is this edge curated or merely mentioned?" and weight it accordingly.

The gate that enforces it all: graph_golden_only

The discipline is enforced by one setting — graph_golden_only in config.py, default True:

Page typeCan write entities/structural edges to the taxonomy?
hub (golden)Always — defines the authoritative entity set
detailOnly when graph_golden_only = False

With the default in force, ordinary content ingestion cannot pollute the taxonomy. Crawled brochures still produce page summaries for contextual retrieval and remain fully searchable, but their entity data is written to doc_metadata only — never to taxonomy_entities/taxonomy_relationships. (The GoldenPageSeeder itself always has full write authority regardless of the flag.)

How golden pages relate to retrieval

A common misconception worth heading off for an examiner: golden pages are not themselves retrieved at query time. They operate entirely at build time. Their payoff is indirect but decisive:

Because the taxonomy is seeded from ground truth, the query-time subsystems that read it — the taxonomy query-enrichment resolver chain, SNOMED routing, and graph-RAG — resolve "ik zoek een hartdokter" to the real cardiologists in the real Cardiologie department, instead of to a phantom edge a brochure happened to imply. Golden pages buy retrieval correctness, paid for once at seeding time.

Summary

  • A golden page is an authoritative listing page the hospital publishes; in the DB it is a row in app.golden_pages typed hub (trust anchor) or detail (single entity).
  • They are AI-discovered, human-confirmed — not hand-maintained URL lists — via the hub-detection lifecycle.
  • The GoldenPageSeeder seeds the taxonomy top-down from confirmed hubs at confidence = 1.0, stamping GOLDEN_SEED provenance.
  • The graph_golden_only gate (default True) stops ordinary crawled content from writing to the taxonomy, killing the phantom-relationship / hub-inflation / orphan-doctor / meaning-drift failure modes that motivated the design.
  • Edges carry CURATED_FROM (strong) vs MENTIONED_IN (weak) provenance for auditability.
  • Golden pages act at build time; their dividend is query-time correctness.

See also