Frozen Taxonomy Registry

The FrozenTaxonomyRegistry is a read-only, in-memory entity index loaded from PostgreSQL taxonomy tables at application startup. It provides O(1) lookups for entity resolution, relationship validation, and allowlist filtering. The registry is "frozen" because it represents a point-in-time snapshot of the hospital's entity inventory -- it does not change during system operation and can only be refreshed by re-running taxonomy extraction.

Hub/Detail Reclassification (2026-03-09)

The original 8-type page classification (golden_seed, golden_listing, department_page, etc.) was replaced with binary hub/detail classification. Hub pages (~20-40 per hospital) are auto-detected by an LLM classifier. YAML no longer contains departments, doctors, or golden_pages sections -- the DB is the sole source of truth. See docs/plans/2026-03-09-golden-page-reclassification-design.md.

What Problem Does It Solve?

Without the registry, the extraction pipeline would accept any string that matches a doctor name regex as a valid doctor, any department-like pattern as a real department, and any relationship between them as plausible. This produces a noisy graph full of phantom entities -- job titles parsed as doctor names, section headers stored as departments, body parts classified as conditions.

The registry solves this by providing an authoritative inventory of known entities scraped from the hospital's own website. During extraction, every entity is checked against this inventory:

Known entities are accepted and normalised to their canonical names
Unknown entities are either rejected (doctors, departments) or flagged for downstream LLM validation (conditions, treatments, examinations)

Package Architecture

The taxonomy package (app/services/graph/taxonomy/) contains 9 modules:

Module	Purpose
`__init__.py`	Public API: exports frozen models, `ScrapeResult`, registry functions
`models.py`	Frozen dataclasses: `FrozenDoctor`, `FrozenDepartment`, `FrozenCondition`, `FrozenTreatment`, `FrozenExamination`, `ConsultationSchedule`, `ScrapeResult`
`registry.py`	`FrozenTaxonomyRegistry` class + singleton lifecycle (`get_registry`, `initialize_registry`, `clear_registry`)
`allowlist_filter.py`	`AllowlistFilter` -- filters extraction results against the registry
`persistence.py`	`save_scrape_result()` and `load_scrape_result()` -- PostgreSQL persistence
`scraper_base.py`	`TaxonomyScraperBase` -- abstract async HTTP fetcher
`zol_scraper.py`	`ZOLTaxonomyScraper` -- ZOL-specific HTML parser (BeautifulSoup)
`golden_page_config.py`	`GoldenPageSet` dataclass + `ALL_GOLDEN_PAGE_SETS` (historically from YAML; now hub pages are auto-discovered by LLM classifier)
`golden_page_seeder.py`	`GoldenPageSeeder` -- merges three knowledge sources and seeds PostgreSQL taxonomy tables

Data Models

All models use @dataclass(frozen=True) to guarantee immutability:

Model	Key Fields
`FrozenDoctor`	`name`, `departments`, `specialties`, `campuses`, `source_url`, `source_site`
`FrozenDepartment`	`canonical_name`, `aliases`, `campuses`, `domain_group`, `conditions`, `treatments`, `examinations`
`FrozenCondition`	`canonical_name`, `aliases`, `category`
`FrozenTreatment`	`canonical_name`, `aliases`, `category`
`FrozenExamination`	`canonical_name`, `aliases`, `category`
`ConsultationSchedule`	`doctor_name`, `department`, `slots` (day/period/type), `status`, `campus_contacts`
`ScrapeResult`	Lists of all entity types + `schedules`, `scraped_at`, `source_sites`

O(1) Lookup Indexes

The registry builds 10 lookup indexes at construction time:

Index	Type	Key	Value	Used For
`_doctor_names`	`set[str]`	original-case	--	`doctor_names` property
`_doctor_names_lower`	`set[str]`	lowered	--	`is_known_doctor()`
`_department_alias_map`	`dict[str, str]`	alias (lower)	canonical name	department resolution
`_condition_alias_map`	`dict[str, str]`	alias (lower)	canonical name	condition resolution
`_treatment_alias_map`	`dict[str, str]`	alias (lower)	canonical name	treatment resolution
`_examination_alias_map`	`dict[str, str]`	alias (lower)	canonical name	examination resolution
`_dept_condition_map`	`dict[str, list]`	dept (lower)	condition names	HANDLES validation
`_dept_treatment_map`	`dict[str, list]`	dept (lower)	treatment names	OFFERS validation
`_dept_examination_map`	`dict[str, list]`	dept (lower)	exam names	PERFORMS validation
`_departments_by_key`	`dict[str, FrozenDepartment]`	dept (lower)	full object	`get_department()`

All lookups are case-insensitive via lowered keys. The _build_indexes() method runs once at construction.

Core API

class FrozenTaxonomyRegistry:
    def __init__(self, scrape_result: ScrapeResult, knowledge_maps: dict | None = None):
        """Build registry with optional injected knowledge maps."""

    def is_known_entity(self, name: str, entity_type: str) -> bool:
        """O(1) check if an entity exists in the taxonomy."""

    def resolve_entity(self, name: str, entity_type: str) -> str | None:
        """Resolve an alias to its canonical name. Returns None if unknown."""

    def is_known_doctor(self, name: str) -> bool:
        """Dedicated doctor check (case-insensitive, Dr. prefix handling)."""

    def get_department(self, name: str) -> FrozenDepartment | None:
        """Full department object lookup by name or alias."""

    def is_valid_relationship(self, source: str, target: str, rel_type: str) -> bool:
        """Validate a relationship against taxonomy data + knowledge maps."""

    def summary(self) -> dict[str, int]:
        """Entity counts by type."""

Dependency Injection

The knowledge_maps parameter was introduced to make the hidden dependency on medical_knowledge explicit and testable. When None (default), the registry loads knowledge maps from the medical_knowledge package at construction time. When provided, the injected maps are used instead -- enabling unit tests without importing the full knowledge package:

# Production: loads from medical_knowledge
registry = FrozenTaxonomyRegistry(scrape_result)

# Testing: inject custom maps
registry = FrozenTaxonomyRegistry(scrape_result, knowledge_maps={
    "dept_conditions": {"hartfalen": ["Cardiologie"]},
    "dept_treatments": {},
    "dept_exams": {},
})

Singleton Lifecycle

The registry follows a singleton pattern because it represents a global, read-only resource:

# Async initialisation (from database)
registry = await initialize_registry(session, tenant_id)

# Sync initialisation (from ScrapeResult, for testing/scripts)
registry = initialize_registry_from_result(scrape_result)

# Access anywhere
registry = get_registry()  # Returns None if not initialised

# Reset (testing only)
clear_registry()

The registry is initialised once at application startup and refreshed only when the taxonomy is re-scraped via the /api/v1/graph/refresh-taxonomy endpoint.

Database Schema

Three PostgreSQL tables in the app schema (migration 026):

`taxonomy_entities`

Stores the frozen entity allowlist.

Column	Type	Notes
`id`	`UUID` (PK)	`gen_random_uuid()`
`tenant_id`	`UUID` (FK)	CASCADE delete
`entity_type`	`VARCHAR(30)`	`doctor`, `department`, `condition`, `treatment`, `examination`
`canonical_name`	`VARCHAR(300)`	NOT NULL
`aliases`	`JSONB`	Default `[]`
`metadata`	`JSONB`	Default `{}`; stores departments/specialties (doctors), domain_group (depts)
`source_url`	`VARCHAR(1000)`	Nullable
`source_site`	`VARCHAR(200)`	e.g., `"www.zol.be"` or `"zol.novation.website"`
`is_active`	`BOOLEAN`	Default `true`

Unique constraint: (tenant_id, entity_type, canonical_name) -- enables UPSERT for non-doctor entities.

`taxonomy_relationships`

Stores scraped department-to-entity maps.

Column	Type	Notes
`tenant_id`	`UUID` (FK)	CASCADE delete
`source_name` / `source_type`	`VARCHAR`	e.g., `"Cardiologie"` / `"department"`
`target_name` / `target_type`	`VARCHAR`	e.g., `"Hartfalen"` / `"condition"`
`relationship_type`	`VARCHAR(50)`	`HANDLES`, `OFFERS`, `PERFORMS`, `WORKS_IN_SCHEDULE`
`metadata`	`JSONB`	Schedule slots and campus contacts for `WORKS_IN_SCHEDULE`

`unknown_entities`

Human review queue for entities detected during ingestion but not found in the taxonomy.

Column	Type	Notes
`name`	`VARCHAR(300)`	Entity name
`detected_type`	`VARCHAR(30)`	Entity type
`status`	`VARCHAR(20)`	`pending`, `approved`, `rejected`
`resolved_to`	`VARCHAR(300)`	Canonical name if approved
`content_snippet`	`TEXT`	Context where entity was found

AllowlistFilter

The AllowlistFilter validates extraction results against the registry using a dual-mode filtering strategy:

Entity Filtering Modes

Mode	Entity Types	Behaviour
Passthrough	campus, hospital, facility	Always kept, no taxonomy check
Strict	doctor, department	Must exist in taxonomy; unknown entities are rejected and recorded
Relaxed	condition, treatment, examination, service	Kept even without taxonomy match; canonical name applied if found

Relationship Filtering (Three Gates)

Source-type gating: WORKS_IN/WORKS_AT_CAMPUS require the source page to be a GOLDEN_SOURCE; HANDLES/OFFERS/PERFORMS require an AUTHORITATIVE_SOURCE
Endpoint survival: Both the source and target entity names must be in the set of kept entities
Taxonomy validation: registry.is_valid_relationship() checks the relationship against frozen taxonomy data and fallback domain knowledge maps

Scraping Flow

The scraper fetches HTML from hub pages (auto-detected by the LLM classifier from crawled content), parses doctors, departments, conditions, treatments, examinations, and consultation schedules, then deduplicates across both sites. The persistence layer uses a full-replacement strategy: DELETE existing entities for the tenant, then INSERT new ones. This ensures the taxonomy always reflects the latest extraction. The DB is the sole source of truth -- YAML no longer contains hub page URLs.

Integration Points

The registry integrates at three pipeline stages:

1. Extraction Time

Applied at the end of extract_medical_entities() in medical_extraction.py:

registry = get_registry()
if registry:
    filter_result = AllowlistFilter(registry).filter(
        result, source_url, source_page_type=source_page_type.value,
    )
    result = filter_result.filtered

2. Graph Seeding

The GoldenPageSeeder reads the registry to build a MedicalExtractionResult that combines scraped entities with domain knowledge maps. See Seeding Pipeline for details.

3. Query-Time Resolution

The resolve_search_query() function in zol_taxonomy.py uses alias maps built from the hospital config YAML. These maps are enriched by registry data when available, enabling patient-friendly search terms ("hartdokter") to resolve to canonical department names ("Cardiologie").

Called from:

query_service.py -- entity extraction from user queries
rag_service.py -- query enrichment with taxonomy terms
intent_classification_service.py -- intent routing based on entity type

Loading from Published Tables (SP-5)

The from_published() classmethod (added in SP-5) is the production initialization path. Instead of loading from the legacy ScrapeResult format, it reads directly from the versioned published_entities and published_relationships tables created by the Draft/Publish System.

@classmethod
async def from_published(
    cls,
    session: AsyncSession,
    hospital_id: UUID,
    version: int | None = None,  # None = latest
) -> "FrozenTaxonomyRegistry":
    """Load registry from published tables (SP-5 production path)."""
    if version is None:
        version = await _get_latest_version(session, hospital_id)
    entities = await _load_published_entities(session, hospital_id, version)
    relationships = await _load_published_relationships(session, hospital_id, version)
    return cls._build_registry_from_rows(entities, relationships)

`_build_registry_from_rows()`

This method transforms flat PostgreSQL rows into the frozen dataclasses the registry's O(1) indexes are built from:

Entities — rows are grouped by entity_type and converted to FrozenDoctor, FrozenDepartment, FrozenCondition, FrozenTreatment, and FrozenExamination instances
Aliases — the aliases JSONB column is unpacked to populate alias maps
Relationships — published_relationships rows with HANDLES, OFFERS, PERFORMS types are converted to the _dept_condition_map, _dept_treatment_map, and _dept_examination_map indexes
Schedules — WORKS_IN_SCHEDULE relationship rows with metadata payloads are parsed into ConsultationSchedule instances

Version-Check Cache Invalidation

The registry singleton polls taxonomy_versions every 60 seconds to detect new publishes:

async def _version_check_loop(hospital_id: UUID) -> None:
    """Background task: reload registry when a new version is published."""
    while True:
        await asyncio.sleep(60)
        latest = await _get_latest_version(session, hospital_id)
        if latest > _registry_version:
            await _reload_registry(hospital_id, latest)

`force_registry_rebuild()`

force_registry_rebuild(hospital_id) bypasses the 60-second poll interval and triggers an immediate reload. It is called automatically by PublishService after every publish and rollback operation, ensuring the change is live in search within milliseconds rather than up to 60 seconds.

# Called by PublishService after successful publish or rollback
await force_registry_rebuild(hospital_id=hospital.id)

Backward Compatibility

The ScrapeResult-based initialization path (initialize_registry_from_result()) remains fully functional. It is used in:

Unit tests — inject a ScrapeResult directly without a database
Legacy scripts — seed_golden_graph.py and taxonomy scraper scripts that predate SP-5
Development — quick local initialization without a published version

When a hospital has no published version yet (e.g., a new tenant mid-setup), the registry falls back to loading from taxonomy_entities directly (draft data) until the first publish completes.

Migration Path

Hospitals that were initialized via scrape_taxonomy + seed_golden_graph before SP-5 will continue to work without modification. The first explicit publish operation creates the published_entities snapshot and switches the registry to the from_published() path.

References

Hogan, A., et al. (2021). Knowledge graphs. ACM Computing Surveys, 54(4), 1--37. https://doi.org/10.1145/3447772
Uschold, M., & Gruninger, M. (1996). Ontologies: Principles, methods and applications. Knowledge Engineering Review, 11(2), 93--136. https://doi.org/10.1017/S0269888900007797

What Problem Does It Solve?​

Package Architecture​

Data Models​

O(1) Lookup Indexes​

Core API​

Dependency Injection​

Singleton Lifecycle​

Database Schema​

taxonomy_entities​

taxonomy_relationships​

unknown_entities​

AllowlistFilter​

Entity Filtering Modes​

Relationship Filtering (Three Gates)​

Scraping Flow​

Integration Points​

1. Extraction Time​

2. Graph Seeding​

3. Query-Time Resolution​

Loading from Published Tables (SP-5)​

_build_registry_from_rows()​

Version-Check Cache Invalidation​

force_registry_rebuild()​

Backward Compatibility​

References​