Skip to main content

Frozen Taxonomy Registry

The FrozenTaxonomyRegistry is a read-only, in-memory entity index loaded from PostgreSQL taxonomy tables at application startup. It provides O(1) lookups for entity resolution, relationship validation, and allowlist filtering. The registry is "frozen" because it represents a point-in-time snapshot of the hospital's entity inventory -- it does not change during system operation and can only be refreshed by re-running taxonomy extraction.

Hub/Detail Reclassification (2026-03-09)

The original 8-type page classification (golden_seed, golden_listing, department_page, etc.) was replaced with binary hub/detail classification. Hub pages (~20-40 per hospital) are auto-detected by an LLM classifier. YAML no longer contains departments, doctors, or golden_pages sections -- the DB is the sole source of truth. See docs/plans/2026-03-09-golden-page-reclassification-design.md.

What Problem Does It Solve?

Without the registry, the extraction pipeline would accept any string that matches a doctor name regex as a valid doctor, any department-like pattern as a real department, and any relationship between them as plausible. This produces a noisy graph full of phantom entities -- job titles parsed as doctor names, section headers stored as departments, body parts classified as conditions.

The registry solves this by providing an authoritative inventory of known entities scraped from the hospital's own website. During extraction, every entity is checked against this inventory:

  • Known entities are accepted and normalised to their canonical names
  • Unknown entities are either rejected (doctors, departments) or flagged for downstream LLM validation (conditions, treatments, examinations)

Package Architecture

The taxonomy package (app/services/graph/taxonomy/) contains 9 modules:

ModulePurpose
__init__.pyPublic API: exports frozen models, ScrapeResult, registry functions
models.pyFrozen dataclasses: FrozenDoctor, FrozenDepartment, FrozenCondition, FrozenTreatment, FrozenExamination, ConsultationSchedule, ScrapeResult
registry.pyFrozenTaxonomyRegistry class + singleton lifecycle (get_registry, initialize_registry, clear_registry)
allowlist_filter.pyAllowlistFilter -- filters extraction results against the registry
persistence.pysave_scrape_result() and load_scrape_result() -- PostgreSQL persistence
scraper_base.pyTaxonomyScraperBase -- abstract async HTTP fetcher
zol_scraper.pyZOLTaxonomyScraper -- ZOL-specific HTML parser (BeautifulSoup)
golden_page_config.pyGoldenPageSet dataclass + ALL_GOLDEN_PAGE_SETS (historically from YAML; now hub pages are auto-discovered by LLM classifier)
golden_page_seeder.pyGoldenPageSeeder -- merges three knowledge sources and seeds PostgreSQL taxonomy tables

Data Models

All models use @dataclass(frozen=True) to guarantee immutability:

ModelKey Fields
FrozenDoctorname, departments, specialties, campuses, source_url, source_site
FrozenDepartmentcanonical_name, aliases, campuses, domain_group, conditions, treatments, examinations
FrozenConditioncanonical_name, aliases, category
FrozenTreatmentcanonical_name, aliases, category
FrozenExaminationcanonical_name, aliases, category
ConsultationScheduledoctor_name, department, slots (day/period/type), status, campus_contacts
ScrapeResultLists of all entity types + schedules, scraped_at, source_sites

O(1) Lookup Indexes

The registry builds 10 lookup indexes at construction time:

IndexTypeKeyValueUsed For
_doctor_namesset[str]original-case--doctor_names property
_doctor_names_lowerset[str]lowered--is_known_doctor()
_department_alias_mapdict[str, str]alias (lower)canonical namedepartment resolution
_condition_alias_mapdict[str, str]alias (lower)canonical namecondition resolution
_treatment_alias_mapdict[str, str]alias (lower)canonical nametreatment resolution
_examination_alias_mapdict[str, str]alias (lower)canonical nameexamination resolution
_dept_condition_mapdict[str, list]dept (lower)condition namesHANDLES validation
_dept_treatment_mapdict[str, list]dept (lower)treatment namesOFFERS validation
_dept_examination_mapdict[str, list]dept (lower)exam namesPERFORMS validation
_departments_by_keydict[str, FrozenDepartment]dept (lower)full objectget_department()

All lookups are case-insensitive via lowered keys. The _build_indexes() method runs once at construction.

Core API

class FrozenTaxonomyRegistry:
def __init__(self, scrape_result: ScrapeResult, knowledge_maps: dict | None = None):
"""Build registry with optional injected knowledge maps."""

def is_known_entity(self, name: str, entity_type: str) -> bool:
"""O(1) check if an entity exists in the taxonomy."""

def resolve_entity(self, name: str, entity_type: str) -> str | None:
"""Resolve an alias to its canonical name. Returns None if unknown."""

def is_known_doctor(self, name: str) -> bool:
"""Dedicated doctor check (case-insensitive, Dr. prefix handling)."""

def get_department(self, name: str) -> FrozenDepartment | None:
"""Full department object lookup by name or alias."""

def is_valid_relationship(self, source: str, target: str, rel_type: str) -> bool:
"""Validate a relationship against taxonomy data + knowledge maps."""

def summary(self) -> dict[str, int]:
"""Entity counts by type."""

Dependency Injection

The knowledge_maps parameter was introduced to make the hidden dependency on medical_knowledge explicit and testable. When None (default), the registry loads knowledge maps from the medical_knowledge package at construction time. When provided, the injected maps are used instead -- enabling unit tests without importing the full knowledge package:

# Production: loads from medical_knowledge
registry = FrozenTaxonomyRegistry(scrape_result)

# Testing: inject custom maps
registry = FrozenTaxonomyRegistry(scrape_result, knowledge_maps={
"dept_conditions": {"hartfalen": ["Cardiologie"]},
"dept_treatments": {},
"dept_exams": {},
})

Singleton Lifecycle

The registry follows a singleton pattern because it represents a global, read-only resource:

# Async initialisation (from database)
registry = await initialize_registry(session, tenant_id)

# Sync initialisation (from ScrapeResult, for testing/scripts)
registry = initialize_registry_from_result(scrape_result)

# Access anywhere
registry = get_registry() # Returns None if not initialised

# Reset (testing only)
clear_registry()

The registry is initialised once at application startup and refreshed only when the taxonomy is re-scraped via the /api/v1/graph/refresh-taxonomy endpoint.

Database Schema

Three PostgreSQL tables in the app schema (migration 026):

taxonomy_entities

Stores the frozen entity allowlist.

ColumnTypeNotes
idUUID (PK)gen_random_uuid()
tenant_idUUID (FK)CASCADE delete
entity_typeVARCHAR(30)doctor, department, condition, treatment, examination
canonical_nameVARCHAR(300)NOT NULL
aliasesJSONBDefault []
metadataJSONBDefault {}; stores departments/specialties (doctors), domain_group (depts)
source_urlVARCHAR(1000)Nullable
source_siteVARCHAR(200)e.g., "www.zol.be" or "zol.novation.website"
is_activeBOOLEANDefault true

Unique constraint: (tenant_id, entity_type, canonical_name) -- enables UPSERT for non-doctor entities.

taxonomy_relationships

Stores scraped department-to-entity maps.

ColumnTypeNotes
tenant_idUUID (FK)CASCADE delete
source_name / source_typeVARCHARe.g., "Cardiologie" / "department"
target_name / target_typeVARCHARe.g., "Hartfalen" / "condition"
relationship_typeVARCHAR(50)HANDLES, OFFERS, PERFORMS, WORKS_IN_SCHEDULE
metadataJSONBSchedule slots and campus contacts for WORKS_IN_SCHEDULE

unknown_entities

Human review queue for entities detected during ingestion but not found in the taxonomy.

ColumnTypeNotes
nameVARCHAR(300)Entity name
detected_typeVARCHAR(30)Entity type
statusVARCHAR(20)pending, approved, rejected
resolved_toVARCHAR(300)Canonical name if approved
content_snippetTEXTContext where entity was found

AllowlistFilter

The AllowlistFilter validates extraction results against the registry using a dual-mode filtering strategy:

Entity Filtering Modes

ModeEntity TypesBehaviour
Passthroughcampus, hospital, facilityAlways kept, no taxonomy check
Strictdoctor, departmentMust exist in taxonomy; unknown entities are rejected and recorded
Relaxedcondition, treatment, examination, serviceKept even without taxonomy match; canonical name applied if found

Relationship Filtering (Three Gates)

  1. Source-type gating: WORKS_IN/WORKS_AT_CAMPUS require the source page to be a GOLDEN_SOURCE; HANDLES/OFFERS/PERFORMS require an AUTHORITATIVE_SOURCE
  2. Endpoint survival: Both the source and target entity names must be in the set of kept entities
  3. Taxonomy validation: registry.is_valid_relationship() checks the relationship against frozen taxonomy data and fallback domain knowledge maps

Scraping Flow

The scraper fetches HTML from hub pages (auto-detected by the LLM classifier from crawled content), parses doctors, departments, conditions, treatments, examinations, and consultation schedules, then deduplicates across both sites. The persistence layer uses a full-replacement strategy: DELETE existing entities for the tenant, then INSERT new ones. This ensures the taxonomy always reflects the latest extraction. The DB is the sole source of truth -- YAML no longer contains hub page URLs.

Integration Points

The registry integrates at three pipeline stages:

1. Extraction Time

Applied at the end of extract_medical_entities() in medical_extraction.py:

registry = get_registry()
if registry:
filter_result = AllowlistFilter(registry).filter(
result, source_url, source_page_type=source_page_type.value,
)
result = filter_result.filtered

2. Graph Seeding

The GoldenPageSeeder reads the registry to build a MedicalExtractionResult that combines scraped entities with domain knowledge maps. See Seeding Pipeline for details.

3. Query-Time Resolution

The resolve_search_query() function in zol_taxonomy.py uses alias maps built from the hospital config YAML. These maps are enriched by registry data when available, enabling patient-friendly search terms ("hartdokter") to resolve to canonical department names ("Cardiologie").

Called from:

  • query_service.py -- entity extraction from user queries
  • rag_service.py -- query enrichment with taxonomy terms
  • intent_classification_service.py -- intent routing based on entity type

Loading from Published Tables (SP-5)

The from_published() classmethod (added in SP-5) is the production initialization path. Instead of loading from the legacy ScrapeResult format, it reads directly from the versioned published_entities and published_relationships tables created by the Draft/Publish System.

@classmethod
async def from_published(
cls,
session: AsyncSession,
hospital_id: UUID,
version: int | None = None, # None = latest
) -> "FrozenTaxonomyRegistry":
"""Load registry from published tables (SP-5 production path)."""
if version is None:
version = await _get_latest_version(session, hospital_id)
entities = await _load_published_entities(session, hospital_id, version)
relationships = await _load_published_relationships(session, hospital_id, version)
return cls._build_registry_from_rows(entities, relationships)

_build_registry_from_rows()

This method transforms flat PostgreSQL rows into the frozen dataclasses the registry's O(1) indexes are built from:

  1. Entities — rows are grouped by entity_type and converted to FrozenDoctor, FrozenDepartment, FrozenCondition, FrozenTreatment, and FrozenExamination instances
  2. Aliases — the aliases JSONB column is unpacked to populate alias maps
  3. Relationshipspublished_relationships rows with HANDLES, OFFERS, PERFORMS types are converted to the _dept_condition_map, _dept_treatment_map, and _dept_examination_map indexes
  4. SchedulesWORKS_IN_SCHEDULE relationship rows with metadata payloads are parsed into ConsultationSchedule instances

Version-Check Cache Invalidation

The registry singleton polls taxonomy_versions every 60 seconds to detect new publishes:

async def _version_check_loop(hospital_id: UUID) -> None:
"""Background task: reload registry when a new version is published."""
while True:
await asyncio.sleep(60)
latest = await _get_latest_version(session, hospital_id)
if latest > _registry_version:
await _reload_registry(hospital_id, latest)

force_registry_rebuild()

force_registry_rebuild(hospital_id) bypasses the 60-second poll interval and triggers an immediate reload. It is called automatically by PublishService after every publish and rollback operation, ensuring the change is live in search within milliseconds rather than up to 60 seconds.

# Called by PublishService after successful publish or rollback
await force_registry_rebuild(hospital_id=hospital.id)

Backward Compatibility

The ScrapeResult-based initialization path (initialize_registry_from_result()) remains fully functional. It is used in:

  • Unit tests — inject a ScrapeResult directly without a database
  • Legacy scriptsseed_golden_graph.py and taxonomy scraper scripts that predate SP-5
  • Development — quick local initialization without a published version

When a hospital has no published version yet (e.g., a new tenant mid-setup), the registry falls back to loading from taxonomy_entities directly (draft data) until the first publish completes.

Migration Path

Hospitals that were initialized via scrape_taxonomy + seed_golden_graph before SP-5 will continue to work without modification. The first explicit publish operation creates the published_entities snapshot and switches the registry to the from_published() path.

References