Frozen Taxonomy Registry
The FrozenTaxonomyRegistry is a read-only, in-memory entity index loaded from PostgreSQL taxonomy tables at application startup. It provides O(1) lookups for entity resolution, relationship validation, and allowlist filtering. The registry is "frozen" because it represents a point-in-time snapshot of the hospital's entity inventory -- it does not change during system operation and can only be refreshed by re-running taxonomy extraction.
The original 8-type page classification (golden_seed, golden_listing, department_page, etc.) was replaced with binary hub/detail classification. Hub pages (~20-40 per hospital) are auto-detected by an LLM classifier. YAML no longer contains departments, doctors, or golden_pages sections -- the DB is the sole source of truth. See docs/plans/2026-03-09-golden-page-reclassification-design.md.
What Problem Does It Solve?
Without the registry, the extraction pipeline would accept any string that matches a doctor name regex as a valid doctor, any department-like pattern as a real department, and any relationship between them as plausible. This produces a noisy graph full of phantom entities -- job titles parsed as doctor names, section headers stored as departments, body parts classified as conditions.
The registry solves this by providing an authoritative inventory of known entities scraped from the hospital's own website. During extraction, every entity is checked against this inventory:
- Known entities are accepted and normalised to their canonical names
- Unknown entities are either rejected (doctors, departments) or flagged for downstream LLM validation (conditions, treatments, examinations)
Package Architecture
The taxonomy package (app/services/graph/taxonomy/) contains 9 modules:
| Module | Purpose |
|---|---|
__init__.py | Public API: exports frozen models, ScrapeResult, registry functions |
models.py | Frozen dataclasses: FrozenDoctor, FrozenDepartment, FrozenCondition, FrozenTreatment, FrozenExamination, ConsultationSchedule, ScrapeResult |
registry.py | FrozenTaxonomyRegistry class + singleton lifecycle (get_registry, initialize_registry, clear_registry) |
allowlist_filter.py | AllowlistFilter -- filters extraction results against the registry |
persistence.py | save_scrape_result() and load_scrape_result() -- PostgreSQL persistence |
scraper_base.py | TaxonomyScraperBase -- abstract async HTTP fetcher |
zol_scraper.py | ZOLTaxonomyScraper -- ZOL-specific HTML parser (BeautifulSoup) |
golden_page_config.py | GoldenPageSet dataclass + ALL_GOLDEN_PAGE_SETS (historically from YAML; now hub pages are auto-discovered by LLM classifier) |
golden_page_seeder.py | GoldenPageSeeder -- merges three knowledge sources and seeds PostgreSQL taxonomy tables |
Data Models
All models use @dataclass(frozen=True) to guarantee immutability:
| Model | Key Fields |
|---|---|
FrozenDoctor | name, departments, specialties, campuses, source_url, source_site |
FrozenDepartment | canonical_name, aliases, campuses, domain_group, conditions, treatments, examinations |
FrozenCondition | canonical_name, aliases, category |
FrozenTreatment | canonical_name, aliases, category |
FrozenExamination | canonical_name, aliases, category |
ConsultationSchedule | doctor_name, department, slots (day/period/type), status, campus_contacts |
ScrapeResult | Lists of all entity types + schedules, scraped_at, source_sites |
O(1) Lookup Indexes
The registry builds 10 lookup indexes at construction time:
| Index | Type | Key | Value | Used For |
|---|---|---|---|---|
_doctor_names | set[str] | original-case | -- | doctor_names property |
_doctor_names_lower | set[str] | lowered | -- | is_known_doctor() |
_department_alias_map | dict[str, str] | alias (lower) | canonical name | department resolution |
_condition_alias_map | dict[str, str] | alias (lower) | canonical name | condition resolution |
_treatment_alias_map | dict[str, str] | alias (lower) | canonical name | treatment resolution |
_examination_alias_map | dict[str, str] | alias (lower) | canonical name | examination resolution |
_dept_condition_map | dict[str, list] | dept (lower) | condition names | HANDLES validation |
_dept_treatment_map | dict[str, list] | dept (lower) | treatment names | OFFERS validation |
_dept_examination_map | dict[str, list] | dept (lower) | exam names | PERFORMS validation |
_departments_by_key | dict[str, FrozenDepartment] | dept (lower) | full object | get_department() |
All lookups are case-insensitive via lowered keys. The _build_indexes() method runs once at construction.
Core API
class FrozenTaxonomyRegistry:
def __init__(self, scrape_result: ScrapeResult, knowledge_maps: dict | None = None):
"""Build registry with optional injected knowledge maps."""
def is_known_entity(self, name: str, entity_type: str) -> bool:
"""O(1) check if an entity exists in the taxonomy."""
def resolve_entity(self, name: str, entity_type: str) -> str | None:
"""Resolve an alias to its canonical name. Returns None if unknown."""
def is_known_doctor(self, name: str) -> bool:
"""Dedicated doctor check (case-insensitive, Dr. prefix handling)."""
def get_department(self, name: str) -> FrozenDepartment | None:
"""Full department object lookup by name or alias."""
def is_valid_relationship(self, source: str, target: str, rel_type: str) -> bool:
"""Validate a relationship against taxonomy data + knowledge maps."""
def summary(self) -> dict[str, int]:
"""Entity counts by type."""
Dependency Injection
The knowledge_maps parameter was introduced to make the hidden dependency on medical_knowledge explicit and testable. When None (default), the registry loads knowledge maps from the medical_knowledge package at construction time. When provided, the injected maps are used instead -- enabling unit tests without importing the full knowledge package:
# Production: loads from medical_knowledge
registry = FrozenTaxonomyRegistry(scrape_result)
# Testing: inject custom maps
registry = FrozenTaxonomyRegistry(scrape_result, knowledge_maps={
"dept_conditions": {"hartfalen": ["Cardiologie"]},
"dept_treatments": {},
"dept_exams": {},
})
Singleton Lifecycle
The registry follows a singleton pattern because it represents a global, read-only resource:
# Async initialisation (from database)
registry = await initialize_registry(session, tenant_id)
# Sync initialisation (from ScrapeResult, for testing/scripts)
registry = initialize_registry_from_result(scrape_result)
# Access anywhere
registry = get_registry() # Returns None if not initialised
# Reset (testing only)
clear_registry()
The registry is initialised once at application startup and refreshed only when the taxonomy is re-scraped via the /api/v1/graph/refresh-taxonomy endpoint.
Database Schema
Three PostgreSQL tables in the app schema (migration 026):
taxonomy_entities
Stores the frozen entity allowlist.
| Column | Type | Notes |
|---|---|---|
id | UUID (PK) | gen_random_uuid() |
tenant_id | UUID (FK) | CASCADE delete |
entity_type | VARCHAR(30) | doctor, department, condition, treatment, examination |
canonical_name | VARCHAR(300) | NOT NULL |
aliases | JSONB | Default [] |
metadata | JSONB | Default {}; stores departments/specialties (doctors), domain_group (depts) |
source_url | VARCHAR(1000) | Nullable |
source_site | VARCHAR(200) | e.g., "www.zol.be" or "zol.novation.website" |
is_active | BOOLEAN | Default true |
Unique constraint: (tenant_id, entity_type, canonical_name) -- enables UPSERT for non-doctor entities.
taxonomy_relationships
Stores scraped department-to-entity maps.
| Column | Type | Notes |
|---|---|---|
tenant_id | UUID (FK) | CASCADE delete |
source_name / source_type | VARCHAR | e.g., "Cardiologie" / "department" |
target_name / target_type | VARCHAR | e.g., "Hartfalen" / "condition" |
relationship_type | VARCHAR(50) | HANDLES, OFFERS, PERFORMS, WORKS_IN_SCHEDULE |
metadata | JSONB | Schedule slots and campus contacts for WORKS_IN_SCHEDULE |
unknown_entities
Human review queue for entities detected during ingestion but not found in the taxonomy.
| Column | Type | Notes |
|---|---|---|
name | VARCHAR(300) | Entity name |
detected_type | VARCHAR(30) | Entity type |
status | VARCHAR(20) | pending, approved, rejected |
resolved_to | VARCHAR(300) | Canonical name if approved |
content_snippet | TEXT | Context where entity was found |
AllowlistFilter
The AllowlistFilter validates extraction results against the registry using a dual-mode filtering strategy:
Entity Filtering Modes
| Mode | Entity Types | Behaviour |
|---|---|---|
| Passthrough | campus, hospital, facility | Always kept, no taxonomy check |
| Strict | doctor, department | Must exist in taxonomy; unknown entities are rejected and recorded |
| Relaxed | condition, treatment, examination, service | Kept even without taxonomy match; canonical name applied if found |
Relationship Filtering (Three Gates)
- Source-type gating: WORKS_IN/WORKS_AT_CAMPUS require the source page to be a
GOLDEN_SOURCE; HANDLES/OFFERS/PERFORMS require anAUTHORITATIVE_SOURCE - Endpoint survival: Both the source and target entity names must be in the set of kept entities
- Taxonomy validation:
registry.is_valid_relationship()checks the relationship against frozen taxonomy data and fallback domain knowledge maps
Scraping Flow
The scraper fetches HTML from hub pages (auto-detected by the LLM classifier from crawled content), parses doctors, departments, conditions, treatments, examinations, and consultation schedules, then deduplicates across both sites. The persistence layer uses a full-replacement strategy: DELETE existing entities for the tenant, then INSERT new ones. This ensures the taxonomy always reflects the latest extraction. The DB is the sole source of truth -- YAML no longer contains hub page URLs.
Integration Points
The registry integrates at three pipeline stages:
1. Extraction Time
Applied at the end of extract_medical_entities() in medical_extraction.py:
registry = get_registry()
if registry:
filter_result = AllowlistFilter(registry).filter(
result, source_url, source_page_type=source_page_type.value,
)
result = filter_result.filtered
2. Graph Seeding
The GoldenPageSeeder reads the registry to build a MedicalExtractionResult that combines scraped entities with domain knowledge maps. See Seeding Pipeline for details.
3. Query-Time Resolution
The resolve_search_query() function in zol_taxonomy.py uses alias maps built from the hospital config YAML. These maps are enriched by registry data when available, enabling patient-friendly search terms ("hartdokter") to resolve to canonical department names ("Cardiologie").
Called from:
query_service.py-- entity extraction from user queriesrag_service.py-- query enrichment with taxonomy termsintent_classification_service.py-- intent routing based on entity type
Loading from Published Tables (SP-5)
The from_published() classmethod (added in SP-5) is the production initialization path. Instead of loading from the legacy ScrapeResult format, it reads directly from the versioned published_entities and published_relationships tables created by the Draft/Publish System.
@classmethod
async def from_published(
cls,
session: AsyncSession,
hospital_id: UUID,
version: int | None = None, # None = latest
) -> "FrozenTaxonomyRegistry":
"""Load registry from published tables (SP-5 production path)."""
if version is None:
version = await _get_latest_version(session, hospital_id)
entities = await _load_published_entities(session, hospital_id, version)
relationships = await _load_published_relationships(session, hospital_id, version)
return cls._build_registry_from_rows(entities, relationships)
_build_registry_from_rows()
This method transforms flat PostgreSQL rows into the frozen dataclasses the registry's O(1) indexes are built from:
- Entities — rows are grouped by
entity_typeand converted toFrozenDoctor,FrozenDepartment,FrozenCondition,FrozenTreatment, andFrozenExaminationinstances - Aliases — the
aliasesJSONB column is unpacked to populate alias maps - Relationships —
published_relationshipsrows withHANDLES,OFFERS,PERFORMStypes are converted to the_dept_condition_map,_dept_treatment_map, and_dept_examination_mapindexes - Schedules —
WORKS_IN_SCHEDULErelationship rows withmetadatapayloads are parsed intoConsultationScheduleinstances
Version-Check Cache Invalidation
The registry singleton polls taxonomy_versions every 60 seconds to detect new publishes:
async def _version_check_loop(hospital_id: UUID) -> None:
"""Background task: reload registry when a new version is published."""
while True:
await asyncio.sleep(60)
latest = await _get_latest_version(session, hospital_id)
if latest > _registry_version:
await _reload_registry(hospital_id, latest)
force_registry_rebuild()
force_registry_rebuild(hospital_id) bypasses the 60-second poll interval and triggers an immediate reload. It is called automatically by PublishService after every publish and rollback operation, ensuring the change is live in search within milliseconds rather than up to 60 seconds.
# Called by PublishService after successful publish or rollback
await force_registry_rebuild(hospital_id=hospital.id)
Backward Compatibility
The ScrapeResult-based initialization path (initialize_registry_from_result()) remains fully functional. It is used in:
- Unit tests — inject a
ScrapeResultdirectly without a database - Legacy scripts —
seed_golden_graph.pyand taxonomy scraper scripts that predate SP-5 - Development — quick local initialization without a published version
When a hospital has no published version yet (e.g., a new tenant mid-setup), the registry falls back to loading from taxonomy_entities directly (draft data) until the first publish completes.
Hospitals that were initialized via scrape_taxonomy + seed_golden_graph before SP-5 will continue to work without modification. The first explicit publish operation creates the published_entities snapshot and switches the registry to the from_published() path.
References
- Hogan, A., et al. (2021). Knowledge graphs. ACM Computing Surveys, 54(4), 1--37. https://doi.org/10.1145/3447772
- Uschold, M., & Gruninger, M. (1996). Ontologies: Principles, methods and applications. Knowledge Engineering Review, 11(2), 93--136. https://doi.org/10.1017/S0269888900007797