Taxonomy Extraction Pipeline

Overview

The taxonomy extraction pipeline automatically discovers and extracts medical entities (doctors, departments, conditions, treatments, examinations) from a hospital website. It replaces the manual entity creation process with an AI-driven 4-stage pipeline.

Pipeline Stages

Stage 1: Data Management

Purpose: Configure the hospital tenant — campuses, websites, pipeline settings.

Components:

Hospital registration (name, domain, campuses)
Website URL configuration
Pipeline budget settings (max_extraction_cost_usd)

Stage 2: Crawl & Ingest

Purpose: Discover all URLs on the hospital website and ingest their content.

Components:

Sitemap-based URL discovery (crawl_service.py)
Content extraction and chunking (processing_service.py)
Contextual embeddings + canonical questions (ADR-0019)
Dashboard: processed, discovered, empty, dead URL counts + content type breakdown

Stage 3: Hub Detection

Purpose: Find listing pages (hubs) that contain entity links — e.g., a "Onze artsen" page listing all doctors.

AI Classification Pipeline:

Pre-filter (candidate_filter.py): Score URLs by keyword matching (30 Dutch medical keywords). Threshold: score >= 2.0
Content retrieval: Fetch up to 3 chunks per candidate from document store
AI classifier (ai_classifier.py): LLM evaluates each candidate — is it a hub? What entity types? What's the child URL pattern?
URL inference fallback: For JS-rendered pages with no content, infer entity types from URL patterns

Output: Confirmed hub pages with entity types and child URL patterns.

Stage 4: Entity Extraction

Purpose: Extract entities from each hub's child pages using LLM.

Extraction Flow:

For each confirmed hub, resolve child URLs (pattern matching or prefix fallback)
LLM extracts entities from page content (name, type, confidence, relationships)
Resolution pipeline: dedup by name normalization + SNOMED CT matching
Auto-approval: entities with confidence >= 0.90 AND SNOMED match are auto-approved
Fuzzy dedup scan (SP-7): Levenshtein distance to find near-duplicates

Entity Types:

Type	Source	Example
DOCTOR	/zol-arts/* pages	Dr. Wilfried Mullens
DEPARTMENT	Implicit from relationships	Cardiologie
CONDITION	/ziektebeeld/* pages	Epilepsie
TREATMENT	/behandeling/* pages	Hartrevalidatie
EXAMINATION	/onderzoek/* pages	Echografie
CAMPUS	Authoritative campuses table	Campus Sint-Jan

Relationship Types:

Type	Source	Target
WORKS_IN	DOCTOR	DEPARTMENT
LOCATED_AT	DOCTOR	CAMPUS
TREATS	DEPARTMENT	CONDITION
OFFERS	DEPARTMENT	TREATMENT
PERFORMS	DEPARTMENT	EXAMINATION

Stage 5: Review & Publish

Purpose: Operator reviews extracted entities, approves/rejects/merges, then publishes.

Actions:

Approve (auto or manual)
Reject (with reason)
Merge duplicates (fuzzy dedup candidates)
Edit (canonical name, aliases)
Publish as new taxonomy version

Entity Resolution and Deduplication

After raw entity extraction (Stage 4), entities pass through a multi-step resolution pipeline before they appear in the review table. This pipeline eliminates duplicates, links to medical terminology standards, and determines which entities can be auto-approved.

Step 1: Name Normalization

Every extracted entity name is normalized to produce a dedup key:

"Dr. Wilfried Mullens" → "dr_wilfried_mullens"
"Cardiologie" → "cardiologie"
"Hartrevalidatie (cardiale revalidatie)" → "hartrevalidatie_cardiale_revalidatie"

Normalization rules:

Lowercase
Replace non-alphanumeric characters with underscores
Strip leading/trailing whitespace
Collapse multiple underscores

Entities with identical dedup keys within the same type are automatically merged (the highest-confidence version wins).

Step 2: Cluster-Based Deduplication

Within each entity type, entities are grouped into clusters by dedup key. Statistics from the latest extraction (March 2026):

Entity Type	Raw Extracted	After Dedup	Duplicates Merged
CONDITION	871	712	159
DOCTOR	774	387	387
EXAMINATION	883	268	615
TREATMENT	786	692	94
Total	3,314	2,059	1,255

The high doctor duplicate count (387 merged from 774) occurs because the same doctors appear on multiple hub pages (e.g., /zol-artsen-0 and /zol-artsen-specialisme).

Step 3: SNOMED CT Matching

Each deduplicated entity is matched against SNOMED CT (529K Dutch medical terms) using a tiered strategy:

Tier	Method	Example
Tier 1: Exact match	Entity name matches SNOMED description exactly	"Epilepsie" matches SNOMED concept 84757009
Tier 2: Normalized match	Lowercased, stripped name matches	"epilepsie" matches "Epilepsie"
Tier 3: Fuzzy match	Best Levenshtein match above threshold (0.85)	"cardiale revalidatie" matches "Cardiale revalidatie" at 0.95

When a SNOMED match is found:

snomed_concept_id is stored on the entity
snomed_preferred_term is stored for display
snomed_match_confidence (0.0-1.0) records match quality

Step 4: Auto-Approval

Entities meeting both criteria are automatically approved without operator review:

AI confidence >= 0.90 — the extraction LLM was confident
SNOMED match found — the entity maps to a recognized medical concept

Entities that don't meet both criteria are set to proposed status and require manual operator approval.

Step 5: Fuzzy Dedup Scan (SP-7)

After initial dedup and SNOMED matching, a second pass uses LLM-assisted fuzzy deduplication to find near-duplicates that name normalization missed:

Examples of fuzzy matches found:

"Lumbale discushernia" and "lumbale discushernia's" (plural variation)
"Schildklierkanker" and "schildklierlijden" (related but distinct -- correctly dismissed)
"ZOL Genk, campus Sint-Barbara" and "Sint-Barbara Lanaken" (same campus, different names)

The fuzzy scan produces merge candidates that operators review in the UI. Each candidate shows:

The proposed canonical name
All variant names
Confidence score
Option to merge or dismiss

Step 6: Operator Review

The taxonomy review UI (Stage 4) presents all entities for operator action:

Action	Effect
Approve	Entity moves to `approved` status, eligible for publishing
Reject	Entity moves to `rejected` status, excluded from taxonomy
Merge	Multiple entities combined into one, aliases preserved
Edit	Canonical name, aliases, or entity type modified

Step 7: Publish

When the operator is satisfied, they publish the taxonomy as a new version. Publishing:

Creates a snapshot in published_entities and published_relationships
Records the version number and timestamp
Runs the auto-linker (Step 8 below) to fill relationship gaps
Makes the taxonomy available to the RAG query pipeline
The taxonomy is immediately used for query enrichment and post-answer doctor listing

Step 8: Auto-Linker (Post-Publish)

RelationshipAutoLinker runs automatically at the end of every publish operation. It classifies orphaned entities — conditions, treatments, and examinations with no department relationship — to the most appropriate department via LLM, then inserts the missing TREATS or PERFORMS relationships directly into published_relationships.

This step is non-fatal: if the LLM call fails, the publish succeeds and the gap remains (resolvable via the on-demand admin endpoint). When links are created, the count appears in the publish response as AUTO_LINKED.

See Multi-Tenancy & Hospital-Agnostic Architecture for full details on the auto-linker design.

Resolution Pipeline Architecture

Key Design Decisions

CAMPUS entities come from the authoritative campuses table, not AI extraction (avoids duplicates from external hospitals mentioned in content)
SNOMED CT integration provides Dutch synonym resolution at both extraction time (entity matching) and query time (search expansion)
Child URL patterns use SQL LIKE with % wildcards, converted from AI-detected {slug} format
Cost budget prevents runaway extraction costs (configurable per hospital)
Composite quality gate in evaluation handles the known faithfulness scoring limitation for taxonomy-enriched answers

Entity Resolution — Dedup and SNOMED matching details
Query Enrichment — How taxonomy data improves search at query time
Medical Knowledge Architecture — Entity type definitions and relationships

Overview​

Pipeline Stages​

Stage 1: Data Management​

Stage 2: Crawl & Ingest​

Stage 3: Hub Detection​

Stage 4: Entity Extraction​

Stage 5: Review & Publish​

Entity Resolution and Deduplication​

Step 1: Name Normalization​

Step 2: Cluster-Based Deduplication​

Step 3: SNOMED CT Matching​

Step 4: Auto-Approval​

Step 5: Fuzzy Dedup Scan (SP-7)​

Step 6: Operator Review​

Step 7: Publish​

Step 8: Auto-Linker (Post-Publish)​

Resolution Pipeline Architecture​

Key Design Decisions​

Related Pages​