Taxonomy Extraction Pipeline
Overview
The taxonomy extraction pipeline automatically discovers and extracts medical entities (doctors, departments, conditions, treatments, examinations) from a hospital website. It replaces the manual entity creation process with an AI-driven 4-stage pipeline.
Pipeline Stages
Stage 1: Data Management
Purpose: Configure the hospital tenant — campuses, websites, pipeline settings.
Components:
- Hospital registration (name, domain, campuses)
- Website URL configuration
- Pipeline budget settings (
max_extraction_cost_usd)
Stage 2: Crawl & Ingest
Purpose: Discover all URLs on the hospital website and ingest their content.
Components:
- Sitemap-based URL discovery (
crawl_service.py) - Content extraction and chunking (
processing_service.py) - Contextual embeddings + canonical questions (ADR-0019)
- Dashboard: processed, discovered, empty, dead URL counts + content type breakdown
Stage 3: Hub Detection
Purpose: Find listing pages (hubs) that contain entity links — e.g., a "Onze artsen" page listing all doctors.
AI Classification Pipeline:
- Pre-filter (
candidate_filter.py): Score URLs by keyword matching (30 Dutch medical keywords). Threshold: score >= 2.0 - Content retrieval: Fetch up to 3 chunks per candidate from document store
- AI classifier (
ai_classifier.py): LLM evaluates each candidate — is it a hub? What entity types? What's the child URL pattern? - URL inference fallback: For JS-rendered pages with no content, infer entity types from URL patterns
Output: Confirmed hub pages with entity types and child URL patterns.
Stage 4: Entity Extraction
Purpose: Extract entities from each hub's child pages using LLM.
Extraction Flow:
- For each confirmed hub, resolve child URLs (pattern matching or prefix fallback)
- LLM extracts entities from page content (name, type, confidence, relationships)
- Resolution pipeline: dedup by name normalization + SNOMED CT matching
- Auto-approval: entities with confidence >= 0.90 AND SNOMED match are auto-approved
- Fuzzy dedup scan (SP-7): Levenshtein distance to find near-duplicates
Entity Types:
| Type | Source | Example |
|---|---|---|
| DOCTOR | /zol-arts/* pages | Dr. Wilfried Mullens |
| DEPARTMENT | Implicit from relationships | Cardiologie |
| CONDITION | /ziektebeeld/* pages | Epilepsie |
| TREATMENT | /behandeling/* pages | Hartrevalidatie |
| EXAMINATION | /onderzoek/* pages | Echografie |
| CAMPUS | Authoritative campuses table | Campus Sint-Jan |
Relationship Types:
| Type | Source | Target |
|---|---|---|
| WORKS_IN | DOCTOR | DEPARTMENT |
| LOCATED_AT | DOCTOR | CAMPUS |
| TREATS | DEPARTMENT | CONDITION |
| OFFERS | DEPARTMENT | TREATMENT |
| PERFORMS | DEPARTMENT | EXAMINATION |
Stage 5: Review & Publish
Purpose: Operator reviews extracted entities, approves/rejects/merges, then publishes.
Actions:
- Approve (auto or manual)
- Reject (with reason)
- Merge duplicates (fuzzy dedup candidates)
- Edit (canonical name, aliases)
- Publish as new taxonomy version
Entity Resolution and Deduplication
After raw entity extraction (Stage 4), entities pass through a multi-step resolution pipeline before they appear in the review table. This pipeline eliminates duplicates, links to medical terminology standards, and determines which entities can be auto-approved.
Step 1: Name Normalization
Every extracted entity name is normalized to produce a dedup key:
"Dr. Wilfried Mullens" → "dr_wilfried_mullens"
"Cardiologie" → "cardiologie"
"Hartrevalidatie (cardiale revalidatie)" → "hartrevalidatie_cardiale_revalidatie"
Normalization rules:
- Lowercase
- Replace non-alphanumeric characters with underscores
- Strip leading/trailing whitespace
- Collapse multiple underscores
Entities with identical dedup keys within the same type are automatically merged (the highest-confidence version wins).
Step 2: Cluster-Based Deduplication
Within each entity type, entities are grouped into clusters by dedup key. Statistics from the latest extraction (March 2026):
| Entity Type | Raw Extracted | After Dedup | Duplicates Merged |
|---|---|---|---|
| CONDITION | 871 | 712 | 159 |
| DOCTOR | 774 | 387 | 387 |
| EXAMINATION | 883 | 268 | 615 |
| TREATMENT | 786 | 692 | 94 |
| Total | 3,314 | 2,059 | 1,255 |
The high doctor duplicate count (387 merged from 774) occurs because the same doctors appear on multiple hub pages (e.g., /zol-artsen-0 and /zol-artsen-specialisme).
Step 3: SNOMED CT Matching
Each deduplicated entity is matched against SNOMED CT (529K Dutch medical terms) using a tiered strategy:
| Tier | Method | Example |
|---|---|---|
| Tier 1: Exact match | Entity name matches SNOMED description exactly | "Epilepsie" matches SNOMED concept 84757009 |
| Tier 2: Normalized match | Lowercased, stripped name matches | "epilepsie" matches "Epilepsie" |
| Tier 3: Fuzzy match | Best Levenshtein match above threshold (0.85) | "cardiale revalidatie" matches "Cardiale revalidatie" at 0.95 |
When a SNOMED match is found:
snomed_concept_idis stored on the entitysnomed_preferred_termis stored for displaysnomed_match_confidence(0.0-1.0) records match quality
Step 4: Auto-Approval
Entities meeting both criteria are automatically approved without operator review:
- AI confidence >= 0.90 — the extraction LLM was confident
- SNOMED match found — the entity maps to a recognized medical concept
Entities that don't meet both criteria are set to proposed status and require manual operator approval.
Step 5: Fuzzy Dedup Scan (SP-7)
After initial dedup and SNOMED matching, a second pass uses LLM-assisted fuzzy deduplication to find near-duplicates that name normalization missed:
Examples of fuzzy matches found:
- "Lumbale discushernia" and "lumbale discushernia's" (plural variation)
- "Schildklierkanker" and "schildklierlijden" (related but distinct -- correctly dismissed)
- "ZOL Genk, campus Sint-Barbara" and "Sint-Barbara Lanaken" (same campus, different names)
The fuzzy scan produces merge candidates that operators review in the UI. Each candidate shows:
- The proposed canonical name
- All variant names
- Confidence score
- Option to merge or dismiss
Step 6: Operator Review
The taxonomy review UI (Stage 4) presents all entities for operator action:
| Action | Effect |
|---|---|
| Approve | Entity moves to approved status, eligible for publishing |
| Reject | Entity moves to rejected status, excluded from taxonomy |
| Merge | Multiple entities combined into one, aliases preserved |
| Edit | Canonical name, aliases, or entity type modified |
Step 7: Publish
When the operator is satisfied, they publish the taxonomy as a new version. Publishing:
- Creates a snapshot in
published_entitiesandpublished_relationships - Records the version number and timestamp
- Runs the auto-linker (Step 8 below) to fill relationship gaps
- Makes the taxonomy available to the RAG query pipeline
- The taxonomy is immediately used for query enrichment and post-answer doctor listing
Step 8: Auto-Linker (Post-Publish)
RelationshipAutoLinker runs automatically at the end of every publish operation. It classifies orphaned entities — conditions, treatments, and examinations with no department relationship — to the most appropriate department via LLM, then inserts the missing TREATS or PERFORMS relationships directly into published_relationships.
This step is non-fatal: if the LLM call fails, the publish succeeds and the gap remains (resolvable via the on-demand admin endpoint). When links are created, the count appears in the publish response as AUTO_LINKED.
See Multi-Tenancy & Hospital-Agnostic Architecture for full details on the auto-linker design.
Resolution Pipeline Architecture
Key Design Decisions
- CAMPUS entities come from the authoritative
campusestable, not AI extraction (avoids duplicates from external hospitals mentioned in content) - SNOMED CT integration provides Dutch synonym resolution at both extraction time (entity matching) and query time (search expansion)
- Child URL patterns use SQL LIKE with
%wildcards, converted from AI-detected{slug}format - Cost budget prevents runaway extraction costs (configurable per hospital)
- Composite quality gate in evaluation handles the known faithfulness scoring limitation for taxonomy-enriched answers
Related Pages
- Entity Resolution — Dedup and SNOMED matching details
- Query Enrichment — How taxonomy data improves search at query time
- Medical Knowledge Architecture — Entity type definitions and relationships