Skip to main content

Taxonomy Extraction Pipeline

Overview

The taxonomy extraction pipeline automatically discovers and extracts medical entities (doctors, departments, conditions, treatments, examinations) from a hospital website. It replaces the manual entity creation process with an AI-driven 4-stage pipeline.

Pipeline Stages

Stage 1: Data Management

Purpose: Configure the hospital tenant — campuses, websites, pipeline settings.

Components:

  • Hospital registration (name, domain, campuses)
  • Website URL configuration
  • Pipeline budget settings (max_extraction_cost_usd)

Stage 2: Crawl & Ingest

Purpose: Discover all URLs on the hospital website and ingest their content.

Components:

  • Sitemap-based URL discovery (crawl_service.py)
  • Content extraction and chunking (processing_service.py)
  • Contextual embeddings + canonical questions (ADR-0019)
  • Dashboard: processed, discovered, empty, dead URL counts + content type breakdown

Stage 3: Hub Detection

Purpose: Find listing pages (hubs) that contain entity links — e.g., a "Onze artsen" page listing all doctors.

AI Classification Pipeline:

  1. Pre-filter (candidate_filter.py): Score URLs by keyword matching (30 Dutch medical keywords). Threshold: score >= 2.0
  2. Content retrieval: Fetch up to 3 chunks per candidate from document store
  3. AI classifier (ai_classifier.py): LLM evaluates each candidate — is it a hub? What entity types? What's the child URL pattern?
  4. URL inference fallback: For JS-rendered pages with no content, infer entity types from URL patterns

Output: Confirmed hub pages with entity types and child URL patterns.

Stage 4: Entity Extraction

Purpose: Extract entities from each hub's child pages using LLM.

Extraction Flow:

  1. For each confirmed hub, resolve child URLs (pattern matching or prefix fallback)
  2. LLM extracts entities from page content (name, type, confidence, relationships)
  3. Resolution pipeline: dedup by name normalization + SNOMED CT matching
  4. Auto-approval: entities with confidence >= 0.90 AND SNOMED match are auto-approved
  5. Fuzzy dedup scan (SP-7): Levenshtein distance to find near-duplicates

Entity Types:

TypeSourceExample
DOCTOR/zol-arts/* pagesDr. Wilfried Mullens
DEPARTMENTImplicit from relationshipsCardiologie
CONDITION/ziektebeeld/* pagesEpilepsie
TREATMENT/behandeling/* pagesHartrevalidatie
EXAMINATION/onderzoek/* pagesEchografie
CAMPUSAuthoritative campuses tableCampus Sint-Jan

Relationship Types:

TypeSourceTarget
WORKS_INDOCTORDEPARTMENT
LOCATED_ATDOCTORCAMPUS
TREATSDEPARTMENTCONDITION
OFFERSDEPARTMENTTREATMENT
PERFORMSDEPARTMENTEXAMINATION

Stage 5: Review & Publish

Purpose: Operator reviews extracted entities, approves/rejects/merges, then publishes.

Actions:

  • Approve (auto or manual)
  • Reject (with reason)
  • Merge duplicates (fuzzy dedup candidates)
  • Edit (canonical name, aliases)
  • Publish as new taxonomy version

Entity Resolution and Deduplication

After raw entity extraction (Stage 4), entities pass through a multi-step resolution pipeline before they appear in the review table. This pipeline eliminates duplicates, links to medical terminology standards, and determines which entities can be auto-approved.

Step 1: Name Normalization

Every extracted entity name is normalized to produce a dedup key:

"Dr. Wilfried Mullens" → "dr_wilfried_mullens"
"Cardiologie" → "cardiologie"
"Hartrevalidatie (cardiale revalidatie)" → "hartrevalidatie_cardiale_revalidatie"

Normalization rules:

  • Lowercase
  • Replace non-alphanumeric characters with underscores
  • Strip leading/trailing whitespace
  • Collapse multiple underscores

Entities with identical dedup keys within the same type are automatically merged (the highest-confidence version wins).

Step 2: Cluster-Based Deduplication

Within each entity type, entities are grouped into clusters by dedup key. Statistics from the latest extraction (March 2026):

Entity TypeRaw ExtractedAfter DedupDuplicates Merged
CONDITION871712159
DOCTOR774387387
EXAMINATION883268615
TREATMENT78669294
Total3,3142,0591,255

The high doctor duplicate count (387 merged from 774) occurs because the same doctors appear on multiple hub pages (e.g., /zol-artsen-0 and /zol-artsen-specialisme).

Step 3: SNOMED CT Matching

Each deduplicated entity is matched against SNOMED CT (529K Dutch medical terms) using a tiered strategy:

TierMethodExample
Tier 1: Exact matchEntity name matches SNOMED description exactly"Epilepsie" matches SNOMED concept 84757009
Tier 2: Normalized matchLowercased, stripped name matches"epilepsie" matches "Epilepsie"
Tier 3: Fuzzy matchBest Levenshtein match above threshold (0.85)"cardiale revalidatie" matches "Cardiale revalidatie" at 0.95

When a SNOMED match is found:

  • snomed_concept_id is stored on the entity
  • snomed_preferred_term is stored for display
  • snomed_match_confidence (0.0-1.0) records match quality

Step 4: Auto-Approval

Entities meeting both criteria are automatically approved without operator review:

  1. AI confidence >= 0.90 — the extraction LLM was confident
  2. SNOMED match found — the entity maps to a recognized medical concept

Entities that don't meet both criteria are set to proposed status and require manual operator approval.

Step 5: Fuzzy Dedup Scan (SP-7)

After initial dedup and SNOMED matching, a second pass uses LLM-assisted fuzzy deduplication to find near-duplicates that name normalization missed:

Examples of fuzzy matches found:

  • "Lumbale discushernia" and "lumbale discushernia's" (plural variation)
  • "Schildklierkanker" and "schildklierlijden" (related but distinct -- correctly dismissed)
  • "ZOL Genk, campus Sint-Barbara" and "Sint-Barbara Lanaken" (same campus, different names)

The fuzzy scan produces merge candidates that operators review in the UI. Each candidate shows:

  • The proposed canonical name
  • All variant names
  • Confidence score
  • Option to merge or dismiss

Step 6: Operator Review

The taxonomy review UI (Stage 4) presents all entities for operator action:

ActionEffect
ApproveEntity moves to approved status, eligible for publishing
RejectEntity moves to rejected status, excluded from taxonomy
MergeMultiple entities combined into one, aliases preserved
EditCanonical name, aliases, or entity type modified

Step 7: Publish

When the operator is satisfied, they publish the taxonomy as a new version. Publishing:

  1. Creates a snapshot in published_entities and published_relationships
  2. Records the version number and timestamp
  3. Runs the auto-linker (Step 8 below) to fill relationship gaps
  4. Makes the taxonomy available to the RAG query pipeline
  5. The taxonomy is immediately used for query enrichment and post-answer doctor listing

Step 8: Auto-Linker (Post-Publish)

RelationshipAutoLinker runs automatically at the end of every publish operation. It classifies orphaned entities — conditions, treatments, and examinations with no department relationship — to the most appropriate department via LLM, then inserts the missing TREATS or PERFORMS relationships directly into published_relationships.

This step is non-fatal: if the LLM call fails, the publish succeeds and the gap remains (resolvable via the on-demand admin endpoint). When links are created, the count appears in the publish response as AUTO_LINKED.

See Multi-Tenancy & Hospital-Agnostic Architecture for full details on the auto-linker design.

Resolution Pipeline Architecture

Key Design Decisions

  • CAMPUS entities come from the authoritative campuses table, not AI extraction (avoids duplicates from external hospitals mentioned in content)
  • SNOMED CT integration provides Dutch synonym resolution at both extraction time (entity matching) and query time (search expansion)
  • Child URL patterns use SQL LIKE with % wildcards, converted from AI-detected {slug} format
  • Cost budget prevents runaway extraction costs (configurable per hospital)
  • Composite quality gate in evaluation handles the known faithfulness scoring limitation for taxonomy-enriched answers