Skip to main content

Entity Resolution & Deduplication

The entity resolution pipeline (SP-4) transforms raw HTML content from confirmed hub pages into clean, deduplicated, SNOMED-linked draft taxonomy entities. It is the first stage of the Draft/Publish System — entities produced here land as draft rows that operators review before publishing.

SP-4 in the Pipeline

SP-4 sits between hub page confirmation (SP-2/SP-3) and the publish step (SP-5). Hub pages provide the URLs; entity resolution extracts and normalizes the entities; the publish system promotes approved entities to the live RAG index.

The Problem

The same real-world entity appears under dozens of name variations across a hospital's web properties:

  • Doctors: "dr. Peeters", "Dr. E. Peeters", "Erik Peeters", "Peeters Erik" — four strings, one person
  • Departments: "Cardiologie", "Hartafdeling", "Afdeling Cardiologie" — three strings, one department
  • Conditions: "hartfalen", "Hartfalen", "Congestief hartfalen" — same condition, varying capitalization and specificity

Without resolution, these duplicates produce a fragmented taxonomy: query enrichment fails ("hartfalen" doesn't match "Hartfalen"), relationship inference fires multiple times for the same pair, and operators see hundreds of redundant entities to review.

The pipeline solves this with a four-stage pipeline: name normalization → LLM extraction → deduplication clustering → SNOMED matching.

Pipeline Overview

Step 1: Name Normalization

name_normalizer.py converts raw extracted strings into a canonical form and generates a dedup key used downstream for clustering.

strip_titles()

Removes Dutch honorific and role prefixes before normalization:

InputAfter strip_titles()
dr. Erik PeetersErik Peeters
Prof. dr. Van den BergVan den Berg
Hoofdverpleegkundige JanssenJanssen

generate_dedup_key()

Produces a stable, case-insensitive clustering key per entity type:

Entity TypeKey StrategyExample
doctor{last_name}_{first_initial}peeters_e
departmentlowercase_underscored(canonical_name)cardiologie
conditionlowercase_underscored(canonical_name)hartfalen
treatmentlowercase_underscored(canonical_name)knieprothese
examinationlowercase_underscored(canonical_name)echografie_hart

Doctor keys use last name + first initial rather than full name so that "Erik Peeters" and "E. Peeters" both produce peeters_e and cluster together.

Why Not Fuzzy Matching for Dedup Keys?

Deterministic string keys are simpler, faster, and fully testable. Fuzzy matching (Levenshtein, Jaro-Winkler) is reserved for cases where the key itself doesn't suffice — at that point, the LLM dedup step handles it with semantic understanding.

Step 2: LLM Extraction

entity_extractor.py sends hub page HTML to GPT-4.1 mini with type-specific extraction prompts. Six prompts exist, one per entity type, each tuned for Dutch hospital content:

PromptExtractsKey Instructions
doctorsNames, specialties, departments, campusesIgnore job titles; extract only medical professionals
departmentsCanonical department name, aliases, domain groupNormalize to Belgian standard department names
conditionsDutch condition names, severity hintsExclude body parts and generic terms
treatmentsTreatment/procedure namesExclude examinations and generic verbs
examinationsDiagnostic test namesDistinguish from treatments
servicesNavigational services (Bezoekuren, Route)Only concrete navigational entities

Each LLM call returns structured JSON with:

  • name — raw extracted string
  • confidence — float 0.0–1.0 (the model's self-reported confidence)
  • source_snippet — the text fragment that triggered the extraction

The confidence score feeds directly into ai_confidence on the taxonomy_entities row and drives dedup cluster selection in Step 3.

Dutch Language Handling

All prompts explicitly instruct the model to work in Dutch. Entity names are preserved in their Dutch form. SNOMED matching (Step 4) handles the mapping to international concept IDs without translating the names.

Content Assembly

Page content is assembled from the database — no live HTTP fetches during extraction. The pipeline joins crawled_urls → document_chunks for each confirmed hub page child URL and concatenates all chunks. This ensures deterministic, reproducible extraction from the same corpus snapshot.

Step 3: Deduplication Clustering

dedup_service.py groups extracted entities by their dedup_key and selects one as the primary entity per cluster:

Cluster rules:

  1. All entities sharing a dedup_key receive the same dedup_cluster_id (a shared UUID)
  2. The entity with the highest ai_confidence is marked dedup_is_primary = true
  3. All others are marked dedup_is_primary = false
  4. Non-primary entities are not deleted — they remain available for operator review and can be merged or reinstated

LLM-assisted dedup handles edge cases where the key strategy isn't sufficient. When two different dedup keys refer to the same entity (e.g., a department scraped as "Intensieve Zorgen" from one page and "Intensive Care" from another), the LLM dedup step compares the candidates semantically and merges them, setting merged_into on the subordinate row.

Step 4: SNOMED Matching

resolution_pipeline.py links clinical entities (conditions, treatments, examinations) to SNOMED-CT concept IDs using the existing SnomedMatcher service:

ColumnSet ByExample
snomed_concept_idSnomedMatcher.match()"84114007" (hartfalen)
snomed_preferred_termSNOMED descriptions table"heart failure"
snomed_match_confidenceSnomedMatcher fuzzy score0.94

Doctors and departments do not receive SNOMED links — SNOMED covers clinical concepts, not organizational entities.

SNOMED as a Quality Signal

A high snomed_match_confidence (>0.85) is a strong indicator that the extracted entity name is a legitimate clinical term, not a spurious extraction artifact. The operator review UI surfaces this as a confidence badge.

SNOMED Matching Strategy

The SnomedMatcher uses a 5-tier matching strategy to maximize coverage of Dutch medical terminology:

TierMethodExampleThreshold
1Exact match"ablatie" → "ablatie"1.0
2Compound splitting + normalization"peniskanker" → "kanker van penis"0.95
3Fuzzy (pg_trgm)"hartritmestoornissen" → "hartritmestoornis"≥0.4
4Word overlapAll significant words must appear in SNOMED term0.7
5LLM fallbackGPT-4.1 mini normalizes Dutch term, then searches0.85

The compound splitter handles Dutch medical suffixes (-kanker, -stoornis, -operatie) and anatomical prefixes (hart-, nier-, borst-), expanding them into multi-word forms and "van/de" patterns that match SNOMED descriptions.

Coverage results (ZOL dataset, 1,465 non-doctor entities):

  • Conditions: 97.5% (539/553)
  • Examinations: 90.1% (227/252)
  • Treatments: 86.5% (571/660)

Incremental Extraction

Entity extraction supports incremental mode: when re-running extraction, URLs that have already been successfully extracted are skipped. The pipeline checks the extraction_results table for existing successful results per URL and only processes new or previously failed URLs. This reduces extraction time significantly for subsequent runs after adding new hub pages or retrying after partial failures.

Execution Model

Entity extraction runs as a background task (not SSE streaming) with Redis-stored progress:

  1. POST /extraction/run → returns immediately, starts BackgroundTask
  2. Progress stored in Redis key extraction_progress:{hospital_id}
  3. Frontend polls GET /extraction/progress every 2 seconds
  4. Survives page refresh — progress resumes on navigate-back

The extraction also creates implicit entities from relationships: when a WORKS_IN relationship references "Cardiologie" but no DEPARTMENT entity exists, the pipeline auto-creates it as an approved entity.

Operator Review

After the three-stage pipeline completes, entities sit as status = 'proposed' draft rows in taxonomy_entities. Operators manage them via the REST API:

API Endpoints

MethodPathAction
GET/api/v1/entity-resolution/entitiesList entities (filterable by type, status, search)
GET/api/v1/entity-resolution/entities/{id}Single entity with relationships
PATCH/api/v1/entity-resolution/entities/{id}Update name, type, status
POST/api/v1/entity-resolution/entities/{id}/approveSet status = approved
POST/api/v1/entity-resolution/entities/{id}/rejectSet status = rejected
POST/api/v1/entity-resolution/entities/mergeMerge entity B into entity A
POST/api/v1/entity-resolution/entities/bulk-approveApprove all entities above confidence threshold
GET/api/v1/entity-resolution/entities/countsEntity counts by type and status

Tiered Bulk Merge

For large taxonomy datasets with hundreds of potential duplicates, reviewing merge candidates one by one is impractical. The system provides a tiered bulk merge workflow that groups merge candidates by confidence level, allowing operators to approve entire tiers at once:

TierCriteriaTypical Action
100% overlapExact token overlap between entity namesAuto-merge (highest confidence)
80% overlap>= 80% token overlapBulk merge with quick review
SNOMED matchBoth entities map to the same SNOMED concept IDBulk merge (semantic equivalence)
High confidenceAI confidence >= 0.85 for both candidatesBulk merge after spot check

Each tier is presented as a batch in the UI. The operator can approve an entire tier with one click, or expand individual candidates for inspection before deciding.

NEEDS_REVIEW Merge Candidates

Merge candidates that do not meet the automatic threshold are flagged as NEEDS_REVIEW. These appear in a dedicated review queue with Merge and Reject buttons directly on each candidate row. This replaces the previous workflow where operators had to navigate to individual entity pages to perform merges. The inline action buttons reduce the number of clicks per decision from 4-5 to a single click, making it feasible to process hundreds of candidates in a single review session.

Audit Trail: taxonomy_overrides

Every operator action (approve, reject, rename, merge) is recorded in the taxonomy_overrides table:

ColumnTypeDescription
idUUID
hospital_idUUID FKScoped to hospital
override_typeVARCHAR(20)approve, reject, rename, merge, bulk_approve
target_entity_idUUID FK → taxonomy_entitiesThe affected entity
override_dataJSONBBefore/after state snapshot
applied_byUUID FK → usersOperator user ID
draft_versionINTEGERThe draft version this override applied to
created_atTIMESTAMPTZ

This audit trail satisfies the EU AI Act Article 12 requirement for automatic logging of all AI operations with human override decisions recorded.

Database Schema (SP-4 Additions)

SP-4 extends taxonomy_entities with columns added by migration 052:

ColumnTypeDescription
hospital_idUUID FKReferences hospitals.id
nameVARCHAR(500)Raw extracted name (pre-normalization)
dedup_keyVARCHAR(500)Clustering key (peeters_e, cardiologie)
source_hub_idUUID FKHub page that produced this entity
ai_confidenceFLOATLLM self-reported confidence 0.0–1.0
snomed_concept_idVARCHAR(20)SNOMED-CT concept ID if matched
snomed_preferred_termVARCHAR(500)SNOMED preferred term (English)
snomed_match_confidenceFLOATFuzzy match score
dedup_cluster_idUUIDShared ID for all entities in a cluster
dedup_is_primaryBOOLEANTrue for the cluster representative
merged_intoUUID FK → selfPoints to primary if this entity was merged
draft_versionINTEGERDraft version (incremented on each pipeline run)

New unique constraint: (hospital_id, entity_type, dedup_key, draft_version) — prevents duplicate entries per version.

Indexes added:

  • idx_te_hospital_type on (hospital_id, entity_type) — entity list queries
  • idx_te_hospital_status on (hospital_id, status) — review queue filtering
  • idx_te_dedup_key on (dedup_key) — dedup clustering lookups

Integration with the Publish System

Approved entities (status = 'approved') are the input to SP-5's publish step. The publish pipeline queries:

SELECT * FROM app.taxonomy_entities
WHERE hospital_id = :hospital_id
AND draft_version = :draft_version
AND status = 'approved'
AND dedup_is_primary = true;

Only primary entities are published — non-primary cluster members are excluded unless the operator explicitly promotes them.

See Draft/Publish System for how approved entities transition to the live published_entities table consumed by FrozenTaxonomyRegistry.

References