Entity Resolution & Deduplication
The entity resolution pipeline (SP-4) transforms raw HTML content from confirmed hub pages into clean, deduplicated, SNOMED-linked draft taxonomy entities. It is the first stage of the Draft/Publish System — entities produced here land as draft rows that operators review before publishing.
SP-4 sits between hub page confirmation (SP-2/SP-3) and the publish step (SP-5). Hub pages provide the URLs; entity resolution extracts and normalizes the entities; the publish system promotes approved entities to the live RAG index.
The Problem
The same real-world entity appears under dozens of name variations across a hospital's web properties:
- Doctors: "dr. Peeters", "Dr. E. Peeters", "Erik Peeters", "Peeters Erik" — four strings, one person
- Departments: "Cardiologie", "Hartafdeling", "Afdeling Cardiologie" — three strings, one department
- Conditions: "hartfalen", "Hartfalen", "Congestief hartfalen" — same condition, varying capitalization and specificity
Without resolution, these duplicates produce a fragmented taxonomy: query enrichment fails ("hartfalen" doesn't match "Hartfalen"), relationship inference fires multiple times for the same pair, and operators see hundreds of redundant entities to review.
The pipeline solves this with a four-stage pipeline: name normalization → LLM extraction → deduplication clustering → SNOMED matching.
Pipeline Overview
Step 1: Name Normalization
name_normalizer.py converts raw extracted strings into a canonical form and generates a dedup key used downstream for clustering.
strip_titles()
Removes Dutch honorific and role prefixes before normalization:
| Input | After strip_titles() |
|---|---|
dr. Erik Peeters | Erik Peeters |
Prof. dr. Van den Berg | Van den Berg |
Hoofdverpleegkundige Janssen | Janssen |
generate_dedup_key()
Produces a stable, case-insensitive clustering key per entity type:
| Entity Type | Key Strategy | Example |
|---|---|---|
doctor | {last_name}_{first_initial} | peeters_e |
department | lowercase_underscored(canonical_name) | cardiologie |
condition | lowercase_underscored(canonical_name) | hartfalen |
treatment | lowercase_underscored(canonical_name) | knieprothese |
examination | lowercase_underscored(canonical_name) | echografie_hart |
Doctor keys use last name + first initial rather than full name so that "Erik Peeters" and "E. Peeters" both produce peeters_e and cluster together.
Deterministic string keys are simpler, faster, and fully testable. Fuzzy matching (Levenshtein, Jaro-Winkler) is reserved for cases where the key itself doesn't suffice — at that point, the LLM dedup step handles it with semantic understanding.
Step 2: LLM Extraction
entity_extractor.py sends hub page HTML to GPT-4.1 mini with type-specific extraction prompts. Six prompts exist, one per entity type, each tuned for Dutch hospital content:
| Prompt | Extracts | Key Instructions |
|---|---|---|
doctors | Names, specialties, departments, campuses | Ignore job titles; extract only medical professionals |
departments | Canonical department name, aliases, domain group | Normalize to Belgian standard department names |
conditions | Dutch condition names, severity hints | Exclude body parts and generic terms |
treatments | Treatment/procedure names | Exclude examinations and generic verbs |
examinations | Diagnostic test names | Distinguish from treatments |
services | Navigational services (Bezoekuren, Route) | Only concrete navigational entities |
Each LLM call returns structured JSON with:
name— raw extracted stringconfidence— float 0.0–1.0 (the model's self-reported confidence)source_snippet— the text fragment that triggered the extraction
The confidence score feeds directly into ai_confidence on the taxonomy_entities row and drives dedup cluster selection in Step 3.
All prompts explicitly instruct the model to work in Dutch. Entity names are preserved in their Dutch form. SNOMED matching (Step 4) handles the mapping to international concept IDs without translating the names.
Content Assembly
Page content is assembled from the database — no live HTTP fetches during extraction. The pipeline joins crawled_urls → document_chunks for each confirmed hub page child URL and concatenates all chunks. This ensures deterministic, reproducible extraction from the same corpus snapshot.
Step 3: Deduplication Clustering
dedup_service.py groups extracted entities by their dedup_key and selects one as the primary entity per cluster:
Cluster rules:
- All entities sharing a
dedup_keyreceive the samededup_cluster_id(a shared UUID) - The entity with the highest
ai_confidenceis markeddedup_is_primary = true - All others are marked
dedup_is_primary = false - Non-primary entities are not deleted — they remain available for operator review and can be merged or reinstated
LLM-assisted dedup handles edge cases where the key strategy isn't sufficient. When two different dedup keys refer to the same entity (e.g., a department scraped as "Intensieve Zorgen" from one page and "Intensive Care" from another), the LLM dedup step compares the candidates semantically and merges them, setting merged_into on the subordinate row.
Step 4: SNOMED Matching
resolution_pipeline.py links clinical entities (conditions, treatments, examinations) to SNOMED-CT concept IDs using the existing SnomedMatcher service:
| Column | Set By | Example |
|---|---|---|
snomed_concept_id | SnomedMatcher.match() | "84114007" (hartfalen) |
snomed_preferred_term | SNOMED descriptions table | "heart failure" |
snomed_match_confidence | SnomedMatcher fuzzy score | 0.94 |
Doctors and departments do not receive SNOMED links — SNOMED covers clinical concepts, not organizational entities.
A high snomed_match_confidence (>0.85) is a strong indicator that the extracted entity name is a legitimate clinical term, not a spurious extraction artifact. The operator review UI surfaces this as a confidence badge.
SNOMED Matching Strategy
The SnomedMatcher uses a 5-tier matching strategy to maximize coverage of Dutch medical terminology:
| Tier | Method | Example | Threshold |
|---|---|---|---|
| 1 | Exact match | "ablatie" → "ablatie" | 1.0 |
| 2 | Compound splitting + normalization | "peniskanker" → "kanker van penis" | 0.95 |
| 3 | Fuzzy (pg_trgm) | "hartritmestoornissen" → "hartritmestoornis" | ≥0.4 |
| 4 | Word overlap | All significant words must appear in SNOMED term | 0.7 |
| 5 | LLM fallback | GPT-4.1 mini normalizes Dutch term, then searches | 0.85 |
The compound splitter handles Dutch medical suffixes (-kanker, -stoornis, -operatie) and anatomical prefixes (hart-, nier-, borst-), expanding them into multi-word forms and "van/de" patterns that match SNOMED descriptions.
Coverage results (ZOL dataset, 1,465 non-doctor entities):
- Conditions: 97.5% (539/553)
- Examinations: 90.1% (227/252)
- Treatments: 86.5% (571/660)
Incremental Extraction
Entity extraction supports incremental mode: when re-running extraction, URLs that have already been successfully extracted are skipped. The pipeline checks the extraction_results table for existing successful results per URL and only processes new or previously failed URLs. This reduces extraction time significantly for subsequent runs after adding new hub pages or retrying after partial failures.
Execution Model
Entity extraction runs as a background task (not SSE streaming) with Redis-stored progress:
POST /extraction/run→ returns immediately, startsBackgroundTask- Progress stored in Redis key
extraction_progress:{hospital_id} - Frontend polls
GET /extraction/progressevery 2 seconds - Survives page refresh — progress resumes on navigate-back
The extraction also creates implicit entities from relationships: when a WORKS_IN relationship references "Cardiologie" but no DEPARTMENT entity exists, the pipeline auto-creates it as an approved entity.
Operator Review
After the three-stage pipeline completes, entities sit as status = 'proposed' draft rows in taxonomy_entities. Operators manage them via the REST API:
API Endpoints
| Method | Path | Action |
|---|---|---|
GET | /api/v1/entity-resolution/entities | List entities (filterable by type, status, search) |
GET | /api/v1/entity-resolution/entities/{id} | Single entity with relationships |
PATCH | /api/v1/entity-resolution/entities/{id} | Update name, type, status |
POST | /api/v1/entity-resolution/entities/{id}/approve | Set status = approved |
POST | /api/v1/entity-resolution/entities/{id}/reject | Set status = rejected |
POST | /api/v1/entity-resolution/entities/merge | Merge entity B into entity A |
POST | /api/v1/entity-resolution/entities/bulk-approve | Approve all entities above confidence threshold |
GET | /api/v1/entity-resolution/entities/counts | Entity counts by type and status |
Tiered Bulk Merge
For large taxonomy datasets with hundreds of potential duplicates, reviewing merge candidates one by one is impractical. The system provides a tiered bulk merge workflow that groups merge candidates by confidence level, allowing operators to approve entire tiers at once:
| Tier | Criteria | Typical Action |
|---|---|---|
| 100% overlap | Exact token overlap between entity names | Auto-merge (highest confidence) |
| 80% overlap | >= 80% token overlap | Bulk merge with quick review |
| SNOMED match | Both entities map to the same SNOMED concept ID | Bulk merge (semantic equivalence) |
| High confidence | AI confidence >= 0.85 for both candidates | Bulk merge after spot check |
Each tier is presented as a batch in the UI. The operator can approve an entire tier with one click, or expand individual candidates for inspection before deciding.
NEEDS_REVIEW Merge Candidates
Merge candidates that do not meet the automatic threshold are flagged as NEEDS_REVIEW. These appear in a dedicated review queue with Merge and Reject buttons directly on each candidate row. This replaces the previous workflow where operators had to navigate to individual entity pages to perform merges. The inline action buttons reduce the number of clicks per decision from 4-5 to a single click, making it feasible to process hundreds of candidates in a single review session.
Audit Trail: taxonomy_overrides
Every operator action (approve, reject, rename, merge) is recorded in the taxonomy_overrides table:
| Column | Type | Description |
|---|---|---|
id | UUID | |
hospital_id | UUID FK | Scoped to hospital |
override_type | VARCHAR(20) | approve, reject, rename, merge, bulk_approve |
target_entity_id | UUID FK → taxonomy_entities | The affected entity |
override_data | JSONB | Before/after state snapshot |
applied_by | UUID FK → users | Operator user ID |
draft_version | INTEGER | The draft version this override applied to |
created_at | TIMESTAMPTZ |
This audit trail satisfies the EU AI Act Article 12 requirement for automatic logging of all AI operations with human override decisions recorded.
Database Schema (SP-4 Additions)
SP-4 extends taxonomy_entities with columns added by migration 052:
| Column | Type | Description |
|---|---|---|
hospital_id | UUID FK | References hospitals.id |
name | VARCHAR(500) | Raw extracted name (pre-normalization) |
dedup_key | VARCHAR(500) | Clustering key (peeters_e, cardiologie) |
source_hub_id | UUID FK | Hub page that produced this entity |
ai_confidence | FLOAT | LLM self-reported confidence 0.0–1.0 |
snomed_concept_id | VARCHAR(20) | SNOMED-CT concept ID if matched |
snomed_preferred_term | VARCHAR(500) | SNOMED preferred term (English) |
snomed_match_confidence | FLOAT | Fuzzy match score |
dedup_cluster_id | UUID | Shared ID for all entities in a cluster |
dedup_is_primary | BOOLEAN | True for the cluster representative |
merged_into | UUID FK → self | Points to primary if this entity was merged |
draft_version | INTEGER | Draft version (incremented on each pipeline run) |
New unique constraint: (hospital_id, entity_type, dedup_key, draft_version) — prevents duplicate entries per version.
Indexes added:
idx_te_hospital_typeon(hospital_id, entity_type)— entity list queriesidx_te_hospital_statuson(hospital_id, status)— review queue filteringidx_te_dedup_keyon(dedup_key)— dedup clustering lookups
Integration with the Publish System
Approved entities (status = 'approved') are the input to SP-5's publish step. The publish pipeline queries:
SELECT * FROM app.taxonomy_entities
WHERE hospital_id = :hospital_id
AND draft_version = :draft_version
AND status = 'approved'
AND dedup_is_primary = true;
Only primary entities are published — non-primary cluster members are excluded unless the operator explicitly promotes them.
See Draft/Publish System for how approved entities transition to the live published_entities table consumed by FrozenTaxonomyRegistry.
References
- Getoor, L., & Diehl, C. P. (2005). Link mining: A survey. ACM SIGKDD Explorations, 7(2), 3–12. https://doi.org/10.1145/1117454.1117456
- SNOMED International. (2023). SNOMED CT Technical Implementation Guide. https://confluence.ihtsdotools.org/
- Hogan, A., et al. (2021). Knowledge graphs. ACM Computing Surveys, 54(4), 1–37. https://doi.org/10.1145/3447772