Entity Resolution & Deduplication

The entity resolution pipeline (SP-4) transforms raw HTML content from confirmed hub pages into clean, deduplicated, SNOMED-linked draft taxonomy entities. It is the first stage of the Draft/Publish System — entities produced here land as draft rows that operators review before publishing.

SP-4 in the Pipeline

SP-4 sits between hub page confirmation (SP-2/SP-3) and the publish step (SP-5). Hub pages provide the URLs; entity resolution extracts and normalizes the entities; the publish system promotes approved entities to the live RAG index.

The Problem

The same real-world entity appears under dozens of name variations across a hospital's web properties:

Doctors: "dr. Peeters", "Dr. E. Peeters", "Erik Peeters", "Peeters Erik" — four strings, one person
Departments: "Cardiologie", "Hartafdeling", "Afdeling Cardiologie" — three strings, one department
Conditions: "hartfalen", "Hartfalen", "Congestief hartfalen" — same condition, varying capitalization and specificity

Without resolution, these duplicates produce a fragmented taxonomy: query enrichment fails ("hartfalen" doesn't match "Hartfalen"), relationship inference fires multiple times for the same pair, and operators see hundreds of redundant entities to review.

The pipeline solves this with a four-stage pipeline: name normalization → LLM extraction → deduplication clustering → SNOMED matching.

Pipeline Overview

Step 1: Name Normalization

name_normalizer.py converts raw extracted strings into a canonical form and generates a dedup key used downstream for clustering.

`strip_titles()`

Removes Dutch honorific and role prefixes before normalization:

Input	After `strip_titles()`
`dr. Erik Peeters`	`Erik Peeters`
`Prof. dr. Van den Berg`	`Van den Berg`
`Hoofdverpleegkundige Janssen`	`Janssen`

`generate_dedup_key()`

Produces a stable, case-insensitive clustering key per entity type:

Entity Type	Key Strategy	Example
`doctor`	`{last_name}_{first_initial}`	`peeters_e`
`department`	`lowercase_underscored(canonical_name)`	`cardiologie`
`condition`	`lowercase_underscored(canonical_name)`	`hartfalen`
`treatment`	`lowercase_underscored(canonical_name)`	`knieprothese`
`examination`	`lowercase_underscored(canonical_name)`	`echografie_hart`

Doctor keys use last name + first initial rather than full name so that "Erik Peeters" and "E. Peeters" both produce peeters_e and cluster together.

Why Not Fuzzy Matching for Dedup Keys?

Deterministic string keys are simpler, faster, and fully testable. Fuzzy matching (Levenshtein, Jaro-Winkler) is reserved for cases where the key itself doesn't suffice — at that point, the LLM dedup step handles it with semantic understanding.

Step 2: LLM Extraction

entity_extractor.py sends hub page HTML to GPT-4.1 mini with type-specific extraction prompts. Six prompts exist, one per entity type, each tuned for Dutch hospital content:

Prompt	Extracts	Key Instructions
`doctors`	Names, specialties, departments, campuses	Ignore job titles; extract only medical professionals
`departments`	Canonical department name, aliases, domain group	Normalize to Belgian standard department names
`conditions`	Dutch condition names, severity hints	Exclude body parts and generic terms
`treatments`	Treatment/procedure names	Exclude examinations and generic verbs
`examinations`	Diagnostic test names	Distinguish from treatments
`services`	Navigational services (Bezoekuren, Route)	Only concrete navigational entities

Each LLM call returns structured JSON with:

name — raw extracted string
confidence — float 0.0–1.0 (the model's self-reported confidence)
source_snippet — the text fragment that triggered the extraction

The confidence score feeds directly into ai_confidence on the taxonomy_entities row and drives dedup cluster selection in Step 3.

Dutch Language Handling

All prompts explicitly instruct the model to work in Dutch. Entity names are preserved in their Dutch form. SNOMED matching (Step 4) handles the mapping to international concept IDs without translating the names.

Content Assembly

Page content is assembled from the database — no live HTTP fetches during extraction. The pipeline joins crawled_urls → document_chunks for each confirmed hub page child URL and concatenates all chunks. This ensures deterministic, reproducible extraction from the same corpus snapshot.

Step 3: Deduplication Clustering

dedup_service.py groups extracted entities by their dedup_key and selects one as the primary entity per cluster:

Cluster rules:

All entities sharing a dedup_key receive the same dedup_cluster_id (a shared UUID)
The entity with the highest ai_confidence is marked dedup_is_primary = true
All others are marked dedup_is_primary = false
Non-primary entities are not deleted — they remain available for operator review and can be merged or reinstated

LLM-assisted dedup handles edge cases where the key strategy isn't sufficient. When two different dedup keys refer to the same entity (e.g., a department scraped as "Intensieve Zorgen" from one page and "Intensive Care" from another), the LLM dedup step compares the candidates semantically and merges them, setting merged_into on the subordinate row.

Step 4: SNOMED Matching

resolution_pipeline.py links clinical entities (conditions, treatments, examinations) to SNOMED-CT concept IDs using the existing SnomedMatcher service:

Column	Set By	Example
`snomed_concept_id`	`SnomedMatcher.match()`	`"84114007"` (hartfalen)
`snomed_preferred_term`	SNOMED descriptions table	`"heart failure"`
`snomed_match_confidence`	SnomedMatcher fuzzy score	`0.94`

Doctors and departments do not receive SNOMED links — SNOMED covers clinical concepts, not organizational entities.

SNOMED as a Quality Signal

A high snomed_match_confidence (>0.85) is a strong indicator that the extracted entity name is a legitimate clinical term, not a spurious extraction artifact. The operator review UI surfaces this as a confidence badge.

SNOMED Matching Strategy

The SnomedMatcher uses a 5-tier matching strategy to maximize coverage of Dutch medical terminology:

Tier	Method	Example	Threshold
1	Exact match	"ablatie" → "ablatie"	1.0
2	Compound splitting + normalization	"peniskanker" → "kanker van penis"	0.95
3	Fuzzy (pg_trgm)	"hartritmestoornissen" → "hartritmestoornis"	≥0.4
4	Word overlap	All significant words must appear in SNOMED term	0.7
5	LLM fallback	GPT-4.1 mini normalizes Dutch term, then searches	0.85

The compound splitter handles Dutch medical suffixes (-kanker, -stoornis, -operatie) and anatomical prefixes (hart-, nier-, borst-), expanding them into multi-word forms and "van/de" patterns that match SNOMED descriptions.

Coverage results (ZOL dataset, 1,465 non-doctor entities):

Conditions: 97.5% (539/553)
Examinations: 90.1% (227/252)
Treatments: 86.5% (571/660)

Incremental Extraction

Entity extraction supports incremental mode: when re-running extraction, URLs that have already been successfully extracted are skipped. The pipeline checks the extraction_results table for existing successful results per URL and only processes new or previously failed URLs. This reduces extraction time significantly for subsequent runs after adding new hub pages or retrying after partial failures.

Execution Model

Entity extraction runs as a background task (not SSE streaming) with Redis-stored progress:

POST /extraction/run → returns immediately, starts BackgroundTask
Progress stored in Redis key extraction_progress:{hospital_id}
Frontend polls GET /extraction/progress every 2 seconds
Survives page refresh — progress resumes on navigate-back

The extraction also creates implicit entities from relationships: when a WORKS_IN relationship references "Cardiologie" but no DEPARTMENT entity exists, the pipeline auto-creates it as an approved entity.

Operator Review

After the three-stage pipeline completes, entities sit as status = 'proposed' draft rows in taxonomy_entities. Operators manage them via the REST API:

API Endpoints

Method	Path	Action
`GET`	`/api/v1/entity-resolution/entities`	List entities (filterable by type, status, search)
`GET`	`/api/v1/entity-resolution/entities/{id}`	Single entity with relationships
`PATCH`	`/api/v1/entity-resolution/entities/{id}`	Update name, type, status
`POST`	`/api/v1/entity-resolution/entities/{id}/approve`	Set status = approved
`POST`	`/api/v1/entity-resolution/entities/{id}/reject`	Set status = rejected
`POST`	`/api/v1/entity-resolution/entities/merge`	Merge entity B into entity A
`POST`	`/api/v1/entity-resolution/entities/bulk-approve`	Approve all entities above confidence threshold
`GET`	`/api/v1/entity-resolution/entities/counts`	Entity counts by type and status

Tiered Bulk Merge

For large taxonomy datasets with hundreds of potential duplicates, reviewing merge candidates one by one is impractical. The system provides a tiered bulk merge workflow that groups merge candidates by confidence level, allowing operators to approve entire tiers at once:

Tier	Criteria	Typical Action
100% overlap	Exact token overlap between entity names	Auto-merge (highest confidence)
80% overlap	>= 80% token overlap	Bulk merge with quick review
SNOMED match	Both entities map to the same SNOMED concept ID	Bulk merge (semantic equivalence)
High confidence	AI confidence >= 0.85 for both candidates	Bulk merge after spot check

Each tier is presented as a batch in the UI. The operator can approve an entire tier with one click, or expand individual candidates for inspection before deciding.

NEEDS_REVIEW Merge Candidates

Merge candidates that do not meet the automatic threshold are flagged as NEEDS_REVIEW. These appear in a dedicated review queue with Merge and Reject buttons directly on each candidate row. This replaces the previous workflow where operators had to navigate to individual entity pages to perform merges. The inline action buttons reduce the number of clicks per decision from 4-5 to a single click, making it feasible to process hundreds of candidates in a single review session.

Audit Trail: `taxonomy_overrides`

Every operator action (approve, reject, rename, merge) is recorded in the taxonomy_overrides table:

Column	Type	Description
`id`	UUID
`hospital_id`	UUID FK	Scoped to hospital
`override_type`	VARCHAR(20)	`approve`, `reject`, `rename`, `merge`, `bulk_approve`
`target_entity_id`	UUID FK → `taxonomy_entities`	The affected entity
`override_data`	JSONB	Before/after state snapshot
`applied_by`	UUID FK → `users`	Operator user ID
`draft_version`	INTEGER	The draft version this override applied to
`created_at`	TIMESTAMPTZ

This audit trail satisfies the EU AI Act Article 12 requirement for automatic logging of all AI operations with human override decisions recorded.

Database Schema (SP-4 Additions)

SP-4 extends taxonomy_entities with columns added by migration 052:

Column	Type	Description
`hospital_id`	UUID FK	References `hospitals.id`
`name`	VARCHAR(500)	Raw extracted name (pre-normalization)
`dedup_key`	VARCHAR(500)	Clustering key (`peeters_e`, `cardiologie`)
`source_hub_id`	UUID FK	Hub page that produced this entity
`ai_confidence`	FLOAT	LLM self-reported confidence 0.0–1.0
`snomed_concept_id`	VARCHAR(20)	SNOMED-CT concept ID if matched
`snomed_preferred_term`	VARCHAR(500)	SNOMED preferred term (English)
`snomed_match_confidence`	FLOAT	Fuzzy match score
`dedup_cluster_id`	UUID	Shared ID for all entities in a cluster
`dedup_is_primary`	BOOLEAN	True for the cluster representative
`merged_into`	UUID FK → self	Points to primary if this entity was merged
`draft_version`	INTEGER	Draft version (incremented on each pipeline run)

New unique constraint: (hospital_id, entity_type, dedup_key, draft_version) — prevents duplicate entries per version.

Indexes added:

idx_te_hospital_type on (hospital_id, entity_type) — entity list queries
idx_te_hospital_status on (hospital_id, status) — review queue filtering
idx_te_dedup_key on (dedup_key) — dedup clustering lookups

Integration with the Publish System

Approved entities (status = 'approved') are the input to SP-5's publish step. The publish pipeline queries:

SELECT * FROM app.taxonomy_entities
WHERE hospital_id = :hospital_id
  AND draft_version = :draft_version
  AND status = 'approved'
  AND dedup_is_primary = true;

Only primary entities are published — non-primary cluster members are excluded unless the operator explicitly promotes them.

See Draft/Publish System for how approved entities transition to the live published_entities table consumed by FrozenTaxonomyRegistry.

References

Getoor, L., & Diehl, C. P. (2005). Link mining: A survey. ACM SIGKDD Explorations, 7(2), 3–12. https://doi.org/10.1145/1117454.1117456
SNOMED International. (2023). SNOMED CT Technical Implementation Guide. https://confluence.ihtsdotools.org/
Hogan, A., et al. (2021). Knowledge graphs. ACM Computing Surveys, 54(4), 1–37. https://doi.org/10.1145/3447772

The Problem​

Pipeline Overview​

Step 1: Name Normalization​

strip_titles()​

generate_dedup_key()​

Step 2: LLM Extraction​

Content Assembly​

Step 3: Deduplication Clustering​

Step 4: SNOMED Matching​

SNOMED Matching Strategy​

Incremental Extraction​

Execution Model​

Operator Review​

API Endpoints​

Tiered Bulk Merge​

NEEDS_REVIEW Merge Candidates​

Audit Trail: taxonomy_overrides​

Database Schema (SP-4 Additions)​

Integration with the Publish System​

References​