Taxonomy Deduplication & Gap Fill

Problem Statement

After scaling the document corpus to 573 PDF brochures and running multiple extraction rounds, the published taxonomy accumulated 12,997 entities — but 79% of those were duplicates. The root cause was cross-version extraction: each new pipeline run produced a new taxonomy version and published it, adding new copies of entities that already existed under an earlier version number.

The result was a bloated published_entities table where, for example, "Cardiologie" appeared as 8 separate rows (one per published version), each with its own UUID. Query enrichment logic that looked up entities by canonical_name would match all 8 rows, causing unpredictable behaviour depending on which row was returned first. Downstream, published_relationships mirrored this inflation: 6,821 of 12,997 entities had no relationships at all (orphans), meaning the RAG pipeline could not route queries about those conditions or treatments to any department.

Two Alembic migrations address this in sequence:

Migration	Scope
056 — Dedup published entities	Collapse duplicates, remap relationships, remove stale rows
057 — SNOMED relationship gap fill	Re-attach orphaned CONDITION/TREATMENT/EXAMINATION entities using the SNOMED transitive closure, with a Dutch manual-seed fallback

Migration 056: Entity Deduplication

Flow

Step-by-Step Details

Step 1 — Backup. Before any mutation, three tables are created:

app._dedup_entity_backup — full copy of published_entities
app._dedup_rel_backup — full copy of published_relationships
app._dedup_mapping — maps old_entity_id → surviving_entity_id for every loser

These tables persist after migration so operators can audit changes. They are dropped during downgrade().

Step 2 — Group and score. All entities are loaded with their relationship count (occurrences as source or target in published_relationships). They are grouped in Python via build_dedup_groups() using the composite key (hospital_id, entity_type, canonical_name). Within each group pick_survivor() selects the best representative.

Step 3 — Remap relationships. Any relationship that referenced a losing entity ID (either as source or target) is updated to point to the survivor. This preserves all navigational and medical relationships; only the UUID changes.

Step 4 — Deduplicate relationships. After remapping, multiple relationships may now share identical (hospital_id, source_entity_id, target_entity_id, relationship_type) tuples. A self-join DELETE removes the duplicates, keeping the row with the lower UUID.

Step 5 — Delete losers. All entities whose IDs appear in _dedup_mapping.old_entity_id are deleted. Only the survivors remain.

Step 6 — Remove orphaned DOCTORs. Doctors without any relationship (no WORKS_IN, no LOCATED_AT) are deleted — a doctor not linked to a department or campus has no value in the knowledge graph. CONDITION, TREATMENT, and EXAMINATION orphans are intentionally preserved here; Migration 057 will attempt to repair them.

Survivor Selection Algorithm

pick_survivor() in dedup_published.py ranks candidates by a scoring tuple evaluated in descending order:

score(entity) = (
    rel_count,            # number of existing relationships — highest wins
    has_snomed,           # 1 if snomed_concept_id is set, 0 otherwise
    -len(name),           # shorter display name preferred (negated for descending sort)
)

The entity with the highest tuple value becomes the survivor. The intuition: a well-connected entity with a SNOMED code and a concise display name is the canonical representation. Losers' relationship edges are all remapped to the survivor before the losers are deleted.

Rollback

downgrade() restores the pre-migration state exactly:

Truncates published_entities and published_relationships
Re-inserts all rows from the backup tables
Drops _dedup_mapping, _dedup_entity_backup, and _dedup_rel_backup

Migration 057: SNOMED Relationship Gap Fill

After deduplication, orphaned CONDITION, TREATMENT, and EXAMINATION entities still exist with no TREATS/PERFORMS relationship to any department. Migration 057 restores those connections using two strategies: walking the SNOMED transitive closure, then falling back to a curated Dutch manual seed dictionary.

Flow

SNOMED Transitive Closure Walk

The app.snomed_transitive_closure table stores precomputed ancestor/descendant pairs for all SNOMED CT concepts in the Dutch release. The gap filler exploits this to bridge the gap between a clinical concept and a hospital department:

Direct map: For each published department entity that carries a snomed_concept_id, record {snomed_id → department_entity_id}.
Expand via closure: Query snomed_transitive_closure for all descendants of those department SNOMED IDs. Each descendant SNOMED concept is now also mapped to that department.
Orphan lookup: For each orphan with a snomed_concept_id, check if its concept ID (or any of its ancestors) appears in the expanded map. The first match at the shallowest depth wins.

The result is that a condition like "hartinfarct" (SNOMED 22298006), which is a descendant of the cardiology SNOMED branch, is automatically linked to the Cardiologie department entity — even if no explicit relationship was ever extracted.

Manual Seed Fallback

For orphans with no snomed_concept_id, or whose SNOMED ancestors don't intersect any mapped department, the filler falls back to a curated dictionary keyed on the entity's lowercased canonical_name:

Condition (canonical_name)	Department (canonical_name)
borstkanker	oncologie
beroerte	neurologie
hartfalen	cardiologie
diabetes	endocrinologie
nierfalen	nefrologie
longontsteking	pneumologie
darmkanker	gastro-enterologie
prostaatkanker	urologie
leukemie	hematologie
reuma	reumatologie
epilepsie	neurologie
astma	pneumologie
copd	pneumologie
nierstenen	urologie
schildklieraandoeningen	endocrinologie
maagkanker	gastro-enterologie
leveraandoeningen	gastro-enterologie
huidkanker	dermatologie
psoriasis	dermatologie
depressie	psychiatrie
angststoornis	psychiatrie
glaucoom	oogheelkunde
cataract	oogheelkunde
gehoorverlies	neus_keel_en_oorheelkunde
heupfractuur	orthopedie
rughernia	orthopedie
blaaskanker	urologie

The seed lookup (find_department_via_manual_seed) is case-insensitive and exact on canonical_name. If a department with the returned canonical name exists in published_entities for that hospital, a new relationship is created.

Relationship Type Assignment

determine_relationship_type() maps entity type to relationship type:

Entity Type	Relationship Created
CONDITION	TREATS (department → condition)
TREATMENT	PERFORMS (department → treatment)
EXAMINATION	PERFORMS (department → examination)

Tracking Table

All relationship IDs inserted by Migration 057 are recorded in app._gap_fill_relationships. This table is the sole basis for rollback — downgrade() deletes only those rows.

Rollback

DELETE FROM app.published_relationships
WHERE id IN (SELECT relationship_id FROM app._gap_fill_relationships);

DROP TABLE IF EXISTS app._gap_fill_relationships;

Unlike Migration 056, this rollback does not restore backups — it only removes the relationships that were added. The deduplicated entity state from Migration 056 is preserved.

Before / After Results

Metric	Before (056+057)	After (056+057)
Total published entities	12,997	~2,700
Duplicate entities	~10,297 (79%)	0
Orphaned CONDITION/TREATMENT/EXAMINATION	6,821	~200 (residual)
Orphaned DOCTOR entities	included in orphan count	0 (deleted)
CAMPUS entities	25 (incl. 13 duplicates)	12
Relationships remapped	—	all preserved via _dedup_mapping
New SNOMED-derived relationships	0	varies by SNOMED coverage
New seed-derived relationships	0	up to 27 seed terms matched

Residual orphans after Migration 057 are CONDITION/TREATMENT/EXAMINATION entities that carry no snomed_concept_id and whose canonical_name does not match any manual seed. These remain in the taxonomy as isolated nodes and are candidates for future manual linking or an expanded seed list.

Relationship to the Draft-Publish Lifecycle

Migrations 056 and 057 are one-time cleanup operations targeting the bloated state that accumulated before per-version isolation was enforced. In the current draft-publish pipeline (SP-5), each publish cycle creates a clean snapshot:

Going forward, the dedup logic in dedup_published.py is available as a library function that the publish endpoint can invoke incrementally — preventing the same accumulation from recurring. The SNOMED gap filler likewise can be run as a scheduled maintenance pass after each publish cycle if new orphans are detected.

CAMPUS entities are always excluded from deduplication: they are populated from the authoritative campuses table (ADR-0026) and have stable UUIDs across publish versions.

Department Co-Reference Clustering (shipped 2026-06-09)

Migrations 056/057 deduplicate entities by version within published_entities. A separate, complementary problem exists in taxonomy_entities: the same real-world department can be recorded as multiple distinct rows — for example MKA existed as 5 separate DEPARTMENT entities, but only one of them carried the 7 WORKS_IN doctor edges. Before this fix, a roster query that resolved to any of the zero-doctor variants returned an empty list and fell back to the help desk.

What Shipped

The solution is non-destructive co-reference clustering using two new columns on taxonomy_entities:

Column	Purpose
`dedup_cluster_id` (UUID)	Shared cluster key; all co-referent entities in one cluster carry the same value (set to the primary's `entity_id`)
`dedup_is_primary` (boolean)	Marks the canonical representative within a cluster — the entity with the highest `doctor_count`, shortest name as tiebreaker

No entities are deleted and no merged_into pointer is written. The clustering is fully reversible.

Three Components

scripts/cluster_department_entities.py — two-phase offline job

Runs in two phases separated by a human review step:

--propose — loads all active DEPARTMENT entities for a hospital, runs pairwise token-subset co-reference detection (see below), writes a dept_cluster_proposals.json file. No DB writes at this phase.
--apply --in <file> — reads the human-reviewed JSON, validates every entity id exists as an active DEPARTMENT for the given hospital, then writes dedup_cluster_id / dedup_is_primary. Idempotent (re-run = no-op when values are already correct).

The human review gate is the only trust boundary for the write path. No automated path can set dedup_cluster_id without a reviewed proposals file.

Re-run after full re-ingest

A full re-ingest can reset dedup_cluster_id to NULL on rebuilt entities (tracked in issue #175, closed by PR #176). The clustering job must be re-run and re-applied after any full re-ingest.

Co-reference criterion — token-subset only (no trigram, no is_consistent)

Two candidates are co-referent iff, after fold_clinical_term, one's significant-token set is a non-empty subset of the other's. Stopwords (dienst, afdeling, de, het, en, van, voor) are removed before comparison.

This is deliberately conservative:

Punctuation/suffix/parenthetical variants merge (Mond- Kaak- vs Mond-, Kaak- … (MKA)).
True synonyms with different root words do not auto-merge (chirurgie vs heelkunde — must be confirmed by a qualified ZOL taxonomy owner).
Bare abbreviations that share no significant token do not auto-merge.

A guarded-fuzzy tier (trigram-Jaccard + is_consistent domain guard) was considered and dropped: the domain guard cannot distinguish Cardiologie from Cardiochirurgie, removing the only property that justified allowing approximate matches. The job's co-reference logic is purely structural.

app/services/department_resolver.py — alias-only resolver (no DB access)

A pure function resolve_department(term, candidates) that takes a free-text department name and a caller-supplied list of DeptCandidate objects, and returns a ResolverResult via a four-tier deterministic cascade:

Tier	Logic	Confidence
Guard	blank term or empty candidates	`no_input`
2. Exact	`fold(name) == fold(term)`	1.0
3. Token	significant-token set subset relation	0.9
4. Alias	`fold(alias) == fold(term)` for any alias	0.85

fold_clinical_term was extended to also fold punctuation (-,;:/().') to spaces, which directly fixes the MKA mond-kaak-en incident (2026-06-08) as well as the prior ligature (ae→e, #168) and longest-name (#165) failures. Every call emits a structured log line: department_resolved term=… matched=… tier=… considered=… doctor_count=… (R1 discipline).

app/services/department_roster.py — cluster-aware roster aggregation

When the matched entity carries a non-null dedup_cluster_id, the roster SQL expands the WORKS_IN predicate to cover all entities in the trusted cluster, not just the matched entity:

AND (te.dedup_cluster_id = CAST(:cid AS UUID) OR te.name = :dept)

This means a query that resolves to any variant — including zero-doctor duplicates — still returns the full doctor roster of the canonical primary. When dedup_cluster_id is NULL the behaviour is identical to the pre-clustering path (single-entity match).

Flag-gated — off by default

The DepartmentResolver cascade and this cluster-aware roster aggregation are gated behind the department_resolver_enabled setting (default False). With the flag off, each surface uses its legacy per-surface matching and the roster query does not expand across the cluster — so cluster co-reference only changes answers once the flag is enabled (and a cluster has been applied via the human-gated job above).

Data Flow

free-text department term
  ─▶ surface fetches all DEPARTMENT candidates (own SQL: published_* | taxonomy_*)
       candidates carry doctor_count + cluster_id + aliases
  ─▶ DepartmentResolver.resolve_department(term, candidates)
  ─▶ matched entity (+ cluster_id)
  ─▶ roster SQL: if cluster_id present → UNION WORKS_IN across cluster
                 else → single entity match (unchanged)
  ─▶ non-empty roster ? answer with names : graceful fallback (dept phone + transfer)

Relationship to Migrations 056/057

Migrations 056/057 were one-time cleanup operations against published_entities (the old, bloated table). Department co-reference clustering operates on taxonomy_entities (the current live taxonomy) and is a runtime, ongoing maintenance task rather than a one-time migration. The two mechanisms are complementary: 056/057 collapsed cross-version duplicates within published taxonomy; clustering reconciles entity fragmentation within the live extraction graph.

Taxonomy Extraction Pipeline — How entities are extracted and published
Medical Knowledge Architecture — Entity type definitions and relationship schema
Entity Resolution — Dedup and SNOMED matching at extraction time

Problem Statement​

Migration 056: Entity Deduplication​

Flow​

Step-by-Step Details​

Survivor Selection Algorithm​

Rollback​

Migration 057: SNOMED Relationship Gap Fill​

Flow​

SNOMED Transitive Closure Walk​

Manual Seed Fallback​

Relationship Type Assignment​

Tracking Table​

Rollback​

Before / After Results​

Relationship to the Draft-Publish Lifecycle​

Department Co-Reference Clustering (shipped 2026-06-09)​

What Shipped​

Three Components​

Data Flow​

Relationship to Migrations 056/057​

Related Pages​

Problem Statement

Migration 056: Entity Deduplication

Flow

Step-by-Step Details

Survivor Selection Algorithm

Rollback

Migration 057: SNOMED Relationship Gap Fill

Flow

SNOMED Transitive Closure Walk

Manual Seed Fallback

Relationship Type Assignment

Tracking Table

Rollback

Before / After Results

Relationship to the Draft-Publish Lifecycle

Department Co-Reference Clustering (shipped 2026-06-09)

What Shipped

Three Components

Data Flow

Relationship to Migrations 056/057

Related Pages