Taxonomy Deduplication & Gap Fill
Problem Statement
After scaling the document corpus to 573 PDF brochures and running multiple extraction rounds, the published taxonomy accumulated 12,997 entities — but 79% of those were duplicates. The root cause was cross-version extraction: each new pipeline run produced a new taxonomy version and published it, adding new copies of entities that already existed under an earlier version number.
The result was a bloated published_entities table where, for example, "Cardiologie" appeared as 8 separate rows (one per published version), each with its own UUID. Query enrichment logic that looked up entities by canonical_name would match all 8 rows, causing unpredictable behaviour depending on which row was returned first. Downstream, published_relationships mirrored this inflation: 6,821 of 12,997 entities had no relationships at all (orphans), meaning the RAG pipeline could not route queries about those conditions or treatments to any department.
Two Alembic migrations address this in sequence:
| Migration | Scope |
|---|---|
| 056 — Dedup published entities | Collapse duplicates, remap relationships, remove stale rows |
| 057 — SNOMED relationship gap fill | Re-attach orphaned CONDITION/TREATMENT/EXAMINATION entities using the SNOMED transitive closure, with a Dutch manual-seed fallback |
Migration 056: Entity Deduplication
Flow
Step-by-Step Details
Step 1 — Backup. Before any mutation, three tables are created:
app._dedup_entity_backup— full copy ofpublished_entitiesapp._dedup_rel_backup— full copy ofpublished_relationshipsapp._dedup_mapping— mapsold_entity_id → surviving_entity_idfor every loser
These tables persist after migration so operators can audit changes. They are dropped during downgrade().
Step 2 — Group and score. All entities are loaded with their relationship count (occurrences as source or target in published_relationships). They are grouped in Python via build_dedup_groups() using the composite key (hospital_id, entity_type, canonical_name). Within each group pick_survivor() selects the best representative.
Step 3 — Remap relationships. Any relationship that referenced a losing entity ID (either as source or target) is updated to point to the survivor. This preserves all navigational and medical relationships; only the UUID changes.
Step 4 — Deduplicate relationships. After remapping, multiple relationships may now share identical (hospital_id, source_entity_id, target_entity_id, relationship_type) tuples. A self-join DELETE removes the duplicates, keeping the row with the lower UUID.
Step 5 — Delete losers. All entities whose IDs appear in _dedup_mapping.old_entity_id are deleted. Only the survivors remain.
Step 6 — Remove orphaned DOCTORs. Doctors without any relationship (no WORKS_IN, no LOCATED_AT) are deleted — a doctor not linked to a department or campus has no value in the knowledge graph. CONDITION, TREATMENT, and EXAMINATION orphans are intentionally preserved here; Migration 057 will attempt to repair them.
Survivor Selection Algorithm
pick_survivor() in dedup_published.py ranks candidates by a scoring tuple evaluated in descending order:
score(entity) = (
rel_count, # number of existing relationships — highest wins
has_snomed, # 1 if snomed_concept_id is set, 0 otherwise
-len(name), # shorter display name preferred (negated for descending sort)
)
The entity with the highest tuple value becomes the survivor. The intuition: a well-connected entity with a SNOMED code and a concise display name is the canonical representation. Losers' relationship edges are all remapped to the survivor before the losers are deleted.
Rollback
downgrade() restores the pre-migration state exactly:
- Truncates
published_entitiesandpublished_relationships - Re-inserts all rows from the backup tables
- Drops
_dedup_mapping,_dedup_entity_backup, and_dedup_rel_backup
Migration 057: SNOMED Relationship Gap Fill
After deduplication, orphaned CONDITION, TREATMENT, and EXAMINATION entities still exist with no TREATS/PERFORMS relationship to any department. Migration 057 restores those connections using two strategies: walking the SNOMED transitive closure, then falling back to a curated Dutch manual seed dictionary.
Flow
SNOMED Transitive Closure Walk
The app.snomed_transitive_closure table stores precomputed ancestor/descendant pairs for all SNOMED CT concepts in the Dutch release. The gap filler exploits this to bridge the gap between a clinical concept and a hospital department:
- Direct map: For each published department entity that carries a
snomed_concept_id, record{snomed_id → department_entity_id}. - Expand via closure: Query
snomed_transitive_closurefor all descendants of those department SNOMED IDs. Each descendant SNOMED concept is now also mapped to that department. - Orphan lookup: For each orphan with a
snomed_concept_id, check if its concept ID (or any of its ancestors) appears in the expanded map. The first match at the shallowest depth wins.
The result is that a condition like "hartinfarct" (SNOMED 22298006), which is a descendant of the cardiology SNOMED branch, is automatically linked to the Cardiologie department entity — even if no explicit relationship was ever extracted.
Manual Seed Fallback
For orphans with no snomed_concept_id, or whose SNOMED ancestors don't intersect any mapped department, the filler falls back to a curated dictionary keyed on the entity's lowercased canonical_name:
| Condition (canonical_name) | Department (canonical_name) |
|---|---|
| borstkanker | oncologie |
| beroerte | neurologie |
| hartfalen | cardiologie |
| diabetes | endocrinologie |
| nierfalen | nefrologie |
| longontsteking | pneumologie |
| darmkanker | gastro-enterologie |
| prostaatkanker | urologie |
| leukemie | hematologie |
| reuma | reumatologie |
| epilepsie | neurologie |
| astma | pneumologie |
| copd | pneumologie |
| nierstenen | urologie |
| schildklieraandoeningen | endocrinologie |
| maagkanker | gastro-enterologie |
| leveraandoeningen | gastro-enterologie |
| huidkanker | dermatologie |
| psoriasis | dermatologie |
| depressie | psychiatrie |
| angststoornis | psychiatrie |
| glaucoom | oogheelkunde |
| cataract | oogheelkunde |
| gehoorverlies | neus_keel_en_oorheelkunde |
| heupfractuur | orthopedie |
| rughernia | orthopedie |
| blaaskanker | urologie |
The seed lookup (find_department_via_manual_seed) is case-insensitive and exact on canonical_name. If a department with the returned canonical name exists in published_entities for that hospital, a new relationship is created.
Relationship Type Assignment
determine_relationship_type() maps entity type to relationship type:
| Entity Type | Relationship Created |
|---|---|
| CONDITION | TREATS (department → condition) |
| TREATMENT | PERFORMS (department → treatment) |
| EXAMINATION | PERFORMS (department → examination) |
Tracking Table
All relationship IDs inserted by Migration 057 are recorded in app._gap_fill_relationships. This table is the sole basis for rollback — downgrade() deletes only those rows.
Rollback
DELETE FROM app.published_relationships
WHERE id IN (SELECT relationship_id FROM app._gap_fill_relationships);
DROP TABLE IF EXISTS app._gap_fill_relationships;
Unlike Migration 056, this rollback does not restore backups — it only removes the relationships that were added. The deduplicated entity state from Migration 056 is preserved.
Before / After Results
| Metric | Before (056+057) | After (056+057) |
|---|---|---|
| Total published entities | 12,997 | ~2,700 |
| Duplicate entities | ~10,297 (79%) | 0 |
| Orphaned CONDITION/TREATMENT/EXAMINATION | 6,821 | ~200 (residual) |
| Orphaned DOCTOR entities | included in orphan count | 0 (deleted) |
| CAMPUS entities | 25 (incl. 13 duplicates) | 12 |
| Relationships remapped | — | all preserved via _dedup_mapping |
| New SNOMED-derived relationships | 0 | varies by SNOMED coverage |
| New seed-derived relationships | 0 | up to 27 seed terms matched |
Residual orphans after Migration 057 are CONDITION/TREATMENT/EXAMINATION entities that carry no snomed_concept_id and whose canonical_name does not match any manual seed. These remain in the taxonomy as isolated nodes and are candidates for future manual linking or an expanded seed list.
Relationship to the Draft-Publish Lifecycle
Migrations 056 and 057 are one-time cleanup operations targeting the bloated state that accumulated before per-version isolation was enforced. In the current draft-publish pipeline (SP-5), each publish cycle creates a clean snapshot:
Going forward, the dedup logic in dedup_published.py is available as a library function that the publish endpoint can invoke incrementally — preventing the same accumulation from recurring. The SNOMED gap filler likewise can be run as a scheduled maintenance pass after each publish cycle if new orphans are detected.
CAMPUS entities are always excluded from deduplication: they are populated from the authoritative campuses table (ADR-0026) and have stable UUIDs across publish versions.
Related Pages
- Taxonomy Extraction Pipeline — How entities are extracted and published
- Medical Knowledge Architecture — Entity type definitions and relationship schema
- Entity Resolution — Dedup and SNOMED matching at extraction time