Skip to main content

Taxonomy Deduplication & Gap Fill

Problem Statement

After scaling the document corpus to 573 PDF brochures and running multiple extraction rounds, the published taxonomy accumulated 12,997 entities — but 79% of those were duplicates. The root cause was cross-version extraction: each new pipeline run produced a new taxonomy version and published it, adding new copies of entities that already existed under an earlier version number.

The result was a bloated published_entities table where, for example, "Cardiologie" appeared as 8 separate rows (one per published version), each with its own UUID. Query enrichment logic that looked up entities by canonical_name would match all 8 rows, causing unpredictable behaviour depending on which row was returned first. Downstream, published_relationships mirrored this inflation: 6,821 of 12,997 entities had no relationships at all (orphans), meaning the RAG pipeline could not route queries about those conditions or treatments to any department.

Two Alembic migrations address this in sequence:

MigrationScope
056 — Dedup published entitiesCollapse duplicates, remap relationships, remove stale rows
057 — SNOMED relationship gap fillRe-attach orphaned CONDITION/TREATMENT/EXAMINATION entities using the SNOMED transitive closure, with a Dutch manual-seed fallback

Migration 056: Entity Deduplication

Flow

Step-by-Step Details

Step 1 — Backup. Before any mutation, three tables are created:

  • app._dedup_entity_backup — full copy of published_entities
  • app._dedup_rel_backup — full copy of published_relationships
  • app._dedup_mapping — maps old_entity_id → surviving_entity_id for every loser

These tables persist after migration so operators can audit changes. They are dropped during downgrade().

Step 2 — Group and score. All entities are loaded with their relationship count (occurrences as source or target in published_relationships). They are grouped in Python via build_dedup_groups() using the composite key (hospital_id, entity_type, canonical_name). Within each group pick_survivor() selects the best representative.

Step 3 — Remap relationships. Any relationship that referenced a losing entity ID (either as source or target) is updated to point to the survivor. This preserves all navigational and medical relationships; only the UUID changes.

Step 4 — Deduplicate relationships. After remapping, multiple relationships may now share identical (hospital_id, source_entity_id, target_entity_id, relationship_type) tuples. A self-join DELETE removes the duplicates, keeping the row with the lower UUID.

Step 5 — Delete losers. All entities whose IDs appear in _dedup_mapping.old_entity_id are deleted. Only the survivors remain.

Step 6 — Remove orphaned DOCTORs. Doctors without any relationship (no WORKS_IN, no LOCATED_AT) are deleted — a doctor not linked to a department or campus has no value in the knowledge graph. CONDITION, TREATMENT, and EXAMINATION orphans are intentionally preserved here; Migration 057 will attempt to repair them.

Survivor Selection Algorithm

pick_survivor() in dedup_published.py ranks candidates by a scoring tuple evaluated in descending order:

score(entity) = (
rel_count, # number of existing relationships — highest wins
has_snomed, # 1 if snomed_concept_id is set, 0 otherwise
-len(name), # shorter display name preferred (negated for descending sort)
)

The entity with the highest tuple value becomes the survivor. The intuition: a well-connected entity with a SNOMED code and a concise display name is the canonical representation. Losers' relationship edges are all remapped to the survivor before the losers are deleted.

Rollback

downgrade() restores the pre-migration state exactly:

  1. Truncates published_entities and published_relationships
  2. Re-inserts all rows from the backup tables
  3. Drops _dedup_mapping, _dedup_entity_backup, and _dedup_rel_backup

Migration 057: SNOMED Relationship Gap Fill

After deduplication, orphaned CONDITION, TREATMENT, and EXAMINATION entities still exist with no TREATS/PERFORMS relationship to any department. Migration 057 restores those connections using two strategies: walking the SNOMED transitive closure, then falling back to a curated Dutch manual seed dictionary.

Flow

SNOMED Transitive Closure Walk

The app.snomed_transitive_closure table stores precomputed ancestor/descendant pairs for all SNOMED CT concepts in the Dutch release. The gap filler exploits this to bridge the gap between a clinical concept and a hospital department:

  1. Direct map: For each published department entity that carries a snomed_concept_id, record {snomed_id → department_entity_id}.
  2. Expand via closure: Query snomed_transitive_closure for all descendants of those department SNOMED IDs. Each descendant SNOMED concept is now also mapped to that department.
  3. Orphan lookup: For each orphan with a snomed_concept_id, check if its concept ID (or any of its ancestors) appears in the expanded map. The first match at the shallowest depth wins.

The result is that a condition like "hartinfarct" (SNOMED 22298006), which is a descendant of the cardiology SNOMED branch, is automatically linked to the Cardiologie department entity — even if no explicit relationship was ever extracted.

Manual Seed Fallback

For orphans with no snomed_concept_id, or whose SNOMED ancestors don't intersect any mapped department, the filler falls back to a curated dictionary keyed on the entity's lowercased canonical_name:

Condition (canonical_name)Department (canonical_name)
borstkankeroncologie
beroerteneurologie
hartfalencardiologie
diabetesendocrinologie
nierfalennefrologie
longontstekingpneumologie
darmkankergastro-enterologie
prostaatkankerurologie
leukemiehematologie
reumareumatologie
epilepsieneurologie
astmapneumologie
copdpneumologie
nierstenenurologie
schildklieraandoeningenendocrinologie
maagkankergastro-enterologie
leveraandoeningengastro-enterologie
huidkankerdermatologie
psoriasisdermatologie
depressiepsychiatrie
angststoornispsychiatrie
glaucoomoogheelkunde
cataractoogheelkunde
gehoorverliesneus_keel_en_oorheelkunde
heupfractuurorthopedie
rugherniaorthopedie
blaaskankerurologie

The seed lookup (find_department_via_manual_seed) is case-insensitive and exact on canonical_name. If a department with the returned canonical name exists in published_entities for that hospital, a new relationship is created.

Relationship Type Assignment

determine_relationship_type() maps entity type to relationship type:

Entity TypeRelationship Created
CONDITIONTREATS (department → condition)
TREATMENTPERFORMS (department → treatment)
EXAMINATIONPERFORMS (department → examination)

Tracking Table

All relationship IDs inserted by Migration 057 are recorded in app._gap_fill_relationships. This table is the sole basis for rollback — downgrade() deletes only those rows.

Rollback

DELETE FROM app.published_relationships
WHERE id IN (SELECT relationship_id FROM app._gap_fill_relationships);

DROP TABLE IF EXISTS app._gap_fill_relationships;

Unlike Migration 056, this rollback does not restore backups — it only removes the relationships that were added. The deduplicated entity state from Migration 056 is preserved.


Before / After Results

MetricBefore (056+057)After (056+057)
Total published entities12,997~2,700
Duplicate entities~10,297 (79%)0
Orphaned CONDITION/TREATMENT/EXAMINATION6,821~200 (residual)
Orphaned DOCTOR entitiesincluded in orphan count0 (deleted)
CAMPUS entities25 (incl. 13 duplicates)12
Relationships remappedall preserved via _dedup_mapping
New SNOMED-derived relationships0varies by SNOMED coverage
New seed-derived relationships0up to 27 seed terms matched

Residual orphans after Migration 057 are CONDITION/TREATMENT/EXAMINATION entities that carry no snomed_concept_id and whose canonical_name does not match any manual seed. These remain in the taxonomy as isolated nodes and are candidates for future manual linking or an expanded seed list.


Relationship to the Draft-Publish Lifecycle

Migrations 056 and 057 are one-time cleanup operations targeting the bloated state that accumulated before per-version isolation was enforced. In the current draft-publish pipeline (SP-5), each publish cycle creates a clean snapshot:

Going forward, the dedup logic in dedup_published.py is available as a library function that the publish endpoint can invoke incrementally — preventing the same accumulation from recurring. The SNOMED gap filler likewise can be run as a scheduled maintenance pass after each publish cycle if new orphans are detected.

CAMPUS entities are always excluded from deduplication: they are populated from the authoritative campuses table (ADR-0026) and have stable UUIDs across publish versions.