Draft/Publish System
The Draft/Publish system (SP-5) provides the separation layer between AI-extracted draft data (taxonomy entities awaiting operator review) and validated live data consumed by the RAG pipeline. It is the final gate before taxonomy changes reach search users.
SP-5 receives approved entities from the Entity Resolution Pipeline (SP-4) and exposes versioned snapshots to the Frozen Taxonomy Registry. The Pipeline Wizard (SP-6) provides the operator UI for triggering publish, previewing deltas, and managing version history.
The Problem
Without a draft/publish boundary, any change to taxonomy data would immediately affect the live RAG pipeline:
- An operator approving a batch of entities mid-review could expose an incomplete taxonomy to search users
- Rolling back a bad extraction run would require surgical DELETE statements against production tables
- Regulatory auditors (EU AI Act, Art. 12) cannot reconstruct which entity inventory was active at any point in time
The draft/publish system solves all three: operators work entirely in draft space, publish is an atomic snapshot operation, rollback is a version range DELETE, and every version is permanently recorded.
Architecture
Published Tables
Three new tables (migration 053) hold the published snapshot:
published_entities
A version-per-row store: each publish operation inserts new rows tagged with the version number. Rows from previous versions are not deleted — they remain for rollback and audit.
| Column | Type | Notes |
|---|---|---|
id | UUID PK | |
version | INTEGER | Publish version number |
hospital_id | UUID FK | |
entity_type | VARCHAR(30) | doctor, department, condition, etc. |
canonical_name | VARCHAR(300) | |
aliases | JSONB | |
metadata | JSONB | Full entity metadata |
snomed_concept_id | VARCHAR(20) | Nullable |
source_entity_id | UUID FK | Reference back to taxonomy_entities |
published_at | TIMESTAMPTZ | |
published_by | UUID FK → users |
published_relationships
Same version-per-row strategy as published_entities.
| Column | Type | Notes |
|---|---|---|
id | UUID PK | |
version | INTEGER | Publish version number |
hospital_id | UUID FK | |
source_name / source_type | VARCHAR | |
target_name / target_type | VARCHAR | |
relationship_type | VARCHAR(50) | HANDLES, OFFERS, PERFORMS, WORKS_IN_SCHEDULE |
metadata | JSONB | |
source_relationship_id | UUID FK | Reference back to taxonomy_relationships |
published_at | TIMESTAMPTZ |
taxonomy_versions
One row per publish operation — the version registry.
| Column | Type | Notes |
|---|---|---|
id | UUID PK | |
hospital_id | UUID FK | |
version | INTEGER | Monotonically increasing per hospital |
draft_version | INTEGER | Which draft version was published |
entity_count | INTEGER | Entities in this snapshot |
relationship_count | INTEGER | Relationships in this snapshot |
published_at | TIMESTAMPTZ | |
published_by | UUID FK → users | |
notes | TEXT | Operator notes |
The alternative — a JSONB snapshot per version — would require deserializing large JSON blobs to reconstruct the entity inventory. Version-per-row allows the registry to load from published tables with a simple WHERE version = :v filter at full SQL speed.
Publish Flow
PublishService.publish() executes the following steps inside a single database transaction:
Advisory lock (pg_advisory_xact_lock) ensures only one publish runs at a time per hospital. If a second operator triggers publish concurrently, the second call waits for the lock rather than producing a corrupt snapshot.
ID remapping: Relationships in taxonomy_relationships reference taxonomy_entities by UUID. During publish, the pipeline re-maps these to the newly inserted published_entities UUIDs so that published_relationships references only published entity rows.
Impact Preview
Before committing a publish, operators can call the preview endpoint to see the delta against the current live version:
{
"current_version": 6,
"next_version": 7,
"added": {
"entities": 14,
"relationships": 31
},
"modified": {
"entities": 3,
"relationships": 0
},
"removed": {
"entities": 2,
"relationships": 7
},
"entity_breakdown": {
"doctor": {"added": 5, "modified": 1, "removed": 0},
"condition": {"added": 9, "modified": 2, "removed": 2},
"department": {"added": 0, "modified": 0, "removed": 0}
}
}
The delta is computed by comparing approved entities in the current draft against the rows in published_entities WHERE version = current_version. No data is written during preview.
Rollback
Rolling back to version N deletes all published rows with version > N and removes the corresponding taxonomy_versions records:
-- Rollback to version 5
DELETE FROM app.published_entities
WHERE hospital_id = :hospital_id AND version > 5;
DELETE FROM app.published_relationships
WHERE hospital_id = :hospital_id AND version > 5;
DELETE FROM app.taxonomy_versions
WHERE hospital_id = :hospital_id AND version > 5;
After deletion, force_registry_rebuild() is called to reload the registry from the now-current version.
Rolled-back versions cannot be recovered without re-running the publish pipeline. The source data in taxonomy_entities (draft space) is never touched by rollback — operators can re-approve and re-publish as needed.
Registry Integration
FrozenTaxonomyRegistry gains a from_published() classmethod (SP-5) that loads directly from published_entities and published_relationships instead of the legacy ScrapeResult path:
@classmethod
async def from_published(
cls,
session: AsyncSession,
hospital_id: UUID,
version: int | None = None, # None = latest
) -> "FrozenTaxonomyRegistry":
"""Load registry from published tables."""
if version is None:
version = await _get_latest_version(session, hospital_id)
entities = await _load_published_entities(session, hospital_id, version)
relationships = await _load_published_relationships(session, hospital_id, version)
return cls._build_registry_from_rows(entities, relationships)
_build_registry_from_rows() converts flat DB rows into the FrozenDoctor, FrozenDepartment, and other frozen dataclasses that the registry's O(1) lookup indexes are built from.
Version-Check Cache Invalidation
The registry singleton checks for a new published version every 60 seconds:
async def _version_check_loop(hospital_id: UUID) -> None:
"""Background task: poll taxonomy_versions every 60s."""
while True:
await asyncio.sleep(60)
latest = await _get_latest_version(session, hospital_id)
if latest > _registry_version:
await _reload_registry(hospital_id, latest)
The force flag (force_registry_rebuild(hospital_id)) bypasses the poll interval and immediately triggers a reload. It is called by PublishService after every publish or rollback so that the change is reflected in search within milliseconds.
The ScrapeResult-based initialization path (initialize_registry_from_result()) still works. It is used in unit tests and legacy scripts that do not yet have a published version. The from_published() path is the production path from SP-5 onwards.
API Endpoints
| Method | Path | Description |
|---|---|---|
GET | /api/v1/taxonomy/publish/preview | Impact delta before publishing |
POST | /api/v1/taxonomy/publish | Execute publish (returns new version) |
POST | /api/v1/taxonomy/publish/rollback/{version} | Rollback to version N |
DELETE | /api/v1/taxonomy/publish/unpublish | Remove all published data for hospital |
GET | /api/v1/taxonomy/publish/versions | List all versions with metadata |
GET | /api/v1/taxonomy/publish/versions/{version} | Single version details + entity counts |
All write endpoints require require_admin authorization (FastAPI dependency injection). Preview and list endpoints require standard authentication.
Regulatory Compliance
The versioned publish system directly supports:
- EU AI Act Art. 12 — Every publish creates an immutable record of which entity inventory was active. The
taxonomy_versionstable withpublished_byandpublished_atprovides the required automatic logging. - EU AI Act Art. 14 — Human oversight is enforced: entities cannot reach the RAG pipeline without explicit operator approval and a deliberate publish action.
- GDPR Art. 22 — The audit trail in
taxonomy_overrides(SP-4) records every approve/reject decision with the responsible user ID, supporting the right to human review.
References
- Brewer, E. A. (2000). Towards robust distributed systems. PODC 2000 Keynote. (CAP theorem — basis for advisory lock design)
- Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly Media. (Chapter 7: Transactions)
- European Parliament. (2024). EU AI Act, Regulation (EU) 2024/1689. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689