Draft/Publish System

The Draft/Publish system (SP-5) provides the separation layer between AI-extracted draft data (taxonomy entities awaiting operator review) and validated live data consumed by the RAG pipeline. It is the final gate before taxonomy changes reach search users.

SP-5 in Context

SP-5 receives approved entities from the Entity Resolution Pipeline (SP-4) and exposes versioned snapshots to the Frozen Taxonomy Registry. The Pipeline Wizard (SP-6) provides the operator UI for triggering publish, previewing deltas, and managing version history.

The Problem

Without a draft/publish boundary, any change to taxonomy data would immediately affect the live RAG pipeline:

An operator approving a batch of entities mid-review could expose an incomplete taxonomy to search users
Rolling back a bad extraction run would require surgical DELETE statements against production tables
Regulatory auditors (EU AI Act, Art. 12) cannot reconstruct which entity inventory was active at any point in time

The draft/publish system solves all three: operators work entirely in draft space, publish is an atomic snapshot operation, rollback is a version range DELETE, and every version is permanently recorded.

Architecture

Published Tables

Three new tables (migration 053) hold the published snapshot:

`published_entities`

A version-per-row store: each publish operation inserts new rows tagged with the version number. Rows from previous versions are not deleted — they remain for rollback and audit.

Column	Type	Notes
`id`	UUID PK
`version`	INTEGER	Publish version number
`hospital_id`	UUID FK
`entity_type`	VARCHAR(30)	`doctor`, `department`, `condition`, etc.
`canonical_name`	VARCHAR(300)
`aliases`	JSONB
`metadata`	JSONB	Full entity metadata
`snomed_concept_id`	VARCHAR(20)	Nullable
`source_entity_id`	UUID FK	Reference back to `taxonomy_entities`
`published_at`	TIMESTAMPTZ
`published_by`	UUID FK → `users`

`published_relationships`

Same version-per-row strategy as published_entities.

Column	Type	Notes
`id`	UUID PK
`version`	INTEGER	Publish version number
`hospital_id`	UUID FK
`source_name` / `source_type`	VARCHAR
`target_name` / `target_type`	VARCHAR
`relationship_type`	VARCHAR(50)	`HANDLES`, `OFFERS`, `PERFORMS`, `WORKS_IN_SCHEDULE`
`metadata`	JSONB
`source_relationship_id`	UUID FK	Reference back to `taxonomy_relationships`
`published_at`	TIMESTAMPTZ

`taxonomy_versions`

One row per publish operation — the version registry.

Column	Type	Notes
`id`	UUID PK
`hospital_id`	UUID FK
`version`	INTEGER	Monotonically increasing per hospital
`draft_version`	INTEGER	Which draft version was published
`entity_count`	INTEGER	Entities in this snapshot
`relationship_count`	INTEGER	Relationships in this snapshot
`published_at`	TIMESTAMPTZ
`published_by`	UUID FK → `users`
`notes`	TEXT	Operator notes

Version-per-Row Strategy

The alternative — a JSONB snapshot per version — would require deserializing large JSON blobs to reconstruct the entity inventory. Version-per-row allows the registry to load from published tables with a simple WHERE version = :v filter at full SQL speed.

Publish Flow

PublishService.publish() executes the following steps inside a single database transaction:

Advisory lock (pg_advisory_xact_lock) ensures only one publish runs at a time per hospital. If a second operator triggers publish concurrently, the second call waits for the lock rather than producing a corrupt snapshot.

ID remapping: Relationships in taxonomy_relationships reference taxonomy_entities by UUID. During publish, the pipeline re-maps these to the newly inserted published_entities UUIDs so that published_relationships references only published entity rows.

Impact Preview

Before committing a publish, operators can call the preview endpoint to see the delta against the current live version:

{
  "current_version": 6,
  "next_version": 7,
  "added": {
    "entities": 14,
    "relationships": 31
  },
  "modified": {
    "entities": 3,
    "relationships": 0
  },
  "removed": {
    "entities": 2,
    "relationships": 7
  },
  "entity_breakdown": {
    "doctor": {"added": 5, "modified": 1, "removed": 0},
    "condition": {"added": 9, "modified": 2, "removed": 2},
    "department": {"added": 0, "modified": 0, "removed": 0}
  }
}

The delta is computed by comparing approved entities in the current draft against the rows in published_entities WHERE version = current_version. No data is written during preview.

Rollback

Rolling back to version N deletes all published rows with version > N and removes the corresponding taxonomy_versions records:

-- Rollback to version 5
DELETE FROM app.published_entities
WHERE hospital_id = :hospital_id AND version > 5;

DELETE FROM app.published_relationships
WHERE hospital_id = :hospital_id AND version > 5;

DELETE FROM app.taxonomy_versions
WHERE hospital_id = :hospital_id AND version > 5;

After deletion, force_registry_rebuild() is called to reload the registry from the now-current version.

Rollback Is Irreversible

Rolled-back versions cannot be recovered without re-running the publish pipeline. The source data in taxonomy_entities (draft space) is never touched by rollback — operators can re-approve and re-publish as needed.

Registry Integration

FrozenTaxonomyRegistry gains a from_published() classmethod (SP-5) that loads directly from published_entities and published_relationships instead of the legacy ScrapeResult path:

@classmethod
async def from_published(
    cls,
    session: AsyncSession,
    hospital_id: UUID,
    version: int | None = None,  # None = latest
) -> "FrozenTaxonomyRegistry":
    """Load registry from published tables."""
    if version is None:
        version = await _get_latest_version(session, hospital_id)
    entities = await _load_published_entities(session, hospital_id, version)
    relationships = await _load_published_relationships(session, hospital_id, version)
    return cls._build_registry_from_rows(entities, relationships)

_build_registry_from_rows() converts flat DB rows into the FrozenDoctor, FrozenDepartment, and other frozen dataclasses that the registry's O(1) lookup indexes are built from.

Version-Check Cache Invalidation

The registry singleton checks for a new published version every 60 seconds:

async def _version_check_loop(hospital_id: UUID) -> None:
    """Background task: poll taxonomy_versions every 60s."""
    while True:
        await asyncio.sleep(60)
        latest = await _get_latest_version(session, hospital_id)
        if latest > _registry_version:
            await _reload_registry(hospital_id, latest)

The force flag (force_registry_rebuild(hospital_id)) bypasses the poll interval and immediately triggers a reload. It is called by PublishService after every publish or rollback so that the change is reflected in search within milliseconds.

Backward Compatibility

The ScrapeResult-based initialization path (initialize_registry_from_result()) still works. It is used in unit tests and legacy scripts that do not yet have a published version. The from_published() path is the production path from SP-5 onwards.

API Endpoints

Method	Path	Description
`GET`	`/api/v1/taxonomy/publish/preview`	Impact delta before publishing
`POST`	`/api/v1/taxonomy/publish`	Execute publish (returns new version)
`POST`	`/api/v1/taxonomy/publish/rollback/{version}`	Rollback to version N
`DELETE`	`/api/v1/taxonomy/publish/unpublish`	Remove all published data for hospital
`GET`	`/api/v1/taxonomy/publish/versions`	List all versions with metadata
`GET`	`/api/v1/taxonomy/publish/versions/{version}`	Single version details + entity counts

All write endpoints require require_admin authorization (FastAPI dependency injection). Preview and list endpoints require standard authentication.

Regulatory Compliance

The versioned publish system directly supports:

EU AI Act Art. 12 — Every publish creates an immutable record of which entity inventory was active. The taxonomy_versions table with published_by and published_at provides the required automatic logging.
EU AI Act Art. 14 — Human oversight is enforced: entities cannot reach the RAG pipeline without explicit operator approval and a deliberate publish action.
GDPR Art. 22 — The audit trail in taxonomy_overrides (SP-4) records every approve/reject decision with the responsible user ID, supporting the right to human review.

References

Brewer, E. A. (2000). Towards robust distributed systems. PODC 2000 Keynote. (CAP theorem — basis for advisory lock design)
Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly Media. (Chapter 7: Transactions)
European Parliament. (2024). EU AI Act, Regulation (EU) 2024/1689. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689

The Problem​

Architecture​

Published Tables​

published_entities​

published_relationships​

taxonomy_versions​

Publish Flow​

Impact Preview​

Rollback​

Registry Integration​

Version-Check Cache Invalidation​

API Endpoints​

Regulatory Compliance​

References​