Taxonomy Pipeline Wizard
The Taxonomy Pipeline Wizard (SP-6) is the management UI that guides hospital operators through the full taxonomy population workflow: from initial hospital configuration through crawling, hub page confirmation, entity review, and publishing to the live RAG index.
TaxonomyPage.tsx (the original 8-tab taxonomy management interface) is removed and replaced by this 5-stage wizard. The new design reflects the pipeline's natural sequencing rather than exposing raw CRUD tables.
Overview
The wizard is a 5-stage sequential process with a persistent vertical sidebar. Each stage maps to one major pipeline step:
| Stage | Name | Core Action |
|---|---|---|
| 1 | Hospital Setup | Verify configuration completeness |
| 2 | Crawl & Discover | Trigger/monitor web crawls |
| 3 | Hub Pages | Confirm or reject hub page candidates |
| 4 | Taxonomy Review | Review, approve, and merge extracted entities |
| 5 | Publish | Preview delta and publish to live index |
Stages are not forced-sequential — operators can jump to any stage via the sidebar. Stage status indicators (not started / in progress / complete) provide at-a-glance pipeline health without locking operators into a linear flow.
Design Principles
AI proposes, human approves. The pipeline runs autonomously (crawl → extract → deduplicate → SNOMED-match), but nothing reaches the live search index without an explicit operator approval and publish action. This enforces EU AI Act Art. 14 human oversight at the architectural level.
No manual data entry. Operators confirm, reject, or merge AI-generated proposals. They do not type entity names or configure relationships manually. This eliminates typos and ensures consistency with the scraped source data.
Full reversibility. Every stage action is undoable: rejected hub pages can be re-confirmed, rejected entities can be re-approved, and published versions can be rolled back. The pipeline never destroys data.
Layout
┌─────────────────────────────────────────────────────────────────┐
│ Taxonomy Pipeline — ZOL │
├──────────────────┬──────────────────────────────────────────────┤
│ ● 1 Setup ✓ │ │
│ ● 2 Crawl ✓ │ Stage Content Area │
│ ● 3 Hub Pages ● │ │
│ ● 4 Review ○ │ (loaded lazily per stage click) │
│ ● 5 Publish ○ │ │
└──────────────────┴──────────────────────────────────────────────┘
Status icons in the sidebar:
- ✓ Complete — all items processed
- ● In Progress — partially done (pending items remain)
- ○ Not Started — no activity yet
The sidebar is driven by PipelineStatusResponse from the backend aggregation endpoint, which combines counts from all five pipeline stages into a single response.
Stage 1 — Hospital Setup
SetupChecklist.tsx displays a config completeness checklist:
| Item | Check |
|---|---|
| Hospital name configured | hospital.name is set |
| At least one website URL | hospital_websites count ≥ 1 |
| At least one campus defined | hospital_campuses count ≥ 1 |
| Crawl settings present | crawl_config not null |
Each incomplete item links directly to the relevant settings page. Stage 1 is considered complete when all four checks pass.
Stage 1 doubles as the onboarding checklist for new hospital configurations. A new hospital tenant will land here first and follow the checklist to readiness before starting any crawl.
Stage 2 — Crawl & Discover
CrawlDashboard.tsx presents crawl statistics and a recrawl trigger:
Stat cards:
- Total URLs discovered
- Active URLs (200 OK)
- Dead URLs (404/403/410)
- Last crawl timestamp
Recrawl trigger: POST /api/v1/crawl/start — kicks off a background crawl job. Progress is tracked via SSE. The recrawl button is disabled while a crawl is in progress (status from GET /api/v1/crawl/status).
Stage 2 is complete when at least one successful crawl has run and URLs are present in crawled_urls.
Stage 3 — Hub Pages
HubCandidateList.tsx shows a card grid of hub page candidates identified by the LLM classifier (SP-2):
Card Design
Each HubCandidateCard.tsx displays:
- Page title and URL
- AI confidence score (0–100%)
- Page type classification (doctors listing, department listing, conditions, etc.)
- Discovered child URL count
- Confirm / Reject action buttons (inline, no dialog)
window.confirm()Per project conventions, destructive actions use the inline ConfirmBar component rather than native browser dialogs. Rejecting a hub page shows a ConfirmBar inline within the card before committing.
Filters
The candidate list is filterable by:
- Status: All / Pending / Confirmed / Rejected
- Page type: All types or a specific classification
- Confidence threshold: slider (default: ≥70%)
Stage 3 is complete when all candidates have been confirmed or rejected (no pending items remain).
Stage 4 — Taxonomy Review
Stage 4 has three views selectable via a tab strip: Entities, Relationships, and Graph (placeholder).
Entity Table (EntityTable.tsx)
The main review interface for the entity resolution output (SP-4).
Columns:
| Column | Description |
|---|---|
| Entity name | canonical_name |
| Type | doctor / department / condition / treatment / examination |
| Status | proposed / approved / rejected |
| AI Confidence | ai_confidence badge (color-coded: green ≥0.85, yellow ≥0.65, red below 0.65) |
| SNOMED | Concept ID badge if matched |
| Hub source | Which hub page produced this entity |
| Actions | Approve / Reject (inline) |
Filter bar (EntityFilterBar.tsx): filter pills for type and status, free-text search on canonical_name, confidence threshold slider.
Expandable rows (EntityExpandedRow.tsx): clicking a row expands to show:
- Raw extracted
name(pre-normalization) dedup_keyand cluster members (non-primary variants)- SNOMED preferred term and match confidence
- Source snippet from the hub page
Bulk actions (EntityBulkActions.tsx): when rows are selected via checkbox:
- Bulk approve (all selected)
- Bulk reject (all selected)
- Bulk approve by confidence threshold (e.g., approve all with confidence ≥0.85)
Stage 4 is complete when all proposed entities have been approved or rejected.
Relationship Browser (RelationshipBrowser.tsx)
A tabular cross-reference browser for taxonomy relationships. Tab strip switches between relationship types:
| Tab | Relationships Shown |
|---|---|
| Works In | Doctor → Department |
| Handles | Department → Condition |
| Offers | Department → Treatment |
| Performs | Department → Examination |
| Treats | Treatment → Condition |
| Diagnoses | Examination → Condition |
Each tab shows a sortable table with source name, target name, confidence, and a remove action. The relationship browser is read-heavy — operators rarely modify relationships manually, but the view is essential for spotting extraction artifacts (e.g., a department linked to an implausible condition).
Graph View (Placeholder)
GraphPlaceholder.tsx renders a stub with the message:
"Graph visualization coming in a future release. The entity relationship data is complete and ready to visualize."
The component reserves the tab slot without blocking the current release. A D3.js or Cytoscape.js force-directed graph of the entity network is planned for a later sprint.
Stage 5 — Publish
PublishPage.tsx is the final gate before taxonomy changes reach search users.
Layout
┌─────────────────────────────────────────────────────────┐
│ Current version: 6 │ Last published: 2026-03-14 │
├─────────────────────────────────────────────────────────┤
│ │
│ Impact Preview │
│ ───────────────────────────────────────────────────── │
│ + 14 entities added ▸ 3 entities modified │
│ - 2 entities removed + 31 relationships added │
│ │
│ [ Preview Details ] [ Publish Version 7 ] │
│ │
├─────────────────────────────────────────────────────────┤
│ Version History │
│ v6 2026-03-14 312 entities 891 rels [Rollback] │
│ v5 2026-03-10 298 entities 847 rels [Rollback] │
│ v4 2026-03-07 291 entities 820 rels [Rollback] │
└─────────────────────────────────────────────────────────┘
PublishPreview.tsx
Displays the delta from GET /api/v1/taxonomy/publish/preview before committing:
- Summary counts (added/modified/removed for entities and relationships)
- Entity type breakdown (how many doctors vs conditions vs treatments changed)
- Collapsible detail list of specific changes
VersionHistory.tsx
A table of all published versions with:
- Version number
- Published date
- Entity and relationship counts
- Publisher user name
- Rollback button (inline
ConfirmBarconfirmation required)
Rollback calls POST /api/v1/taxonomy/publish/rollback/{version} and force-reloads the pipeline status sidebar after completion.
Pipeline Status API
A single backend endpoint aggregates all stage statuses:
GET /api/v1/pipeline-status/{hospital_id}
Response (PipelineStatusResponse):
{
"hospital_setup": {
"status": "complete",
"name": true,
"websites": 2,
"campuses": 4
},
"crawl": {
"status": "complete",
"total_urls": 4821,
"active_urls": 4604,
"dead_urls": 217,
"last_crawl_at": "2026-03-14T09:30:00Z"
},
"hub_pages": {
"status": "in_progress",
"total": 38,
"confirmed": 31,
"pending": 7,
"rejected": 0
},
"taxonomy_review": {
"status": "not_started",
"total": 0,
"approved": 0,
"proposed": 0,
"rejected": 0
},
"publish": {
"status": "not_started",
"current_version": null,
"last_published_at": null
}
}
Two companion endpoints:
| Endpoint | Purpose |
|---|---|
GET /api/v1/pipeline-status/{hospital_id}/crawl-summary | Crawl statistics for Stage 2 |
GET /api/v1/pipeline-status/{hospital_id}/entity-counts | Entity counts by type and status for Stage 4 |
Component Architecture
The wizard uses a two-level component hierarchy:
Page level:
TaxonomyPipelinePage.tsx— route entry point, replacesTaxonomyPage
Wizard shell:
PipelineWizard.tsx— sidebar + content container, manages active stage statePipelineSidebar.tsx— 5-stage nav with status icons fromPipelineStatusResponse
Stage 1:
SetupChecklist.tsx— config completeness checks
Stage 2:
CrawlDashboard.tsx— stat cards + recrawl trigger
Stage 3:
HubCandidateList.tsx— card grid with filtersHubCandidateCard.tsx— single candidate card with confirm/reject
Stage 4:
EntityTable.tsx— main entity review tableEntityFilterBar.tsx— filter pills + search inputEntityExpandedRow.tsx— expanded entity detailEntityBulkActions.tsx— bulk action toolbarRelationshipBrowser.tsx— tabular relationship cross-referenceGraphPlaceholder.tsx— future graph visualization stub
Stage 5:
PublishPage.tsx— status summary + publish trigger + version historyPublishPreview.tsx— delta displayVersionHistory.tsx— version table with rollback actions
Services (frontend):
pipelineStatusService.ts— pipeline status + crawl summary API callsentityResolutionService.ts— entity CRUD + countstaxonomyPublishService.ts— preview / publish / rollbackhubPageService.ts— hub candidates + confirm/reject
All data fetching uses @tanstack/react-query v5 for caching, background refetch, and optimistic updates. Stage components are lazy-loaded (React.lazy()) so each stage's bundle only loads when the operator navigates to it.
References
- Norman, D. A. (2013). The Design of Everyday Things (Revised ed.). Basic Books. (Progressive disclosure principle applied to pipeline staging)
- European Parliament. (2024). EU AI Act, Art. 14 — Human oversight. Regulation (EU) 2024/1689.
- TanStack. (2024). TanStack Query v5 documentation. https://tanstack.com/query/latest