Release Notes: March 28-31, 2026
88 commits | 65 code files | ~4,400 lines of production code | 7 database migrations
This sprint transformed the system from a ZOL-specific prototype into a hospital-agnostic platform ready for pilot testing. The taxonomy was deduplicated from 12,997 to 2,663 entities, a new AI-powered feedback investigation dashboard was built, PDF ingestion was hardened against crashes, and the eval score held at 99.0% (effective 99.7%) throughout. Seven database migrations (054-060) landed, and every component was verified on the pilot server.
Pilot Readiness Assessment
| Criterion | Status | Evidence |
|---|---|---|
| Query accuracy | Pass | 99.0% (296/299), effective 99.7% with ground truth fixes |
| Safety layer | Pass | Zero medical advice incidents across all eval runs |
| All queries execute | Pass | Query decomposition crash (KeyError) fixed and deployed |
| Hospital-agnostic | Pass | All 259 ZOL-specific references removed; config-driven |
| Data integrity | Pass | 0 orphaned chunks, 0 orphaned embeddings, FK cascades verified |
| Infrastructure health | Pass | All 7 containers healthy, migrations at head (060) |
| PDF handling | Pass | Subprocess isolation prevents OOM worker crashes |
| Feedback tooling | Pass | AI investigation, override, flag, golden question promotion |
Verdict: The system is ready for pilot testing.
Detailed Changes
1. Hospital-Agnostic Architecture (Phases 1-4)
The single largest workstream: converting the entire codebase from hardcoded ZOL references to a database-driven, hospital-agnostic platform.
What changed:
- Audit: Identified 259 ZOL-specific references across 40 files
- Phase 1 — Config Extraction: New
site_crawl_configstable (migration 058) and admin API for per-hospital crawl settings - Phase 2 — Prompt Parameterization: All LLM prompts now receive hospital identity via
PromptContextdataclass loaded from the database at runtime - Phase 3 — Generic Naming:
ZOLCrawlerrenamed toHospitalCrawler; all ZOL-branded strings removed from API titles, app descriptions, and defaults - Phase 4 — DB-driven Config Cache:
SiteConfigCacheloads hospital identity, boilerplate patterns, and crawl settings from the database on startup; no more in-code constants
Impact: A new hospital can be onboarded by inserting configuration rows — zero code changes required.
Key files:
backend/app/services/site_config.py— DB-backed config cachebackend/app/crawlers/hospital_crawler.py— generic crawler (waszol_crawler.py)backend/app/prompts.py— parameterized prompt templatesbackend/app/api/hospital_config.py— admin CRUD for crawl configs
2. Taxonomy Deduplication & SNOMED Gap Fill
The taxonomy had grown to 12,997 entities with massive duplication from multiple extraction runs. This sprint cleaned and enriched it.
What changed:
- Dedup (migration 056): Survivor selection algorithm kept the richest entity per group; reduced to 2,663 unique entities
- SNOMED Gap Fill (migration 057): 1,674 orphaned entities were linked via SNOMED hierarchy lookups + manual seed fallback
- LLM Auto-linker: New
relationship_autolinker.pyuses GPT-4.1-mini to classify and link remaining orphans during the publish pipeline - Result: 3,591 relationships, only 43 orphans remaining (2.1%)
Taxonomy before/after:
| Metric | Before | After |
|---|---|---|
| Entities | 12,997 | 2,663 |
| Relationships | ~2,000 | 3,591 |
| Orphans | ~1,674 | 43 (2.1%) |
| Duplicates | Severe | Eliminated |
Key files:
backend/alembic/versions/056_dedup_published_entities.pybackend/alembic/versions/057_snomed_relationship_gap_fill.pybackend/app/services/taxonomy/relationship_autolinker.pybackend/app/services/taxonomy/dedup_published.py
3. Feedback Investigation Dashboard
A new AI-powered system for analysing negative user feedback and improving answer quality.
What changed:
- AI Case Investigation: Click any feedback item to trigger a GPT-4.1 analysis that diagnoses why the answer was wrong, identifies missing chunks, and suggests fixes
- Override Mechanism: Admin can force a response override (correct answer, source citations) that is served to future identical queries
- Add to Golden Questions: Promote investigated questions directly into the evaluation benchmark
- Dashboard Metrics (Spec B): Telemetry stats with P95 latency comparison, Think Harder funnel visualization, trend chart
- Flag & Persist: Flag content for review; investigation results persist across page refreshes via backend storage
Key files:
backend/app/services/feedback_investigation_service.pybackend/app/api/admin_feedback.py— 5 new endpointsfrontend/src/pages/FeedbackDashboardPage.tsx— complete redesign
4. PDF & Document Pipeline Hardening
573 PDF brochures were ingested during this sprint, revealing and fixing several crash patterns.
What changed:
- Subprocess Isolation: PDF extraction now runs in a forked subprocess; if it OOMs or hangs, only the subprocess dies — the worker survives
- Image-only PDF Detection: PDFs with no extractable text are gracefully skipped instead of crashing
- Boilerplate Stripping: Hospital header/footer patterns (phone numbers, addresses) are stripped from chunks; patterns are DB-configurable per hospital
- Enrichment Retry: Gaps in contextual embeddings are retried inline before marking a document as completed
- Self-heal Purge: Soft-deleted documents are now cleaned up during the self-heal diagnostic cycle
Key files:
backend/app/services/document_service.pybackend/app/services/processing_service.pybackend/app/services/diagnostics/self_heal_service.py
5. RAG Pipeline Improvements
Several retrieval quality improvements targeting navigational and practical queries.
What changed:
- Category-Aware Retrieval Boosting: Navigational queries (
navigation_or_practical_info) get a 1.5x authority boost for chunks in relevant categories (Location, Contact, Financial, etc.) and a 0.7x penalty for unrelated categories - Taxonomy Enrichment for Navigation: Practical queries now trigger taxonomy lookups (campus info, department details) even when no medical entity is detected
- Campus-Aware Doctor Lookup: Doctor queries with campus mentions now filter by campus via published taxonomy relationships
- Speculative Retrieval Merge: When intent classification reformulates a query, results from both original and reformulated queries are merged using deduplication
- Query Decomposition Fix: Fixed the
KeyError: '"multi_hop"'crash caused by double f-string/format escaping — was blocking all complex queries on pilot
Key files:
backend/app/services/search_service.py— category boostingbackend/app/services/taxonomy/query_service.py— campus-aware lookupsbackend/app/services/rag/retrieval_mixin.py— speculative mergebackend/app/services/query_decomposition_service.py— prompt fix
6. Entity Resolution UI
Merge candidate management improvements for the taxonomy pipeline wizard.
What changed:
- Merge/Reject buttons added to NEEDS_REVIEW candidates (previously only visible for AUTO_MERGE)
- Tiered Bulk Merge: One-click approval for high-confidence candidates (100% token overlap → 80%+)
- SNOMED Bulk Merge: Now includes NEEDS_REVIEW candidates, not just AUTO_MERGE
- No-confirmation individual merge: Single-click merge for reviewed candidates
7. Data Integrity & Migrations
Seven database migrations ensuring referential integrity and clean data.
| Migration | Purpose |
|---|---|
| 054 | merge_candidates table for fuzzy entity dedup |
| 055 | feedback_investigations table + override columns |
| 056 | Deduplicate published entities (12,997 → 2,663) |
| 057 | SNOMED relationship gap fill + manual seeds |
| 058 | site_crawl_configs for hospital-agnostic crawl config |
| 059 | Seed golden_pages with ZOL navigational pages |
| 060 | Change ingestion_results FK from SET NULL to CASCADE |
Data cleanup performed:
- 5,732 orphaned ingestion results deleted
- 12 zombie documents (completed, 0 chunks) removed
- 15 NULL-document_id ingestion results cleaned
8. Evaluation Results
| Date | Score | Context |
|---|---|---|
| March 27 | 99.7% (298/299) | RAG mixin split, dedup, all fixes |
| March 29 | 98.7% (295/299) | PDF corpus scaling incident (-1.0%) |
| March 30 | 97.7% (293/299) | Post-gap-fill, taxonomy in flux |
| March 31 | 99.0% (296/299) | Taxonomy dedup + gap fill + Graph ON |
The 3 remaining failures:
- GQ-195: Non-deterministic (buikpijn routes to Abdominale Heelkunde vs Kindergeneeskunde depending on context) — needs pediatric keyword boosting
- GQ-043, GQ-124: Ground truth corrections applied; effective score with fixes: 99.7%
9. Documentation & Tooling
- Architecture-as-Code: New Docusaurus plugin that generates architecture index from frontmatter metadata
- Multi-tenancy docs: Comprehensive page covering the hospital-agnostic design
- Taxonomy dedup/gap-fill docs: Technical deep-dive into the dedup algorithm and SNOMED gap fill
- Feedback dashboard metrics docs: Dashboard feature documentation
- PDF corpus scaling incident: Academic analysis of the 1% eval regression from PDF ingestion
- Prompt engineering docs: New page covering the prompt architecture
- 45+ updated documentation pages across all sections
Current System State
| Component | Value |
|---|---|
| Documents | 2,522 completed |
| Chunks | 10,437 (all with embeddings) |
| Taxonomy entities | 2,663 (deduplicated) |
| Taxonomy relationships | 3,591 |
| Orphan rate | 2.1% (43 entities) |
| Golden questions | 302 (v3.6) |
| Database migrations | Head at 060 |
| Eval score | 99.0% (effective 99.7%) |
| Containers | 7/7 healthy |
| Medical advice incidents | ZERO |