Skip to main content

Release Notes: March 28-31, 2026

88 commits | 65 code files | ~4,400 lines of production code | 7 database migrations

2-Minute Summary

This sprint transformed the system from a ZOL-specific prototype into a hospital-agnostic platform ready for pilot testing. The taxonomy was deduplicated from 12,997 to 2,663 entities, a new AI-powered feedback investigation dashboard was built, PDF ingestion was hardened against crashes, and the eval score held at 99.0% (effective 99.7%) throughout. Seven database migrations (054-060) landed, and every component was verified on the pilot server.


Pilot Readiness Assessment

CriterionStatusEvidence
Query accuracyPass99.0% (296/299), effective 99.7% with ground truth fixes
Safety layerPassZero medical advice incidents across all eval runs
All queries executePassQuery decomposition crash (KeyError) fixed and deployed
Hospital-agnosticPassAll 259 ZOL-specific references removed; config-driven
Data integrityPass0 orphaned chunks, 0 orphaned embeddings, FK cascades verified
Infrastructure healthPassAll 7 containers healthy, migrations at head (060)
PDF handlingPassSubprocess isolation prevents OOM worker crashes
Feedback toolingPassAI investigation, override, flag, golden question promotion

Verdict: The system is ready for pilot testing.


Detailed Changes

1. Hospital-Agnostic Architecture (Phases 1-4)

The single largest workstream: converting the entire codebase from hardcoded ZOL references to a database-driven, hospital-agnostic platform.

What changed:

  • Audit: Identified 259 ZOL-specific references across 40 files
  • Phase 1 — Config Extraction: New site_crawl_configs table (migration 058) and admin API for per-hospital crawl settings
  • Phase 2 — Prompt Parameterization: All LLM prompts now receive hospital identity via PromptContext dataclass loaded from the database at runtime
  • Phase 3 — Generic Naming: ZOLCrawler renamed to HospitalCrawler; all ZOL-branded strings removed from API titles, app descriptions, and defaults
  • Phase 4 — DB-driven Config Cache: SiteConfigCache loads hospital identity, boilerplate patterns, and crawl settings from the database on startup; no more in-code constants

Impact: A new hospital can be onboarded by inserting configuration rows — zero code changes required.

Key files:

  • backend/app/services/site_config.py — DB-backed config cache
  • backend/app/crawlers/hospital_crawler.py — generic crawler (was zol_crawler.py)
  • backend/app/prompts.py — parameterized prompt templates
  • backend/app/api/hospital_config.py — admin CRUD for crawl configs

2. Taxonomy Deduplication & SNOMED Gap Fill

The taxonomy had grown to 12,997 entities with massive duplication from multiple extraction runs. This sprint cleaned and enriched it.

What changed:

  • Dedup (migration 056): Survivor selection algorithm kept the richest entity per group; reduced to 2,663 unique entities
  • SNOMED Gap Fill (migration 057): 1,674 orphaned entities were linked via SNOMED hierarchy lookups + manual seed fallback
  • LLM Auto-linker: New relationship_autolinker.py uses GPT-4.1-mini to classify and link remaining orphans during the publish pipeline
  • Result: 3,591 relationships, only 43 orphans remaining (2.1%)

Taxonomy before/after:

MetricBeforeAfter
Entities12,9972,663
Relationships~2,0003,591
Orphans~1,67443 (2.1%)
DuplicatesSevereEliminated

Key files:

  • backend/alembic/versions/056_dedup_published_entities.py
  • backend/alembic/versions/057_snomed_relationship_gap_fill.py
  • backend/app/services/taxonomy/relationship_autolinker.py
  • backend/app/services/taxonomy/dedup_published.py

3. Feedback Investigation Dashboard

A new AI-powered system for analysing negative user feedback and improving answer quality.

What changed:

  • AI Case Investigation: Click any feedback item to trigger a GPT-4.1 analysis that diagnoses why the answer was wrong, identifies missing chunks, and suggests fixes
  • Override Mechanism: Admin can force a response override (correct answer, source citations) that is served to future identical queries
  • Add to Golden Questions: Promote investigated questions directly into the evaluation benchmark
  • Dashboard Metrics (Spec B): Telemetry stats with P95 latency comparison, Think Harder funnel visualization, trend chart
  • Flag & Persist: Flag content for review; investigation results persist across page refreshes via backend storage

Key files:

  • backend/app/services/feedback_investigation_service.py
  • backend/app/api/admin_feedback.py — 5 new endpoints
  • frontend/src/pages/FeedbackDashboardPage.tsx — complete redesign

4. PDF & Document Pipeline Hardening

573 PDF brochures were ingested during this sprint, revealing and fixing several crash patterns.

What changed:

  • Subprocess Isolation: PDF extraction now runs in a forked subprocess; if it OOMs or hangs, only the subprocess dies — the worker survives
  • Image-only PDF Detection: PDFs with no extractable text are gracefully skipped instead of crashing
  • Boilerplate Stripping: Hospital header/footer patterns (phone numbers, addresses) are stripped from chunks; patterns are DB-configurable per hospital
  • Enrichment Retry: Gaps in contextual embeddings are retried inline before marking a document as completed
  • Self-heal Purge: Soft-deleted documents are now cleaned up during the self-heal diagnostic cycle

Key files:

  • backend/app/services/document_service.py
  • backend/app/services/processing_service.py
  • backend/app/services/diagnostics/self_heal_service.py

5. RAG Pipeline Improvements

Several retrieval quality improvements targeting navigational and practical queries.

What changed:

  • Category-Aware Retrieval Boosting: Navigational queries (navigation_or_practical_info) get a 1.5x authority boost for chunks in relevant categories (Location, Contact, Financial, etc.) and a 0.7x penalty for unrelated categories
  • Taxonomy Enrichment for Navigation: Practical queries now trigger taxonomy lookups (campus info, department details) even when no medical entity is detected
  • Campus-Aware Doctor Lookup: Doctor queries with campus mentions now filter by campus via published taxonomy relationships
  • Speculative Retrieval Merge: When intent classification reformulates a query, results from both original and reformulated queries are merged using deduplication
  • Query Decomposition Fix: Fixed the KeyError: '"multi_hop"' crash caused by double f-string/format escaping — was blocking all complex queries on pilot

Key files:

  • backend/app/services/search_service.py — category boosting
  • backend/app/services/taxonomy/query_service.py — campus-aware lookups
  • backend/app/services/rag/retrieval_mixin.py — speculative merge
  • backend/app/services/query_decomposition_service.py — prompt fix

6. Entity Resolution UI

Merge candidate management improvements for the taxonomy pipeline wizard.

What changed:

  • Merge/Reject buttons added to NEEDS_REVIEW candidates (previously only visible for AUTO_MERGE)
  • Tiered Bulk Merge: One-click approval for high-confidence candidates (100% token overlap → 80%+)
  • SNOMED Bulk Merge: Now includes NEEDS_REVIEW candidates, not just AUTO_MERGE
  • No-confirmation individual merge: Single-click merge for reviewed candidates

7. Data Integrity & Migrations

Seven database migrations ensuring referential integrity and clean data.

MigrationPurpose
054merge_candidates table for fuzzy entity dedup
055feedback_investigations table + override columns
056Deduplicate published entities (12,997 → 2,663)
057SNOMED relationship gap fill + manual seeds
058site_crawl_configs for hospital-agnostic crawl config
059Seed golden_pages with ZOL navigational pages
060Change ingestion_results FK from SET NULL to CASCADE

Data cleanup performed:

  • 5,732 orphaned ingestion results deleted
  • 12 zombie documents (completed, 0 chunks) removed
  • 15 NULL-document_id ingestion results cleaned

8. Evaluation Results

DateScoreContext
March 2799.7% (298/299)RAG mixin split, dedup, all fixes
March 2998.7% (295/299)PDF corpus scaling incident (-1.0%)
March 3097.7% (293/299)Post-gap-fill, taxonomy in flux
March 3199.0% (296/299)Taxonomy dedup + gap fill + Graph ON

The 3 remaining failures:

  • GQ-195: Non-deterministic (buikpijn routes to Abdominale Heelkunde vs Kindergeneeskunde depending on context) — needs pediatric keyword boosting
  • GQ-043, GQ-124: Ground truth corrections applied; effective score with fixes: 99.7%

9. Documentation & Tooling

  • Architecture-as-Code: New Docusaurus plugin that generates architecture index from frontmatter metadata
  • Multi-tenancy docs: Comprehensive page covering the hospital-agnostic design
  • Taxonomy dedup/gap-fill docs: Technical deep-dive into the dedup algorithm and SNOMED gap fill
  • Feedback dashboard metrics docs: Dashboard feature documentation
  • PDF corpus scaling incident: Academic analysis of the 1% eval regression from PDF ingestion
  • Prompt engineering docs: New page covering the prompt architecture
  • 45+ updated documentation pages across all sections

Current System State

ComponentValue
Documents2,522 completed
Chunks10,437 (all with embeddings)
Taxonomy entities2,663 (deduplicated)
Taxonomy relationships3,591
Orphan rate2.1% (43 entities)
Golden questions302 (v3.6)
Database migrationsHead at 060
Eval score99.0% (effective 99.7%)
Containers7/7 healthy
Medical advice incidentsZERO