Skip to main content

Pilot Review Deck — ZOL Intelligent Search (May 2026)

A reviewer-facing slide deck rendered as a markdown page. Each ## is a slide; bullets are the speaker-track. Inline links land on the deeper engineering material so a reviewer can follow any claim back to source. Every quantitative claim on this page traces to a specific source-of-truth file (thesis Chapter 4, the audit ledger, the bibliography, or the Operations dashboard).

For deeper engineering material: see Voice Stack Compendium. For competitive positioning: see SOTA Positioning Matrix (May 2026). For empirical evidence: see thesis Chapter 4 — Results.


1. What we built

A production retrieval-augmented-generation system that replaces the keyword search on the ZOL hospital website with a natural-language interface in chat and on the phone, grounded in the hospital's own published content and bounded by a five-layer safety architecture.

  • One backend, two channels (web chat + voice via Twilio + LiveKit SIP), one retrieval pipeline.
  • Multi-tenant by construction — a second hospital onboards via tenant-overlay rows, not source changes.
  • Engineering depth: see Voice Stack Compendium; empirical depth: see thesis Chapter 4 — Results.

2. The problem

ZOL's website carries comprehensive healthcare information, but visitors cannot find it. Keyword search fails because patients think in colloquial terms; content is filed under medical terminology.

  • ~100,000 monthly website visitors and ~25,000 monthly search queries — 25% of all web traffic touches the search function (thesis §1.1).
  • Content corpus: 1,000+ patient brochures, 700+ condition descriptions, plus doctor and department pages (thesis §1.1).
  • The current Elasticsearch keyword search is a vocabulary-mismatch problem (Manning et al. 2008) — "hoge bloeddruk" (high blood pressure) does not retrieve content filed as "hypertensie."
  • Helpdesk and call-centre overflow with questions whose answers already live in published brochures (thesis §1.1).

3. The system

A layered architecture (Voice Stack Compendium §2): PSTN → SIP → LiveKit media → Deepgram Nova-3 STT → backend cognition → ElevenLabs TTS → Postgres telemetry. The chat channel reuses the same retrieval pipeline.

  • RAG core — pgvector + BM25 hybrid retrieval with cross-encoder reranking, conditional knowledge-graph injection, and structured_call schema-validated outputs (architecture/system-overview).
  • Voice channel — agentic LLM with three tools (search_hospital_kb, transfer_to_helpdesk, end_call); regex pre-filter and post-filter wrap the LLM (voice/architecture, ADR-0049, ADR-0051).
  • Safety architecture — five independent layers (intent classification, GCG anomaly detection, quality gate, LLM-as-judge validation, output guardrails); voice adds a regex pre-filter and a medical-disclaimer prepender (safety/overview, voice/triple-defense).
  • Multi-tenancy — DB-driven for web/RAG, YAML-overlay for voice; tenant identity is bound to the Keycloak JWT claim (architecture/multi-tenancy).

4. Headline metrics

The numbers below are the empirical record. Each cell traces to a dated, immutable evaluation report or to the audit register; markers below are propagated literally from source ("not yet measured" never substitutes a placeholder).

MetricValueSource
Golden-eval pass rate99.0% (296/299) full run; effective 99.7% after ground-truth correctionsthesis §4.1, Table 4.1
Entity recall0.932 (95% CI [0.916, 0.965])thesis §4.1.2, Table 4.2
Faithfulness (best ablation)0.959 (Guardrails-only)thesis §4.2.1, Table 4.4
Median end-to-end latency7,829 ms (P50, 302 queries)thesis §4.1.3, Table 4.3
P95 / P99 latencyP90 12,182 ms / P99 20,925 ms (chat channel)thesis §4.1.3, Table 4.3
Voice-channel P95 (pilot)Not yet measured — Phase 5 backfill commitmentSOTA matrix §2.1
Safety-refusal accuracy100% (14/14 safety + 12/12 GCG)thesis §4.5, Table 4.9
Medical-advice incidents0 across all evaluation runsthesis §4.5, Table 4.9
Categories at 100% pass18 of 21thesis §4.1.1, Table 4.1
Estimated monthly cost~$8.70/month at 25K queries (40% cache hit rate)performance/overview cost table

The conditional knowledge-graph configuration improves pass rate by 1.7 percentage points over graph-off (97.2% → 99.0%) for navigational and relationship queries, while unconditional injection slightly degrades the average (thesis §4.3, Table 4.7). This conditional-injection insight is documented as one of the project's contributions.


5. Why us, not Retell, Vapi, or Cognigy

Three differentiators populate cells in our row of the SOTA matrix §2 that competitor cells leave blank or marked "not publicly documented." Each is one sentence; engineering depth is at the link.

  1. Patient-facing voice + citation-grounded retrieval, in the same stack. We are the only entry on the matrix that ships per-turn chunk-id traceability with patient-facing voice (SOTA §2.8 Provenance, voice/citation-pipeline).
  2. Multi-language safety regex packs (nl/en/fr/it) plus a three-stage post-LLM disclaimer. Domain-depth cells in SOTA §2.3 — competitors mostly read "not publicly documented"; ours read with the 100% pass rate on the safety-refusal and adversarial-GCG cohorts.
  3. GDPR Art. 35 DPIA + AI Act Art. 50 limited-risk classification on file, in version-controlled engineering artifacts. SOTA §2.6 Compliance, safety/dpia, safety/ai-act-compliance.

Single-sentence framing: we are the only stack on the matrix that combines patient-facing voice with citation-grounded retrieval, multi-tenant overlay onboarding, GDPR-and-AI-Act artifacts on file, and an empirically measured 99.0% pass rate against a 302-question regulated-domain benchmark (SOTA §1, §5.4).


6. Honest gaps

We name three deficits up front, in the same shape they appear in SOTA §3. Each is committed to a specific roadmap line in SOTA §4; none is a marketing-roadmap line.

  • Infrastructure reliability — managed hyperscaler stacks (OpenAI Realtime API, Deepgram Voice Agent) have one less moving part. We have not yet measured pilot uptime against a managed alternative on an apples-to-apples SLO basis. Voice-channel P95 is "not yet measured at p95 on pilot" (SOTA §2.1).
  • Conversational fluidity — Retell ships fine-grained barge-in tuning at the per-call level (SOTA §3.2). Our barge-in uses LiveKit Agents' default semantic turn detection; per-tenant Voice Activity Detection tuning is on the Q3 2026 list.
  • Marketplace integrations — Genesys, NICE, Five9, and Talkdesk ship marketplace ecosystems with hundreds of pre-built connectors (Salesforce, HubSpot). Our codebase has zero such connectors today; Salesforce + HubSpot are committed for 2027 H1 (SOTA §3.3, §4.3).

Two smaller gaps for completeness (SOTA §3.4): mid-call language switching is a deliberate trade-off (ADR-0052) not a deficit, and we run an Operations dashboard but not a 24×7 NOC at pilot scale.


7. Roadmap (Q3 2026, Q4 2026, 2027 H1)

Each item maps to a specific gap in §6. No padding. The full table is at SOTA §4.

  • Q3 2026 — pilot uptime SLO posting; latency-cell backfill (replace dev-p50 with pilot-p95); barge-in tuning v1; ElevenLabs streaming TTS investigation for sub-200 ms first-audio; like-for-like cost-comparison spreadsheet (SOTA §4.1).
  • Q4 2026 — open-source intent-classifier benchmark (publish the 7-intent × 6-category affinity matrix); zero-shot prompt mode for fast tenant onboarding; second-pilot deployment as empirical proof of zero-source-change onboarding; diagnostic-V2 metric publication (SOTA §4.2).
  • 2027 H1 — Salesforce connector; HubSpot connector; multi-region deploy + DNS failover; AI Act high-risk pathway documentation kept ready against future feature scope-creep (SOTA §4.3).

8. Spinoff potential

The Voice Stack Compendium documents the architecture at a level of abstraction that an engineer in an adjacent regulated domain can rebuild at roughly seventy-percent fidelity without further consultation (Compendium §1). The same chassis lifts cleanly to:

  • Phone-support deflection for any enterprise with a high-volume, content-heavy public-facing knowledge base — the regex-pre-filter / agentic-LLM / RAG / safety-post-filter spine is domain-agnostic (Compendium §6).
  • Appointment booking — the agentic-tool surface accepts new tools without architectural change; a book_appointment tool plugs into the same orchestrator as search_hospital_kb (voice/architecture).
  • Telemedicine triage (informational only, not clinical) — the negative-scope discipline ("we do not provide medical advice") that shaped the ZOL safety architecture transfers directly to triage-style information lookups in adjacent regulated domains (Compendium §1, §5, safety/ai-act-compliance §1.2 scope-limit warning).

9. Engineering rigor

Quantifiable signals that this is not a thesis-prototype masquerading as a product.

  • 50 ADRs documenting every significant architectural decision (thesis §1.5.1, Architecture Decisions sidebar).
  • 62 verified bibliography entries with last-verified dates, one-line summaries, and consistent inline-citation format (references).
  • Silent-failure discipline (R1/R2/R3) codified after the 2026-05-07 voice-history regression: R1 logs collection size on every collection-returning function; R2 lands a regression test with every silent-failure fix; R3 enforces contract tests for cross-component shared state (SOTA §5.4, CLAUDE.md).
  • No-mocking test policy (ADR-0002) with a test-to-production code ratio of approximately 1.3:1 (84,467 lines test code vs 65,075 lines application code per thesis Table 1.1).
  • 335+ git commits at thesis snapshot (thesis Table 1.1), 600+ tests restored in the 2026-04-22 test-debt sprint with coverage floor lifted from 40% to 55% (project memory).

10. The ask

What we want from the pilot review meeting:

  • Feedback-loop cadence. A bi-weekly pilot-review meeting with the hospital sponsor to walk Operations dashboard charts (Category Mismatch Trend, Diagnostic Accuracy Trend) and the per-week conversation transcripts. Live URLs in the KPI snapshot.
  • KPI review cadence. Monthly review of the headline metrics in §4 against the SLO commitments in §7. Voice-channel P95 lands first, in Q3 2026, per the SOTA §4.1 commitment.
  • Signoff criteria. Three gates: (a) zero medical-advice incidents in production telemetry, sustained for the pilot quarter; (b) end-to-end voice-turn P95 below the 10-second attention bound from Nielsen 1993 for the dominant intent categories; (c) no high-severity incidents in the audit log.
  • Compliance review. A walkthrough of the DPIA, AI Act memo, and Data Retention Policy with the hospital DPO; full index at Compliance Package Index.
  • Demo signoff. A live walkthrough of the demo script — seven scripted scenarios drawn from the golden-question set and the smoke-test script. We expect the safety-refusal scenario (Turn 9 in voice/smoke-test-script) to be the most-watched moment of the meeting.