Skip to main content

Architecture One-Pager

A five-minute read for a technical buyer. The full engineering description is in the Voice Stack Compendium; this page is the one-page distillation.

What it does

The ZOL Intelligent Search system replaces the hospital website's keyword search with a natural-language interface across two channels — web chat and telephony — sharing one retrieval backend. A caller dialling the Belgian PSTN number reaches a self-hosted SIP gateway that bridges into a LiveKit room; a voice_agent worker streams audio through Deepgram Nova-3 for speech-to-text, hands transcripts to the FastAPI backend over a WebSocket, runs the answer through the cognitive core (regex pre-filter → agentic LLM → safety post-filter → answer-shaper → disclaimer), and synthesises audio back through ElevenLabs Multilingual v2. The chat channel skips the SIP and TTS stages but reuses the same backend cognition end-to-end.

How it is structured

The system composes seven layers, each with a single responsibility (Compendium §2). The composition is intentionally rigid: layers do not bypass each other, and each was chosen against named alternatives recorded in an ADR. The cognitive core (ADR-0049, ADR-0051) is agentic-only — a single GPT-4.1 agent with three tools (search_hospital_kb, transfer_to_helpdesk, end_call) wraps the existing RAGService.query_stream. There is no dialogue manager and no state machine; the legacy 8-stage VoiceOrchestrator and the dialogue manager were retired together in commit 158d793, removing approximately 7,000 LOC. Embeddings are OpenAI text-embedding-3-large over pgvector (ADR-0048); the knowledge representation is PostgreSQL taxonomy tables (ADR-0053, which retired Neo4j after the GDS algorithms proved non-load-bearing in retrieval).

What is distinctive

Three engineering decisions differentiate the stack from comparable systems. Conditional knowledge-graph injection (graph context is added only when the query contains recognised medical entities) improves pass rate by 1.7 percentage points over graph-off, while unconditional injection slightly degrades the average (thesis §4.3, Table 4.7). Language locking at first utterance (ADR-0052) trades mid-call language switching for Flemish acoustic accuracy, after two empirical pilot regressions where multi-language Deepgram emitted gibberish or zero transcripts. The Value Framework is a 7-intent × 6-category affinity matrix (voice/value-framework) that prevents cross-category contamination — a wheelchair-accessibility query gets a parking answer, not an orthopaedic-reimbursement answer.

The seven-layer stack

Figure 1 — Layered stack. Adapted from Voice Stack Compendium §2, Figure 2.1.

The per-turn request path

Figure 2 — Per-turn path through the seven layers. Reproduced from Voice Stack Compendium §2, Figure 2.1.

Deployment

The pilot deploys on a single Hetzner server via Docker Compose with seven containers: backend (FastAPI uvicorn), voice_agent (LiveKit Agents worker), livekit-sip (SIP gateway), LiveKit Server (WebRTC media relay), Postgres 16 with pgvector, Redis, and MinIO for object storage. Twilio Elastic SIP Trunk forwards inbound calls over TLS:5061 with IP allowlisting; OpenAI, Deepgram, and ElevenLabs are subprocessors of record under GDPR Art. 28 (safety/dpia §5.2). Self-hosting livekit-sip rather than the LiveKit Cloud SIP managed gateway saves an estimated $375–625/month at the projected 25K-queries/month scale (Compendium §2 Layer 1, SOTA §2.1 trade-offs row). Operational runbook: ADR-0050.

Where to read more