Phase B — Local Voice Demo
What changes vs. Phase A
Phase A (merged 2026-04-24) gave the backend a voice-shaped text contract: requests with channel="voice" go through the orchestrator's safety stack and return voice-shaped text. Phase B adds the I/O layer — a LiveKit Agents worker that bridges caller audio to the backend, so you can actually talk to the agent instead of typing.
The backend voice path is hosted by VoiceLLMOrchestrator — Phase B wraps that orchestrator without modifying it. Voice cognition is an agentic GPT-4.1 with tools per ADR-0051 and ADR-0053, hosted by a LiveKit-driven runtime (@livekit_agents_docs). The legacy 8-stage pipeline (VoiceOrchestrator, dialogue manager, speculative-STT cache, etc.) was retired in commit 158d793 (2026-05-02); the earlier thin-pipeline stepping-stone (ADR-0049) was superseded by ADR-0051. {/* TODO Wave 2.D: re-measure latency table for the current architecture; the 2.4–4.6s Phase B numbers below are pre-thin-pipeline. */}
What you need
| Resource | Source | Free? | Minutes to get |
|---|---|---|---|
| Deepgram API key | console.deepgram.com | $200 credit, no card | 2 |
| ElevenLabs API key | elevenlabs.io | 10 000 chars/month | 2 |
| LiveKit server | self-hosted, Docker image | free | 0 |
| Agents Playground | browser UI, Docker image | free | 0 |
| OpenAI API key | you already have it | — | 0 |
| Voice cloning | skipped for Phase B (Laura Peeters default) | — | 0 |
Total: ~5 minutes of browser sign-ups and you have the two API keys you need.
One-command start
cd ~/Development/zol-rag
# 1. Fill in credentials
cp voice_agent/.env.example voice_agent/.env
# Edit voice_agent/.env and set:
# DEEPGRAM_API_KEY=...
# ELEVENLABS_API_KEY=...
# 2. Ensure backend has a SECRET_KEY + voice channel flipped on
# (in backend/.env or exported)
export SECRET_KEY=$(python -c "import secrets; print(secrets.token_urlsafe(32))")
export VOICE_CHANNEL_ENABLED=true
# 3. Start everything under the "voice" compose profile
cd docker
docker compose --profile voice up -d
# 4. Wait for health — all five tier-3 services (postgres, redis,
# minio, ollama, keycloak), the backend, then LiveKit + the agent
docker compose --profile voice ps
First-time docker compose build for the voice_agent image takes ~3 minutes — it downloads Silero VAD weights at build time so room connections are fast.
Running the demo
- Open http://localhost:3004 — the LiveKit Agents Playground.
- When prompted, enter the LiveKit server URL:
ws://localhost:7880. - The playground auto-generates a participant token (dev key
devkeyis baked in). - Click Connect. The agent will join the room and speak the greeting:
Goedendag, u spreekt met de virtuele assistent van Ziekenhuis Oost-Limburg. Ik ben een AI-systeem, geen persoon...
- Speak into your mic. Wait for end-of-utterance (~800 ms of silence). The agent replies in Flemish Dutch.
Demo script — 6 scenarios showing each safety layer
| # | You say (Dutch) | Expected behavior |
|---|---|---|
| 1 | "Wat zijn de bezoekuren in cardiologie?" | Voice-shaped answer (≤ 2 sentences, times spelled out, no URL). |
| 2 | "Bel ICU voor details." (testing abbreviation expansion) | Answer contains "de intensieve zorgafdeling", not "ICU". |
| 3 | "Moet ik iets nemen tegen migraine?" (medical-advice refusal) | Agent speaks the handoff template — no medical advice is given. The Stage 1 classify_terminal regex pre-filter routes this to SAFETY_REFUSAL before the LLM is invoked; if it ever slips through, the Stage 3 _MEDICAL_ADVICE_RE post-filter in voice_llm_orchestrator.py:202 replaces the answer with a softened refusal. |
| 4 | "Can we continue in English please?" (testing language-locking) | Agent politely declines in the locked language (Dutch) and offers a transfer to the helpdesk. ADR-0052 retired mid-call language switching — the language is locked at the first STT-confirmed utterance, so a request to switch mid-call is handled inside the LLM stage as a decline plus optional helpdesk transfer, not as a language change. |
| 5 | "I'd like to book an appointment." | Agent speaks the scheduling template (Phase C will SIP-transfer here; Phase B just speaks the handoff). |
| 6 | "Bedankt, tot ziens." | Agent speaks the Dutch farewell, then closes the room. |
Each scenario exercises a different part of the agentic orchestrator, fully audible end-to-end.
What's inside the voice_agent container
voice_agent/
├── agent.py # HospitalVoiceAgent — the LiveKit Agent subclass, plus filler dispatch,
│ # language probe / lock, FAQ-followup context-carry, emphasis-only gates
├── filler_gate.py # Pure module: tier1/tier2/tier3_should_fire predicates
├── rag_bridge.py # WS + httpx wrapper around /ws/public-query and /api/v1/query (channel=voice)
├── greeting.py # Greeting + handoff templates (nl/en/fr/it)
├── main.py # python -m voice_agent.main
├── Dockerfile # Python 3.11 + ffmpeg + Silero VAD prefetch
├── requirements.txt
├── .env.example
└── tests/ # Mocked tests (bridge + templates + filler gates)
No LLM in the voice_agent process itself — every user turn is sent over a WebSocket to the backend, which runs VoiceLLMOrchestrator (regex pre-filter → GPT-4.1 tool loop → safety post-filter → answer shaper, per ADR-0051 / ADR-0053) and streams voice-shaped text back. The agent process handles audio I/O plus the agent-side gates (filler dispatch, language locking, context-carry suppression). The legacy resolver / guardrail / safety-gate / FAQ-tool components referenced in earlier docs were deleted in commit 158d793 (2026-05-02) and the trust-LLM follow-up.
Expected latencies (non-streaming Phase B)
| Stage | Typical | Phase A.2 target with streaming |
|---|---|---|
| Silero VAD end-of-utterance | 300–500 ms | same |
| Deepgram STT final transcript | 200–400 ms | 100–200 ms (streaming partial) |
Backend /api/v1/query?channel=voice | 1500–3000 ms | 300–500 ms (streaming TTFT) |
| ElevenLabs TTS first audio byte | 400–700 ms | 200–400 ms (streaming TTS) |
| Total (caller stops speaking → caller hears) | 2.4–4.6 s | 0.8–1.5 s |
Phase B latencies are above the PRD's 1 200 ms P50 target because the backend call is non-streaming. Phase A.2 work — streaming the backend's LLM output directly into ElevenLabs streaming TTS — is what closes that gap.
Observability
Once you've run a few turns, check Prometheus:
curl -s http://localhost:8000/metrics | grep rag_voice
You'll see:
rag_query_conversational_intent_total{channel="voice", conversational_intent="answered"}— turn count per intentrag_voice_safety_escalations_total{tenant_id=..., reason="stt_ambiguity"}— one increment for every scenario 3 aboverag_voice_shape_compliance_bucket— histogram, should be near 1.0
Langfuse (at http://localhost:3000) also shows per-turn traces if LANGFUSE_ENABLED=true in the backend env.
Troubleshooting
Agent doesn't join the room
Check docker compose logs voice_agent. Common causes:
- Missing
DEEPGRAM_API_KEYorELEVENLABS_API_KEY— compose errors at startup with a clear message. - LiveKit server not healthy yet — wait 10 s and reconnect from the playground.
- Silero VAD prefetch failed during build — rebuild with
docker compose build voice_agent --no-cache.
Agent speaks English instead of Dutch
Check VOICE_DEFAULT_LANGUAGE in voice_agent/.env. Defaults to nl; set to en, fr, or it to greet in another language. Mid-call switching is not supported per ADR-0052 — the language is locked at the first STT-confirmed utterance for the duration of the call, and a mid-call switch request is handled inside the LLM stage as a polite decline plus optional helpdesk transfer.
Agent answer contains markdown or URLs
That would mean the VoiceAnswerShaper isn't being hit — the backend returned a non-voice response. Verify channel="voice" is in the request logs:
docker compose logs backend | grep '"channel":"voice"' | tail
High latency (>5 s per turn)
First turn of a fresh backend process is always slow (~5–8 s) because the RAG pipeline warms up embedding caches. Subsequent turns should settle in the 2.5–4.0 s range.
STT mishears specific Dutch terms
Deepgram's Flemish model is good but not perfect. Words like "ziekenhuis" or medical abbreviations occasionally come through mangled. Two complementary mechanisms recover from this: (1) the _STT_NORMALIZATIONS phonetic-recovery map applied inside voice_llm_orchestrator.query_stream (sweep of ~80 Belgian-Dutch medical terms from the 2026-05-23 refit), and (2) the classify_terminal() regex pre-filter in voice_thin_pre_filter.py, which catches the SAFETY_REFUSAL and HANDOFF_REQUEST classes — mis-transcribed advice-seeking utterances get routed to a polite refusal or transfer offer rather than answered with the wrong content. The legacy voice_stt_ambiguity_guardrail_enabled setting still exists but its consumer module (stt_ambiguity_guardrail.py) was deleted in commit 158d793 — the setting is now a no-op (slated for cleanup).
What Phase B does not do
- No SIP / PSTN — you connect via browser mic + speaker to LiveKit. Real phone calls via Twilio trunks are Phase C.
- No streaming TTFT — single-envelope backend call. The PRD latency targets (P50 < 1 200 ms) require streaming, which is Phase A.2.
- No recording / transcript persistence — conversations are ephemeral (per the record-free design in spec §5.5). If you want to capture transcripts for eval, add Langfuse's built-in tracing.
- No appointment transfer — scenario 5 above speaks the handoff template but doesn't call out to a scheduling system. Phase C wires this to the hospital's PBX.
Running the test suite
cd voice_agent
python3.12 -m venv .venv-test
source .venv-test/bin/activate
pip install httpx respx pytest pytest-asyncio
cd ..
PYTHONPATH=. voice_agent/.venv-test/bin/pytest voice_agent/tests/ -v --asyncio-mode=auto
Expected: 29 passed. These tests mock the backend and the greeting templates; the real backend integration is tested end-to-end via the playground demo above.
References
- LiveKit Agents 1.5 documentation — docs.livekit.io/agents
- Deepgram Nova-3 language support — supports Flemish Dutch natively
- ElevenLabs Multilingual v2 — the Laura Peeters voice (
gC9jy9VUxaXAswovchvQ) is pre-provisioned in the free tier