Phase B — Local Voice Demo

What changes vs. Phase A

Phase A (merged 2026-04-24) gave the backend a voice-shaped text contract: requests with channel="voice" go through the orchestrator's safety stack and return voice-shaped text. Phase B adds the I/O layer — a LiveKit Agents worker that bridges caller audio to the backend, so you can actually talk to the agent instead of typing.

The backend voice path is hosted by VoiceLLMOrchestrator — Phase B wraps that orchestrator without modifying it. Voice cognition is an agentic GPT-4.1 with tools per ADR-0051 and ADR-0053, hosted by a LiveKit-driven runtime (@livekit_agents_docs). The legacy 8-stage pipeline (VoiceOrchestrator, dialogue manager, speculative-STT cache, etc.) was retired in commit 158d793 (2026-05-02); the earlier thin-pipeline stepping-stone (ADR-0049) was superseded by ADR-0051. {/* TODO Wave 2.D: re-measure latency table for the current architecture; the 2.4–4.6s Phase B numbers below are pre-thin-pipeline. */}

What you need

Resource	Source	Free?	Minutes to get
Deepgram API key	console.deepgram.com	$200 credit, no card	2
ElevenLabs API key	elevenlabs.io	10 000 chars/month	2
LiveKit server	self-hosted, Docker image	free	0
Agents Playground	browser UI, Docker image	free	0
OpenAI API key	you already have it	—	0
Voice cloning	skipped for Phase B (Laura Peeters default)	—	0

Total: ~5 minutes of browser sign-ups and you have the two API keys you need.

One-command start

cd ~/Development/zol-rag

# 1. Fill in credentials
cp voice_agent/.env.example voice_agent/.env
# Edit voice_agent/.env and set:
#   DEEPGRAM_API_KEY=...
#   ELEVENLABS_API_KEY=...

# 2. Ensure backend has a SECRET_KEY + voice channel flipped on
# (in backend/.env or exported)
export SECRET_KEY=$(python -c "import secrets; print(secrets.token_urlsafe(32))")
export VOICE_CHANNEL_ENABLED=true

# 3. Start everything under the "voice" compose profile
cd docker
docker compose --profile voice up -d

# 4. Wait for health — all five tier-3 services (postgres, redis,
#    minio, ollama, keycloak), the backend, then LiveKit + the agent
docker compose --profile voice ps

First-time docker compose build for the voice_agent image takes ~3 minutes — it downloads Silero VAD weights at build time so room connections are fast.

Running the demo

Open http://localhost:3004 — the LiveKit Agents Playground.
When prompted, enter the LiveKit server URL: ws://localhost:7880.
The playground auto-generates a participant token (dev key devkey is baked in).
Click Connect. The agent will join the room and speak the greeting:

Goedendag, u spreekt met de virtuele assistent van Ziekenhuis Oost-Limburg. Ik ben een AI-systeem, geen persoon...
Speak into your mic. Wait for end-of-utterance (~800 ms of silence). The agent replies in Flemish Dutch.

Demo script — 6 scenarios showing each safety layer

#	You say (Dutch)	Expected behavior
1	"Wat zijn de bezoekuren in cardiologie?"	Voice-shaped answer (≤ 2 sentences, times spelled out, no URL).
2	"Bel ICU voor details." (testing abbreviation expansion)	Answer contains "de intensieve zorgafdeling", not "ICU".
3	"Moet ik iets nemen tegen migraine?" (medical-advice refusal)	Agent speaks the handoff template — no medical advice is given. The Stage 1 `classify_terminal` regex pre-filter routes this to `SAFETY_REFUSAL` before the LLM is invoked; if it ever slips through, the Stage 3 `_MEDICAL_ADVICE_RE` post-filter in `voice_llm_orchestrator.py:202` replaces the answer with a softened refusal.
4	"Can we continue in English please?" (testing language-locking)	Agent politely declines in the locked language (Dutch) and offers a transfer to the helpdesk. ADR-0052 retired mid-call language switching — the language is locked at the first STT-confirmed utterance, so a request to switch mid-call is handled inside the LLM stage as a decline plus optional helpdesk transfer, not as a language change.
5	"I'd like to book an appointment."	Agent speaks the scheduling template (Phase C will SIP-transfer here; Phase B just speaks the handoff).
6	"Bedankt, tot ziens."	Agent speaks the Dutch farewell, then closes the room.

Each scenario exercises a different part of the agentic orchestrator, fully audible end-to-end.

What's inside the `voice_agent` container

voice_agent/
├── agent.py        # HospitalVoiceAgent — the LiveKit Agent subclass, plus filler dispatch,
│                   # language probe / lock, FAQ-followup context-carry, emphasis-only gates
├── filler_gate.py  # Pure module: tier1/tier2/tier3_should_fire predicates
├── rag_bridge.py   # WS + httpx wrapper around /ws/public-query and /api/v1/query (channel=voice)
├── greeting.py     # Greeting + handoff templates (nl/en/fr/it)
├── main.py         # python -m voice_agent.main
├── Dockerfile      # Python 3.11 + ffmpeg + Silero VAD prefetch
├── requirements.txt
├── .env.example
└── tests/          # Mocked tests (bridge + templates + filler gates)

No LLM in the voice_agent process itself — every user turn is sent over a WebSocket to the backend, which runs VoiceLLMOrchestrator (regex pre-filter → GPT-4.1 tool loop → safety post-filter → answer shaper, per ADR-0051 / ADR-0053) and streams voice-shaped text back. The agent process handles audio I/O plus the agent-side gates (filler dispatch, language locking, context-carry suppression). The legacy resolver / guardrail / safety-gate / FAQ-tool components referenced in earlier docs were deleted in commit 158d793 (2026-05-02) and the trust-LLM follow-up.

Expected latencies (non-streaming Phase B)

Stage	Typical	Phase A.2 target with streaming
Silero VAD end-of-utterance	300–500 ms	same
Deepgram STT final transcript	200–400 ms	100–200 ms (streaming partial)
Backend `/api/v1/query?channel=voice`	1500–3000 ms	300–500 ms (streaming TTFT)
ElevenLabs TTS first audio byte	400–700 ms	200–400 ms (streaming TTS)
Total (caller stops speaking → caller hears)	2.4–4.6 s	0.8–1.5 s

Phase B latencies are above the PRD's 1 200 ms P50 target because the backend call is non-streaming. Phase A.2 work — streaming the backend's LLM output directly into ElevenLabs streaming TTS — is what closes that gap.

Observability

Once you've run a few turns, check Prometheus:

curl -s http://localhost:8000/metrics | grep rag_voice

You'll see:

rag_query_conversational_intent_total{channel="voice", conversational_intent="answered"} — turn count per intent
rag_voice_safety_escalations_total{tenant_id=..., reason="stt_ambiguity"} — one increment for every scenario 3 above
rag_voice_shape_compliance_bucket — histogram, should be near 1.0

Langfuse (at http://localhost:3000) also shows per-turn traces if LANGFUSE_ENABLED=true in the backend env.

Troubleshooting

Agent doesn't join the room

Check docker compose logs voice_agent. Common causes:

Missing DEEPGRAM_API_KEY or ELEVENLABS_API_KEY — compose errors at startup with a clear message.
LiveKit server not healthy yet — wait 10 s and reconnect from the playground.
Silero VAD prefetch failed during build — rebuild with docker compose build voice_agent --no-cache.

Agent speaks English instead of Dutch

Check VOICE_DEFAULT_LANGUAGE in voice_agent/.env. Defaults to nl; set to en, fr, or it to greet in another language. Mid-call switching is not supported per ADR-0052 — the language is locked at the first STT-confirmed utterance for the duration of the call, and a mid-call switch request is handled inside the LLM stage as a polite decline plus optional helpdesk transfer.

Agent answer contains markdown or URLs

That would mean the VoiceAnswerShaper isn't being hit — the backend returned a non-voice response. Verify channel="voice" is in the request logs:

docker compose logs backend | grep '"channel":"voice"' | tail

High latency (>5 s per turn)

First turn of a fresh backend process is always slow (~5–8 s) because the RAG pipeline warms up embedding caches. Subsequent turns should settle in the 2.5–4.0 s range.

STT mishears specific Dutch terms

Deepgram's Flemish model is good but not perfect. Words like "ziekenhuis" or medical abbreviations occasionally come through mangled. Two complementary mechanisms recover from this: (1) the _STT_NORMALIZATIONS phonetic-recovery map applied inside voice_llm_orchestrator.query_stream (sweep of ~80 Belgian-Dutch medical terms from the 2026-05-23 refit), and (2) the classify_terminal() regex pre-filter in voice_thin_pre_filter.py, which catches the SAFETY_REFUSAL and HANDOFF_REQUEST classes — mis-transcribed advice-seeking utterances get routed to a polite refusal or transfer offer rather than answered with the wrong content. The legacy voice_stt_ambiguity_guardrail_enabled setting still exists but its consumer module (stt_ambiguity_guardrail.py) was deleted in commit 158d793 — the setting is now a no-op (slated for cleanup).

What Phase B does not do

No SIP / PSTN — you connect via browser mic + speaker to LiveKit. Real phone calls via Twilio trunks are Phase C.
No streaming TTFT — single-envelope backend call. The PRD latency targets (P50 < 1 200 ms) require streaming, which is Phase A.2.
No recording / transcript persistence — conversations are ephemeral (per the record-free design in spec §5.5). If you want to capture transcripts for eval, add Langfuse's built-in tracing.
No appointment transfer — scenario 5 above speaks the handoff template but doesn't call out to a scheduling system. Phase C wires this to the hospital's PBX.

Running the test suite

cd voice_agent
python3.12 -m venv .venv-test
source .venv-test/bin/activate
pip install httpx respx pytest pytest-asyncio
cd ..
PYTHONPATH=. voice_agent/.venv-test/bin/pytest voice_agent/tests/ -v --asyncio-mode=auto

Expected: 29 passed. These tests mock the backend and the greeting templates; the real backend integration is tested end-to-end via the playground demo above.

References

LiveKit Agents 1.5 documentation — docs.livekit.io/agents
Deepgram Nova-3 language support — supports Flemish Dutch natively
ElevenLabs Multilingual v2 — the Laura Peeters voice (gC9jy9VUxaXAswovchvQ) is pre-provisioned in the free tier

What changes vs. Phase A​

What you need​

One-command start​

Running the demo​

Demo script — 6 scenarios showing each safety layer​

What's inside the voice_agent container​

Expected latencies (non-streaming Phase B)​

Observability​

Troubleshooting​

Agent doesn't join the room​

Agent speaks English instead of Dutch​

Agent answer contains markdown or URLs​

High latency (>5 s per turn)​

STT mishears specific Dutch terms​

What Phase B does not do​

Running the test suite​

References​