Release Notes: May 22 – 23, 2026

LLM-First Agentic Voice · 4 Hotfixes · Filler Race Calibration

This window completed the trust-LLM simplification sprint that began with the 11-task plan brainstormed on 2026-05-22 and landed as the LLM-first agentic voice pipeline (ADR-0053). The shape of the work was unusual: the design was approved cleanly, but the production reality surfaced four distinct failure modes during pilot SIP smoke-testing, each requiring a focused hotfix. The release note below documents both halves — the architectural change and the empirical calibration that followed — because the second half is the more transferable lesson.

The headline themes:

LLM-first agentic dispatch (ADR-0053). Replaced the spec's two-call pattern (non-streaming tool decision → separate streaming final) with native OpenAI streaming-with-tools — a single chat.completions.create(stream=True, tools=_TOOLS, tool_choice="auto") call per tool-decision iteration. For direct-response queries this halves OpenAI round-trips.
Hotfix cascade (4 fixes, all same-day). Citation serialization, blocked-event sentence-buffer gap, tool_choice="none" OpenAI API rejection, two-call latency. Each fix was a single-commit landing; the cascade itself is the lesson, not any individual diff.
Tier-1 filler grace tuned twice (800 → 1500 ms → language-switch skip). Pilot SIP testing surfaced a "third clock" problem — transport latency on the Twilio/LiveKit SIP path — that the original threshold hadn't accounted for. A second retune added a 5-second skip window for the post-language-switch STT/TTS reload.
Pilot warmup config flip. VOICE_WARMUP_CACHE_ENABLED=true was set during an earlier streaming-TTFT validation; the agent.py docstring explicitly warned this would cause "3 fillers and an offer to transfer before the agent speaks the actual answer." Pilot SIP audit reproduced exactly that. Flag returned to false.
Pilot disk cleanup. 350 stale Docker image tags + 33 GB of build cache removed; 32 GB of disk reclaimed without touching active images.

1 · ADR-0053 — LLM-First Agentic Voice Pipeline

The spec at docs/superpowers/specs/2026-05-22-llm-first-agentic-design.md ratified what the prior architecture had been drifting toward: the LLM, not a regex pre-filter, decides whether to engage in dialogue, ask clarification, or call into the RAG. The system prompt's AVAILABLE TOOLS section documents three tools (search_hospital_kb, transfer_to_helpdesk, end_call) and 11 dialogue examples spanning capability questions, greetings, ambiguous follow-ups, meta-questions, ZOL info (tool call), educational (tool call), prescriptive refusal, handoff, farewell, off-topic redirect, and phone-not-in-corpus.

The land was clean: a044e813 (prompt rewrite) → 15c596b5 (ADR) → 62ba676d (9-pattern integration test).

The original design vs the production reality

The spec's Section 2 specified a two-call pattern:

Non-streaming chat.completions.create() for the tool-decision iteration loop
Separate streaming chat.completions.create(stream=True, tool_choice="none") for the final answer

This shape passed the 9-pattern integration test cleanly because AsyncMock OpenAI clients don't enforce SDK-level argument validation. Production traffic hit the actual API and immediately got HTTP 400: "tool_choice is only valid when tools is provided." The second-call pattern needed either tools=_TOOLS (with explicit "you have tools but I'm forbidding them this turn") OR no tool_choice parameter at all. Hotfix #2 (0afe7160) removed the invalid tool_choice kwarg.

But the deeper problem was latency. Direct-response queries ("Do you speak English?", which the LLM can answer without RAG) made two OpenAI round-trips under the two-call pattern. The +800 ms tier-1 filler always fired on every direct-response turn because the LLM didn't start streaming until after the tool-decision call returned.

Hotfix #4 (296031ee) replaced the spec's option (b) with option (a): one streaming-with-tools call per iteration. The LLM emits content deltas if no tool is needed, or tool-call deltas if one is. The orchestrator parses both shapes from a single stream. ONE call per direct-response, N+1 calls (one per tool iteration + one final synthesis) for tool-using queries — all streamed end-to-end.

spec option (b): direct-response = 2 OpenAI calls, first sentence ~1500-2500 ms
shipped (a):     direct-response = 1 OpenAI call,  first sentence ~600-900 ms

2 · The four-hotfix cascade

Hotfixes landed in this order on 2026-05-22 evening through 2026-05-23 morning. Each was triggered by a specific failure mode in pilot SIP smoke-testing; each is its own commit with its own test pin.

#	Commit	Failure mode	Fix shape
1	`8f9769df`	`TypeError: Object of type Citation is not JSON serializable` at `public_websocket.py:779`. Cascade: WS connection died → asyncpg pool slot poisoned → "Can't reconnect until invalid transaction is rolled back" on next request	Wrap event in `jsonable_encoder()` before `send_json()`
1b	`a47a195a`	T7 query_stream only buffered `chunk` events; `blocked` / `clarification` / `repeat_previous` were forwarded unchanged. voice_agent's consumer only handles `sentence` / `final` / `error` → never spoke off-topic refusals	Extended sentence buffer to transform all 4 event types
2	`0afe7160`	`voice_llm_streaming_final_failed` on every turn. Root cause: `tool_choice="none"` without `tools=` parameter → OpenAI HTTP 400	Removed invalid `tool_choice="none"` from streaming call
3	(logging fix folded into #2)	Diagnostics gap — `logger.error("event", extra={"err":...})` silently dropped the `extra=` keys in the project's structlog formatter	Switched to `logger.exception("event err=%s", exc)`
4	`296031ee`	HTTP `/api/v1/query` returned empty answer (Gate 1 persona eval showed 18/88 turns, all empty). Sync `query()` wrapper consumed chunk events, but the T3 sentence-buffer wrapper only emitted sentence events. Plus: filler always fired on direct-response queries because the two-call pattern took >800 ms	Native streaming-with-tools (option a); sync wrapper now accumulates from `sentence` events

What the cascade taught the codebase

Three memory entries were written from this sequence, each codifying a rule for future maintainers:

feedback-openai-tool-choice-none-requires-tools-param: OpenAI's chat.completions.create(...) rejects tool_choice="none" with HTTP 400 when no tools= parameter is provided. Valid combinations are (tools, tool_choice="auto"), (tools, tool_choice="none"), (tools, force-call dict), or NEITHER. The invalid form is tool_choice="none" alone.
feedback-use-logger-exception-not-extra-err: Use logger.exception("event err=%s", exc) for exception logging — NOT logger.error("event", extra={"err": ...}). The project's structlog formatter silently drops extra= keys; format-string interpolation IS captured.
feedback-mocked-llm-tests-miss-sdk-validation: AsyncMock OpenAI clients verify orchestrator BEHAVIOR given a known LLM script, but don't enforce SDK-level argument validation. For LLM-SDK changes, add a real-API smoke test before declaring the deploy done. T5's 9-pattern integration test passed with invalid arguments; production failed on the first SIP turn.

3 · Tier-1 filler grace — calibration in production

Once the architecture was clean, two pilot SIP smoke calls in succession surfaced a tuning problem.

First retune — `38d7f6be` (800 → 1500 ms)

A pilot SIP call asking "Do you speak English?" fired a "Let me search that for you" filler before the actual "Yes, I speak English" response arrived. Investigation:

Backend turn: 622 ms total (LLM-direct, no tool call)
LiveKit + Twilio SIP transport: ~50-200 ms each way
_streamed_answer_spoken=True flipped at ~700-1100 ms
Original tier-1 grace: 800 ms

The threshold landed inside the transport-tail window. Bumped the default to 1500 ms — comfortable margin for direct-LLM answers, RAG-needed turns (2-3.5 s) still trigger the filler.

The deeper insight: the codebase comment at voice_agent/agent.py had calibrated the original 800 ms threshold against two clocks — the 600 ms human-noticed-silence floor and the ~1000 ms first-sentence mark. The third clock — SIP transport latency — was missing from the model.

Second retune — `a54ce8de` (+ language-switch skip)

The next pilot call (NL→EN switch) still fired a filler on turn 1. Investigation showed a 3.3-second gap between voice_agent's voice_turn_start and the backend WS receipt — STT/TTS plugin reload after the language switch. Subsequent turns had 40-50 ms gaps; only the post-switch turn was affected.

The fix added a fourth gate to filler_gate.tier1_should_fire: language_switch_recent, set by _last_language_switch_at inside _switch_session_language. Tier-1 skips entirely if a language switch happened within LANGUAGE_SWITCH_GRACE_WINDOW_S (default 5 s, env-overridable). Tier-2 at 4 s still fires for genuine stalls.

3 unit tests added to test_filler_gate.py: blocks-when-recent, fires-when-not-recent, default-False (backward-compat).

What the calibration taught

The three skip conditions on tier-1 are independent: (1) RAG still in flight, (2) first chunk not yet spoken, (3) NOT a terminal phrase, (4) NO recent language switch. Adding the 4th gate required only:

One new instance attribute (_last_language_switch_at)
One new module constant (LANGUAGE_SWITCH_GRACE_WINDOW_S)
One new kwarg on tier1_should_fire with default-False (preserves all existing callers)
One timestamp assignment in _switch_session_language
Three new unit tests

The pure-module pattern in voice_agent/filler_gate.py paid off: the dispatch loop in agent.py only needed to compute the boolean and pass it. The gate predicate itself is unit-testable in 0.01 s without LiveKit fixtures.

4 · Pilot warmup config flip (2026-05-23)

A third pilot SIP call surfaced a separate failure mode: 19 seconds of dead air before the user's actual query landed at backend, with tier-2 ("Almost there, just another second") and tier-3 ("Almost ready, I have nearly all the information") fillers playing in sequence.

Investigation found VOICE_WARMUP_CACHE_ENABLED=true set in <ENV_FILE>. The agent's warmup_cache() fires 4 FAQ pre-warm queries at the backend at session start — and the docstring at voice_agent/agent.py:warmup_cache explicitly described the failure mode that the user was hearing:

"Disabled by default since the dialogue-manager rollout (2026-04-29): each warmup query now traverses the full dialogue-manager LLM round-trip (1.5-2.5 s per query × 4 queries = 6-10 s). Those calls run concurrently with the caller's first real turn, competing for OpenAI rate-limit budget and CPU, and consistently push the first answer past the 12 s LONG_FILLER_THRESHOLD_S — meaning the caller hears 3 fillers and an offer to transfer before the agent speaks the actual answer."

The flag was re-enabled briefly during a streaming-TTFT validation pass, apparently on the theory that streaming would mitigate the serialization cost. Pilot logs showed it did not — the 4 warmup queries serialize on the backend's OpenAI client / WS handler chain (each 4-7 s back-to-back), and the user's actual query waits 15-20 s behind the warmup batch.

Fix: flag returned to false. No code change.

The lesson is preserved in the agent.py docstring: when re-enabling a flag whose docstring describes failure modes, verify those failure modes are actually addressed before flipping. The underlying WS / OpenAI-client serialisation has not been investigated — that's deferred until needed.

5 · Pilot disk cleanup

Routine housekeeping. docker system df showed 68.93 GB in images + 33.23 GB in build cache; 367 image tags accumulated over months of deploys.

Step	Before	After
Total Docker images	367	17
Image size	68.93 GB	34.11 GB
Build cache	33.23 GB	0 B
Disk used (/)	77 GB (36%)	45 GB (21%)

~32 GB freed. Kept: zol-rag-app:296031ee (running), zol-rag-app:0afe7160 (rollback), zol-voice-agent:latest (running), plus 14 system images (postgres, redis, keycloak, minio, etc.). All older zol-rag-app:* tags and zol-voice-agent:* tags removed.

Rollback

Each of the four code changes is independently reversible without a redeploy. Edit <ENV_FILE> and restart the relevant container.

Concern	Override
Tier-1 grace bump (38d7f6be)	`VOICE_STREAMING_FIRST_SENTENCE_GRACE_MS=800` + restart `zol-voice-agent`
Language-switch tier-1 skip (a54ce8de)	`VOICE_LANGUAGE_SWITCH_GRACE_S=0` + restart `zol-voice-agent`
LLM-first agentic streaming (ADR-0053)	`VOICE_STREAMING_ENABLED=false` (falls back to sync `query()` path) + restart `zol-app` AND `zol-voice-agent`
Warmup cache	`VOICE_WARMUP_CACHE_ENABLED=true` (re-enables the regression; not recommended) + restart `zol-voice-agent`

What's next

Backend WS / OpenAI-client serialization investigation. The warmup-disabled config is a workaround; the underlying choke point on the backend's voice WS handler is the real fix. Deferred until needed.
Re-run Gate 1 persona eval on a54ce8de. The prior eval on 0afe7160 returned 18/88 (broken HTTP path; fixed in 296031ee). The 2026-05-23 voice eval result is captured in the next release window.
VoiceCallMetadata literal_error. app/api/voice_calls.py:168 constructs VoiceCallMetadata with channel='voice_sip' but the Pydantic model still expects the older Literal['web', 'voice']. SIP calls' /internal/voice/calls/{id}/end returns 500 → hangup_reason / duration_seconds / outcome columns stay NULL. Cosmetic (call still works, messages persist) but breaks SIP analytics. Tracked for the next maintenance window.

Memory entries shipped

Three new feedback memories codify rules surfaced by the hotfix cascade. See ~/.claude/projects/-Users-soft4u-Development-zol-rag/memory/:

feedback-openai-tool-choice-none-requires-tools-param.md
feedback-use-logger-exception-not-extra-err.md
feedback-mocked-llm-tests-miss-sdk-validation.md

LLM-First Agentic Voice · 4 Hotfixes · Filler Race Calibration​

1 · ADR-0053 — LLM-First Agentic Voice Pipeline​

The original design vs the production reality​

2 · The four-hotfix cascade​

What the cascade taught the codebase​

3 · Tier-1 filler grace — calibration in production​

First retune — 38d7f6be (800 → 1500 ms)​

Second retune — a54ce8de (+ language-switch skip)​

What the calibration taught​

4 · Pilot warmup config flip (2026-05-23)​

5 · Pilot disk cleanup​

Rollback​

What's next​

Memory entries shipped​