Skip to main content

Release Notes: May 22 – 23, 2026

LLM-First Agentic Voice · 4 Hotfixes · Filler Race Calibration

~10 commits | 2 days | ADR-0053 accepted | 1 user-found regression chain (4 hotfixes) | tier-1 grace tuned twice in 12 hours | 32 GB pilot disk reclaimed

This window completed the trust-LLM simplification sprint that began with the 11-task plan brainstormed on 2026-05-22 and landed as the LLM-first agentic voice pipeline (ADR-0053). The shape of the work was unusual: the design was approved cleanly, but the production reality surfaced four distinct failure modes during pilot SIP smoke-testing, each requiring a focused hotfix. The release note below documents both halves — the architectural change and the empirical calibration that followed — because the second half is the more transferable lesson.

The headline themes:

  1. LLM-first agentic dispatch (ADR-0053). Replaced the spec's two-call pattern (non-streaming tool decision → separate streaming final) with native OpenAI streaming-with-tools — a single chat.completions.create(stream=True, tools=_TOOLS, tool_choice="auto") call per tool-decision iteration. For direct-response queries this halves OpenAI round-trips.
  2. Hotfix cascade (4 fixes, all same-day). Citation serialization, blocked-event sentence-buffer gap, tool_choice="none" OpenAI API rejection, two-call latency. Each fix was a single-commit landing; the cascade itself is the lesson, not any individual diff.
  3. Tier-1 filler grace tuned twice (800 → 1500 ms → language-switch skip). Pilot SIP testing surfaced a "third clock" problem — transport latency on the Twilio/LiveKit SIP path — that the original threshold hadn't accounted for. A second retune added a 5-second skip window for the post-language-switch STT/TTS reload.
  4. Pilot warmup config flip. VOICE_WARMUP_CACHE_ENABLED=true was set during an earlier streaming-TTFT validation; the agent.py docstring explicitly warned this would cause "3 fillers and an offer to transfer before the agent speaks the actual answer." Pilot SIP audit reproduced exactly that. Flag returned to false.
  5. Pilot disk cleanup. 350 stale Docker image tags + 33 GB of build cache removed; 32 GB of disk reclaimed without touching active images.

1 · ADR-0053 — LLM-First Agentic Voice Pipeline

The spec at docs/superpowers/specs/2026-05-22-llm-first-agentic-design.md ratified what the prior architecture had been drifting toward: the LLM, not a regex pre-filter, decides whether to engage in dialogue, ask clarification, or call into the RAG. The system prompt's AVAILABLE TOOLS section documents three tools (search_hospital_kb, transfer_to_helpdesk, end_call) and 11 dialogue examples spanning capability questions, greetings, ambiguous follow-ups, meta-questions, ZOL info (tool call), educational (tool call), prescriptive refusal, handoff, farewell, off-topic redirect, and phone-not-in-corpus.

The land was clean: a044e813 (prompt rewrite) → 15c596b5 (ADR) → 62ba676d (9-pattern integration test).

The original design vs the production reality

The spec's Section 2 specified a two-call pattern:

  1. Non-streaming chat.completions.create() for the tool-decision iteration loop
  2. Separate streaming chat.completions.create(stream=True, tool_choice="none") for the final answer

This shape passed the 9-pattern integration test cleanly because AsyncMock OpenAI clients don't enforce SDK-level argument validation. Production traffic hit the actual API and immediately got HTTP 400: "tool_choice is only valid when tools is provided." The second-call pattern needed either tools=_TOOLS (with explicit "you have tools but I'm forbidding them this turn") OR no tool_choice parameter at all. Hotfix #2 (0afe7160) removed the invalid tool_choice kwarg.

But the deeper problem was latency. Direct-response queries ("Do you speak English?", which the LLM can answer without RAG) made two OpenAI round-trips under the two-call pattern. The +800 ms tier-1 filler always fired on every direct-response turn because the LLM didn't start streaming until after the tool-decision call returned.

Hotfix #4 (296031ee) replaced the spec's option (b) with option (a): one streaming-with-tools call per iteration. The LLM emits content deltas if no tool is needed, or tool-call deltas if one is. The orchestrator parses both shapes from a single stream. ONE call per direct-response, N+1 calls (one per tool iteration + one final synthesis) for tool-using queries — all streamed end-to-end.

spec option (b): direct-response = 2 OpenAI calls, first sentence ~1500-2500 ms
shipped (a): direct-response = 1 OpenAI call, first sentence ~600-900 ms

2 · The four-hotfix cascade

Hotfixes landed in this order on 2026-05-22 evening through 2026-05-23 morning. Each was triggered by a specific failure mode in pilot SIP smoke-testing; each is its own commit with its own test pin.

#CommitFailure modeFix shape
18f9769dfTypeError: Object of type Citation is not JSON serializable at public_websocket.py:779. Cascade: WS connection died → asyncpg pool slot poisoned → "Can't reconnect until invalid transaction is rolled back" on next requestWrap event in jsonable_encoder() before send_json()
1ba47a195aT7 query_stream only buffered chunk events; blocked / clarification / repeat_previous were forwarded unchanged. voice_agent's consumer only handles sentence / final / error → never spoke off-topic refusalsExtended sentence buffer to transform all 4 event types
20afe7160voice_llm_streaming_final_failed on every turn. Root cause: tool_choice="none" without tools= parameter → OpenAI HTTP 400Removed invalid tool_choice="none" from streaming call
3(logging fix folded into #2)Diagnostics gap — logger.error("event", extra={"err":...}) silently dropped the extra= keys in the project's structlog formatterSwitched to logger.exception("event err=%s", exc)
4296031eeHTTP /api/v1/query returned empty answer (Gate 1 persona eval showed 18/88 turns, all empty). Sync query() wrapper consumed chunk events, but the T3 sentence-buffer wrapper only emitted sentence events. Plus: filler always fired on direct-response queries because the two-call pattern took >800 msNative streaming-with-tools (option a); sync wrapper now accumulates from sentence events

What the cascade taught the codebase

Three memory entries were written from this sequence, each codifying a rule for future maintainers:

  • feedback-openai-tool-choice-none-requires-tools-param: OpenAI's chat.completions.create(...) rejects tool_choice="none" with HTTP 400 when no tools= parameter is provided. Valid combinations are (tools, tool_choice="auto"), (tools, tool_choice="none"), (tools, force-call dict), or NEITHER. The invalid form is tool_choice="none" alone.
  • feedback-use-logger-exception-not-extra-err: Use logger.exception("event err=%s", exc) for exception logging — NOT logger.error("event", extra={"err": ...}). The project's structlog formatter silently drops extra= keys; format-string interpolation IS captured.
  • feedback-mocked-llm-tests-miss-sdk-validation: AsyncMock OpenAI clients verify orchestrator BEHAVIOR given a known LLM script, but don't enforce SDK-level argument validation. For LLM-SDK changes, add a real-API smoke test before declaring the deploy done. T5's 9-pattern integration test passed with invalid arguments; production failed on the first SIP turn.

3 · Tier-1 filler grace — calibration in production

Once the architecture was clean, two pilot SIP smoke calls in succession surfaced a tuning problem.

First retune — 38d7f6be (800 → 1500 ms)

A pilot SIP call asking "Do you speak English?" fired a "Let me search that for you" filler before the actual "Yes, I speak English" response arrived. Investigation:

  • Backend turn: 622 ms total (LLM-direct, no tool call)
  • LiveKit + Twilio SIP transport: ~50-200 ms each way
  • _streamed_answer_spoken=True flipped at ~700-1100 ms
  • Original tier-1 grace: 800 ms

The threshold landed inside the transport-tail window. Bumped the default to 1500 ms — comfortable margin for direct-LLM answers, RAG-needed turns (2-3.5 s) still trigger the filler.

The deeper insight: the codebase comment at voice_agent/agent.py had calibrated the original 800 ms threshold against two clocks — the 600 ms human-noticed-silence floor and the ~1000 ms first-sentence mark. The third clock — SIP transport latency — was missing from the model.

Second retune — a54ce8de (+ language-switch skip)

The next pilot call (NL→EN switch) still fired a filler on turn 1. Investigation showed a 3.3-second gap between voice_agent's voice_turn_start and the backend WS receipt — STT/TTS plugin reload after the language switch. Subsequent turns had 40-50 ms gaps; only the post-switch turn was affected.

The fix added a fourth gate to filler_gate.tier1_should_fire: language_switch_recent, set by _last_language_switch_at inside _switch_session_language. Tier-1 skips entirely if a language switch happened within LANGUAGE_SWITCH_GRACE_WINDOW_S (default 5 s, env-overridable). Tier-2 at 4 s still fires for genuine stalls.

3 unit tests added to test_filler_gate.py: blocks-when-recent, fires-when-not-recent, default-False (backward-compat).

What the calibration taught

The three skip conditions on tier-1 are independent: (1) RAG still in flight, (2) first chunk not yet spoken, (3) NOT a terminal phrase, (4) NO recent language switch. Adding the 4th gate required only:

  • One new instance attribute (_last_language_switch_at)
  • One new module constant (LANGUAGE_SWITCH_GRACE_WINDOW_S)
  • One new kwarg on tier1_should_fire with default-False (preserves all existing callers)
  • One timestamp assignment in _switch_session_language
  • Three new unit tests

The pure-module pattern in voice_agent/filler_gate.py paid off: the dispatch loop in agent.py only needed to compute the boolean and pass it. The gate predicate itself is unit-testable in 0.01 s without LiveKit fixtures.


4 · Pilot warmup config flip (2026-05-23)

A third pilot SIP call surfaced a separate failure mode: 19 seconds of dead air before the user's actual query landed at backend, with tier-2 ("Almost there, just another second") and tier-3 ("Almost ready, I have nearly all the information") fillers playing in sequence.

Investigation found VOICE_WARMUP_CACHE_ENABLED=true set in /opt/zol-rag/.env.prod. The agent's warmup_cache() fires 4 FAQ pre-warm queries at the backend at session start — and the docstring at voice_agent/agent.py:warmup_cache explicitly described the failure mode that the user was hearing:

"Disabled by default since the dialogue-manager rollout (2026-04-29): each warmup query now traverses the full dialogue-manager LLM round-trip (1.5-2.5 s per query × 4 queries = 6-10 s). Those calls run concurrently with the caller's first real turn, competing for OpenAI rate-limit budget and CPU, and consistently push the first answer past the 12 s LONG_FILLER_THRESHOLD_S — meaning the caller hears 3 fillers and an offer to transfer before the agent speaks the actual answer."

The flag was re-enabled briefly during a streaming-TTFT validation pass, apparently on the theory that streaming would mitigate the serialization cost. Pilot logs showed it did not — the 4 warmup queries serialize on the backend's OpenAI client / WS handler chain (each 4-7 s back-to-back), and the user's actual query waits 15-20 s behind the warmup batch.

Fix: flag returned to false. No code change.

The lesson is preserved in the agent.py docstring: when re-enabling a flag whose docstring describes failure modes, verify those failure modes are actually addressed before flipping. The underlying WS / OpenAI-client serialisation has not been investigated — that's deferred until needed.


5 · Pilot disk cleanup

Routine housekeeping. docker system df showed 68.93 GB in images + 33.23 GB in build cache; 367 image tags accumulated over months of deploys.

StepBeforeAfter
Total Docker images36717
Image size68.93 GB34.11 GB
Build cache33.23 GB0 B
Disk used (/)77 GB (36%)45 GB (21%)

~32 GB freed. Kept: zol-rag-app:296031ee (running), zol-rag-app:0afe7160 (rollback), zol-voice-agent:latest (running), plus 14 system images (postgres, redis, keycloak, minio, etc.). All older zol-rag-app:* tags and zol-voice-agent:* tags removed.


Rollback

Each of the four code changes is independently reversible without a redeploy. Edit /opt/zol-rag/.env.prod and restart the relevant container.

ConcernOverride
Tier-1 grace bump (38d7f6be)VOICE_STREAMING_FIRST_SENTENCE_GRACE_MS=800 + restart zol-voice-agent
Language-switch tier-1 skip (a54ce8de)VOICE_LANGUAGE_SWITCH_GRACE_S=0 + restart zol-voice-agent
LLM-first agentic streaming (ADR-0053)VOICE_STREAMING_ENABLED=false (falls back to sync query() path) + restart zol-app AND zol-voice-agent
Warmup cacheVOICE_WARMUP_CACHE_ENABLED=true (re-enables the regression; not recommended) + restart zol-voice-agent

What's next

  • Backend WS / OpenAI-client serialization investigation. The warmup-disabled config is a workaround; the underlying choke point on the backend's voice WS handler is the real fix. Deferred until needed.
  • Re-run Gate 1 persona eval on a54ce8de. The prior eval on 0afe7160 returned 18/88 (broken HTTP path; fixed in 296031ee). The 2026-05-23 voice eval result is captured in the next release window.
  • VoiceCallMetadata literal_error. app/api/voice_calls.py:168 constructs VoiceCallMetadata with channel='voice_sip' but the Pydantic model still expects the older Literal['web', 'voice']. SIP calls' /internal/voice/calls/{id}/end returns 500 → hangup_reason / duration_seconds / outcome columns stay NULL. Cosmetic (call still works, messages persist) but breaks SIP analytics. Tracked for the next maintenance window.

Memory entries shipped

Three new feedback memories codify rules surfaced by the hotfix cascade. See ~/.claude/projects/-Users-soft4u-Development-zol-rag/memory/:

  • feedback-openai-tool-choice-none-requires-tools-param.md
  • feedback-use-logger-exception-not-extra-err.md
  • feedback-mocked-llm-tests-miss-sdk-validation.md