Long-utterance acknowledgment (A.1-lite)
Designed and implemented 2026-04-25. Final item of the Q2
naturalness sprint, follows A.2 (context-aware filler), A.4 (prosody
injection), A.3 (adaptive TTS speed). Default ON in the container;
flag VOICE_LISTENING_ACK_ENABLED=false to disable.
Note: this is the lightweight variant of A.1 (full acoustic backchannels). The during-speech variant was scoped out after risk analysis — see "Why lite, not full" below.
Why mid-utterance acknowledgments
A long caller utterance with a mid-thought pause needs a different signal than the post-end-of-turn filler ladder. Tier-1/2/3 fillers ("Een ogenblikje" / "Ik ben nog aan het zoeken" / "Het duurt wat langer") fire AFTER the caller stops speaking; the listening-ack fires DURING the caller's pause, BEFORE the orchestrator runs. The two mechanisms are complementary:
| Mechanism | When | Purpose | Cost |
|---|---|---|---|
| Listening ack ("ik luister hoor") | During mid-thought caller pause (≥ 1.2 s stable interim) | "I'm tracking your long utterance, please continue" | One TTS line per qualifying turn |
| Tier-1/2/3 fillers | After STT-final, while RAG runs | "I heard your question, working on it" | TTS line at 300 ms / 4 s / 10 s thresholds |
The two mechanisms must not stack: the ack fires before the caller has technically finished speaking; the filler fires after. Distress-detection overrides both — distress always wins.
What it is
When a caller gives a long utterance and pauses mid-thought, the agent emits a single low-key acknowledgment so the caller knows they're being tracked:
| Caller utterance | Agent ack (after the mid-thought pause) |
|---|---|
| "Vorig jaar lag ik een week in het ziekenhuis voor een operatie en ik weet niet meer welke arts dat was…" [1.3s pause] | "Ik luister hoor." |
| "I had surgery last year and I'm trying to remember which doctor performed it…" [1.3s pause] | "Please go on." |
The ack lands during the caller's pause — before they've technically finished speaking — so it functions as listening reassurance, not as a turn-taking move.
Why lite, not full
Full A.1 (acoustic backchannels DURING caller speech) was originally on the roadmap. Risk analysis identified five problems specific to hospital context:
- Catastrophic safety failure: backchannels during distress utterances ("I want to die…" → "mm-hmm") are not just inappropriate; they read as encouragement to continue describing self-harm.
- STT echo interference: agent's mm-hmm during caller speech leaks back into Deepgram on phone calls with imperfect echo cancellation.
- Mistimed-ack patronizing risk: 90% well-timed + 10% mistimed is worse than no backchannels because catastrophic failures are remembered (loss aversion).
- The existing tier-1/2/3 filler ladder already addresses 70% of the perceived gap for the post-end-of-turn phase.
- No measurable ROI — full A.1's benefit is purely subjective UX, while predictive caching (B.4) has hard latency wins.
A.1-lite captures the long-utterance reassurance piece (the production-meaningful subset) at ~10% of the engineering cost and ~5% of the deployment risk by firing AFTER the caller's mid-thought pause instead of DURING their speech. Sequential audio = no VAD precision, no echo problem, no audio mixing.
Trigger
Five gates, all must clear:
| Gate | Threshold | Reason |
|---|---|---|
| Word count in interim | ≥12 words | Rejects normal short questions ("Wat zijn de bezoekuren?" = 4 words) |
| Stable interim duration | ≥1.2 s | Rejects normal cadence pauses (500-800ms); fires before VAD end-of-turn (~1.5 s) |
| Throttle since last ack | ≥10 s | Matches human phone-call backchannel cadence (Yngve 1970, Switchboard corpus) |
| Per-turn fire cap | <3 fires | Prevents nagging on >60 s monologues |
| Distress override | not active | Safety path always wins — no acks during distress utterances |
A 90 s monologue gets exactly 3 acks (at ~10 s, ~20 s, ~30 s); the remaining 60 s carries momentum without further interruption.
Templates (4 per language × 4 languages = 16 variants)
nl: "Ik luister hoor." / "Ga gerust verder." / "Ja, ik ben er." / "Ik volg het."
en: "I'm listening." / "Please go on." / "I'm with you." / "I'm following."
fr: "Je vous écoute." / "Continuez, je vous en prie." / "Je suis là." / "Je vous suis."
it: "La ascolto." / "Prego, continui." / "Sono qui." / "La seguo."
random.choice per fire prevents repetition in multi-ack turns.
fr/it use formal vous / Lei register to match hospital-context
formality.
Coexistence
| Existing feature | Coexistence |
|---|---|
| Tier-1 filler (300 ms post-end-of-turn) | No conflict — different temporal phase |
| Tier-2/3 filler | Same as tier-1 |
| Speculation (Phase 4.3+4.4) | Both fire on same turn; speculation hits backend, ack emits TTS — additive |
| Context-aware filler (A.2) | A.1-lite doesn't read or override _partial_topic |
| Adaptive TTS speed (A.3) | Ack inherits whatever speed is currently active |
| Distress detection | A.1-lite skips entirely — distress overrides everything |
Failure modes
| Scenario | Behavior |
|---|---|
| Caller speaks German (no template) | Falls back to NL templates — established get_fillers fallback policy |
| TTS raises during ack delivery | Caught + logged; ack drops silently. No turn impact |
| Ack fires; caller resumes mid-ack | allow_interruptions=True — caller barges in cleanly |
| Distress fires DURING ack | Distress handoff barges in; ack truncates (correct prioritization) |
| Feature flag flipped at runtime | Read once at module import; needs container restart |
No path crashes or degrades safety.
Observability
{"event": "voice_listening_ack_fired",
"turn": 4,
"fire_count_in_turn": 1,
"language": "nl",
"interim_word_count": 14,
"stable_ms": 1280}
Greppable via docker logs zol-voice-agent | grep voice_listening_ack_fired.
Operators can audit:
- False positive rate: acks firing on what should have been short questions (rare given the 12-word floor).
- Per-language fire distribution: language coverage of the trigger.
- Turn distribution: turn 1 vs turn 5 — does the trigger fire mostly on opening narratives or mid-call clarifications?
Settings
VOICE_LISTENING_ACK_ENABLED=true # Default ON
Test it
Speak a deliberately long utterance with a pause:
"Ik zoek informatie over mijn vader die vorige week is opgenomen op de afdeling cardiologie en ik wil graag weten…" [wait 1.5 s]
After the pause, the agent should emit one of the four NL ack templates ("Ik luister hoor." / "Ga gerust verder." / "Ja, ik ben er." / "Ik volg het."). Then continue your question to verify the ack TTS truncates cleanly (allow_interruptions=True).
For the throttle check: ask a long question, get the ack, then ask ANOTHER long question within 10 seconds → no second ack. After 10 s since the last ack, the next qualifying utterance fires again.
Files
voice_agent/listening_ack.py— pure trigger logic (no LiveKit deps)voice_agent/agent.py— integration (~50 lines added across 3 hooks)voice_agent/tests/unit/test_listening_ack.py— 17 design-locked testsdocs/plans/2026-04-25-listening-ack-design.md— full design spec
References
- LiveKit Agents Documentation — interim transcript event hooks the trigger reads
- ElevenLabs Multilingual v2 — TTS model that speaks the ack templates;
allow_interruptions=Truelets caller barge in cleanly - Yngve, V. (1970). "On getting a word in edgewise." Papers from the Sixth Regional Meeting, Chicago Linguistic Society. — backchannel-cadence reference (quoted as the 10s throttle baseline). {/* TODO Wave 2.D: bibkey for "Yngve 1970 backchannels in conversation" needed */}
- (Sacks, Schegloff & Jefferson 1974)
- Lin et al. 2026