Long-utterance acknowledgment (A.1-lite)

Status

Designed and implemented 2026-04-25. Final item of the Q2 naturalness sprint, follows A.2 (context-aware filler), A.4 (prosody injection), A.3 (adaptive TTS speed). Default ON in the container; flag VOICE_LISTENING_ACK_ENABLED=false to disable.

Note: this is the lightweight variant of A.1 (full acoustic backchannels). The during-speech variant was scoped out after risk analysis — see "Why lite, not full" below.

Why mid-utterance acknowledgments

A long caller utterance with a mid-thought pause needs a different signal than the post-end-of-turn filler ladder. Tier-1/2/3 fillers ("Een ogenblikje" / "Ik ben nog aan het zoeken" / "Het duurt wat langer") fire AFTER the caller stops speaking; the listening-ack fires DURING the caller's pause, BEFORE the orchestrator runs. The two mechanisms are complementary:

Mechanism	When	Purpose	Cost
Listening ack ("ik luister hoor")	During mid-thought caller pause (≥ 1.2 s stable interim)	"I'm tracking your long utterance, please continue"	One TTS line per qualifying turn
Tier-1/2/3 fillers	After STT-final, while RAG runs	"I heard your question, working on it"	TTS line at 300 ms / 4 s / 10 s thresholds

The two mechanisms must not stack: the ack fires before the caller has technically finished speaking; the filler fires after. Distress-detection overrides both — distress always wins.

What it is

When a caller gives a long utterance and pauses mid-thought, the agent emits a single low-key acknowledgment so the caller knows they're being tracked:

Caller utterance	Agent ack (after the mid-thought pause)
"Vorig jaar lag ik een week in het ziekenhuis voor een operatie en ik weet niet meer welke arts dat was…" [1.3s pause]	"Ik luister hoor."
"I had surgery last year and I'm trying to remember which doctor performed it…" [1.3s pause]	"Please go on."

The ack lands during the caller's pause — before they've technically finished speaking — so it functions as listening reassurance, not as a turn-taking move.

Why lite, not full

Full A.1 (acoustic backchannels DURING caller speech) was originally on the roadmap. Risk analysis identified five problems specific to hospital context:

Catastrophic safety failure: backchannels during distress utterances ("I want to die…" → "mm-hmm") are not just inappropriate; they read as encouragement to continue describing self-harm.
STT echo interference: agent's mm-hmm during caller speech leaks back into Deepgram on phone calls with imperfect echo cancellation.
Mistimed-ack patronizing risk: 90% well-timed + 10% mistimed is worse than no backchannels because catastrophic failures are remembered (loss aversion).
The existing tier-1/2/3 filler ladder already addresses 70% of the perceived gap for the post-end-of-turn phase.
No measurable ROI — full A.1's benefit is purely subjective UX, while predictive caching (B.4) has hard latency wins.

A.1-lite captures the long-utterance reassurance piece (the production-meaningful subset) at ~10% of the engineering cost and ~5% of the deployment risk by firing AFTER the caller's mid-thought pause instead of DURING their speech. Sequential audio = no VAD precision, no echo problem, no audio mixing.

Trigger

Five gates, all must clear:

Gate	Threshold	Reason
Word count in interim	`≥12 words`	Rejects normal short questions ("Wat zijn de bezoekuren?" = 4 words)
Stable interim duration	`≥1.2 s`	Rejects normal cadence pauses (500-800ms); fires before VAD end-of-turn (~1.5 s)
Throttle since last ack	`≥10 s`	Matches human phone-call backchannel cadence (Yngve 1970, Switchboard corpus)
Per-turn fire cap	`<3 fires`	Prevents nagging on >60 s monologues
Distress override	`not active`	Safety path always wins — no acks during distress utterances

A 90 s monologue gets exactly 3 acks (at ~10 s, ~20 s, ~30 s); the remaining 60 s carries momentum without further interruption.

Templates (4 per language × 4 languages = 16 variants)

nl: "Ik luister hoor." / "Ga gerust verder." / "Ja, ik ben er." / "Ik volg het."
en: "I'm listening." / "Please go on." / "I'm with you." / "I'm following."
fr: "Je vous écoute." / "Continuez, je vous en prie." / "Je suis là." / "Je vous suis."
it: "La ascolto." / "Prego, continui." / "Sono qui." / "La seguo."

random.choice per fire prevents repetition in multi-ack turns. fr/it use formal vous / Lei register to match hospital-context formality.

Coexistence

Existing feature	Coexistence
Tier-1 filler (300 ms post-end-of-turn)	No conflict — different temporal phase
Tier-2/3 filler	Same as tier-1
Speculation (Phase 4.3+4.4)	Both fire on same turn; speculation hits backend, ack emits TTS — additive
Context-aware filler (A.2)	A.1-lite doesn't read or override `_partial_topic`
Adaptive TTS speed (A.3)	Ack inherits whatever speed is currently active
Distress detection	A.1-lite skips entirely — distress overrides everything

Failure modes

Scenario	Behavior
Caller speaks German (no template)	Falls back to NL templates — established `get_fillers` fallback policy
TTS raises during ack delivery	Caught + logged; ack drops silently. No turn impact
Ack fires; caller resumes mid-ack	`allow_interruptions=True` — caller barges in cleanly
Distress fires DURING ack	Distress handoff barges in; ack truncates (correct prioritization)
Feature flag flipped at runtime	Read once at module import; needs container restart

No path crashes or degrades safety.

Observability

{"event": "voice_listening_ack_fired",
 "turn": 4,
 "fire_count_in_turn": 1,
 "language": "nl",
 "interim_word_count": 14,
 "stable_ms": 1280}

Greppable via docker logs zol-voice-agent | grep voice_listening_ack_fired. Operators can audit:

False positive rate: acks firing on what should have been short questions (rare given the 12-word floor).
Per-language fire distribution: language coverage of the trigger.
Turn distribution: turn 1 vs turn 5 — does the trigger fire mostly on opening narratives or mid-call clarifications?

Settings

VOICE_LISTENING_ACK_ENABLED=true   # Default ON

Test it

Speak a deliberately long utterance with a pause:

"Ik zoek informatie over mijn vader die vorige week is opgenomen op de afdeling cardiologie en ik wil graag weten…" [wait 1.5 s]

After the pause, the agent should emit one of the four NL ack templates ("Ik luister hoor." / "Ga gerust verder." / "Ja, ik ben er." / "Ik volg het."). Then continue your question to verify the ack TTS truncates cleanly (allow_interruptions=True).

For the throttle check: ask a long question, get the ack, then ask ANOTHER long question within 10 seconds → no second ack. After 10 s since the last ack, the next qualifying utterance fires again.

Files

voice_agent/listening_ack.py — pure trigger logic (no LiveKit deps)
voice_agent/agent.py — integration (~50 lines added across 3 hooks)
voice_agent/tests/unit/test_listening_ack.py — 17 design-locked tests
docs/plans/2026-04-25-listening-ack-design.md — full design spec

References

LiveKit Agents Documentation — interim transcript event hooks the trigger reads
ElevenLabs Multilingual v2 — TTS model that speaks the ack templates; allow_interruptions=True lets caller barge in cleanly
Yngve, V. (1970). "On getting a word in edgewise." Papers from the Sixth Regional Meeting, Chicago Linguistic Society. — backchannel-cadence reference (quoted as the 10s throttle baseline). {/* TODO Wave 2.D: bibkey for "Yngve 1970 backchannels in conversation" needed */}
(Sacks, Schegloff & Jefferson 1974)
Lin et al. 2026

Why mid-utterance acknowledgments​

What it is​

Why lite, not full​

Trigger​

Templates (4 per language × 4 languages = 16 variants)​

Coexistence​

Failure modes​

Observability​

Settings​

Test it​

Files​

References​