Skip to main content

Long-utterance acknowledgment (A.1-lite)

Status

Designed and implemented 2026-04-25. Final item of the Q2 naturalness sprint, follows A.2 (context-aware filler), A.4 (prosody injection), A.3 (adaptive TTS speed). Default ON in the container; flag VOICE_LISTENING_ACK_ENABLED=false to disable.

Note: this is the lightweight variant of A.1 (full acoustic backchannels). The during-speech variant was scoped out after risk analysis — see "Why lite, not full" below.

Why mid-utterance acknowledgments

A long caller utterance with a mid-thought pause needs a different signal than the post-end-of-turn filler ladder. Tier-1/2/3 fillers ("Een ogenblikje" / "Ik ben nog aan het zoeken" / "Het duurt wat langer") fire AFTER the caller stops speaking; the listening-ack fires DURING the caller's pause, BEFORE the orchestrator runs. The two mechanisms are complementary:

MechanismWhenPurposeCost
Listening ack ("ik luister hoor")During mid-thought caller pause (≥ 1.2 s stable interim)"I'm tracking your long utterance, please continue"One TTS line per qualifying turn
Tier-1/2/3 fillersAfter STT-final, while RAG runs"I heard your question, working on it"TTS line at 300 ms / 4 s / 10 s thresholds

The two mechanisms must not stack: the ack fires before the caller has technically finished speaking; the filler fires after. Distress-detection overrides both — distress always wins.

What it is

When a caller gives a long utterance and pauses mid-thought, the agent emits a single low-key acknowledgment so the caller knows they're being tracked:

Caller utteranceAgent ack (after the mid-thought pause)
"Vorig jaar lag ik een week in het ziekenhuis voor een operatie en ik weet niet meer welke arts dat was…" [1.3s pause]"Ik luister hoor."
"I had surgery last year and I'm trying to remember which doctor performed it…" [1.3s pause]"Please go on."

The ack lands during the caller's pause — before they've technically finished speaking — so it functions as listening reassurance, not as a turn-taking move.

Why lite, not full

Full A.1 (acoustic backchannels DURING caller speech) was originally on the roadmap. Risk analysis identified five problems specific to hospital context:

  1. Catastrophic safety failure: backchannels during distress utterances ("I want to die…" → "mm-hmm") are not just inappropriate; they read as encouragement to continue describing self-harm.
  2. STT echo interference: agent's mm-hmm during caller speech leaks back into Deepgram on phone calls with imperfect echo cancellation.
  3. Mistimed-ack patronizing risk: 90% well-timed + 10% mistimed is worse than no backchannels because catastrophic failures are remembered (loss aversion).
  4. The existing tier-1/2/3 filler ladder already addresses 70% of the perceived gap for the post-end-of-turn phase.
  5. No measurable ROI — full A.1's benefit is purely subjective UX, while predictive caching (B.4) has hard latency wins.

A.1-lite captures the long-utterance reassurance piece (the production-meaningful subset) at ~10% of the engineering cost and ~5% of the deployment risk by firing AFTER the caller's mid-thought pause instead of DURING their speech. Sequential audio = no VAD precision, no echo problem, no audio mixing.

Trigger

Five gates, all must clear:

GateThresholdReason
Word count in interim≥12 wordsRejects normal short questions ("Wat zijn de bezoekuren?" = 4 words)
Stable interim duration≥1.2 sRejects normal cadence pauses (500-800ms); fires before VAD end-of-turn (~1.5 s)
Throttle since last ack≥10 sMatches human phone-call backchannel cadence (Yngve 1970, Switchboard corpus)
Per-turn fire cap<3 firesPrevents nagging on >60 s monologues
Distress overridenot activeSafety path always wins — no acks during distress utterances

A 90 s monologue gets exactly 3 acks (at ~10 s, ~20 s, ~30 s); the remaining 60 s carries momentum without further interruption.

Templates (4 per language × 4 languages = 16 variants)

nl: "Ik luister hoor." / "Ga gerust verder." / "Ja, ik ben er." / "Ik volg het."
en: "I'm listening." / "Please go on." / "I'm with you." / "I'm following."
fr: "Je vous écoute." / "Continuez, je vous en prie." / "Je suis là." / "Je vous suis."
it: "La ascolto." / "Prego, continui." / "Sono qui." / "La seguo."

random.choice per fire prevents repetition in multi-ack turns. fr/it use formal vous / Lei register to match hospital-context formality.

Coexistence

Existing featureCoexistence
Tier-1 filler (300 ms post-end-of-turn)No conflict — different temporal phase
Tier-2/3 fillerSame as tier-1
Speculation (Phase 4.3+4.4)Both fire on same turn; speculation hits backend, ack emits TTS — additive
Context-aware filler (A.2)A.1-lite doesn't read or override _partial_topic
Adaptive TTS speed (A.3)Ack inherits whatever speed is currently active
Distress detectionA.1-lite skips entirely — distress overrides everything

Failure modes

ScenarioBehavior
Caller speaks German (no template)Falls back to NL templates — established get_fillers fallback policy
TTS raises during ack deliveryCaught + logged; ack drops silently. No turn impact
Ack fires; caller resumes mid-ackallow_interruptions=True — caller barges in cleanly
Distress fires DURING ackDistress handoff barges in; ack truncates (correct prioritization)
Feature flag flipped at runtimeRead once at module import; needs container restart

No path crashes or degrades safety.

Observability

{"event": "voice_listening_ack_fired",
"turn": 4,
"fire_count_in_turn": 1,
"language": "nl",
"interim_word_count": 14,
"stable_ms": 1280}

Greppable via docker logs zol-voice-agent | grep voice_listening_ack_fired. Operators can audit:

  • False positive rate: acks firing on what should have been short questions (rare given the 12-word floor).
  • Per-language fire distribution: language coverage of the trigger.
  • Turn distribution: turn 1 vs turn 5 — does the trigger fire mostly on opening narratives or mid-call clarifications?

Settings

VOICE_LISTENING_ACK_ENABLED=true # Default ON

Test it

Speak a deliberately long utterance with a pause:

"Ik zoek informatie over mijn vader die vorige week is opgenomen op de afdeling cardiologie en ik wil graag weten…" [wait 1.5 s]

After the pause, the agent should emit one of the four NL ack templates ("Ik luister hoor." / "Ga gerust verder." / "Ja, ik ben er." / "Ik volg het."). Then continue your question to verify the ack TTS truncates cleanly (allow_interruptions=True).

For the throttle check: ask a long question, get the ack, then ask ANOTHER long question within 10 seconds → no second ack. After 10 s since the last ack, the next qualifying utterance fires again.

Files

  • voice_agent/listening_ack.py — pure trigger logic (no LiveKit deps)
  • voice_agent/agent.py — integration (~50 lines added across 3 hooks)
  • voice_agent/tests/unit/test_listening_ack.py — 17 design-locked tests
  • docs/plans/2026-04-25-listening-ack-design.md — full design spec

References

  • LiveKit Agents Documentation — interim transcript event hooks the trigger reads
  • ElevenLabs Multilingual v2 — TTS model that speaks the ack templates; allow_interruptions=True lets caller barge in cleanly
  • Yngve, V. (1970). "On getting a word in edgewise." Papers from the Sixth Regional Meeting, Chicago Linguistic Society. — backchannel-cadence reference (quoted as the 10s throttle baseline). {/* TODO Wave 2.D: bibkey for "Yngve 1970 backchannels in conversation" needed */}
  • (Sacks, Schegloff & Jefferson 1974)
  • Lin et al. 2026