Adaptive TTS speed (A.3)
Designed and implemented 2026-04-25. Third item of the Q2 naturalness
sprint, follows A.2 (context-aware filler) and A.4 (prosody injection).
Default ON in the container; flag VOICE_ADAPTIVE_SPEED_ENABLED=false
to disable.
Why adapt TTS speed
A fixed TTS speed of 1.0 sounds calm-and-professional in the abstract — and wrong in three observable cases:
- The caller asks the agent to slow down. Pre-A.3, the agent acknowledged the request ("Of course, I'll speak more slowly") but did not actually change the TTS speed; the elevenlabs plugin didn't expose
voice_settingscleanly. As oflivekit-plugins-elevenlabs==1.5.6it does, so A.3 makes the acknowledgement truthful. - The caller is in distress. A distress signal on the inbound transcript should trigger calmer, slower delivery on the outbound — before the handoff template starts speaking. Standard pace mid-distress reads as obtuse at best, callous at worst.
- The caller speaks slowly. Elderly callers (the dominant hospital-helpdesk demographic) often speak at < 110 WPM. Matching their pace with a small
−0.05offset reads as attuned without crossing into mockery.
ElevenLabs Multilingual v2 accepts a speed parameter in its VoiceSettings; A.3 composes the three signals into one knob and updates the TTS settings idempotently.
Trade-offs
| Decision | Chosen | Alternative | Rejected because |
|---|---|---|---|
| Composition function | clamp(discrete + offset, 0.70, 1.00) with priority distress > explicit > baseline | Multiplicative composition (distress × explicit_factor × wpm_factor) | Multiplicative composition can drive the speed below the audibly-degraded threshold (~0.65) under stacked discounts; additive-with-clamp gives a hard floor. |
| Speed-up signal | None — the agent never speeds up beyond 1.0 | Symmetric speed control | Hospital callers perceive "sped up" as "rushed" or "uncaring." The only legitimate request is to slow down. |
| Persistence | Sticky for the call (slow-down) / 2-turn decay (distress) / per-turn (WPM) | Reset every turn | Sticky-for-call matches user mental model: "I asked you to slow down" is a one-time request, not a per-turn parameter. The user shouldn't have to repeat it. |
What it is
The agent's TTS speed adapts to three signals simultaneously:
| Signal | Effect |
|---|---|
| Caller asks "spreek wat trager" / "speak more slowly" | Speed → 0.85, sticky for the rest of the call |
| Distress signal detected on interim or final transcript | Speed → 0.75 (more urgent), sticky for 2 turns then decays |
| Caller speaks slowly (< 110 WPM) | Adds −0.05 offset to whichever speed is active |
The three compose into one knob:
final_speed = clamp(discrete_speed + wpm_offset, 0.70, 1.00)
When both explicit and distress are concurrently active, distress wins
(min(0.85, 0.75) = 0.75). The clamp is a defensive floor — even at the
extreme combination of distress + slow caller (0.75 + (−0.05) = 0.70),
the agent never goes below 0.70 (audibly degraded below that).
Why it matters
Two of the three triggers fix concrete caller-experience gaps:
- Truthfulness. Before A.3, the Phase 3.5 stub ack'd the slow-down
request ("Of course, I'll speak more slowly") but did not actually
slow the TTS — the comment in
agent.pysaid the elevenlabs plugin didn't exposevoice_settingscleanly. As oflivekit-plugins-elevenlabs==1.5.6it does, so A.3 makes the stub truthful: the request now changes the TTS speed for the rest of the call. - Patient-care alignment. A caller in distress hearing the handoff
template at normal pace is the wrong tempo for the moment. A.3 slows
to
0.75before the distress-handoff TTS speaks, so the calm delivery is in effect from the first word.
The third trigger (slow-caller WPM offset) is subtler — a −0.05 nudge
for slow-speaking callers (often elderly). Bounded so it never produces
audible mockery effects.
Triggers and persistence
Each trigger has its own natural lifetime:
| Trigger | Lifetime | Reset condition |
|---|---|---|
| Explicit slow-down | sticky | call end (no mid-call speed-up detector) |
| Distress | sticky for 2 turns | non-distressed turns decrement; counter at 0 = decayed |
| Caller WPM bucket | per-turn | re-evaluated from each turn's transcript |
The "no speed-up signal" is deliberate. Elderly callers (the most common beneficiaries) rarely ask the agent to speed up — they asked for slow because they need slow. Younger callers who triggered slow by mistake can hang up and call back.
WPM measurement
Per-turn rough estimate:
wpm = words_in_final_transcript / (final_received_at - first_interim_at) × 60
The first_interim_at is set by the FIRST interim transcript of each
turn (in on_user_input_transcribed). The final_received_at is the
turn-completion timestamp.
Bucket cutoffs:
| WPM | Bucket | Offset |
|---|---|---|
< 110 | slow | −0.05 |
110 ≤ wpm ≤ 180 | normal | 0 |
> 180 | fast | 0 |
The fast bucket gets 0 (not a positive offset) because A.3 never
speeds the agent up beyond baseline. Hospital callers may perceive
"sped up" as "rushed" or "uncaring."
Plumbing — when update_options is called
The elevenlabs plugin marks its WebSocket connection as non_current
whenever update_options(voice_settings=…) is called, forcing a
connection rebuild that costs ~200-400 ms on the next say(). A.3
guards against this with an idempotency check:
target = voice_speed.compute_target_speed(self._speed_state)
if target == self._current_tts_speed:
return # no change → no API call
self._tts.update_options(voice_settings=VoiceSettings(speed=target, …))
self._current_tts_speed = target
Every actual speed change emits a structured-log event (see Observability below).
Failure modes
| Scenario | Behavior |
|---|---|
update_options raises (plugin error, network blip) | Log warning, retain current speed (silent degradation, never crash) |
| WPM measurement window is 0 or empty text | Skip WPM update; bucket stays at last known value |
| Feature flag off | All speed-change paths bypassed; agent runs at baseline 1.0 forever |
| ElevenLabs rejects an out-of-range speed value | Plugin's own validation surfaces error; we log and continue at last-good speed |
| Language switch during call | session.tts accessor reflects the new TTS instance; A.3 reads it dynamically (no stale reference) |
No path crashes the session, kills audio, or degrades the safety paths.
Observability
Two structured log events:
{"event": "voice_speed_changed",
"from_speed": 1.0, "to_speed": 0.85,
"reason": "explicit_slow_request",
"language": "nl"}
{"event": "voice_speed_changed",
"from_speed": 0.75, "to_speed": 0.85,
"reason": "distress_decayed",
"language": "nl"}
reason field values:
| Value | When it fires |
|---|---|
explicit_slow_request | Caller said "spreek trager" or equivalent |
distress_detected | Distress signal on interim or final transcript |
distress_decayed | Distress counter reached 0 |
wpm_offset_change | Caller's WPM bucket flipped (normal → slow or vice versa) |
Greppable via docker logs zol-voice-agent | grep voice_speed_changed.
A pilot call where you say "spreek wat trager" should show one event
with reason explicit_slow_request and to_speed=0.85.
Settings
# Default ON — flag off bypasses all adaptive speed logic, agent runs
# at baseline 1.0 for every turn. Read once at module import; runtime
# change requires container restart.
VOICE_ADAPTIVE_SPEED_ENABLED=true
Test it
Speak a slow-down request and listen for the speed change in the agent's next utterance:
- nl: "Spreek wat trager alsjeblieft."
- en: "Can you speak more slowly?"
- fr: "Parlez plus lentement, s'il vous plaît."
- it: "Parla più lentamente."
Then check the container logs for the voice_speed_changed event with
reason explicit_slow_request. The change is sticky for the rest of
the call — every subsequent answer will be at speed 0.85.
For distress: speak a distress signal ("ik wil niet meer leven", "I want
to die" — patterns are documented in the safety doc). Listen for the
distress handoff template at speed 0.75. The next two turns stay
slow; turn 3 decays back to 1.0 (or 0.85 if explicit was also set).
Files
voice_agent/voice_speed.py— pure composition module (testable without LiveKit)voice_agent/agent.py— integration (~30 lines added across 4 hooks)voice_agent/tests/unit/test_voice_speed.py— 24 design-locked testsdocs/plans/2026-04-25-adaptive-tts-speed-design.md— full design spec
References
- ElevenLabs Multilingual v2 — production TTS model; the
VoiceSettings.speedparameter is the lever this feature controls - LiveKit Agents Documentation — the
update_optionsAPI used to apply per-turn settings - Wang et al. 2017
- Nielsen 1993
- {/* TODO Wave 2.D: bibkey for "elderly-caller speech rate" needed (gerontology evidence base for the < 110 WPM threshold) */}