Adaptive TTS speed (A.3)

Status

Designed and implemented 2026-04-25. Third item of the Q2 naturalness sprint, follows A.2 (context-aware filler) and A.4 (prosody injection). Default ON in the container; flag VOICE_ADAPTIVE_SPEED_ENABLED=false to disable.

Why adapt TTS speed

A fixed TTS speed of 1.0 sounds calm-and-professional in the abstract — and wrong in three observable cases:

The caller asks the agent to slow down. Pre-A.3, the agent acknowledged the request ("Of course, I'll speak more slowly") but did not actually change the TTS speed; the elevenlabs plugin didn't expose voice_settings cleanly. As of livekit-plugins-elevenlabs==1.5.6 it does, so A.3 makes the acknowledgement truthful.
The caller is in distress. A distress signal on the inbound transcript should trigger calmer, slower delivery on the outbound — before the handoff template starts speaking. Standard pace mid-distress reads as obtuse at best, callous at worst.
The caller speaks slowly. Elderly callers (the dominant hospital-helpdesk demographic) often speak at < 110 WPM. Matching their pace with a small −0.05 offset reads as attuned without crossing into mockery.

ElevenLabs Multilingual v2 accepts a speed parameter in its VoiceSettings; A.3 composes the three signals into one knob and updates the TTS settings idempotently.

Trade-offs

Decision	Chosen	Alternative	Rejected because
Composition function	`clamp(discrete + offset, 0.70, 1.00)` with priority `distress > explicit > baseline`	Multiplicative composition (`distress × explicit_factor × wpm_factor`)	Multiplicative composition can drive the speed below the audibly-degraded threshold (~0.65) under stacked discounts; additive-with-clamp gives a hard floor.
Speed-up signal	None — the agent never speeds up beyond `1.0`	Symmetric speed control	Hospital callers perceive "sped up" as "rushed" or "uncaring." The only legitimate request is to slow down.
Persistence	Sticky for the call (slow-down) / 2-turn decay (distress) / per-turn (WPM)	Reset every turn	Sticky-for-call matches user mental model: "I asked you to slow down" is a one-time request, not a per-turn parameter. The user shouldn't have to repeat it.

What it is

The agent's TTS speed adapts to three signals simultaneously:

Signal	Effect
Caller asks "spreek wat trager" / "speak more slowly"	Speed → `0.85`, sticky for the rest of the call
Distress signal detected on interim or final transcript	Speed → `0.75` (more urgent), sticky for 2 turns then decays
Caller speaks slowly (< 110 WPM)	Adds `−0.05` offset to whichever speed is active

The three compose into one knob:

final_speed = clamp(discrete_speed + wpm_offset, 0.70, 1.00)

When both explicit and distress are concurrently active, distress wins (min(0.85, 0.75) = 0.75). The clamp is a defensive floor — even at the extreme combination of distress + slow caller (0.75 + (−0.05) = 0.70), the agent never goes below 0.70 (audibly degraded below that).

Why it matters

Two of the three triggers fix concrete caller-experience gaps:

Truthfulness. Before A.3, the Phase 3.5 stub ack'd the slow-down request ("Of course, I'll speak more slowly") but did not actually slow the TTS — the comment in agent.py said the elevenlabs plugin didn't expose voice_settings cleanly. As of livekit-plugins-elevenlabs==1.5.6 it does, so A.3 makes the stub truthful: the request now changes the TTS speed for the rest of the call.
Patient-care alignment. A caller in distress hearing the handoff template at normal pace is the wrong tempo for the moment. A.3 slows to 0.75 before the distress-handoff TTS speaks, so the calm delivery is in effect from the first word.

The third trigger (slow-caller WPM offset) is subtler — a −0.05 nudge for slow-speaking callers (often elderly). Bounded so it never produces audible mockery effects.

Triggers and persistence

Each trigger has its own natural lifetime:

Trigger	Lifetime	Reset condition
Explicit slow-down	sticky	call end (no mid-call speed-up detector)
Distress	sticky for 2 turns	non-distressed turns decrement; counter at 0 = decayed
Caller WPM bucket	per-turn	re-evaluated from each turn's transcript

The "no speed-up signal" is deliberate. Elderly callers (the most common beneficiaries) rarely ask the agent to speed up — they asked for slow because they need slow. Younger callers who triggered slow by mistake can hang up and call back.

WPM measurement

Per-turn rough estimate:

wpm = words_in_final_transcript / (final_received_at - first_interim_at) × 60

The first_interim_at is set by the FIRST interim transcript of each turn (in on_user_input_transcribed). The final_received_at is the turn-completion timestamp.

Bucket cutoffs:

WPM	Bucket	Offset
`< 110`	slow	`−0.05`
`110 ≤ wpm ≤ 180`	normal	`0`
`> 180`	fast	`0`

The fast bucket gets 0 (not a positive offset) because A.3 never speeds the agent up beyond baseline. Hospital callers may perceive "sped up" as "rushed" or "uncaring."

Plumbing — when `update_options` is called

The elevenlabs plugin marks its WebSocket connection as non_current whenever update_options(voice_settings=…) is called, forcing a connection rebuild that costs ~200-400 ms on the next say(). A.3 guards against this with an idempotency check:

target = voice_speed.compute_target_speed(self._speed_state)
if target == self._current_tts_speed:
    return                                # no change → no API call
self._tts.update_options(voice_settings=VoiceSettings(speed=target, …))
self._current_tts_speed = target

Every actual speed change emits a structured-log event (see Observability below).

Failure modes

Scenario	Behavior
`update_options` raises (plugin error, network blip)	Log warning, retain current speed (silent degradation, never crash)
WPM measurement window is 0 or empty text	Skip WPM update; bucket stays at last known value
Feature flag off	All speed-change paths bypassed; agent runs at baseline `1.0` forever
ElevenLabs rejects an out-of-range speed value	Plugin's own validation surfaces error; we log and continue at last-good speed
Language switch during call	`session.tts` accessor reflects the new TTS instance; A.3 reads it dynamically (no stale reference)

No path crashes the session, kills audio, or degrades the safety paths.

Observability

Two structured log events:

{"event": "voice_speed_changed",
 "from_speed": 1.0, "to_speed": 0.85,
 "reason": "explicit_slow_request",
 "language": "nl"}

{"event": "voice_speed_changed",
 "from_speed": 0.75, "to_speed": 0.85,
 "reason": "distress_decayed",
 "language": "nl"}

reason field values:

Value	When it fires
`explicit_slow_request`	Caller said "spreek trager" or equivalent
`distress_detected`	Distress signal on interim or final transcript
`distress_decayed`	Distress counter reached 0
`wpm_offset_change`	Caller's WPM bucket flipped (normal → slow or vice versa)

Greppable via docker logs zol-voice-agent | grep voice_speed_changed. A pilot call where you say "spreek wat trager" should show one event with reason explicit_slow_request and to_speed=0.85.

Settings

# Default ON — flag off bypasses all adaptive speed logic, agent runs
# at baseline 1.0 for every turn. Read once at module import; runtime
# change requires container restart.
VOICE_ADAPTIVE_SPEED_ENABLED=true

Test it

Speak a slow-down request and listen for the speed change in the agent's next utterance:

nl: "Spreek wat trager alsjeblieft."
en: "Can you speak more slowly?"
fr: "Parlez plus lentement, s'il vous plaît."
it: "Parla più lentamente."

Then check the container logs for the voice_speed_changed event with reason explicit_slow_request. The change is sticky for the rest of the call — every subsequent answer will be at speed 0.85.

For distress: speak a distress signal ("ik wil niet meer leven", "I want to die" — patterns are documented in the safety doc). Listen for the distress handoff template at speed 0.75. The next two turns stay slow; turn 3 decays back to 1.0 (or 0.85 if explicit was also set).

Files

voice_agent/voice_speed.py — pure composition module (testable without LiveKit)
voice_agent/agent.py — integration (~30 lines added across 4 hooks)
voice_agent/tests/unit/test_voice_speed.py — 24 design-locked tests
docs/plans/2026-04-25-adaptive-tts-speed-design.md — full design spec

References

ElevenLabs Multilingual v2 — production TTS model; the VoiceSettings.speed parameter is the lever this feature controls
LiveKit Agents Documentation — the update_options API used to apply per-turn settings
Wang et al. 2017
Nielsen 1993
{/* TODO Wave 2.D: bibkey for "elderly-caller speech rate" needed (gerontology evidence base for the < 110 WPM threshold) */}

Why adapt TTS speed​

Trade-offs​

What it is​

Why it matters​

Triggers and persistence​

WPM measurement​

Plumbing — when update_options is called​

Failure modes​

Observability​

Settings​

Test it​

Files​

References​