Skip to main content

Adaptive TTS speed (A.3)

Status

Designed and implemented 2026-04-25. Third item of the Q2 naturalness sprint, follows A.2 (context-aware filler) and A.4 (prosody injection). Default ON in the container; flag VOICE_ADAPTIVE_SPEED_ENABLED=false to disable.

Why adapt TTS speed

A fixed TTS speed of 1.0 sounds calm-and-professional in the abstract — and wrong in three observable cases:

  1. The caller asks the agent to slow down. Pre-A.3, the agent acknowledged the request ("Of course, I'll speak more slowly") but did not actually change the TTS speed; the elevenlabs plugin didn't expose voice_settings cleanly. As of livekit-plugins-elevenlabs==1.5.6 it does, so A.3 makes the acknowledgement truthful.
  2. The caller is in distress. A distress signal on the inbound transcript should trigger calmer, slower delivery on the outbound — before the handoff template starts speaking. Standard pace mid-distress reads as obtuse at best, callous at worst.
  3. The caller speaks slowly. Elderly callers (the dominant hospital-helpdesk demographic) often speak at < 110 WPM. Matching their pace with a small −0.05 offset reads as attuned without crossing into mockery.

ElevenLabs Multilingual v2 accepts a speed parameter in its VoiceSettings; A.3 composes the three signals into one knob and updates the TTS settings idempotently.

Trade-offs

DecisionChosenAlternativeRejected because
Composition functionclamp(discrete + offset, 0.70, 1.00) with priority distress > explicit > baselineMultiplicative composition (distress × explicit_factor × wpm_factor)Multiplicative composition can drive the speed below the audibly-degraded threshold (~0.65) under stacked discounts; additive-with-clamp gives a hard floor.
Speed-up signalNone — the agent never speeds up beyond 1.0Symmetric speed controlHospital callers perceive "sped up" as "rushed" or "uncaring." The only legitimate request is to slow down.
PersistenceSticky for the call (slow-down) / 2-turn decay (distress) / per-turn (WPM)Reset every turnSticky-for-call matches user mental model: "I asked you to slow down" is a one-time request, not a per-turn parameter. The user shouldn't have to repeat it.

What it is

The agent's TTS speed adapts to three signals simultaneously:

SignalEffect
Caller asks "spreek wat trager" / "speak more slowly"Speed → 0.85, sticky for the rest of the call
Distress signal detected on interim or final transcriptSpeed → 0.75 (more urgent), sticky for 2 turns then decays
Caller speaks slowly (< 110 WPM)Adds −0.05 offset to whichever speed is active

The three compose into one knob:

final_speed = clamp(discrete_speed + wpm_offset, 0.70, 1.00)

When both explicit and distress are concurrently active, distress wins (min(0.85, 0.75) = 0.75). The clamp is a defensive floor — even at the extreme combination of distress + slow caller (0.75 + (−0.05) = 0.70), the agent never goes below 0.70 (audibly degraded below that).

Why it matters

Two of the three triggers fix concrete caller-experience gaps:

  • Truthfulness. Before A.3, the Phase 3.5 stub ack'd the slow-down request ("Of course, I'll speak more slowly") but did not actually slow the TTS — the comment in agent.py said the elevenlabs plugin didn't expose voice_settings cleanly. As of livekit-plugins-elevenlabs==1.5.6 it does, so A.3 makes the stub truthful: the request now changes the TTS speed for the rest of the call.
  • Patient-care alignment. A caller in distress hearing the handoff template at normal pace is the wrong tempo for the moment. A.3 slows to 0.75 before the distress-handoff TTS speaks, so the calm delivery is in effect from the first word.

The third trigger (slow-caller WPM offset) is subtler — a −0.05 nudge for slow-speaking callers (often elderly). Bounded so it never produces audible mockery effects.

Triggers and persistence

Each trigger has its own natural lifetime:

TriggerLifetimeReset condition
Explicit slow-downstickycall end (no mid-call speed-up detector)
Distresssticky for 2 turnsnon-distressed turns decrement; counter at 0 = decayed
Caller WPM bucketper-turnre-evaluated from each turn's transcript

The "no speed-up signal" is deliberate. Elderly callers (the most common beneficiaries) rarely ask the agent to speed up — they asked for slow because they need slow. Younger callers who triggered slow by mistake can hang up and call back.

WPM measurement

Per-turn rough estimate:

wpm = words_in_final_transcript / (final_received_at - first_interim_at) × 60

The first_interim_at is set by the FIRST interim transcript of each turn (in on_user_input_transcribed). The final_received_at is the turn-completion timestamp.

Bucket cutoffs:

WPMBucketOffset
< 110slow−0.05
110 ≤ wpm ≤ 180normal0
> 180fast0

The fast bucket gets 0 (not a positive offset) because A.3 never speeds the agent up beyond baseline. Hospital callers may perceive "sped up" as "rushed" or "uncaring."

Plumbing — when update_options is called

The elevenlabs plugin marks its WebSocket connection as non_current whenever update_options(voice_settings=…) is called, forcing a connection rebuild that costs ~200-400 ms on the next say(). A.3 guards against this with an idempotency check:

target = voice_speed.compute_target_speed(self._speed_state)
if target == self._current_tts_speed:
return # no change → no API call
self._tts.update_options(voice_settings=VoiceSettings(speed=target,))
self._current_tts_speed = target

Every actual speed change emits a structured-log event (see Observability below).

Failure modes

ScenarioBehavior
update_options raises (plugin error, network blip)Log warning, retain current speed (silent degradation, never crash)
WPM measurement window is 0 or empty textSkip WPM update; bucket stays at last known value
Feature flag offAll speed-change paths bypassed; agent runs at baseline 1.0 forever
ElevenLabs rejects an out-of-range speed valuePlugin's own validation surfaces error; we log and continue at last-good speed
Language switch during callsession.tts accessor reflects the new TTS instance; A.3 reads it dynamically (no stale reference)

No path crashes the session, kills audio, or degrades the safety paths.

Observability

Two structured log events:

{"event": "voice_speed_changed",
"from_speed": 1.0, "to_speed": 0.85,
"reason": "explicit_slow_request",
"language": "nl"}

{"event": "voice_speed_changed",
"from_speed": 0.75, "to_speed": 0.85,
"reason": "distress_decayed",
"language": "nl"}

reason field values:

ValueWhen it fires
explicit_slow_requestCaller said "spreek trager" or equivalent
distress_detectedDistress signal on interim or final transcript
distress_decayedDistress counter reached 0
wpm_offset_changeCaller's WPM bucket flipped (normal → slow or vice versa)

Greppable via docker logs zol-voice-agent | grep voice_speed_changed. A pilot call where you say "spreek wat trager" should show one event with reason explicit_slow_request and to_speed=0.85.

Settings

# Default ON — flag off bypasses all adaptive speed logic, agent runs
# at baseline 1.0 for every turn. Read once at module import; runtime
# change requires container restart.
VOICE_ADAPTIVE_SPEED_ENABLED=true

Test it

Speak a slow-down request and listen for the speed change in the agent's next utterance:

  • nl: "Spreek wat trager alsjeblieft."
  • en: "Can you speak more slowly?"
  • fr: "Parlez plus lentement, s'il vous plaît."
  • it: "Parla più lentamente."

Then check the container logs for the voice_speed_changed event with reason explicit_slow_request. The change is sticky for the rest of the call — every subsequent answer will be at speed 0.85.

For distress: speak a distress signal ("ik wil niet meer leven", "I want to die" — patterns are documented in the safety doc). Listen for the distress handoff template at speed 0.75. The next two turns stay slow; turn 3 decays back to 1.0 (or 0.85 if explicit was also set).

Files

  • voice_agent/voice_speed.py — pure composition module (testable without LiveKit)
  • voice_agent/agent.py — integration (~30 lines added across 4 hooks)
  • voice_agent/tests/unit/test_voice_speed.py — 24 design-locked tests
  • docs/plans/2026-04-25-adaptive-tts-speed-design.md — full design spec

References