Skip to main content

Prosody injection (A.4)

Status

Designed and implemented 2026-04-25. Closes the Phase E prosody gap. Backend-only change in voice_answer_shaper.py; no operator flag, no docker rebuild.

Why prosody injection matters

ElevenLabs Multilingual v2 is a neural TTS model that reads text token-by-token without intrinsic understanding of phone-number digit groups, time ranges, or other domain-specific structures that human speakers prosodise naturally. "089 80 80 80" without explicit punctuation gets read as a single number; "8:00 - 18:00" without a bridge word reads as two disconnected times. The TTS model is doing its job; the input format is wrong.

A.4 rewrites the answer text before TTS in two narrow ways: phone-number normalisation (commas between digit groups for prosody pauses) and time-range bridging (per-language "tot" / "to" / "à" / "alle"). The rewrite is regex-based, deterministic, and unit-tested.

Why punctuation, not SSML

ElevenLabs Multilingual v2 (the production TTS model) does not support SSML break tags. Only eleven_v3 and eleven_turbo_v2_5 honor <break time="…ms"/>, and ZOL uses Multilingual v2 for its superior Dutch/French/Italian voice quality. That leaves one prosody lever: punctuation. ElevenLabs respects commas (~150ms pause), periods (~300ms), ellipses (~500ms). A.4 expresses prosody intent by rewriting answer text BEFORE it reaches the TTS pipeline.

Trade-offs

DecisionChosenAlternativeRejected because
Prosody mechanismPunctuation rewritingSSML <break> tagsMultilingual v2 (the production model) does not honour SSML; the v3 / turbo models that do honour it have audibly weaker Dutch/French/Italian voices. The voice-quality cost outweighs the SSML expressiveness gain.
TTS model upgrade pathMultilingual v2 + punctuationUpgrade to v3 + SSMLMultilingual v2's voice quality is calibrated for the target population (Flemish callers, hospital context). v3's voices are tuned for English-dominant content; the Dutch voice quality regresses.
Where the rewrite livesvoice_answer_shaper.py (backend)voice_agent (TTS adapter)Centralising in the shaper means all callers (chat as a future fallback, voice now) share the same rewriting; the adapter is too late in the pipeline to also affect log artifacts and test fixtures.

What it is

Two extensions to the answer-shaper that fix listener-perceptible unnaturalness in TTS output:

Caller hears (before A.4)Caller hears (with A.4)
"Bel ons op nul-acht-negen-tachtig-tachtig-tachtig" (one mashed long number)"nul acht negen, tachtig, tachtig, tachtig" (with prosody pauses)
"Open van acht uur achttien uur" (two disconnected times)"Open van acht uur tot achttien uur" (with bridge word)

Both rewrites are punctuation-only — no SSML, no model upgrade, cross-voice safe.

Why punctuation, not SSML

ElevenLabs Multilingual v2 (the production TTS model) does not support SSML break tags. Only eleven_v3 and eleven_turbo_v2_5 honor <break time="…ms"/>, and ZOL uses Multilingual v2 for its superior Dutch/French/Italian voice quality.

That leaves one prosody lever: punctuation. ElevenLabs respects commas (~150ms pause), periods (~300ms), ellipses (~500ms). A.4 expresses prosody intent by rewriting answer text BEFORE it reaches the TTS pipeline.

Phase E foundation

Phase E (shipped earlier in 2026) already handled two cases:

  • Space-separated phones: "089 80 80 80" → "089, 80, 80, 80"
  • Single times: "8:00" → "8 uur" / "8 o'clock" / "8 heures" / "ore 8"

A.4 extends both: more phone-format coverage and time-RANGE handling.

Phone-format extensions

Belgian printed materials use these formats — all now covered:

FormatExampleRewrite
Slash089/80/80/80089, 80, 80, 80
Dash089-80-80-80089, 80, 80, 80
Dot (FR/BE)089.80.80.80089, 80, 80, 80
Compact 9-digit (BE)089808080089, 80, 80, 80
Compact 10-digit (mobile)04731234560473, 12, 34, 56
Space (Phase E)089 80 80 80089, 80, 80, 80

The compact form requires a leading 0. Without that gate, the regex would catch any 9-digit number — years like 2024, reference IDs, account numbers. Belgian phone numbers start with 0, so this is the conservative discriminator.

Time-range bridge insertion

A range like 8:00 - 18:00 reads naturally only when bridged with the local equivalent of "to":

LanguageBridge wordExample
Dutchtot8 uur tot 18 uur
Englishto8 o'clock to 18 o'clock
Frenchà8 heures à 18 heures
Italianalleore 8 alle 18

The separator regex [-–—] covers ASCII hyphen, en-dash, and em-dash, because Word/InDesign auto-replaces hyphens with en-dashes when content is copied from those tools.

The bridge word is inserted BEFORE _spell_time runs, so the individual times then get spelled in the local language as a normal single-time rewrite.

Failure modes

ScenarioBehavior
Source text has no phone or time rangePass-through, no rewrites
Phone already comma-formattedIdempotent — no change
Mixed separators in one numberFirst-match wins, others see comma-formatted text and skip
Unsupported language codeFalls back to Dutch (tot)
9-digit number without leading 0Pass-through (treated as ID/year)

No path crashes or returns malformed text. Every rewrite is a regex sub call with a fail-safe rewriter that defaults to the original match.

Settings

A.4 introduces no new env flags. The existing master flag still controls the whole shaper:

# Default ON; set to false to bypass Phase E + A.4 + all answer reshaping
VOICE_ANSWER_SHAPER_ENABLED=true

Test it

Ask the agent a question whose grounded answer mentions a phone number or opening hours, then listen for the prosody:

  • nl: "Wat is het telefoonnummer van cardiologie?" — listen for pauses between digit groups.
  • nl: "Wat zijn de openingsuren van de receptie?" — listen for the word "tot" between the start and end times.
  • en: "What are the visiting hours?" — listen for "to" as the bridge.
  • fr: "Quels sont les horaires d'ouverture?" — listen for "à".
  • it: "Quali sono gli orari di apertura?" — listen for "alle".

The 17 unit tests in backend/tests/unit/services/voice/test_a4_prosody_extensions.py verify each case in isolation.

Files

  • backend/app/services/voice/voice_answer_shaper.py — implementation
  • backend/tests/unit/services/voice/test_a4_prosody_extensions.py — design-locked tests
  • docs/plans/2026-04-25-prosody-injection-design.md — full design spec

References