Prosody injection (A.4)
Designed and implemented 2026-04-25. Closes the Phase E prosody
gap. Backend-only change in voice_answer_shaper.py; no operator
flag, no docker rebuild.
Why prosody injection matters
ElevenLabs Multilingual v2 is a neural TTS model that reads text token-by-token without intrinsic understanding of phone-number digit groups, time ranges, or other domain-specific structures that human speakers prosodise naturally. "089 80 80 80" without explicit punctuation gets read as a single number; "8:00 - 18:00" without a bridge word reads as two disconnected times. The TTS model is doing its job; the input format is wrong.
A.4 rewrites the answer text before TTS in two narrow ways: phone-number normalisation (commas between digit groups for prosody pauses) and time-range bridging (per-language "tot" / "to" / "à" / "alle"). The rewrite is regex-based, deterministic, and unit-tested.
Why punctuation, not SSML
ElevenLabs Multilingual v2 (the production TTS model) does not support SSML break tags. Only eleven_v3 and eleven_turbo_v2_5 honor <break time="…ms"/>, and ZOL uses Multilingual v2 for its superior Dutch/French/Italian voice quality. That leaves one prosody lever: punctuation. ElevenLabs respects commas (~150ms pause), periods (~300ms), ellipses (~500ms). A.4 expresses prosody intent by rewriting answer text BEFORE it reaches the TTS pipeline.
Trade-offs
| Decision | Chosen | Alternative | Rejected because |
|---|---|---|---|
| Prosody mechanism | Punctuation rewriting | SSML <break> tags | Multilingual v2 (the production model) does not honour SSML; the v3 / turbo models that do honour it have audibly weaker Dutch/French/Italian voices. The voice-quality cost outweighs the SSML expressiveness gain. |
| TTS model upgrade path | Multilingual v2 + punctuation | Upgrade to v3 + SSML | Multilingual v2's voice quality is calibrated for the target population (Flemish callers, hospital context). v3's voices are tuned for English-dominant content; the Dutch voice quality regresses. |
| Where the rewrite lives | voice_answer_shaper.py (backend) | voice_agent (TTS adapter) | Centralising in the shaper means all callers (chat as a future fallback, voice now) share the same rewriting; the adapter is too late in the pipeline to also affect log artifacts and test fixtures. |
What it is
Two extensions to the answer-shaper that fix listener-perceptible unnaturalness in TTS output:
| Caller hears (before A.4) | Caller hears (with A.4) |
|---|---|
| "Bel ons op nul-acht-negen-tachtig-tachtig-tachtig" (one mashed long number) | "nul acht negen, tachtig, tachtig, tachtig" (with prosody pauses) |
| "Open van acht uur achttien uur" (two disconnected times) | "Open van acht uur tot achttien uur" (with bridge word) |
Both rewrites are punctuation-only — no SSML, no model upgrade, cross-voice safe.
Why punctuation, not SSML
ElevenLabs Multilingual v2 (the production TTS model) does not
support SSML break tags. Only eleven_v3 and eleven_turbo_v2_5
honor <break time="…ms"/>, and ZOL uses Multilingual v2 for its
superior Dutch/French/Italian voice quality.
That leaves one prosody lever: punctuation. ElevenLabs respects commas (~150ms pause), periods (~300ms), ellipses (~500ms). A.4 expresses prosody intent by rewriting answer text BEFORE it reaches the TTS pipeline.
Phase E foundation
Phase E (shipped earlier in 2026) already handled two cases:
- Space-separated phones: "089 80 80 80" → "089, 80, 80, 80"
- Single times: "8:00" → "8 uur" / "8 o'clock" / "8 heures" / "ore 8"
A.4 extends both: more phone-format coverage and time-RANGE handling.
Phone-format extensions
Belgian printed materials use these formats — all now covered:
| Format | Example | Rewrite |
|---|---|---|
| Slash | 089/80/80/80 | 089, 80, 80, 80 |
| Dash | 089-80-80-80 | 089, 80, 80, 80 |
| Dot (FR/BE) | 089.80.80.80 | 089, 80, 80, 80 |
| Compact 9-digit (BE) | 089808080 | 089, 80, 80, 80 |
| Compact 10-digit (mobile) | 0473123456 | 0473, 12, 34, 56 |
| Space (Phase E) | 089 80 80 80 | 089, 80, 80, 80 |
The compact form requires a leading 0. Without that gate, the
regex would catch any 9-digit number — years like 2024, reference
IDs, account numbers. Belgian phone numbers start with 0, so this
is the conservative discriminator.
Time-range bridge insertion
A range like 8:00 - 18:00 reads naturally only when bridged with
the local equivalent of "to":
| Language | Bridge word | Example |
|---|---|---|
| Dutch | tot | 8 uur tot 18 uur |
| English | to | 8 o'clock to 18 o'clock |
| French | à | 8 heures à 18 heures |
| Italian | alle | ore 8 alle 18 |
The separator regex [-–—] covers ASCII hyphen, en-dash, and
em-dash, because Word/InDesign auto-replaces hyphens with en-dashes
when content is copied from those tools.
The bridge word is inserted BEFORE _spell_time runs, so the
individual times then get spelled in the local language as a normal
single-time rewrite.
Failure modes
| Scenario | Behavior |
|---|---|
| Source text has no phone or time range | Pass-through, no rewrites |
| Phone already comma-formatted | Idempotent — no change |
| Mixed separators in one number | First-match wins, others see comma-formatted text and skip |
| Unsupported language code | Falls back to Dutch (tot) |
9-digit number without leading 0 | Pass-through (treated as ID/year) |
No path crashes or returns malformed text. Every rewrite is a
regex sub call with a fail-safe rewriter that defaults to the
original match.
Settings
A.4 introduces no new env flags. The existing master flag still controls the whole shaper:
# Default ON; set to false to bypass Phase E + A.4 + all answer reshaping
VOICE_ANSWER_SHAPER_ENABLED=true
Test it
Ask the agent a question whose grounded answer mentions a phone number or opening hours, then listen for the prosody:
- nl: "Wat is het telefoonnummer van cardiologie?" — listen for pauses between digit groups.
- nl: "Wat zijn de openingsuren van de receptie?" — listen for the word "tot" between the start and end times.
- en: "What are the visiting hours?" — listen for "to" as the bridge.
- fr: "Quels sont les horaires d'ouverture?" — listen for "à".
- it: "Quali sono gli orari di apertura?" — listen for "alle".
The 17 unit tests in
backend/tests/unit/services/voice/test_a4_prosody_extensions.py
verify each case in isolation.
Files
backend/app/services/voice/voice_answer_shaper.py— implementationbackend/tests/unit/services/voice/test_a4_prosody_extensions.py— design-locked testsdocs/plans/2026-04-25-prosody-injection-design.md— full design spec
References
- ElevenLabs Multilingual v2 — production TTS model whose punctuation-prosody behaviour the rewriter targets
- Wang et al. 2017
- ITU-T E.164