How to control TTS pronunciation with SSML
SSML lets you tell VICIdial's text-to-speech engine how to say numbers, letters, and account IDs instead of guessing the pronunciation.
When VICIdial reads text aloud with its text-to-speech engine, it has to guess how to pronounce what you wrote. Most of the time that guess is fine. But the moment you put a number or an account ID in front of it, the guesses get strange fast. SSML, the Speech Synthesis Markup Language, is how you stop guessing and tell the engine exactly what to say.
Why plain text is not enough
The TTS Text field of a TTS (text to speech) entry accepts SSML and passes it straight through to the speech engine. If you skip the markup, the engine treats numbers as quantities. Write 12574 and it reads back "twelve thousand five hundred and seventy four." That is correct math and useless on a call where you wanted to read an account number digit by digit.
This matters because a TTS entry can feed any Campaign audio prompt, and those prompts are heard by real people. A mispronounced confirmation number defeats the whole point of reading it back.
Spelling things out digit by digit
The fix is the say-as directive. Wrap the value and tell the engine to treat it as separate characters rather than a single number:
<say-as type='acronym'>12574</say-as>
With that wrapper the engine reads "one two five seven four" instead of a single large number. The same idea covers letters, codes, and anything else you want spoken character by character.
What else SSML controls
Pronunciation is the headline use, but the markup reaches further. You can shape how a prompt sounds without re-recording anything:
- Pronunciation of numbers, acronyms, and account IDs
- Volume of the spoken output
- Pitch of the voice
- Rate, so the engine slows down for an important detail
Because the entry pulls from your default lead tables, you can blend dynamic data with these controls. A confirmation digit string from the Lead record can be wrapped in say-as so every customer hears their own number spelled out cleanly.
How the markup reaches the caller
The path is short. Your campaign prompt points at a TTS entry, the entry's SSML is handed to the speech engine on the dialer, and the rendered audio is played back to the caller over the same Asterisk Dialplan that handles every other prompt.
sequenceDiagram
participant C as Campaign Prompt
participant T as TTS Entry SSML
participant E as Speech Engine
participant A as Asterisk Dialplan
participant P as Caller
C->>T: Request rendered audio
T->>E: Pass SSML markup
E->>A: Return spoken audio
A->>P: Play promptWhere to go next
SSML controls how text is spoken, but the rendered output still lands in your audio store as a normal sound file. To see how that store works, read the VICIdial audio store guide. For the full picture of prompts, voicemail, and TTS together, see the audio prompts and TTS guide. The engine that interprets this SSML is Cepstral, which we cover in what Cepstral is.
TTS shines in an IVR (interactive voice response) or a survey-style flow where dynamic data has to be spoken back accurately. If you would rather run a managed dialer where this is already wired up and tested, see our plans and pricing.
About VICIfast LLC
VICIfast LLC operates a managed VICIdial hosting + BYOI service for outbound and inbound call centers. We run the dialers, the carriers, the recordings pipeline, and the compliance plumbing so operators don’t have to.
Citing this article
VICIfast Engineering. “How to control TTS pronunciation with SSML”. VICIfast LLC, June 27, 2026. Retrieved from https://vicifast.com/blog/vicidial-tts-ssml-explained
Have questions?
Related posts
You might be interested in
VICIfast newsletter
Liked this? Get the next one in your inbox.
We ship the kind of stuff you just read — concrete, numbers-first, no drip. One email when a new post goes live. Unsubscribe in one click.
Comments
No comments yet — be the first.