FreeClimb TTS
Learn best practices to obtain the best quality text-to-speech renderings.
What is FreeClimb TTS?
FreeClimb TTS was built on the foundation of the open-source Coqui TTS project. With FreeClimb TTS, users can create realistic, generative AI voices, leading to more personalized and engaging interactions.
How to execute FreeClimb TTS
To use this feature in your FreeClimb application, you'll utilize the PerCL Say command. First, select your preferred voice, then configure the API to define the specific attributes. When a caller interacts with your application, the chosen voice, such as "Eva" in the example below, will deliver the specified text.
{
"Say" : {
"text" : "Welcome to FreeClimb."
"voice" : "freeclimb.neural.eva"
"language" : "en-us"
}
}
General TTS Best Practices
The following FreeClimb TTS guidelines outline best practices in order to obtain the best quality text-to-speech renderings.
Foreign Loan Words, Archaic Words, and Proper Nouns
As the current model is trained on domain-general English words, it may struggle with pronouncing foreign loan words, archaic words, and proper nouns, such as people's names and locations to the extent that those names are less common (and therefore less likely to appear in the training with a high frequency).
Eau Claire
Doppelgänger
chuse
There is currently no method to ensure such words are rendered correctly, but it may help to experiment with using similar-sounding, higher-frequency words in their place.
Abbreviations and Acronyms
The current model has limited knowledge of various abbreviations and acronyms.
a. Abbreviations
Abbreviations should be expanded to their full word(s).
abbreviation → suggested input to TTS
Dr. -> Doctor or Drive (depending on context)
St. -> Street or Saint (depending on context)
Ct. -> Court
b. Acronyms
For acronyms, the specific instructions depend on how exactly the acronym should be spoken.
c. Acronyms Spoken as Letters
For acronyms that should spoken as a series of letters, capitalize the acronym (FBI). If the acronym is still being spoken as if it were a word, put a space between each letter.
acronym → suggested input to TTS
fbi -> F B I
usa -> USA
am -> AM
pm -> PM
d. Acronyms Spoken as Words
For acronyms that should be spoken as a whole word, use the lowercase version of how the word should be said. Using the fully-capitalized version of the acronym will always cause the acronym to be spelled out. For example, “PIN” would be spoken as “P(pee) I(eye) N(en)”.
acronym → suggested input to TTS
PIN -> pin
SIM -> sim
Punctuation
Proper punctuation should be used for the best-quality rendering. Sentences should always end with the appropriate punctuation depending on whether it is a statement, question, or exclamation. Beyond sentence-level punctuation, every input utterance (even if only a single word) should end in a period.
A as in Alfa -> A as in Alfa.
Yes -> Yes.
Help -> Help.
There are situations in which the model will incorrectly render text that contains punctuation other than commas, periods, question marks, or exclamation points. In such cases, you should remove all punctuation except for the aforementioned commas and periods/question marks/exclamation points and try rendering again. If the problem persists, remove all punctuation except for periods/question marks/exclamation points and try once more.
Newline and Carriage Return
The system cannot handle the newline nor carriage return characters, as in the examples below, and should not be used.
This is an example of a\n newline character.
This is an example of a carriage return\r character.
Input Text Length
Individual sentences should not exceed 200 characters (letters and spaces). Multisentential utterances will automatically be split by the system at sentence boundaries, generated separately, and then concatenated to produce a single .wav file. Latency grows with input length, so a good rule of thumb is to limit input to 35 words per sample generated.
Numbers
Numbers should be spelled out, as opposed to being typed in their numeric form, as this can sometimes lead to errors.
number → suggested input to TTS
1 -> one
1,000 -> one thousand
64 -> sixty-four3. SSML
SSML TTS Best Practices
Coqui TTS supports selected elements of the Speech Synthesis Markup Language (SSML). Interested readers can find out more about SSML Speech Synthesis Markup Language (SSML) Version 1.1. SSML is used to specify how exactly a piece of text is to be spoken by the TTS system. Each section below will go over in detail a specific SSML feature we currently support and how to use it. All SSML content must be contained within a single set of
Say-As
The say-as
SSML tag allows users to control how a piece of data, such as a date, time, or plain number should be spoken. The types of data it supports are detailed in the subsections below.
Dates
You can use the <say-as interpret-as="date">
tag to render a date in the form MM/DD/YYYY where:
- MM is a value between 00 and 12 (can be single-digit)
- DD is a value between 01 and 31 (can be single-digit)
- YYYY is a value greater than 0001 and less than or equal to 9999.
While the method we use to parse the date will attempt to infer the correct date from different formats, you should use the MM/DD/YYYY format for the most accurate rendering.
EXAMPLES:
<say-as interpret-as="date>03/28/2024</say-as>
would be rendered as “March twenty-eighth, twenty twenty-four”
<say-as interpret-as="date>5/6/2027</say-as>
would be rendered as “May sixth, twenty twenty-seven”
Times
You can use the <say-as interpret-as="time">
tag to render a time in the form HH:MM [AM/PM] where:
- HH is between 01 and 12 (can be single-digit)
- MM is between 00 and 59
- [AM/PM] is optional and will not be present in the rendered output if it is not present in the SSML
- If included, it must be capitalized and separated from the time with a single space
EXAMPLES:
<say-as interpret-as="time">3:15 PM</say-as>
would be rendered as rendered as “three fifteen (pee) (em)”
<say-as interpret-as="time">12:00 AM</say-as>
would be rendered as “twelve (ayy) (em)”
<say-as interpret-as="time">5:47</say-as>
would be rendered as “five forty-seven”
Currency
You can use the <say-as interpret-as="currency">
tag to render a monetary amount in the form $D[.CC] where:
- $ - the dollar sign should be present before the currency amount to ensure that the currency should be rendered as US dollars
- You can specify other currencies by prepending the specific currency symbol (such as the Euro - €) before the currency value.
- Only US dollars are officially supported at this time and other currencies may have issues with rendering
- You can specify other currencies by prepending the specific currency symbol (such as the Euro - €) before the currency value.
- D is a dollar amount greater than 0
- Exceptionally large numbers (quadrillion+) may not be rendered correctly
- If set to 0, the dollar amount will be read (“zero dollars”) along with the cent amount
- [.CC] is an optional value in cents between 00 and 99
- If not present, only the dollar amount will be rendered
EXAMPLES:
<say-as interpret-as="currency">$5.55</say-as>
would be rendered as rendered as “five dollars and fifty-five cents”
<say-as interpret-as="currency">$125</say-as>
would be rendered as “one hundred and twenty-five dollars”
<say-as interpret-as="currency">$0.25</say-as>
would be rendered as “zero dollars and twenty-five cents”
Numbers
Numbers can be rendered in a variety of ways, depending on the specific use-case. The subsections below detail the currently supported methods of rendering numbers.
Negative numbers (numbers preceded with a minus sign/hyphen [-]) will be rendered as “minus X” and not “negative X”. If you want to render a number as “negative X” then you will need to explicitly include “negative X” in the text, where X is the positive version of the number.
a. Cardinal
You can use the <say-as interpret-as="number" format="cardinal">
tag to render a positive integer or decimal number in its cardinal format. Note that for numbers between 1000 and 9999, you must use commas in the numbers to prevent them from being interpreted as dates (this is a known bug/feature with the tool we use to parse the SSML).
EXAMPLES:
<say-as interpret-as="number" format="cardinal">5</say-as>
would be rendered as “five”
<say-as interpret-as="number" format="cardinal">1,234</say-as>
would be rendered as “one thousand two hundred and thirty-four”
<say-as interpret-as="number" format="cardinal">27.3</say-as>
would be rendered as “twenty-seven point three”
<say-as interpret-as="number" format="cardinal">-7</say-as>
would be rendered as “minus seven”
b. Ordinal
You can use the <say-as interpret-as="number" format="ordinal">
tag to render a positive integer number in its ordinal format. Note that for numbers between 1000 and 9999, you must use commas in the numbers to prevent them from being interpreted as dates (this is a known bug/feature with the tool we use to parse the SSML).
EXAMPLES:
<say-as interpret-as="number" format="ordinal">5</say-as>
would be rendered as “fifth”
<say-as interpret-as="number" format="ordinal">1,234</say-as>
would be rendered as “one thousand two hundred and thirty-fourth”
<say-as interpret-as="number" format="ordinal">33</say-as>
would be rendered as “thirty-third”
c. Digits
You can use the <say-as interpret-as="number" format="digit">
tag to render a series of digits. Each digit must be separated by a space for the digits to be correctly rendered.
If you need to render a single number without any spaces as a series of digits (ex: 98765), see the Spell-Out section.
EXAMPLES:
<say-as interpret-as="number" format="digits">5</say-as>
would be rendered as “five”
<say-as interpret-as="number" format="digits">1 2 3 4</say-as>
would be rendered as “one two three four”
<say-as interpret-as="number" format="digits">3 3</say-as>
would be rendered as “three three”
d. Year
You can use the <say-as interpret-as="number" format="year">
tag to render a positive integer number as a spoken year.
EXAMPLES:
<say-as interpret-as="number" format="year">2024</say-as>
would be rendered as “twenty twenty-four”
<say-as interpret-as="number" format="year">2002</say-as>
would be rendered as “two thousand and two”
<say-as interpret-as="number" format="year">1996</say-as>
would be rendered as “nineteen ninety-six”
<say-as interpret-as="number" format="year">512</say-as>
would be rendered as “five twelve”
Spell-Out
You can use the <say-as interpret-as="spell-out">
tag to render a number as a series of digits, and to render special characters. All supported characters are listed in the examples below.
EXAMPLES:
<say-as interpret-as="spell-out">-</say-as>
( - ) would be rendered as “dash”
<say-as interpret-as="spell-out">/</say-as>
( / ) would be rendered as “slash”
<say-as interpret-as="spell-out">+</say-as>
( + ) would be rendered as “plus”
<say-as interpret-as="spell-out">.</say-as>
( . ) would be rendered as “dot”
<say-as interpret-as="spell-out">*</say-as>
( * ) would be rendered as “star”
<say-as interpret-as="spell-out">@</say-as>
( @ ) would be rendered as “at”
<say-as interpret-as="spell-out">12345</say-as>
would be rendered as “one two three four five”
Websites and Email Addresses
While rendering a website or email address is not supported as a single say-as
tag, you can combine multiple say-as
tags to render either type of data.
EXAMPLES:
support <say-as interpret-as="spell-out">@</say-as>
free climb <say-as interpret-as="spell-out">.</say-as>
com ([email protected]) would be rendered as “support at free climb dot com”
<say-as interpret-as="spell-out">[www](http://www) .</say-as> vailsis <say-as interpret-as="spell-out">.</say-as> com
(Vail Systems, Inc. ) would be rendered as “double-u double-u double-u dot vail sis dot com”
Note that in the above example vailsys is spelled as “vailsis” to properly render the word as it should be spoken. Similar transformations might need to be done for website and email domains that are not normal words in the English language.
Say-As General Examples
Below are several examples of using fully-formed SSML to control the rendering of text for common use-cases.
Checking an Appointment
<speak>
Your appointment is currently scheduled for
<say-as interpret-as="time">12:15 PM</say-as> on <say-as interpret-as="date">6/15/24.</say-as>
</speak>
Business Hours
<speak>
This business location is currently closed. Our hours of operation are
<say-as interpret-as="time">8:00 AM</say-as> to <say-as interpret-as="time">7:00 PM</say-as> on Monday through Thursday.
<say-as interpret-as="time">8:00 AM</say-as> to <say-as interpret-as="time">6:00 PM</say-as> on Friday.
<say-as interpret-as="time">10:00 AM</say-as> to <say-as interpret-as="time">8:00 PM</say-as> on Saturday.
We are closed on Sunday.
</speak>
Account Balance
<speak>
Your current balance is
<say-as interpret-as="currency">$125.60.</say-as>
</speak>
Prosodic Modifications
Currently, we support modifying several prosodic features of the audio using the <prosody>
tag. The supported prosodic modifications are detailed in the subsections below.
Speaking Rate
You can use the <prosody rate="">
tag to increase or decrease the speed of the audio. The value of rate must either be a positive integer or decimal percentage (including the % symbol) or one of ["x-slow", "slow", "medium", "fast", "x-fast","default"] where:
- default is 100% speed
- x-slow is 60% of the default
- slow is 80% of the default
- medium is 90% of the default
- fast is 150% of the default
- x-fast is 200% of the default
The minimum value for the rate percentage is 10%, while the maximum value is 10000%.
EXAMPLES:
<prosody rate="50%">Hello World.</prosody>
would render the text at half the normal speed
<prosody rate="120%">Hello World.</prosody>
would render the text at 1.2 times the normal speed
Volume
You can use the <prosody volume="">
tag to increase or decrease the volume of the audio. The value of volume must either be a positive or negative integer or decimal value (with corresponding + or - sign before the number) followed by dB or one of ["silent", "x-soft", "soft", "medium", "loud", "x-loud", "default"] where:
- silent is, as the name implies, silence
- default is normal volume
- x-soft is 50% of the default
- soft is 75% of the default
- medium is 90% of the default
- loud is 150% of the default
- x-loud is 200% of the default
The value in decibels is relative to the normal volume. Roughly every (+/-)6 decibels corresponds to a doubling/halving of the original volume.
There is no hard minimum or maximum value for the volume, though very small values (large negative values) will essentially make the audio silent, while very large values will stop adjusting the audio at a certain (unknown) point.
EXAMPLES:
<prosody volume="+6dB">Hello World.</prosody>
would render the text at 6 decibels louder than normal (roughly twice the original volume)
<prosody volume="-6dB">Hello World.</prosody>
would render the text at 6 decibels quieter than normal (roughly half the original volume)
<prosody volume="loud">Hello World.</prosody>
would render the text at 1.5 times the normal volume
Pitch Shifting
You can use the <prosody pitch="">
tag to shift the pitch of the audio up or down. The value of pitch must either be a positive or negative integer or decimal value (with corresponding + or - sign before the number) followed by st or one of ["x-low", "low", "medium", "high", "x-high", "default"] where:
- default is normal pitch
- x-soft is 4 semitones lower than the default
- soft is 2 semitones lower than the default
- medium is 1 semitone lower than the default
- loud is 2 semitones higher than the default
- x-loud is 4 semitones higher than the default
The value in semitones is relative to the default pitch. The conversion from semitone values to relative increases in pitch (double/half the original pitch, etc.) is not exactly known. We recommend testing variable semitone values to obtain the desired change in pitch.
The minimum value for the pitch shift is -79st, while the maximum value is +39st.
EXAMPLES:
<prosody pitch="+4st">Hello World.</prosody>
would render the text at 4 semitones higher than normal (squeakier voice)
<prosody pitch="-4st">Hello World.</prosody>
would render the text at 4 semitones lower than normal (deeper voice)
<prosody pitch="high">Hello World.</prosody>
would render the text at 2 semitones higher than normal (squeakier voice)
Pitch Contouring
You can use the <prosody contour="">
tag to shift the pitch of the audio up or down at a fixed rate over time. The value of contour must be a set of value pairs (T,P), where each pair is contained by a set of parentheses () and separated from other pairs by a space, and each value in the pair is separated by a comma. The first value in each pair (T) corresponds to a time, as a percentage of the overall time that the contour effect applies to. The second value in each pair (P) corresponds to the target pitch value at time T, and the value of P follows the same format as the pitch value in the Pitch Shifting section. A detailed example of a pitch contour, and it’s specific effects, is demonstrated below:
<prosody contour="(0%, +0st) (33%, +2st) (66%, -2st) (100%, +0st)">Hello World.</prosody>
There are 4 contouring segments applied to the audio: (0%, +0st)
, (33%, +2st)
, (66%, -2st)
, and (100%, +0st)
.
Assuming the total audio duration is 3 seconds:
The first segment (0%, +0st) states to set a target pitch of +0 semitones at the start of the contour (start at the default pitch)
The second segment (33%, +2st) states to set a target pitch of +2 semitones at 1 second into the audio (gradually move up 2 semitones from 0 to 1 second)
The third segment (66%, -2st) states to set a target pitch of -2 semitones at 2 seconds into the audio (gradually move down 4 semitones from 1 to 2 seconds)
The last segment (100%, +0st) states to set a target pitch of +0 semitones at 3 seconds into the audio (gradually move up 2 semitones from 2 to 3 seconds)
Overall, this contour will cause the audio pitch to shift up, down, and back to normal over the entire audio.
All semitone values are relative to the default pitch. The conversion from semitone values to relative increases in pitch (double/half the original pitch, etc.) is not exactly known, so play around with different semitone values to obtain the desired change in pitch.
Inflection and Pitch
Oftentimes it is desired to add an upward inflection at the end of a question, or before pausing shortly to read off dynamically generated values (such as account balances or appointment times). While this inflection would ideally be added by simply shifting or contouring the pitch towards the end of a statement, our current implementation of pitch shifting and contouring fall short of creating the desired effect. You may attempt to use pitch shifting and/or contouring in an attempt to create an upward inflection, but we caution against using it for this purpose and are looking into adding proper support for inflections in the future.
Prosody General Examples
Below are several examples of using fully-formed SSML to apply audio effects for common use-cases.
Emphasis (+Volume)
<speak>
If you would like to cancel your plan,
<prosody volume="loud">press 1</prosody> or <prosody volume="loud">say "cancel".</prosody>
</speak>
Emphasis (Slowed Speech)
<speak>
Your confirmation code is
<prosody rate="slow"><say-as interpret-as="number" format="digits">1 2 3 4 5.</say-as></prosody>
</speak>
Adding Pauses
You can use the <break>
tag to insert pauses of a specified length into the text. Pause values can be specified in either seconds or milliseconds, with a positive numerical value followed by either s or ms, respectively.
EXAMPLES:
<break time="500ms"/>
would add a pause of 500ms, or 0.5 seconds
<break time="3s"/>
would add a pause of 3 seconds
Pausing General Examples
Below are several examples of using fully-formed SSML to apply pauses for common use-cases.
a. Creating Pure Silence
In situations where you want to create audio that only consists of silence, you will need to wrap the <break/>
tag in a <s>
(sentence) tag.
<speak>
<s>
<break time="5s"/>
</s>
</speak>
b. Emphasis (Pause Between Words)
<speak>
Your confirmation code is
<say-as interpret-as="number">1</say-as>
<break time="500ms"/>
<say-as interpret-as="number">2</say-as>
<break time="500ms"/>
<say-as interpret-as="number">3</say-as>
<break time="500ms"/>
<say-as interpret-as="number">4</say-as>
<break time="500ms"/>
<say-as interpret-as="number">5</say-as>
</speak>
Updated about 11 hours ago