What is FreeClimb TTS?

FreeClimb TTS was built on the foundation of the open-source Coqui TTS project. With FreeClimb TTS, users can create realistic, generative AI voices, leading to more personalized and engaging interactions.

How to execute FreeClimb TTS

To use this feature in your FreeClimb application, you'll utilize the PerCL Say command. First, select your preferred voice, then configure the API to define the specific attributes. When a caller interacts with your application, the chosen voice, such as "Eva" in the example below, will deliver the specified text.


   {
      "Say" : {
         "text" : "Welcome to FreeClimb."
         "voice" : "freeclimb.neural.eva"
         "language" : "en-us"
      }
   }

General TTS Best Practices

The following FreeClimb TTS guidelines outline best practices in order to obtain the best quality text-to-speech renderings.

Foreign Loan Words, Archaic Words, and Proper Nouns

As the current model is trained on domain-general English words, it may struggle with pronouncing foreign loan words, archaic words, and proper nouns, such as people's names and locations to the extent that those names are less common (and therefore less likely to appear in the training with a high frequency).

Eau Claire

Doppelgänger

chuse

There is currently no method to ensure such words are rendered correctly, but it may help to experiment with using similar-sounding, higher-frequency words in their place.

Abbreviations and Acronyms

The current model has limited knowledge of various abbreviations and acronyms.

a. Abbreviations

Abbreviations should be expanded to their full word(s).

abbreviation → suggested input to TTS

Dr. -> Doctor or Drive (depending on context)

St. -> Street or Saint (depending on context)

Ct. -> Court

b. Acronyms

For acronyms, the specific instructions depend on how exactly the acronym should be spoken.

c. Acronyms Spoken as Letters

For acronyms that should spoken as a series of letters, capitalize the acronym (FBI). If the acronym is still being spoken as if it were a word, put a space between each letter.

acronym → suggested input to TTS

fbi -> F B I

usa -> USA

am -> AM

pm -> PM

d. Acronyms Spoken as Words

For acronyms that should be spoken as a whole word, use the lowercase version of how the word should be said. Using the fully-capitalized version of the acronym will always cause the acronym to be spelled out. For example, “PIN” would be spoken as “P(pee) I(eye) N(en)”.

acronym → suggested input to TTS

PIN -> pin

SIM -> sim

Punctuation

Proper punctuation should be used for the best-quality rendering. Sentences should always end with the appropriate punctuation depending on whether it is a statement, question, or exclamation. Beyond sentence-level punctuation, every input utterance (even if only a single word) should end in a period.

A as in Alfa -> A as in Alfa.

Yes -> Yes.

Help -> Help.

There are situations in which the model will incorrectly render text that contains punctuation other than commas, periods, question marks, or exclamation points. In such cases, you should remove all punctuation except for the aforementioned commas and periods/question marks/exclamation points and try rendering again. If the problem persists, remove all punctuation except for periods/question marks/exclamation points and try once more.

Newline and Carriage Return

The system cannot handle the newline nor carriage return characters, as in the examples below, and should not be used.

This is an example of a\n newline character.

This is an example of a carriage return\r character.

Input Text Length

Individual sentences should not exceed 200 characters (letters and spaces). Multisentential utterances will automatically be split by the system at sentence boundaries, generated separately, and then concatenated to produce a single .wav file. Latency grows with input length, so a good rule of thumb is to limit input to 35 words per sample generated.

Numbers

Numbers should be spelled out, as opposed to being typed in their numeric form, as this can sometimes lead to errors.

number → suggested input to TTS

1 -> one

1,000 -> one thousand

64 -> sixty-four3. SSML

SSML TTS Best Practices

Coqui TTS supports selected elements of the Speech Synthesis Markup Language (SSML). Interested readers can find out more about SSML Speech Synthesis Markup Language (SSML) Version 1.1. SSML is used to specify how exactly a piece of text is to be spoken by the TTS system. Each section below will go over in detail a specific SSML feature we currently support and how to use it. All SSML content must be contained within a single set of tags.

Say-As

The say-as SSML tag allows users to control how a piece of data, such as a date, time, or plain number should be spoken. The types of data it supports are detailed in the subsections below.

Dates

You can use the <say-as interpret-as="date"> tag to render a date in the form MM/DD/YYYY where:

MM is a value between 00 and 12 (can be single-digit)
DD is a value between 01 and 31 (can be single-digit)
YYYY is a value greater than 0001 and less than or equal to 9999.

While the method we use to parse the date will attempt to infer the correct date from different formats, you should use the MM/DD/YYYY format for the most accurate rendering.

EXAMPLES:

<say-as interpret-as="date>03/28/2024</say-as> would be rendered as “March twenty-eighth, twenty twenty-four”

<say-as interpret-as="date>5/6/2027</say-as> would be rendered as “May sixth, twenty twenty-seven”

Times

You can use the <say-as interpret-as="time"> tag to render a time in the form HH:MM [AM/PM] where:

HH is between 01 and 12 (can be single-digit)
MM is between 00 and 59
[AM/PM] is optional and will not be present in the rendered output if it is not present in the SSML
1. If included, it must be capitalized and separated from the time with a single space

EXAMPLES:

<say-as interpret-as="time">3:15 PM</say-as> would be rendered as rendered as “three fifteen (pee) (em)”

<say-as interpret-as="time">12:00 AM</say-as> would be rendered as “twelve (ayy) (em)”

<say-as interpret-as="time">5:47</say-as> would be rendered as “five forty-seven”

Currency

You can use the <say-as interpret-as="currency"> tag to render a monetary amount in the form $D[.CC] where:

$ - the dollar sign should be present before the currency amount to ensure that the currency should be rendered as US dollars
1. You can specify other currencies by prepending the specific currency symbol (such as the Euro - €) before the currency value.
  1. Only US dollars are officially supported at this time and other currencies may have issues with rendering
D is a dollar amount greater than 0
1. Exceptionally large numbers (quadrillion+) may not be rendered correctly
2. If set to 0, the dollar amount will be read (“zero dollars”) along with the cent amount
[.CC] is an optional value in cents between 00 and 99
1. If not present, only the dollar amount will be rendered

EXAMPLES:

<say-as interpret-as="currency">$5.55</say-as> would be rendered as rendered as “five dollars and fifty-five cents”

<say-as interpret-as="currency">$125</say-as> would be rendered as “one hundred and twenty-five dollars”

<say-as interpret-as="currency">$0.25</say-as> would be rendered as “zero dollars and twenty-five cents”

Numbers

Numbers can be rendered in a variety of ways, depending on the specific use-case. The subsections below detail the currently supported methods of rendering numbers.

Negative numbers (numbers preceded with a minus sign/hyphen [-]) will be rendered as “minus X” and not “negative X”. If you want to render a number as “negative X” then you will need to explicitly include “negative X” in the text, where X is the positive version of the number.

a. Cardinal

You can use the <say-as interpret-as="number" format="cardinal"> tag to render a positive integer or decimal number in its cardinal format. Note that for numbers between 1000 and 9999, you must use commas in the numbers to prevent them from being interpreted as dates (this is a known bug/feature with the tool we use to parse the SSML).

EXAMPLES:

<say-as interpret-as="number" format="cardinal">5</say-as> would be rendered as “five”

<say-as interpret-as="number" format="cardinal">1,234</say-as> would be rendered as “one thousand two hundred and thirty-four”

<say-as interpret-as="number" format="cardinal">27.3</say-as> would be rendered as “twenty-seven point three”

<say-as interpret-as="number" format="cardinal">-7</say-as> would be rendered as “minus seven”

b. Ordinal

You can use the <say-as interpret-as="number" format="ordinal"> tag to render a positive integer number in its ordinal format. Note that for numbers between 1000 and 9999, you must use commas in the numbers to prevent them from being interpreted as dates (this is a known bug/feature with the tool we use to parse the SSML).

EXAMPLES:

<say-as interpret-as="number" format="ordinal">5</say-as> would be rendered as “fifth”

<say-as interpret-as="number" format="ordinal">1,234</say-as> would be rendered as “one thousand two hundred and thirty-fourth”

<say-as interpret-as="number" format="ordinal">33</say-as> would be rendered as “thirty-third”

c. Digits

You can use the <say-as interpret-as="number" format="digit"> tag to render a series of digits. Each digit must be separated by a space for the digits to be correctly rendered.

If you need to render a single number without any spaces as a series of digits (ex: 98765), see the Spell-Out section.

EXAMPLES:

<say-as interpret-as="number" format="digits">5</say-as> would be rendered as “five”

<say-as interpret-as="number" format="digits">1 2 3 4</say-as> would be rendered as “one two three four”

<say-as interpret-as="number" format="digits">3 3</say-as> would be rendered as “three three”

d. Year

You can use the <say-as interpret-as="number" format="year"> tag to render a positive integer number as a spoken year.

EXAMPLES:

<say-as interpret-as="number" format="year">2024</say-as> would be rendered as “twenty twenty-four”

<say-as interpret-as="number" format="year">2002</say-as> would be rendered as “two thousand and two”

<say-as interpret-as="number" format="year">1996</say-as> would be rendered as “nineteen ninety-six”

<say-as interpret-as="number" format="year">512</say-as> would be rendered as “five twelve”

Spell-Out

You can use the <say-as interpret-as="spell-out"> tag to render a number as a series of digits, and to render special characters. All supported characters are listed in the examples below.

EXAMPLES:

<say-as interpret-as="spell-out">-</say-as> ( - ) would be rendered as “dash”

<say-as interpret-as="spell-out">/</say-as> ( / ) would be rendered as “slash”

<say-as interpret-as="spell-out">+</say-as> ( + ) would be rendered as “plus”

<say-as interpret-as="spell-out">.</say-as> ( . ) would be rendered as “dot”

<say-as interpret-as="spell-out">*</say-as> ( * ) would be rendered as “star”

<say-as interpret-as="spell-out">@</say-as> ( @ ) would be rendered as “at”

<say-as interpret-as="spell-out">12345</say-as> would be rendered as “one two three four five”

Websites and Email Addresses

While rendering a website or email address is not supported as a single say-as tag, you can combine multiple say-as tags to render either type of data.

EXAMPLES:

support <say-as interpret-as="spell-out">@</say-as> free climb <say-as interpret-as="spell-out">.</say-as> com ([email protected]) would be rendered as “support at free climb dot com”

<say-as interpret-as="spell-out">[www](http://www) .</say-as> vailsis <say-as interpret-as="spell-out">.</say-as> com (Vail Systems, Inc. ) would be rendered as “double-u double-u double-u dot vail sis dot com”

🚧
Note that in the above example vailsys is spelled as “vailsis” to properly render the word as it should be spoken. Similar transformations might need to be done for website and email domains that are not normal words in the English language.

Say-As General Examples

Below are several examples of using fully-formed SSML to control the rendering of text for common use-cases.

Checking an Appointment

<speak>
  Your appointment is currently scheduled for
  <say-as interpret-as="time">12:15 PM</say-as> on <say-as interpret-as="date">6/15/24.</say-as>
</speak>

Business Hours

<speak>
  This business location is currently closed. Our hours of operation are
  <say-as interpret-as="time">8:00 AM</say-as> to <say-as interpret-as="time">7:00 PM</say-as> on Monday through Thursday.
  <say-as interpret-as="time">8:00 AM</say-as> to <say-as interpret-as="time">6:00 PM</say-as> on Friday.
  <say-as interpret-as="time">10:00 AM</say-as> to <say-as interpret-as="time">8:00 PM</say-as> on Saturday.
  We are closed on Sunday.
</speak>

Account Balance

<speak>
  Your current balance is
  <say-as interpret-as="currency">$125.60.</say-as>
</speak>

Prosodic Modifications

Currently, we support modifying several prosodic features of the audio using the <prosody> tag. The supported prosodic modifications are detailed in the subsections below.

Speaking Rate

You can use the <prosody rate=""> tag to increase or decrease the speed of the audio. The value of rate must either be a positive integer or decimal percentage (including the % symbol) or one of ["x-slow", "slow", "medium", "fast", "x-fast","default"] where:

default is 100% speed
x-slow is 60% of the default
slow is 80% of the default
medium is 90% of the default
fast is 150% of the default
x-fast is 200% of the default

The minimum value for the rate percentage is 10%, while the maximum value is 10000%.

EXAMPLES:

<prosody rate="50%">Hello World.</prosody> would render the text at half the normal speed

<prosody rate="120%">Hello World.</prosody> would render the text at 1.2 times the normal speed

Volume

You can use the <prosody volume=""> tag to increase or decrease the volume of the audio. The value of volume must either be a positive or negative integer or decimal value (with corresponding + or - sign before the number) followed by dB or one of ["silent", "x-soft", "soft", "medium", "loud", "x-loud", "default"] where:

silent is, as the name implies, silence
default is normal volume
x-soft is 50% of the default
soft is 75% of the default
medium is 90% of the default
loud is 150% of the default
x-loud is 200% of the default

The value in decibels is relative to the normal volume. Roughly every (+/-)6 decibels corresponds to a doubling/halving of the original volume.

There is no hard minimum or maximum value for the volume, though very small values (large negative values) will essentially make the audio silent, while very large values will stop adjusting the audio at a certain (unknown) point.

EXAMPLES:

<prosody volume="+6dB">Hello World.</prosody> would render the text at 6 decibels louder than normal (roughly twice the original volume)

<prosody volume="-6dB">Hello World.</prosody> would render the text at 6 decibels quieter than normal (roughly half the original volume)

<prosody volume="loud">Hello World.</prosody> would render the text at 1.5 times the normal volume

Pitch Shifting

You can use the <prosody pitch=""> tag to shift the pitch of the audio up or down. The value of pitch must either be a positive or negative integer or decimal value (with corresponding + or - sign before the number) followed by st or one of ["x-low", "low", "medium", "high", "x-high", "default"] where:

default is normal pitch
x-soft is 4 semitones lower than the default
soft is 2 semitones lower than the default
medium is 1 semitone lower than the default
loud is 2 semitones higher than the default
x-loud is 4 semitones higher than the default

The value in semitones is relative to the default pitch. The conversion from semitone values to relative increases in pitch (double/half the original pitch, etc.) is not exactly known. We recommend testing variable semitone values to obtain the desired change in pitch.

The minimum value for the pitch shift is -79st, while the maximum value is +39st.

EXAMPLES:

<prosody pitch="+4st">Hello World.</prosody> would render the text at 4 semitones higher than normal (squeakier voice)

<prosody pitch="-4st">Hello World.</prosody> would render the text at 4 semitones lower than normal (deeper voice)

<prosody pitch="high">Hello World.</prosody> would render the text at 2 semitones higher than normal (squeakier voice)

Pitch Contouring

You can use the <prosody contour=""> tag to shift the pitch of the audio up or down at a fixed rate over time. The value of contour must be a set of value pairs (T,P), where each pair is contained by a set of parentheses () and separated from other pairs by a space, and each value in the pair is separated by a comma. The first value in each pair (T) corresponds to a time, as a percentage of the overall time that the contour effect applies to. The second value in each pair (P) corresponds to the target pitch value at time T, and the value of P follows the same format as the pitch value in the Pitch Shifting section. A detailed example of a pitch contour, and it’s specific effects, is demonstrated below:

<prosody contour="(0%, +0st) (33%, +2st) (66%, -2st) (100%, +0st)">Hello World.</prosody>

There are 4 contouring segments applied to the audio: (0%, +0st), (33%, +2st), (66%, -2st), and (100%, +0st).

Assuming the total audio duration is 3 seconds:

The first segment (0%, +0st) states to set a target pitch of +0 semitones at the start of the contour (start at the default pitch)

The second segment (33%, +2st) states to set a target pitch of +2 semitones at 1 second into the audio (gradually move up 2 semitones from 0 to 1 second)

The third segment (66%, -2st) states to set a target pitch of -2 semitones at 2 seconds into the audio (gradually move down 4 semitones from 1 to 2 seconds)

The last segment (100%, +0st) states to set a target pitch of +0 semitones at 3 seconds into the audio (gradually move up 2 semitones from 2 to 3 seconds)

Overall, this contour will cause the audio pitch to shift up, down, and back to normal over the entire audio.

All semitone values are relative to the default pitch. The conversion from semitone values to relative increases in pitch (double/half the original pitch, etc.) is not exactly known, so play around with different semitone values to obtain the desired change in pitch.

Inflection and Pitch

Oftentimes it is desired to add an upward inflection at the end of a question, or before pausing shortly to read off dynamically generated values (such as account balances or appointment times). While this inflection would ideally be added by simply shifting or contouring the pitch towards the end of a statement, our current implementation of pitch shifting and contouring fall short of creating the desired effect. You may attempt to use pitch shifting and/or contouring in an attempt to create an upward inflection, but we caution against using it for this purpose and are looking into adding proper support for inflections in the future.

Prosody General Examples

Below are several examples of using fully-formed SSML to apply audio effects for common use-cases.

Emphasis (+Volume)

<speak>
  If you would like to cancel your plan, 
  <prosody volume="loud">press 1</prosody> or <prosody volume="loud">say "cancel".</prosody>
</speak>

Emphasis (Slowed Speech)

<speak>
  Your confirmation code is
  <prosody rate="slow"><say-as interpret-as="number" format="digits">1 2 3 4 5.</say-as></prosody>
</speak>

Adding Pauses

You can use the <break> tag to insert pauses of a specified length into the text. Pause values can be specified in either seconds or milliseconds, with a positive numerical value followed by either s or ms, respectively.

EXAMPLES:

<break time="500ms"/> would add a pause of 500ms, or 0.5 seconds

<break time="3s"/> would add a pause of 3 seconds

Pausing General Examples

Below are several examples of using fully-formed SSML to apply pauses for common use-cases.

a. Creating Pure Silence

In situations where you want to create audio that only consists of silence, you will need to wrap the <break/> tag in a <s> (sentence) tag.

<speak>
  <s>
    <break time="5s"/>
  </s>
</speak>

b. Emphasis (Pause Between Words)

<speak>
  Your confirmation code is
  <say-as interpret-as="number">1</say-as>
  <break time="500ms"/>
  <say-as interpret-as="number">2</say-as>
  <break time="500ms"/>
  <say-as interpret-as="number">3</say-as>
  <break time="500ms"/>
  <say-as interpret-as="number">4</say-as>
  <break time="500ms"/>
  <say-as interpret-as="number">5</say-as>
</speak>