SSML Elements

Using Speech Sythesis Markup Language (SSML) you can make responses sound more like natural speech.

To get started, make sure you set the element type attribute to SSML (). See https://docs.ytel.com/docs/say for more information.

The element controls pausing or other prosodic boundaries between words. Using between any pair of tokens is optional. If this element is not present between words, the break is automatically determined based on the linguistic context.

Element Attributes

AttributeDescription
time
optional
Sets the length of the break by seconds or milliseconds (e.g. "3s" or "250ms")
Strength
optional
Sets the strength of the output's prosodic break by relative terms. Valid values are: "x-weak", weak", "medium", "strong", and "x-strong". The value "none" indicates that no prosodic break boundary should be outputted, which can be used to prevent a prosodic break that the processor would otherwise produce. The other values indicate monotonically non-decreasing (conceptually increasing) break strength between tokens. The stronger boundaries are typically accompanied by pauses.

Example

The following example shows how to use the element to pause between steps:

<response>  
 <say type='ssml'>
  Step 1, take a deep breath. <break time="200ms"/>
  Step 2, exhale.
  Step 3, take a deep breath again. <break strength="weak"/>
  Step 4, exhale.
  </say>
</response>

The element lets you indicate information about the type of text construct that is contained within the element. It also helps specify the level of detail for rendering the contained text.

Element Attributes

AttributeDescription
interpret-asDetermines how the value is spoken
format
optional
May be used depending on the particular interpret-as value. See the date value below
detail
optional
May be used depending on the particular interpret-as value. See the date value below

The interpret-as attribute attribute supports the following values:

ValueExample
cardinal"1234" would be spoken as "Twelve thousand three hundred forty five" (for US English) or "Twelve thousand three hundred and forty five (for UK English)"
ordinal"1" would be spoken as "First"
characters"can" would be spoken as "C A N"
fraction"5+1/2" would be spoken as "five and a half"
expletive or beepText comes out as a beep, as though it has been censored
unitConverts units to singular or plural depending on the number. "10 foot" would be spoken as "10 feet"
verbatim or spell-outText would be spoken out letter by letter.
time"2:30pm" would be spoken as "Two thirty P.M."
dateSee below for more details

Date
The format attribute is a sequence of date field character codes. Supported field character codes in format are {y, m, d} for year, month, and day (of the month) respectively. If the field code appears once for year, month, or day then the number of digits expected are 4, 2, and 2 respectively. If the field code is repeated then the number of expected digits is the number of times the code is repeated. Fields in the date text may be separated by punctuation and/or spaces.

The detail attribute controls the spoken form of the date. For detail='1' only the day fields and one of month or year fields are required, although both may be supplied. This is the default when less than all three fields are given. The spoken form is "The {ordinal day} of {month}, {year}".

Example

The following example shows how to use the element and values:

<response>  
 <say type='ssml'>
 Cardinal value <say-as interpret-as="cardinal">12345</say-as>
 Ordinal value <say-as interpret-as="ordinal">1</say-as>
 Characters value <say-as interpret-as="characters">can</say-as>
 Fraction value<say-as interpret-as="fraction">5+1/2</say-as>
 Expletive or beep value <say-as interpret-as="expletive">censor this</say-as>
 Unit value <say-as interpret-as="unit">10 foot</say-as>
 Verbatim or spell-out value <say-as interpret-as="verbatim">abcdefg</say-as>
 Time value <say-as interpret-as="time" format="hms12">2:30pm</say-as>  
 Date value with detail 1 <say-as interpret-as="date" format="yyyymmdd" detail="1"> 1960-09-10</say-as> 
 Date value with detail 2 <say-as interpret-as="date" format="dmy" detail="2"> 10-9-1960 </say-as>    
 </say>
</response>

<p> and <s>

The <p> and <s> element lets you create paragraphs and sentences.

Use ... tags to wrap full sentences, especially if they contain SSML elements that change prosody (that is, , , , , , , , and ).

If a break in speech is intended to be long enough that you can hear it, use ... tags and put that break between sentences.

Example

The following example shows how to use the paragraph and sentence elements:

<response>  
 <say type='ssml'>
 <p><s>This is sentence one.</s><s>This is sentence two.</s></p>
 </say>
</response>

The element lets you customize the pitch, speaking rate, and volume of text contained by the element. Currently the rate, pitch, and volume attributes are supported.

The rate and volume attributes can be set according to the W3 www.w3.org/TR/speech-synthesis11/#S3.2.4. There are three options for setting the value of the pitch attribute:

OptionDescription
RelativeSpecify a relative value (e.g. "low", "medium", "high", etc) where "medium" is the default pitch.
SemitonesIncrease or decrease pitch by "N" semitones using "+Nst" or "-Nst" respectively. Note that "+/-" and "st" are required.
PercentageIncrease or decrease pitch by "N" percent by using "+N%" or "-N%" respectively. Note that "%" is required but "+/-" is optional.

Example

The following example shows how to use the element and options:

<response>  
 <say type='ssml'>
 <prosody rate="slow" pitch="-2st">Can you hear me now?</prosody>
 </say>
</response>

The element supports the insertion of recorded audio files and the insertion of other audio formats in conjunction with synthesized speech output. To learn more about the audio element, see the W3 specification at https://www.w3.org/TR/speech-synthesis/#S3.3.1

AttributeRequiredDefaultValues
srcyesn/aA URI referring to the audio media source. Supported protocol is https.
clipBeginno0A TimeDesignation that is the offset from the audio source's beginning to start playback from. If this value is greater than or equal to the audio source's actual duration, then no audio is inserted.
clipEndnoinfinityA TimeDesignation that is the offset from the audio source's beginning to end playback at. If the audio source's actual duration is less than this value, then playback ends at that time. If clipBegin is greater than or equal to clipEnd, then no audio is inserted.
speedno100%The ratio output playback rate relative to the normal input rate expressed as a percentage. The format is a positive Real Number followed by %. The currently supported range is [50% (slow - half speed), 200% (fast - double speed)]. Values outside that range may (or may not) be adjusted to be within it.
repeatCountno1, or 10 if repeatDur is setA Real Number specifying how many times to insert the audio (after clipping, if any, by clipBegin and/or clipEnd). Fractional repetitions aren't supported, so the value will be rounded to the nearest integer. Zero is not a valid value and is therefore treated as being unspecified and has the default value in that case.
repeatDurnoinfinityA TimeDesignation that is a limit on the duration of the inserted audio after the source is processed for clipBegin, clipEnd, repeatCount, and speed attributes (rather then the normal playback duration). If the duration of the processed audio is less than this value, then playback ends at that time.
soundLevelno+0dBAdjust the sound level of the audio by soundLeveldecibels. Maximum range is +/-40dB but actual range may be effectively less, and output quality may not yield good results over the entire range.

Example

The following example shows how to use the element and attributes:

<response>  
 <say type='ssml'>
 <audio src="cat_purr_close.ogg">
 <desc>a cat purring</desc>
 PURR (sound didn't load)
 </audio>
 </say>
</response>