SSML Elements

Using Speech Sythesis Markup Language (SSML) you can make responses sound more like natural speech.

To get started, make sure you set the element type attribute to SSML (). See https://docs.ytel.com/docs/say for more information.

The element controls pausing or other prosodic boundaries between words. Using between any pair of tokens is optional. If this element is not present between words, the break is automatically determined based on the linguistic context.

Element Attributes

Attribute

Description

time
optional

Sets the length of the break by seconds or milliseconds (e.g. "3s" or "250ms")

Strength
optional

Sets the strength of the output's prosodic break by relative terms. Valid values are: "x-weak", weak", "medium", "strong", and "x-strong". The value "none" indicates that no prosodic break boundary should be outputted, which can be used to prevent a prosodic break that the processor would otherwise produce. The other values indicate monotonically non-decreasing (conceptually increasing) break strength between tokens. The stronger boundaries are typically accompanied by pauses.

Example

The following example shows how to use the element to pause between steps:

<response>  
 <say type='ssml'>
  Step 1, take a deep breath. <break time="200ms"/>
  Step 2, exhale.
  Step 3, take a deep breath again. <break strength="weak"/>
  Step 4, exhale.
  </say>
</response>

The element lets you indicate information about the type of text construct that is contained within the element. It also helps specify the level of detail for rendering the contained text.

Element Attributes

Attribute

Description

interpret-as

Determines how the value is spoken

format
optional

May be used depending on the particular interpret-as value. See the date value below

detail
optional

May be used depending on the particular interpret-as value. See the date value below

The interpret-as attribute attribute supports the following values:

Value

Example

cardinal

"1234" would be spoken as "Twelve thousand three hundred forty five" (for US English) or "Twelve thousand three hundred and forty five (for UK English)"

ordinal

"1" would be spoken as "First"

characters

"can" would be spoken as "C A N"

fraction

"5+1/2" would be spoken as "five and a half"

expletive or beep

Text comes out as a beep, as though it has been censored

unit

Converts units to singular or plural depending on the number. "10 foot" would be spoken as "10 feet"

verbatim or spell-out

Text would be spoken out letter by letter.

time

"2:30pm" would be spoken as "Two thirty P.M."

date

See below for more details

Date
The format attribute is a sequence of date field character codes. Supported field character codes in format are {y, m, d} for year, month, and day (of the month) respectively. If the field code appears once for year, month, or day then the number of digits expected are 4, 2, and 2 respectively. If the field code is repeated then the number of expected digits is the number of times the code is repeated. Fields in the date text may be separated by punctuation and/or spaces.

The detail attribute controls the spoken form of the date. For detail='1' only the day fields and one of month or year fields are required, although both may be supplied. This is the default when less than all three fields are given. The spoken form is "The {ordinal day} of {month}, {year}".

Example

The following example shows how to use the element and values:

<response>  
 <say type='ssml'>
 Cardinal value <say-as interpret-as="cardinal">12345</say-as>
 Ordinal value <say-as interpret-as="ordinal">1</say-as>
 Characters value <say-as interpret-as="characters">can</say-as>
 Fraction value<say-as interpret-as="fraction">5+1/2</say-as>
 Expletive or beep value <say-as interpret-as="expletive">censor this</say-as>
 Unit value <say-as interpret-as="unit">10 foot</say-as>
 Verbatim or spell-out value <say-as interpret-as="verbatim">abcdefg</say-as>
 Time value <say-as interpret-as="time" format="hms12">2:30pm</say-as>  
 Date value with detail 1 <say-as interpret-as="date" format="yyyymmdd" detail="1"> 1960-09-10</say-as> 
 Date value with detail 2 <say-as interpret-as="date" format="dmy" detail="2"> 10-9-1960 </say-as>    
 </say>
</response>

<p>,

The <p> and element lets you create paragraphs and sentences.

Use ... tags to wrap full sentences, especially if they contain SSML elements that change prosody (that is, , , , , , , , and ).

If a break in speech is intended to be long enough that you can hear it, use ... tags and put that break between sentences.

Example

The following example shows how to use the paragraph and sentence elements:

<response>  
 <say type='ssml'>
 <p><s>This is sentence one.</s><s>This is sentence two.</s></p>
 </say>
</response>

The element lets you customize the pitch, speaking rate, and volume of text contained by the element. Currently the rate, pitch, and volume attributes are supported.

The rate and volume attributes can be set according to the W3 www.w3.org/TR/speech-synthesis11/#S3.2.4. There are three options for setting the value of the pitch attribute:

Option

Description

Relative

Specify a relative value (e.g. "low", "medium", "high", etc) where "medium" is the default pitch.

Semitones

Increase or decrease pitch by "N" semitones using "+Nst" or "-Nst" respectively. Note that "+/-" and "st" are required.

Percentage

Increase or decrease pitch by "N" percent by using "+N%" or "-N%" respectively. Note that "%" is required but "+/-" is optional.

Example

The following example shows how to use the element and options:

<response>  
 <say type='ssml'>
 <prosody rate="slow" pitch="-2st">Can you hear me now?</prosody>
 </say>
</response>

The element supports the insertion of recorded audio files and the insertion of other audio formats in conjunction with synthesized speech output. To learn more about the audio element, see the W3 specification at https://www.w3.org/TR/speech-synthesis/#S3.3.1

Attribute

Required

Default

Values

src

yes

n/a

A URI referring to the audio media source. Supported protocol is https.

clipBegin

no

0

A TimeDesignation that is the offset from the audio source's beginning to start playback from. If this value is greater than or equal to the audio source's actual duration, then no audio is inserted.

clipEnd

no

infinity

A TimeDesignation that is the offset from the audio source's beginning to end playback at. If the audio source's actual duration is less than this value, then playback ends at that time. If clipBegin is greater than or equal to clipEnd, then no audio is inserted.

speed

no

100%

The ratio output playback rate relative to the normal input rate expressed as a percentage. The format is a positive Real Number followed by %. The currently supported range is [50% (slow - half speed), 200% (fast - double speed)]. Values outside that range may (or may not) be adjusted to be within it.

repeatCount

no

1, or 10 if repeatDur is set

A Real Number specifying how many times to insert the audio (after clipping, if any, by clipBegin and/or clipEnd). Fractional repetitions aren't supported, so the value will be rounded to the nearest integer. Zero is not a valid value and is therefore treated as being unspecified and has the default value in that case.

repeatDur

no

infinity

A TimeDesignation that is a limit on the duration of the inserted audio after the source is processed for clipBegin, clipEnd, repeatCount, and speed attributes (rather then the normal playback duration). If the duration of the processed audio is less than this value, then playback ends at that time.

soundLevel

no

+0dB

Adjust the sound level of the audio by soundLeveldecibels. Maximum range is +/-40dB but actual range may be effectively less, and output quality may not yield good results over the entire range.

Example

The following example shows how to use the element and attributes:

<response>  
 <say type='ssml'>
 <audio src="cat_purr_close.ogg">
 <desc>a cat purring</desc>
 PURR (sound didn't load)
 </audio>
 </say>
</response>

Did this page help you?