SSML Elements

Using Speech Sythesis Markup Language (SSML) you can make responses sound more like natural speech.

To get started, make sure you set the Say element type attribute to SSML. See https://docs.ytel.com/docs/say for more information.

Break

The Break element controls pausing or other prosodic boundaries between words. Using Break between any pair of tokens is optional. If this element is not present between words, the break is automatically determined based on the linguistic context.

Element Attributes

Attribute

Description

time optional

Sets the length of the break by seconds or milliseconds (e.g. "3s" or "250ms")

Strength optional

Sets the strength of the output's prosodic break by relative terms. Valid values are: "x-weak", weak", "medium", "strong", and "x-strong". The value "none" indicates that no prosodic break boundary should be outputted, which can be used to prevent a prosodic break that the processor would otherwise produce. The other values indicate monotonically non-decreasing (conceptually increasing) break strength between tokens. The stronger boundaries are typically accompanied by pauses.

Example

The following example shows how to use the Break element to pause between steps:

<response>  
  <say type='ssml'>
    Step 1, take a deep breath. <break time="200ms"/>
    Step 2, exhale.
    Step 3, take a deep breath again. <break strength="weak"/>
    Step 4, exhale.
  </say>

<say-as>

The <say-as> element lets you specify information about the type of text construct that is contained within the element.

Attributes

  • interpret-as: Determines how the value is spoken
  • format: Optional formatting for specific interpret-as values
  • detail: Optional detail level for specific interpret-as values

Supported Values

  • cardinal: Speaks numbers as words
  • ordinal: Speaks numbers as ordinal terms
  • characters: Spells out words letter by letter
  • fraction: Converts fractions to spoken words
  • expletive or beep: Censors text
  • unit: Converts units to appropriate form
  • verbatim or spell-out: Spells out text
  • time: Speaks time in a natural format
  • date: Speaks dates with configurable detail

Example

<response>  
  <say type='ssml'>
    Cardinal value <say-as interpret-as="cardinal">12345</say-as>
    Ordinal value <say-as interpret-as="ordinal">1</say-as>
    Characters value <say-as interpret-as="characters">can</say-as>
    Fraction value <say-as interpret-as="fraction">5+1/2</say-as>
    Expletive or beep value <say-as interpret-as="expletive">censor this</say-as>
    Unit value <say-as interpret-as="unit">10 foot</say-as>
    Verbatim or spell-out value <say-as interpret-as="verbatim">abcdefg</say-as>
    Time value <say-as interpret-as="time" format="hms12">2:30pm</say-as>  
    Date value with detail 1 <say-as interpret-as="date" format="yyyymmdd" detail="1"> 1960-09-10</say-as> 
    Date value with detail 2 <say-as interpret-as="date" format="dmy" detail="2"> 10-9-1960 </say-as>    
  </say>

<p> and <s>

The <p> and <s> element lets you create paragraphs and sentences.

Use <s>...</s> tags to wrap full sentences, especially if they contain SSML elements that change prosody (that is, <audio>, <break>, <emphasis>, <par>, <prosody>, <say-as>, <seq>, and <sub>).

If a break in speech is intended to be long enough that you can hear it, use <s>...</s> tags and put that break between sentences.

Example

<response>  
  <say type='ssml'>
    <p>
      <s>This is sentence one.</s>
      <s>This is sentence two.</s>
    </p>
  </say>

Additional Notes

  • The <s> tag helps define sentence boundaries
  • SSML elements within <s> tags can modify how the sentence is spoken
  • Proper use of these tags can improve speech synthesis clarity

Prosody

The Prosody element lets you customize the pitch, speaking rate, and volume of text contained by the element. Currently the rate, pitch, and volume attributes are supported.

The rate and volume attributes can be set according to the W3 specification. There are three options for setting the value of the pitch attribute:

OptionDescription
RelativeSpecify a relative value (e.g. "low", "medium", "high", etc) where "medium" is the default pitch.
SemitonesIncrease or decrease pitch by "N" semitones using "+Nst" or "-Nst" respectively. Note that "+/-" and "st" are required.
PercentageIncrease or decrease pitch by "N" percent by using "+N%" or "-N%" respectively. Note that "%" is required but "+/-" is optional.

Example

<response>  
  <say type='ssml'>
    <prosody rate="slow" pitch="-2st">Can you hear me now?</prosody>
  </say>
</response>

Audio

The Audio element supports the insertion of recorded audio files and other audio formats in conjunction with synthesized speech output.

Refer to the W3 specification for detailed information: https://www.w3.org/TR/speech-synthesis/#S3.3.1

Attributes

AttributeRequiredDefaultDescription
srcYesN/AURI referring to the audio media source (HTTPS only)
clipBeginNo0Offset from audio source's beginning to start playback
clipEndNoInfinityOffset from audio source's beginning to end playback
speedNo100%Playback rate relative to normal input rate
repeatCountNo1Number of times to insert the audio
repeatDurNoInfinityDuration limit for inserted audio
soundLevelNo+0dBSound level adjustment in decibels

Example

<response>  
  <say type='ssml'>
    <audio src="cat_purr_close.ogg">
      <desc>a cat purring</desc>
      PURR (sound didn't load)
    </audio>
  </say>
</response>