Here we specify how Speech Synthesis Markdown (SSMD) works.
SSMD is mapped to SSML using the following rules.
Any text written is implicitly wrapped in a <speak>
root element.
This will be omitted in the rest of the examples shown in this section.
SSMD:
text
SSML:
<speak>text</speak>
SSMD:
*command* & conquer
SSML:
<emphasis level='moderate'>command</emphasis> & conquer
Pauses can be indicated by using ...
. Several modifications to the duration are allowed as shown below.
SSMD:
Hello ... world (default: x-strong break like after a paragraph)
Hello - ...0 world (skip break when there would otherwise be one like after this dash)
Hello ...c world (medium break like after a comma)
Hello ...s world (strong break like after a sentence)
Hello ...p world (extra string break like after a paragraph)
Hello ...5s world (5 second break (max 10s))
Hello ...100ms world (100 millisecond break (max 10000ms))
Hello ...100 world (100 millisecond break (max 10000ms))
SSML:
<s>Hello <break time='1000ms'/> world (default: x-strong break like after a paragraph)</s>
<s>Hello - <break time='1000ms'/>0 world (skip break when there would otherwise be one like after this dash)</s>
<s>Hello <break time='1000ms'/>c world (medium break like after a comma)</s>
<s>Hello <break time='1000ms'/>s world (strong break like after a sentence)</s>
<s>Hello <break time='1000ms'/>p world (extra string break like after a paragraph)</s>
<s>Hello <break time='5s'/> world (5 second break (max 10s))</s>
<s>Hello <break time='100ms'/> world (100 millisecond break (max 10000ms))</s>
<s>Hello <break time='1000ms'/>100 world (100 millisecond break (max 10000ms))</s>
Empty lines indicate a paragraph.
SSMD:
First prepare the ingredients.
Don't forget to wash them first.
Lastly mix them all together.
Don't forget to do the dishes after!
SSML:
<p><s>First prepare the ingredients.</s>
<s>Don't forget to wash them first.</s></p>
<p>Lastly mix them all together.</p>
<p>Don't forget to do the dishes after!</p>
The prosody or rythm depends the volume, rate and pitch of the delivered text.
Each of those values can be defined by a number between 1 and 5 where those mean:
number | volume | rate | pitch |
---|---|---|---|
0 | silent | ||
1 | x-soft | x-slow | x-low |
2 | soft | slow | low |
3 | medium | medium | medium |
4 | loud | fast | high |
5 | x-loud | x-fast | x-high |
SSMD:
Volume:
~silent~
--extra soft--
-soft-
medium
+loud+
++extra loud++
Rate:
<<extra slow<<
<slow<
medium
fast: >fast>
extra fast: >>extra fast>>
Pitch:
__extra low__
_low_
medium
^high^
^^extra high^^
[extra loud, fast, and high](vrp: 555)
SSML:
<s>Volume:</s>
<s><prosody volume='silent'>silent</prosody></s>
<s><prosody volume='x-soft'>extra soft</prosody></s>
<s><prosody volume='soft'>soft</prosody></s>
<s>medium</s>
<s><prosody volume='loud'>loud</prosody></s>
<s><prosody volume='x-loud'>extra loud</prosody></s>
<s>Rate:</s>
<s><prosody rate='x-slow'>extra slow</prosody></s>
<s><prosody rate='slow'>slow</prosody></s>
<s>medium</s>
<s>fast: <prosody rate='fast'>fast</prosody></s>
<s>extra fast: <prosody rate='x-fast'>extra fast</prosody></s>
<s>Pitch:</s>
<s><prosody pitch='x-low'>extra low</prosody></s>
<s><prosody pitch='low'>low</prosody></s>
<s>medium</s>
<s><prosody pitch='high'>high</prosody></s>
<s><prosody pitch='x-high'>extra high</prosody></s>
<s><prosody rate='x-fast' pitch='x-high' volume='x-loud'>extra loud, fast, and high</prosody></s>
The shortcuts are listed first. While they can be combined, sometimes it's easier and shorter to just use
the explicit form shown in the last 2 lines. All of them can be nested, too.
Moreover changes in volume ([louder](v: +10dB)
) and pitch ([lower](p: -4%)
) can also be given explicitly in relative values.
You can give the speech sythesis engine hints as to what it's supposed to read using as
.
Possible values:
- character - spell out each single character, e.g. for KGB
- number - cardinal number, e.g. 100
- ordinal - ordinal number, e.g. 1st
- digits - spell out each single digit, e.g. 123 as 1 - 2 - 3
- fraction - pronounce number as fraction, e.g. 3/4 as three quarters
- unit - e.g. 1meter
- date - read content as a date, must provide format
- time - duration in minutes and seconds
- address - read as part of an address
- telephone - read content as a telephone number
- expletive - beeps out the content
SSMD:
telephone number is [+49 123456](as: telephone).
You can't say [fuck](as: expletive) on television.
SSML:
<s>telephone number is <say-as interpret-as='telephone'>+49 123456</say-as>.</s>
<s>You can't say <say-as interpret-as='expletive'>fuck</say-as> on television.</s>
Audio
Syntax : [description of sound](urlOfSound.mp3 alternative text)
Description text is used for display
Following the url, an alternate text may be provided in case the file is not readable
SSMD:
Here's a fun sound [boing](https://example.com/sounds/boing.mp3)
[a cat purring](cat_purr_close.ogg Purr (sound didn't load))
[](miaou.mp3)
SSML:
<s>Here's a fun sound <audio src="https://example.com/sounds/boing.mp3"><desc>boing</desc></audio></s>
<s><audio src="cat_purr_close.ogg"><desc>a cat purring</desc>Purr (sound didn't load)</audio></s>
<s><audio src="miaou.mp3"></audio></s>
Heading tag adds emphasis and a small break by default, but you can configure it as you like :
const ssml = ssmd("# My first heading 1", {
headingLevels: {
1: [
{ tag: "emphasis", value: 'strong' },
{ tag: "pause", value: '300ms' },
],
// if we ommit key "2", it will uses default params for heading 2
3: [
{ tag: "pause", value: '50ms' },
{ tag: "prosody", value: {rate: 'slow'} },
{ tag: "pause", value: '200ms' },
],
}
});
You can use any tag and value referenced from ssml-builder project
By default headings give :
- # Heading 1 -> strong emphasis and a 100ms pause after
- # Heading 2 -> moderate emphasis and a 75ms pause after
- # Heading 3 -> reduced emphasis and a 50ms pause after
SSMD:
# Heading 1
## Heading 2
##Heading 2
### Heading 3
#### Heading 4 // Not handled by default
SSML:
<s><emphasis level='strong'>Heading 1</emphasis> <break time='300ms'/></s>
<s><emphasis level='moderate'>Heading 2</emphasis> <break time='75ms'/></s>
<s><emphasis level='moderate'>Heading 2</emphasis> <break time='75ms'/></s>
<s><break time='50ms'/> <prosody rate='slow'>Heading 3</prosody> <break time='200ms'/></s>
<s>#### Heading 4 // Not handled by default</s>
Amazon SSML
SSMD:
If he [whispers](ext: whisper), he lies.
Listen this [https://example.com/test.mp3](ext: audio).
[Waouh](as: interjection) trop bien !
SSML:
<s>If he <amazon:effect name="whispered">whispers</amazon:effect>, he lies.</s>
<s>Listen this <audio src='https://example.com/test.mp3'/>.</s>
<s><say-as interpret-as='interjection'>Waouh</say-as> trop bien !</s>