Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Commit

Permalink
Read entire stdin with --ssml, gruut bumped to 2.1
Browse files Browse the repository at this point in the history
  • Loading branch information
synesthesiam committed Nov 10, 2021
1 parent e03cf6c commit 44ecac6
Show file tree
Hide file tree
Showing 11 changed files with 166 additions and 31 deletions.
7 changes: 7 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
## [1.1.0]

### Changed

- With --ssml, input from stdin is assumed to be one document instead of lines (override with --stdin-format lines)
- Bump gruut to version 2.1 for inline lexicons

## [1.0.0] - 20 Oct 2021

### Added
Expand Down
152 changes: 125 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,33 +143,6 @@ Voices and vocoders are automatically downloaded when used on the command-line o

---

## SSML

A subset of [SSML](https://www.w3.org/TR/speech-synthesis11/) is supported (use `--ssml`):

* `<speak>` - wrap around SSML text
* `lang` - set language for document
* `<s>` - sentence (disables automatic sentence breaking)
* `lang` - set language for sentence
* `<w>` / `<token>` - word (disables automatic tokenization)
* `<voice name="...">` - set voice of inner text
* `voice` - name or language of voice
* `<say-as interpret-as="">` - force interpretation of inner text
* `interpret-as` one of "spell-out", "date", "number", "time", or "currency"
* `format` - way to format text depending on `interpret-as`
* number - one of "cardinal", "ordinal", "digits", "year"
* date - string with "d" (cardinal day), "o" (ordinal day), "m" (month), or "y" (year)
* `<break time="">` - Pause for given amount of time
* time - seconds ("123s") or milliseconds ("123ms")
* `<mark name="">` - User-defined mark (written to `--mark-file` or part of `TextToSpeechResult`)
* name - name of mark
* `<sub alias="">` - substitute `alias` for inner text
* `<phoneme ph="...">` - supply phonemes for inner text
* `ph` - phonemes for each word of inner text, separated by whitespace
* `alphabet` - if "ipa", phonemes are intelligently split ("aːˈb" -> "aː", "ˈb")

--

## Command-Line Interface

Larynx has a flexible command-line interface, available with:
Expand Down Expand Up @@ -314,6 +287,131 @@ You can specify the vocoder quality by adding `;<QUALITY>` to the MaryTTS voice

For example: `en;low` will use the lowest quality (but fastest) vocoder. This is usually necessary to get decent performance on a Raspberry Pi.

---

## SSML

A subset of [SSML](https://www.w3.org/TR/speech-synthesis11/) is supported (use `--ssml`):

* `<speak>` - wrap around SSML text
* `lang` - set language for document
* `<s>` - sentence (disables automatic sentence breaking)
* `lang` - set language for sentence
* `<w>` / `<token>` - word (disables automatic tokenization)
* `<voice name="...">` - set voice of inner text
* `voice` - name or language of voice
* `<say-as interpret-as="">` - force interpretation of inner text
* `interpret-as` one of "spell-out", "date", "number", "time", or "currency"
* `format` - way to format text depending on `interpret-as`
* number - one of "cardinal", "ordinal", "digits", "year"
* date - string with "d" (cardinal day), "o" (ordinal day), "m" (month), or "y" (year)
* `<break time="">` - Pause for given amount of time
* time - seconds ("123s") or milliseconds ("123ms")
* `<mark name="">` - User-defined mark (written to `--mark-file` or part of `TextToSpeechResult`)
* name - name of mark
* `<sub alias="">` - substitute `alias` for inner text
* `<phoneme ph="...">` - supply phonemes for inner text
* `ph` - phonemes for each word of inner text, separated by whitespace
* `<lexicon id="...">` - inline pronunciation lexicon
* `id` - unique id of lexicon (used in `<lookup ref="...">`)
* One or more `<lexeme>` child elements with:
* `<grapheme role="...">WORD</grapheme>` - word text (optional [role][#word-roles])
* `<phoneme>P H O N E M E S</phoneme>` - word pronunciation (phonemes separated by whitespace)
* `<lookup ref="...">` - use inline pronunciation lexicon for child elements
* `ref` - id from a `<lexicon id="...">`

### Word Roles

During phonemization, word roles are used to disambiguate pronunciations. Unless manually specified, a word's role is derived from its part of speech tag as `gruut:<TAG>`. For initialisms and `spell-out`, the role `gruut:letter` is used to indicate that e.g., "a" should be spoken as `/eɪ/` instead of `/ə/`.

For `en-us`, the following additional roles are available from the part-of-speech tagger:

* `gruut:CD` - number
* `gruut:DT` - determiner
* `gruut:IN` - preposition or subordinating conjunction
* `gruut:JJ` - adjective
* `gruut:NN` - noun
* `gruut:PRP` - personal pronoun
* `gruut:RB` - adverb
* `gruut:VB` - verb
* `gruut:VB` - verb (past tense)

### Inline Lexicons

Inline [pronunciation lexicons](https://www.w3.org/TR/2008/REC-pronunciation-lexicon-20081014/) are supported via the `<lexicon>` and `<lookup>` tags. gruut diverges slightly from the [SSML standard](https://www.w3.org/TR/speech-synthesis11/) here by only allowing lexicons to be defined within the SSML document itself. Additionally, the `id` attribute of the `<lexicon>` element can be left off to indicate a "default" inline lexicon that does not require a corresponding `<lookup>` tag.

For example, the following document will yield three different pronunciations for the word "tomato":

``` xml
<?xml version="1.0"?>
<speak version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">

<lexicon xml:id="test" alphabet="ipa">
<lexeme>
<grapheme>
tomato
</grapheme>
<phoneme>
<!-- Individual phonemes are separated by whitespace -->
t ə m ˈɑ t oʊ
</phoneme>
</lexeme>
<lexeme>
<grapheme role="fake-role">
tomato
</grapheme>
<phoneme>
<!-- Made up pronunciation for fake word role -->
t ə m ˈi t oʊ
</phoneme>
</lexeme>
</lexicon>

<w>tomato</w>
<lookup ref="test">
<w>tomato</w>
<w role="fake-role">tomato</w>
</lookup>
</speak>
```

The first "tomato" will be looked up in the U.S. English lexicon (`/t ə m ˈeɪ t oʊ/`). Within the `<lookup>` tag's scope, the second and third "tomato" words will be looked up in the inline lexicon. The third "tomato" word has a [role](#word-roles) attached (selecting a made up pronunciation in this case).

Even further from the SSML standard, gruut allows you to leave off the `<lexicon>` id entirely. With no `id`, a `<lookup>` tag is no longer needed, allowing you to override the pronunciation of any word in the document:

``` xml
<?xml version="1.0"?>
<speak version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">

<!-- No id means change all words without a lookup -->
<lexicon>
<lexeme>
<grapheme>
tomato
</grapheme>
<phoneme>
t ə m ˈɑ t oʊ
</phoneme>
</lexeme>
</lexicon>

<w>tomato</w>
</speak>
```

This will yield a pronunciation of `/t ə m ˈɑ t oʊ/` for all instances of "tomato" in the document (unless they have a `<lookup>`).


---

## Text to Speech Models
Expand Down
2 changes: 1 addition & 1 deletion larynx/VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.0.3
1.1.0
2 changes: 1 addition & 1 deletion larynx/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@
import gruut
import numpy as np
import onnxruntime

import phonemes2ids

from larynx.audio import AudioSettings
from larynx.constants import (
InferenceBackend,
Expand Down
32 changes: 31 additions & 1 deletion larynx/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,19 @@ class OutputNaming(str, Enum):
ID = "id"


class StdinFormat(str, Enum):
"""Format of standard input"""

AUTO = "auto"
"""Choose based on SSML state"""

LINES = "lines"
"""Each line is a separate sentence/document"""

DOCUMENT = "document"
"""Entire input is one document"""


# -----------------------------------------------------------------------------


Expand Down Expand Up @@ -170,7 +183,18 @@ def main():
texts = args.text
else:
# Use stdin
texts = sys.stdin
stdin_format = StdinFormat.LINES

if (args.stdin_format == StdinFormat.AUTO) and args.ssml:
# Assume SSML input is entire document
stdin_format = StdinFormat.DOCUMENT

if stdin_format == StdinFormat.DOCUMENT:
# One big line
texts = [sys.stdin.read()]
else:
# Multiple lines
texts = sys.stdin

if os.isatty(sys.stdin.fileno()):
print("Reading text from stdin...", file=sys.stderr)
Expand Down Expand Up @@ -417,6 +441,12 @@ def get_args():
parser.add_argument(
"text", nargs="*", help="Text to convert to speech (default: stdin)"
)
parser.add_argument(
"--stdin-format",
choices=[str(v.value) for v in StdinFormat],
default=StdinFormat.AUTO,
help="Format of stdin text (default: auto)",
)
parser.add_argument(
"--voice",
"-v",
Expand Down
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
dataclasses-json~=0.5.0
gruut[de,es,fr,it,nl,ru,sv,sw]~=2.0.0
gruut[de,es,fr,it,nl,ru,sv,sw]~=2.1.0
numpy>=1.20.0
onnxruntime>=1.6.0,<2.0
phonemes2ids~=1.0.0
Expand Down

0 comments on commit 44ecac6

Please sign in to comment.