Read entire stdin with --ssml, gruut bumped to 2.1

rhasspy · Nov 10, 2021 · 44ecac6 · 44ecac6
1 parent e03cf6c
commit 44ecac6
Show file tree

Hide file tree

Showing 11 changed files with 166 additions and 31 deletions.
diff --git a/CHANGELOG b/CHANGELOG
@@ -1,3 +1,10 @@
+## [1.1.0]
+
+### Changed
+
+- With --ssml, input from stdin is assumed to be one document instead of lines (override with --stdin-format lines)
+- Bump gruut to version 2.1 for inline lexicons
+
 ## [1.0.0] - 20 Oct 2021
 
 ### Added

diff --git a/README.md b/README.md
@@ -143,33 +143,6 @@ Voices and vocoders are automatically downloaded when used on the command-line o
 
 ---
 
-## SSML
-
-A subset of [SSML](https://www.w3.org/TR/speech-synthesis11/) is supported (use `--ssml`):
-
-* `<speak>` - wrap around SSML text
-    * `lang` - set language for document
-* `<s>` - sentence (disables automatic sentence breaking)
-    * `lang` - set language for sentence
-* `<w>` / `<token>` - word (disables automatic tokenization)
-* `<voice name="...">` - set voice of inner text
-    * `voice` - name or language of voice
-* `<say-as interpret-as="">` - force interpretation of inner text
-    * `interpret-as` one of "spell-out", "date", "number", "time", or "currency"
-    * `format` - way to format text depending on `interpret-as`
-        * number - one of "cardinal", "ordinal", "digits", "year"
-        * date - string with "d" (cardinal day), "o" (ordinal day), "m" (month), or "y" (year)
-* `<break time="">` - Pause for given amount of time
-    * time - seconds ("123s") or milliseconds ("123ms")
-* `<mark name="">` - User-defined mark (written to `--mark-file` or part of `TextToSpeechResult`)
-    * name - name of mark
-* `<sub alias="">` - substitute `alias` for inner text
-* `<phoneme ph="...">` - supply phonemes for inner text
-    * `ph` - phonemes for each word of inner text, separated by whitespace
-    * `alphabet` - if "ipa", phonemes are intelligently split ("aːˈb" -> "aː", "ˈb")
-
---
-
 ## Command-Line Interface
 
 Larynx has a flexible command-line interface, available with:
@@ -314,6 +287,131 @@ You can specify the vocoder quality by adding `;<QUALITY>` to the MaryTTS voice
 
 For example: `en;low` will use the lowest quality (but fastest) vocoder. This is usually necessary to get decent performance on a Raspberry Pi.
 
+---
+
+## SSML
+
+A subset of [SSML](https://www.w3.org/TR/speech-synthesis11/) is supported (use `--ssml`):
+
+* `<speak>` - wrap around SSML text
+    * `lang` - set language for document
+* `<s>` - sentence (disables automatic sentence breaking)
+    * `lang` - set language for sentence
+* `<w>` / `<token>` - word (disables automatic tokenization)
+* `<voice name="...">` - set voice of inner text
+    * `voice` - name or language of voice
+* `<say-as interpret-as="">` - force interpretation of inner text
+    * `interpret-as` one of "spell-out", "date", "number", "time", or "currency"
+    * `format` - way to format text depending on `interpret-as`
+        * number - one of "cardinal", "ordinal", "digits", "year"
+        * date - string with "d" (cardinal day), "o" (ordinal day), "m" (month), or "y" (year)
+* `<break time="">` - Pause for given amount of time
+    * time - seconds ("123s") or milliseconds ("123ms")
+* `<mark name="">` - User-defined mark (written to `--mark-file` or part of `TextToSpeechResult`)
+    * name - name of mark
+* `<sub alias="">` - substitute `alias` for inner text
+* `<phoneme ph="...">` - supply phonemes for inner text
+    * `ph` - phonemes for each word of inner text, separated by whitespace
+* `<lexicon id="...">` - inline pronunciation lexicon
+    * `id` - unique id of lexicon (used in `<lookup ref="...">`)
+    * One or more `<lexeme>` child elements with:
+        * `<grapheme role="...">WORD</grapheme>` - word text (optional [role][#word-roles])
+        * `<phoneme>P H O N E M E S</phoneme>` - word pronunciation (phonemes separated by whitespace)
+* `<lookup ref="...">` - use inline pronunciation lexicon for child elements
+    * `ref` - id from a `<lexicon id="...">`
+
+### Word Roles
+
+During phonemization, word roles are used to disambiguate pronunciations. Unless manually specified, a word's role is derived from its part of speech tag as `gruut:<TAG>`. For initialisms and `spell-out`, the role `gruut:letter` is used to indicate that e.g., "a" should be spoken as `/eɪ/` instead of `/ə/`.
+
+For `en-us`, the following additional roles are available from the part-of-speech tagger:
+
+* `gruut:CD` - number
+* `gruut:DT` - determiner
+* `gruut:IN` - preposition or subordinating conjunction 
+* `gruut:JJ` - adjective
+* `gruut:NN` - noun
+* `gruut:PRP` - personal pronoun
+* `gruut:RB` - adverb
+* `gruut:VB` - verb
+* `gruut:VB` - verb (past tense)
+
+### Inline Lexicons
+
+Inline [pronunciation lexicons](https://www.w3.org/TR/2008/REC-pronunciation-lexicon-20081014/) are supported via the `<lexicon>` and `<lookup>` tags. gruut diverges slightly from the [SSML standard](https://www.w3.org/TR/speech-synthesis11/) here by only allowing lexicons to be defined within the SSML document itself. Additionally, the `id` attribute of the `<lexicon>` element can be left off to indicate a "default" inline lexicon that does not require a corresponding `<lookup>` tag.
+
+For example, the following document will yield three different pronunciations for the word "tomato":
+
+``` xml
+<?xml version="1.0"?>
+<speak version="1.1"
+       xmlns="http://www.w3.org/2001/10/synthesis"
+       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
+                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
+       xml:lang="en-US">
+
+  <lexicon xml:id="test" alphabet="ipa">
+    <lexeme>
+      <grapheme>
+        tomato
+      </grapheme>
+      <phoneme>
+        <!-- Individual phonemes are separated by whitespace -->
+        t ə m ˈɑ t oʊ
+      </phoneme>
+    </lexeme>
+    <lexeme>
+      <grapheme role="fake-role">
+        tomato
+      </grapheme>
+      <phoneme>
+        <!-- Made up pronunciation for fake word role -->
+        t ə m ˈi t oʊ
+      </phoneme>
+    </lexeme>
+  </lexicon>
+
+  <w>tomato</w>
+  <lookup ref="test">
+    <w>tomato</w>
+    <w role="fake-role">tomato</w>
+  </lookup>
+</speak>
+```
+
+The first "tomato" will be looked up in the U.S. English lexicon (`/t ə m ˈeɪ t oʊ/`). Within the `<lookup>` tag's scope, the second and third "tomato" words will be looked up in the inline lexicon. The third "tomato" word has a [role](#word-roles) attached  (selecting a made up pronunciation in this case).
+
+Even further from the SSML standard, gruut allows you to leave off the `<lexicon>` id entirely. With no `id`, a `<lookup>` tag is no longer needed, allowing you to override the pronunciation of any word in the document: 
+
+``` xml
+<?xml version="1.0"?>
+<speak version="1.1"
+       xmlns="http://www.w3.org/2001/10/synthesis"
+       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
+                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
+       xml:lang="en-US">
+
+  <!-- No id means change all words without a lookup -->
+  <lexicon>
+    <lexeme>
+      <grapheme>
+        tomato
+      </grapheme>
+      <phoneme>
+        t ə m ˈɑ t oʊ
+      </phoneme>
+    </lexeme>
+  </lexicon>
+
+  <w>tomato</w>
+</speak>
+```
+
+This will yield a pronunciation of `/t ə m ˈɑ t oʊ/` for all instances of "tomato" in the document (unless they have a `<lookup>`).
+
+
 ---
 
 ## Text to Speech Models

diff --git a/larynx/VERSION b/larynx/VERSION
@@ -1 +1 @@
-1.0.3
+1.1.0
diff --git a/larynx/__init__.py b/larynx/__init__.py
@@ -8,8 +8,8 @@
 import gruut
 import numpy as np
 import onnxruntime
-
 import phonemes2ids
+
 from larynx.audio import AudioSettings
 from larynx.constants import (
     InferenceBackend,

diff --git a/larynx/__main__.py b/larynx/__main__.py
@@ -42,6 +42,19 @@ class OutputNaming(str, Enum):
     ID = "id"
 
 
+class StdinFormat(str, Enum):
+    """Format of standard input"""
+
+    AUTO = "auto"
+    """Choose based on SSML state"""
+
+    LINES = "lines"
+    """Each line is a separate sentence/document"""
+
+    DOCUMENT = "document"
+    """Entire input is one document"""
+
+
 # -----------------------------------------------------------------------------
 
 
@@ -170,7 +183,18 @@ def main():
         texts = args.text
     else:
         # Use stdin
-        texts = sys.stdin
+        stdin_format = StdinFormat.LINES
+
+        if (args.stdin_format == StdinFormat.AUTO) and args.ssml:
+            # Assume SSML input is entire document
+            stdin_format = StdinFormat.DOCUMENT
+
+        if stdin_format == StdinFormat.DOCUMENT:
+            # One big line
+            texts = [sys.stdin.read()]
+        else:
+            # Multiple lines
+            texts = sys.stdin
 
         if os.isatty(sys.stdin.fileno()):
             print("Reading text from stdin...", file=sys.stderr)
@@ -417,6 +441,12 @@ def get_args():
     parser.add_argument(
         "text", nargs="*", help="Text to convert to speech (default: stdin)"
     )
+    parser.add_argument(
+        "--stdin-format",
+        choices=[str(v.value) for v in StdinFormat],
+        default=StdinFormat.AUTO,
+        help="Format of stdin text (default: auto)",
+    )
     parser.add_argument(
         "--voice",
         "-v",

diff --git a/local/en-us/glados-glow_tts/samples/be_a_voice_not_an_echo.wav b/local/en-us/glados-glow_tts/samples/be_a_voice_not_an_echo.wav
diff --git a/local/en-us/glados-glow_tts/samples/im_sorry_dave.wav b/local/en-us/glados-glow_tts/samples/im_sorry_dave.wav
diff --git a/local/en-us/glados-glow_tts/samples/it_took_me_quite_a_long_time_to_develop_a_voice.wav b/local/en-us/glados-glow_tts/samples/it_took_me_quite_a_long_time_to_develop_a_voice.wav
diff --git a/local/en-us/glados-glow_tts/samples/prior_to_november.wav b/local/en-us/glados-glow_tts/samples/prior_to_november.wav
diff --git a/local/en-us/glados-glow_tts/samples/this_cake_is_great.wav b/local/en-us/glados-glow_tts/samples/this_cake_is_great.wav
diff --git a/requirements.txt b/requirements.txt
@@ -1,5 +1,5 @@
 dataclasses-json~=0.5.0
-gruut[de,es,fr,it,nl,ru,sv,sw]~=2.0.0
+gruut[de,es,fr,it,nl,ru,sv,sw]~=2.1.0
 numpy>=1.20.0
 onnxruntime>=1.6.0,<2.0
 phonemes2ids~=1.0.0