diff --git a/CHANGELOG b/CHANGELOG index 2c4b527..3e53f5c 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -1,3 +1,10 @@ +## [1.1.0] + +### Changed + +- With --ssml, input from stdin is assumed to be one document instead of lines (override with --stdin-format lines) +- Bump gruut to version 2.1 for inline lexicons + ## [1.0.0] - 20 Oct 2021 ### Added diff --git a/README.md b/README.md index 0a7fdf7..624c7ef 100644 --- a/README.md +++ b/README.md @@ -143,33 +143,6 @@ Voices and vocoders are automatically downloaded when used on the command-line o --- -## SSML - -A subset of [SSML](https://www.w3.org/TR/speech-synthesis11/) is supported (use `--ssml`): - -* `` - wrap around SSML text - * `lang` - set language for document -* `` - sentence (disables automatic sentence breaking) - * `lang` - set language for sentence -* `` / `` - word (disables automatic tokenization) -* `` - set voice of inner text - * `voice` - name or language of voice -* `` - force interpretation of inner text - * `interpret-as` one of "spell-out", "date", "number", "time", or "currency" - * `format` - way to format text depending on `interpret-as` - * number - one of "cardinal", "ordinal", "digits", "year" - * date - string with "d" (cardinal day), "o" (ordinal day), "m" (month), or "y" (year) -* `` - Pause for given amount of time - * time - seconds ("123s") or milliseconds ("123ms") -* `` - User-defined mark (written to `--mark-file` or part of `TextToSpeechResult`) - * name - name of mark -* `` - substitute `alias` for inner text -* `` - supply phonemes for inner text - * `ph` - phonemes for each word of inner text, separated by whitespace - * `alphabet` - if "ipa", phonemes are intelligently split ("aːˈb" -> "aː", "ˈb") - --- - ## Command-Line Interface Larynx has a flexible command-line interface, available with: @@ -314,6 +287,131 @@ You can specify the vocoder quality by adding `;` to the MaryTTS voice For example: `en;low` will use the lowest quality (but fastest) vocoder. This is usually necessary to get decent performance on a Raspberry Pi. +--- + +## SSML + +A subset of [SSML](https://www.w3.org/TR/speech-synthesis11/) is supported (use `--ssml`): + +* `` - wrap around SSML text + * `lang` - set language for document +* `` - sentence (disables automatic sentence breaking) + * `lang` - set language for sentence +* `` / `` - word (disables automatic tokenization) +* `` - set voice of inner text + * `voice` - name or language of voice +* `` - force interpretation of inner text + * `interpret-as` one of "spell-out", "date", "number", "time", or "currency" + * `format` - way to format text depending on `interpret-as` + * number - one of "cardinal", "ordinal", "digits", "year" + * date - string with "d" (cardinal day), "o" (ordinal day), "m" (month), or "y" (year) +* `` - Pause for given amount of time + * time - seconds ("123s") or milliseconds ("123ms") +* `` - User-defined mark (written to `--mark-file` or part of `TextToSpeechResult`) + * name - name of mark +* `` - substitute `alias` for inner text +* `` - supply phonemes for inner text + * `ph` - phonemes for each word of inner text, separated by whitespace +* `` - inline pronunciation lexicon + * `id` - unique id of lexicon (used in ``) + * One or more `` child elements with: + * `WORD` - word text (optional [role][#word-roles]) + * `P H O N E M E S` - word pronunciation (phonemes separated by whitespace) +* `` - use inline pronunciation lexicon for child elements + * `ref` - id from a `` + +### Word Roles + +During phonemization, word roles are used to disambiguate pronunciations. Unless manually specified, a word's role is derived from its part of speech tag as `gruut:`. For initialisms and `spell-out`, the role `gruut:letter` is used to indicate that e.g., "a" should be spoken as `/eɪ/` instead of `/ə/`. + +For `en-us`, the following additional roles are available from the part-of-speech tagger: + +* `gruut:CD` - number +* `gruut:DT` - determiner +* `gruut:IN` - preposition or subordinating conjunction +* `gruut:JJ` - adjective +* `gruut:NN` - noun +* `gruut:PRP` - personal pronoun +* `gruut:RB` - adverb +* `gruut:VB` - verb +* `gruut:VB` - verb (past tense) + +### Inline Lexicons + +Inline [pronunciation lexicons](https://www.w3.org/TR/2008/REC-pronunciation-lexicon-20081014/) are supported via the `` and `` tags. gruut diverges slightly from the [SSML standard](https://www.w3.org/TR/speech-synthesis11/) here by only allowing lexicons to be defined within the SSML document itself. Additionally, the `id` attribute of the `` element can be left off to indicate a "default" inline lexicon that does not require a corresponding `` tag. + +For example, the following document will yield three different pronunciations for the word "tomato": + +``` xml + + + + + + + tomato + + + + t ə m ˈɑ t oʊ + + + + + tomato + + + + t ə m ˈi t oʊ + + + + + tomato + + tomato + tomato + + +``` + +The first "tomato" will be looked up in the U.S. English lexicon (`/t ə m ˈeɪ t oʊ/`). Within the `` tag's scope, the second and third "tomato" words will be looked up in the inline lexicon. The third "tomato" word has a [role](#word-roles) attached (selecting a made up pronunciation in this case). + +Even further from the SSML standard, gruut allows you to leave off the `` id entirely. With no `id`, a `` tag is no longer needed, allowing you to override the pronunciation of any word in the document: + +``` xml + + + + + + + + tomato + + + t ə m ˈɑ t oʊ + + + + + tomato + +``` + +This will yield a pronunciation of `/t ə m ˈɑ t oʊ/` for all instances of "tomato" in the document (unless they have a ``). + + --- ## Text to Speech Models diff --git a/larynx/VERSION b/larynx/VERSION index 21e8796..9084fa2 100644 --- a/larynx/VERSION +++ b/larynx/VERSION @@ -1 +1 @@ -1.0.3 +1.1.0 diff --git a/larynx/__init__.py b/larynx/__init__.py index e272ae1..4e22007 100644 --- a/larynx/__init__.py +++ b/larynx/__init__.py @@ -8,8 +8,8 @@ import gruut import numpy as np import onnxruntime - import phonemes2ids + from larynx.audio import AudioSettings from larynx.constants import ( InferenceBackend, diff --git a/larynx/__main__.py b/larynx/__main__.py index 7bc3e87..96d7875 100644 --- a/larynx/__main__.py +++ b/larynx/__main__.py @@ -42,6 +42,19 @@ class OutputNaming(str, Enum): ID = "id" +class StdinFormat(str, Enum): + """Format of standard input""" + + AUTO = "auto" + """Choose based on SSML state""" + + LINES = "lines" + """Each line is a separate sentence/document""" + + DOCUMENT = "document" + """Entire input is one document""" + + # ----------------------------------------------------------------------------- @@ -170,7 +183,18 @@ def main(): texts = args.text else: # Use stdin - texts = sys.stdin + stdin_format = StdinFormat.LINES + + if (args.stdin_format == StdinFormat.AUTO) and args.ssml: + # Assume SSML input is entire document + stdin_format = StdinFormat.DOCUMENT + + if stdin_format == StdinFormat.DOCUMENT: + # One big line + texts = [sys.stdin.read()] + else: + # Multiple lines + texts = sys.stdin if os.isatty(sys.stdin.fileno()): print("Reading text from stdin...", file=sys.stderr) @@ -417,6 +441,12 @@ def get_args(): parser.add_argument( "text", nargs="*", help="Text to convert to speech (default: stdin)" ) + parser.add_argument( + "--stdin-format", + choices=[str(v.value) for v in StdinFormat], + default=StdinFormat.AUTO, + help="Format of stdin text (default: auto)", + ) parser.add_argument( "--voice", "-v", diff --git a/local/en-us/glados-glow_tts/samples/be_a_voice_not_an_echo.wav b/local/en-us/glados-glow_tts/samples/be_a_voice_not_an_echo.wav new file mode 100644 index 0000000..d450159 Binary files /dev/null and b/local/en-us/glados-glow_tts/samples/be_a_voice_not_an_echo.wav differ diff --git a/local/en-us/glados-glow_tts/samples/im_sorry_dave.wav b/local/en-us/glados-glow_tts/samples/im_sorry_dave.wav new file mode 100644 index 0000000..877a827 Binary files /dev/null and b/local/en-us/glados-glow_tts/samples/im_sorry_dave.wav differ diff --git a/local/en-us/glados-glow_tts/samples/it_took_me_quite_a_long_time_to_develop_a_voice.wav b/local/en-us/glados-glow_tts/samples/it_took_me_quite_a_long_time_to_develop_a_voice.wav new file mode 100644 index 0000000..258409a Binary files /dev/null and b/local/en-us/glados-glow_tts/samples/it_took_me_quite_a_long_time_to_develop_a_voice.wav differ diff --git a/local/en-us/glados-glow_tts/samples/prior_to_november.wav b/local/en-us/glados-glow_tts/samples/prior_to_november.wav new file mode 100644 index 0000000..da19b17 Binary files /dev/null and b/local/en-us/glados-glow_tts/samples/prior_to_november.wav differ diff --git a/local/en-us/glados-glow_tts/samples/this_cake_is_great.wav b/local/en-us/glados-glow_tts/samples/this_cake_is_great.wav new file mode 100644 index 0000000..1611777 Binary files /dev/null and b/local/en-us/glados-glow_tts/samples/this_cake_is_great.wav differ diff --git a/requirements.txt b/requirements.txt index 919f62a..9b091c4 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,5 +1,5 @@ dataclasses-json~=0.5.0 -gruut[de,es,fr,it,nl,ru,sv,sw]~=2.0.0 +gruut[de,es,fr,it,nl,ru,sv,sw]~=2.1.0 numpy>=1.20.0 onnxruntime>=1.6.0,<2.0 phonemes2ids~=1.0.0