refactor phonemizer to operate at Doc level #16

thatbudakguy · 2022-03-18T21:45:38Z

spacy's general design philosophy is that the Doc owns the data and Spans and Tokens are just views of this data. it makes sense to replicate this, especially to handle cases where the phoneme data doesn't cleanly align to Tokens (for which we could maybe even employ Alignment).

The text was updated successfully, but these errors were encountered:

thatbudakguy · 2022-03-21T16:25:23Z

another possibility is leaving out non-phonetic tokens entirely and using an Alignment, so that e.g.:

doc.text
>>> "北冥有魚，其名為鯤。"
doc._.phonemes
>>> "pok meang hjuwX ngjo tshen mjieng sjew kwon"
doc[4].text
>>> "，"
doc[4]._.phonemes
>>> None

thatbudakguy · 2022-04-12T04:17:16Z

Doc._.phon is the tensor representing all the phonological data for the doc
Doc._.phon_ is a string (analogous to doc.text) that passes the phonological data through the configured transcription
Token._.phon is a vector view into the doc phonological data for a single token
Token._.phon_ is a string that transcribes a single token

thatbudakguy · 2022-04-12T17:15:01Z

might need to subclass Alignment to allow for null/dangling tokens, if we care about that. docs say:

The current implementation of the alignment algorithm assumes that both tokenizations add up to the same string. For example, you’ll be able to align ["I", "'", "m"] and ["I", "'m"], which both add up to "I'm", but not ["I", "'m"] and ["I", "am"].

thatbudakguy · 2022-04-16T18:12:35Z

This was referenced Apr 11, 2022

add phonology module #24

Open

support pretrained seq2seq models for Phonemizer #12

Open

thatbudakguy added the enhancement New feature or request label Apr 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor phonemizer to operate at Doc level #16

refactor phonemizer to operate at Doc level #16

thatbudakguy commented Mar 18, 2022

thatbudakguy commented Mar 21, 2022

thatbudakguy commented Apr 12, 2022

thatbudakguy commented Apr 12, 2022

thatbudakguy commented Apr 16, 2022

refactor phonemizer to operate at Doc level #16

refactor phonemizer to operate at Doc level #16

Comments

thatbudakguy commented Mar 18, 2022

thatbudakguy commented Mar 21, 2022

thatbudakguy commented Apr 12, 2022

thatbudakguy commented Apr 12, 2022

thatbudakguy commented Apr 16, 2022

phonologizer

training

tokens

data (och-g2p)

data (`och-g2p`)