Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor phonemizer to operate at Doc level #16

Open
thatbudakguy opened this issue Mar 18, 2022 · 4 comments
Open

refactor phonemizer to operate at Doc level #16

thatbudakguy opened this issue Mar 18, 2022 · 4 comments
Labels
enhancement New feature or request

Comments

@thatbudakguy
Copy link
Member

spacy's general design philosophy is that the Doc owns the data and Spans and Tokens are just views of this data. it makes sense to replicate this, especially to handle cases where the phoneme data doesn't cleanly align to Tokens (for which we could maybe even employ Alignment).

@thatbudakguy
Copy link
Member Author

another possibility is leaving out non-phonetic tokens entirely and using an Alignment, so that e.g.:

doc.text
>>> "北冥有魚,其名為鯤。"
doc._.phonemes
>>> "pok meang hjuwX ngjo tshen mjieng sjew kwon"
doc[4].text
>>> ","
doc[4]._.phonemes
>>> None

@thatbudakguy thatbudakguy added the enhancement New feature or request label Apr 11, 2022
@thatbudakguy
Copy link
Member Author

  • Doc._.phon is the tensor representing all the phonological data for the doc
  • Doc._.phon_ is a string (analogous to doc.text) that passes the phonological data through the configured transcription
  • Token._.phon is a vector view into the doc phonological data for a single token
  • Token._.phon_ is a string that transcribes a single token

@thatbudakguy
Copy link
Member Author

might need to subclass Alignment to allow for null/dangling tokens, if we care about that. docs say:

The current implementation of the alignment algorithm assumes that both tokenizations add up to the same string. For example, you’ll be able to align ["I", "'", "m"] and ["I", "'m"], which both add up to "I'm", but not ["I", "'m"] and ["I", "am"].

@thatbudakguy
Copy link
Member Author

phonologizer

  • make set_annotations just set the annotations on the Doc instead of the Tokens
  • maybe update initialize?

training

  • maybe update get_aligned_phonemes?
  • update example_from_phonemes_dict to generate the correct alignment?

tokens

  • Doc._.phon is the tensor representing all the phonological data for the doc (floats2d). iterating over it yields a single floats1d per syllable (row)
  • Doc._.syllables is an iterator over Syllable objects; one per row in Doc._.phon
  • Doc._.phon_ is a string (analogous to Doc.text) that passes the phonological data through the configured transcription provider
  • Token._.phon is an aligned view into the doc phonological data for a single token (floats2d), which is one or multiple syllables. iterating over it yields a single floats1d per syllable (row)
  • Token._.syllables is an iterator over Syllable objects; one per row in Token._.phon
  • Token._.phon_ is a string (analogous to Token.text) that passes the phonological data through the configured transcription provider
  • Span._.phon is an aligned view into the doc phonological data for a contiguous range of tokens (floats2d), which is multiple syllables
  • Span._.syllables is an iterator over Syllable objects; one per row in Span._.phon
  • Span._.phon_ is a string (analogous to Span.text) that passes the phonological data through the configured transcription provider

data (och-g2p)

  • make sure you are generating valid training data here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant