Skip to content

Latest commit

 

History

History
41 lines (31 loc) · 946 Bytes

faq.md

File metadata and controls

41 lines (31 loc) · 946 Bytes

Frequently asked questions

How to treat word-initial or -final in a special way

Although it may seem a bit hacky, treating word-initial or -final graphemes differently is straightforward. We'll use the common regular expression markers for start ^ and end $.

  1. Create the orthography profile:
>>> from segments.tokenizer import Profile
>>> prf = Profile(
 {'Grapheme': 'c', 'IPA': 'c'},
 {'Grapheme': '^', 'IPA': 'NULL'},
 {'Grapheme': '$', 'IPA': 'NULL'},
 {'Grapheme': 'a', 'IPA': 'b'},
 {'Grapheme': '^a', 'IPA': 'A'})

Note: We treat word-initial a differently!

  1. Create the tokenizer
>>> from segments.tokenizer import Tokenizer
>>> t = Tokenizer(prf)
>>> t('tha', 'IPA')
'tH b'
>>> t('ath', 'IPA')
'b tH'
>>> t('^ath', 'IPA')
'A tH'
  1. Make sure to pass properly marked up words to the tokenizer:
>>> t(' '.join('^' + s + '$' for s in 'tha ath'.split()), 'IPA')
'tH b # A tH'