Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combining accents #147

Closed
mnater opened this issue Oct 13, 2020 · 0 comments
Closed

Combining accents #147

mnater opened this issue Oct 13, 2020 · 0 comments
Assignees
Labels

Comments

@mnater
Copy link
Owner

mnater commented Oct 13, 2020

Playing around with Int.Segmenter (#145) reveleaded an issue with combining accents.
E.g. the word "БЕЛАРУ́СКАЯ" in test6.html contains the letter "У" with a "COMBINING ACUTE ACCENT ◌́").

The current implementation for finding words first looks for consecutive characters of the unicode property escape \p{Letter} and then checks if this set contains a character that is not in the alphabet defined by the .wasm file of the respective language.
Since combining accents are not part of \p{Letter} and there is no normalized character the current implementation finds "БЕЛАР" and hyphenates just this part of the word. This could lead to errors.

How to solve (ideas):

  • Include \p{Mn} (or a subset) in the regex at line 562 (-> don't hyphenate this word at all)
  • Also include \p{Mn} in the "alphabet" but omit while hyphenating
@mnater mnater added the bug label Oct 13, 2020
@mnater mnater self-assigned this Oct 13, 2020
@mnater mnater closed this as completed in 75045f8 Nov 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant