Combining accents #147

mnater · 2020-10-13T20:58:53Z

Playing around with Int.Segmenter (#145) reveleaded an issue with combining accents.
E.g. the word "БЕЛАРУ́СКАЯ" in test6.html contains the letter "У" with a "COMBINING ACUTE ACCENT ◌́").

The current implementation for finding words first looks for consecutive characters of the unicode property escape \p{Letter} and then checks if this set contains a character that is not in the alphabet defined by the .wasm file of the respective language.
Since combining accents are not part of \p{Letter} and there is no normalized character the current implementation finds "БЕЛАР" and hyphenates just this part of the word. This could lead to errors.

How to solve (ideas):

Include \p{Mn} (or a subset) in the regex at line 562 (-> don't hyphenate this word at all)
Also include \p{Mn} in the "alphabet" but omit while hyphenating

The text was updated successfully, but these errors were encountered:

mnater added the bug label Oct 13, 2020

mnater self-assigned this Oct 13, 2020

mnater closed this as completed in 75045f8 Nov 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combining accents #147

Combining accents #147

mnater commented Oct 13, 2020

Combining accents #147

Combining accents #147

Comments

mnater commented Oct 13, 2020