Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

properties: add "ambiwidth" property for ambiguous East Asian Width #270

Merged
merged 1 commit into from
Aug 30, 2024

Conversation

bfredl
Copy link
Contributor

@bfredl bfredl commented Aug 12, 2024

Some characters have their width defined as "Ambiguous" in UAX#11. These are typically rendered as single-width by modern monospace fonts, and utf8proc correctly returns charwidth==1 for these.

However some applications might need to support older CJK fonts where two-byte characters in legacy encodings were rendered as double-width. An example of this is the 'ambiwidth' option of vim and neovim which supports rendering in terminals using such wideness rules.

Add an 'ambiwidth' property to utf8proc_property_t for such characters, by using a previously unused padding bit.

alternatives

  • set charwidth==3 for such characters (which are not zero-width), which is presently unused. Would be too much of a breaking change for existing consumers, I think.

  • return the full set of EAW classes (W, F, N, H, Na, A). Could be more future-proof if some consumers need this info, but would require more space usage.

@stevengj
Copy link
Member

older CJK fonts where two-byte characters in legacy encodings were rendered as double-width

If this is font-dependent, it doesn't seem like something you can infer from codepoint alone?

I'm a little confused about how people would use this new property in practice.

@bfredl
Copy link
Contributor Author

bfredl commented Aug 12, 2024

Sure but it is font dependent either way. currently utf8proc represents all (non-zero) ambiguous width chars as single width, which is a fine first approximation but not guaranteed to be correct either. Knowing which chars are considered to be ambiguous allows apps to treat these more carefully, i e in a TUI you could reposition the cursor after each such codepoint to make sure the TUI and terminal emulator cursors are in sync regardless of the actual width in the user's font.

More specifically, this was motivated by ongoing work in neovim to migrate all unicode table lookups to use utf8proc, and ambiguous EAW is something we need to know in order to not regress functionality. Whether these chars are seen as single- or double-width is configurable as an option, and regardless we do the workaround described above to handle discrepancies in fonts.

@bfredl
Copy link
Contributor Author

bfredl commented Aug 14, 2024

This is an example how this property will be used in neovim: neovim/neovim#30042 .

@clason
Copy link

clason commented Aug 29, 2024

@stevengj any input? This is a bit of a blocker for us.

@stevengj
Copy link
Member

Seems fine to me; can you add an accessor function to the API? e.g. utf8proc_charwidth_ambiguous

…idth

Some characters have their width defined as "Ambiguous" in UAX#11.
These are typically rendered as single-width by modern monospace fonts,
and utf8proc correctly returns charwidth==1 for these.

However some applications might need to support older CJK fonts where
characters which where two-byte in legacy encodings were rendered as
double-width. An example of this is the 'ambiwidth' option of vim
and neovim which supports rendering in terminals using such wideness
rules.

Add an 'ambiguous_width' property to utf8proc_property_t for such characters.
@bfredl
Copy link
Contributor Author

bfredl commented Aug 30, 2024

done.

@stevengj stevengj merged commit 3de4596 into JuliaStrings:master Aug 30, 2024
12 checks passed
@stevengj
Copy link
Member

Note that Unicode 16 looks like it is scheduled to be released on September 10, so it might be good to hold off on a new release for a couple of weeks until we can update the Unicode tables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants