Upgrade unicode-width and handle width more correctly #430

joshtriplett · 2024-10-16T00:29:45Z

Current versions of unicode-width have more handling for characters and
ligatures that typically render at a different width than a
character-by-character width sum would produce, such as ZWJ emoji
sequences.

Current versions of unicode-width also punt on control characters (as
somewhat recommended by Unicode standards), making the assumption that
if not specially handled they might render as width-1 substitution
characters. unicode-width recommends filtering them out and handling
them specially if that isn't the desired behavior.

Update all the dimension-checking functions to always check whole lines
at a time, splitting on newlines, and filtering out all other control
characters. This makes most tests pass.

Remove one test that appears to have depended on the exact rendering
width of Devanagari, which unicode-width notes is not consistent or
feasible, and doesn't have consistent monospace handling.

Add a test for emoji width handling, including a ZWJ-based emoji
cluster that should render as a single emoji.

Note that terminals are inconsistent about how they render ZWJ-based
emoji clusters. There's a proposal "mode 2027" to help applications tell
the terminal how they want to handle such sequences. In the future,
tabled/papergrid might want to add support for both modes, so that
applications built on tabled/papergrid can then choose which handling
they want (either based on user configurability or terminal
detection/configuration based on "mode 2027"). See
https://mitchellh.com/writing/grapheme-clusters-in-terminals for
details.

In the meantime, this implements the ideal behavior of assuming
ZWJ-based emoji clusters are rendered as a single emoji, and will
produce incorrect widths on terminals that render such emoji clusters
incorrectly. See the table at
https://mitchellh.com/writing/grapheme-clusters-in-terminals#terminal-comparison
for a comparison of terminal behavior.

If you want the behavior of handling emoji clusters as multiple emoji (which will match the behavior of the terminals that do so), and ignoring ligatures, I think that behavior would correspond to get_string_width summing the widths of each character individually. However, I've sent a feature request to unicode-width upstream to request a function that gives the exact behavior.

I haven't made any attempt here to add configurability to tabled to allow choosing between those two behaviors. That'd be a much larger change.

In any case, this change may be sufficiently disruptive that you might want to publish it as a new semver-major version. It's more correct according to Unicode, and should produce better results for some people and somewhat less incorrect results for others (it shouldn't be worse for anyone), but it's a substantial behavior change that people may want to knowingly adopt.

Current versions of unicode-width have more handling for characters and ligatures that typically render at a different width than a character-by-character width sum would produce, such as ZWJ emoji sequences. Current versions of unicode-width also punt on control characters (as somewhat recommended by Unicode standards), making the assumption that if not specially handled they might render as width-1 substitution characters. unicode-width recommends filtering them out and handling them specially if that isn't the desired behavior. Update all the dimension-checking functions to always check whole lines at a time, splitting on newlines, and filtering out all other control characters. This makes most tests pass. Remove one test that appears to have depended on the exact rendering width of Devanagari, which unicode-width notes is not consistent or feasible, and doesn't have consistent monospace handling. Add a test for emoji width handling, including a ZWJ-based emoji cluster that should render as a single emoji. Note that terminals are inconsistent about how they render ZWJ-based emoji clusters. There's a proposal "mode 2027" to help applications tell the terminal how they want to handle such sequences. In the future, tabled/papergrid *might* want to add support for both modes, so that applications built on tabled/papergrid can then choose which handling they want (either based on user configurability or terminal detection/configuration based on "mode 2027"). See https://mitchellh.com/writing/grapheme-clusters-in-terminals for details. In the meantime, this implements the ideal behavior of assuming ZWJ-based emoji clusters are rendered as a single emoji, and will produce incorrect widths on terminals that render such emoji clusters incorrectly. See the table at https://mitchellh.com/writing/grapheme-clusters-in-terminals#terminal-comparison for a comparison of terminal behavior.

zhiburt · 2024-10-21T04:55:43Z

Hi @joshtriplett

Thanks for comment.
I think we shall land it.

I don't know why I didn't got an email of it....

I'll try to release it in the nearest.
I Just have some mixed thoughts about Theme interface. It's just got bloated a bit. Probably need a bit improvement.
Also there's some potential performance improvements I have in mind for some ops.

You kind of a hero, cause I was sort of scary to get to deal with this issue 😅
The best I thought is to fork it 🤣

Take care

zhiburt · 2024-10-21T05:03:35Z

papergrid/src/util/string.rs

@@ -72,7 +52,7 @@ pub fn get_char_width(c: char) -> usize {

 /// Returns a string width (accouting all characters).
 pub fn get_string_width(text: &str) -> usize {
-    unicode_width::UnicodeWidthStr::width(text)
+    unicode_width::UnicodeWidthStr::width(text.replace(|c| c < ' ', "").as_str())


Though now it became a heavy operation :(

I wonder 💭

Maybe we rather must 'enforce' this in a way we did with
tabled::settings::formatting::Charset::clean?

Just add a preprocessing step with this replacement and for those who are not sure about there sources it "assumed" to be used.
But for those who are pretty sure or did some processing themself we wouldn't waste this allocation?

What I mean is famous - "why need to pay for what we are not using" (don't remember how the quote goes exactly).

What do you think?

Once I had an idea about a generic which would describe a table content as Clean | Dirty so we could do it out of the box.
And who are sure about their actions could use optimized version.

I don't think the search portion of the replace operation is important to avoid; it's just a single search over the string contents, and when rendering text for human consumption, a character search won't add any significant overhead. The memory allocation, though, is worth avoiding if possible.

Ideally, there'd be a version of str::replace that returns a Cow<str>, and only allocates if it does any replacements. There isn't one in the standard library, but you could add one in util::string. That would eliminate the allocation overhead, and the overhead of a search for control characters isn't worth avoiding.

I'll merge it, and do Cow<str>.
Will run benchmarks and look at it.

As I quite a bit favor the movement of it into its own setting 😄
*All though the one downside......we will need to recalculate it twice....... in most cases... but it's details.

Once again.
Thanks a lot.

joshtriplett mentioned this pull request Oct 16, 2024

upgrade papergrid to a later version of unicode-width #426

Closed

zhiburt reviewed Oct 21, 2024

View reviewed changes

zhiburt merged commit 68f2538 into zhiburt:master Oct 21, 2024
68 of 75 checks passed

joshtriplett deleted the upgrade-unicode-width branch October 21, 2024 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade unicode-width and handle width more correctly #430

Upgrade unicode-width and handle width more correctly #430

joshtriplett commented Oct 16, 2024 •

edited

Loading

zhiburt commented Oct 21, 2024 •

edited

Loading

zhiburt Oct 21, 2024

zhiburt Oct 21, 2024 •

edited

Loading

zhiburt Oct 21, 2024 •

edited

Loading

joshtriplett Oct 21, 2024

zhiburt Oct 21, 2024

Upgrade unicode-width and handle width more correctly #430

Upgrade unicode-width and handle width more correctly #430

Conversation

joshtriplett commented Oct 16, 2024 • edited Loading

zhiburt commented Oct 21, 2024 • edited Loading

zhiburt Oct 21, 2024

Choose a reason for hiding this comment

zhiburt Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

zhiburt Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

joshtriplett Oct 21, 2024

Choose a reason for hiding this comment

zhiburt Oct 21, 2024

Choose a reason for hiding this comment

joshtriplett commented Oct 16, 2024 •

edited

Loading

zhiburt commented Oct 21, 2024 •

edited

Loading

zhiburt Oct 21, 2024 •

edited

Loading

zhiburt Oct 21, 2024 •

edited

Loading