-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade unicode-width and handle width more correctly #430
Conversation
Current versions of unicode-width have more handling for characters and ligatures that typically render at a different width than a character-by-character width sum would produce, such as ZWJ emoji sequences. Current versions of unicode-width also punt on control characters (as somewhat recommended by Unicode standards), making the assumption that if not specially handled they might render as width-1 substitution characters. unicode-width recommends filtering them out and handling them specially if that isn't the desired behavior. Update all the dimension-checking functions to always check whole lines at a time, splitting on newlines, and filtering out all other control characters. This makes most tests pass. Remove one test that appears to have depended on the exact rendering width of Devanagari, which unicode-width notes is not consistent or feasible, and doesn't have consistent monospace handling. Add a test for emoji width handling, including a ZWJ-based emoji cluster that should render as a single emoji. Note that terminals are inconsistent about how they render ZWJ-based emoji clusters. There's a proposal "mode 2027" to help applications tell the terminal how they want to handle such sequences. In the future, tabled/papergrid *might* want to add support for both modes, so that applications built on tabled/papergrid can then choose which handling they want (either based on user configurability or terminal detection/configuration based on "mode 2027"). See https://mitchellh.com/writing/grapheme-clusters-in-terminals for details. In the meantime, this implements the ideal behavior of assuming ZWJ-based emoji clusters are rendered as a single emoji, and will produce incorrect widths on terminals that render such emoji clusters incorrectly. See the table at https://mitchellh.com/writing/grapheme-clusters-in-terminals#terminal-comparison for a comparison of terminal behavior.
Thanks for comment. I don't know why I didn't got an email of it.... I'll try to release it in the nearest. You kind of a hero, cause I was sort of scary to get to deal with this issue 😅 Take care |
@@ -72,7 +52,7 @@ pub fn get_char_width(c: char) -> usize { | |||
|
|||
/// Returns a string width (accouting all characters). | |||
pub fn get_string_width(text: &str) -> usize { | |||
unicode_width::UnicodeWidthStr::width(text) | |||
unicode_width::UnicodeWidthStr::width(text.replace(|c| c < ' ', "").as_str()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though now it became a heavy operation :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder 💭
Maybe we rather must 'enforce' this in a way we did with
tabled::settings::formatting::Charset::clean
?
Just add a preprocessing step with this replacement and for those who are not sure about there sources it "assumed" to be used.
But for those who are pretty sure or did some processing themself we wouldn't waste this allocation?
What I mean is famous - "why need to pay for what we are not using" (don't remember how the quote goes exactly).
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once I had an idea about a generic which would describe a table content as Clean | Dirty
so we could do it out of the box.
And who are sure about their actions could use optimized version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the search portion of the replace operation is important to avoid; it's just a single search over the string contents, and when rendering text for human consumption, a character search won't add any significant overhead. The memory allocation, though, is worth avoiding if possible.
Ideally, there'd be a version of str::replace
that returns a Cow<str>
, and only allocates if it does any replacements. There isn't one in the standard library, but you could add one in util::string
. That would eliminate the allocation overhead, and the overhead of a search for control characters isn't worth avoiding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll merge it, and do Cow<str>
.
Will run benchmarks and look at it.
As I quite a bit favor the movement of it into its own setting 😄
*All though the one downside......we will need to recalculate it twice....... in most cases... but it's details.
Once again.
Thanks a lot.
Current versions of unicode-width have more handling for characters and
ligatures that typically render at a different width than a
character-by-character width sum would produce, such as ZWJ emoji
sequences.
Current versions of unicode-width also punt on control characters (as
somewhat recommended by Unicode standards), making the assumption that
if not specially handled they might render as width-1 substitution
characters. unicode-width recommends filtering them out and handling
them specially if that isn't the desired behavior.
Update all the dimension-checking functions to always check whole lines
at a time, splitting on newlines, and filtering out all other control
characters. This makes most tests pass.
Remove one test that appears to have depended on the exact rendering
width of Devanagari, which unicode-width notes is not consistent or
feasible, and doesn't have consistent monospace handling.
Add a test for emoji width handling, including a ZWJ-based emoji
cluster that should render as a single emoji.
Note that terminals are inconsistent about how they render ZWJ-based
emoji clusters. There's a proposal "mode 2027" to help applications tell
the terminal how they want to handle such sequences. In the future,
tabled/papergrid might want to add support for both modes, so that
applications built on tabled/papergrid can then choose which handling
they want (either based on user configurability or terminal
detection/configuration based on "mode 2027"). See
https://mitchellh.com/writing/grapheme-clusters-in-terminals for
details.
In the meantime, this implements the ideal behavior of assuming
ZWJ-based emoji clusters are rendered as a single emoji, and will
produce incorrect widths on terminals that render such emoji clusters
incorrectly. See the table at
https://mitchellh.com/writing/grapheme-clusters-in-terminals#terminal-comparison
for a comparison of terminal behavior.
If you want the behavior of handling emoji clusters as multiple emoji (which will match the behavior of the terminals that do so), and ignoring ligatures, I think that behavior would correspond to
get_string_width
summing the widths of each character individually. However, I've sent a feature request to unicode-width upstream to request a function that gives the exact behavior.I haven't made any attempt here to add configurability to tabled to allow choosing between those two behaviors. That'd be a much larger change.
In any case, this change may be sufficiently disruptive that you might want to publish it as a new semver-major version. It's more correct according to Unicode, and should produce better results for some people and somewhat less incorrect results for others (it shouldn't be worse for anyone), but it's a substantial behavior change that people may want to knowingly adopt.