-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for East Asian Width #375
base: master
Are you sure you want to change the base?
Conversation
It would be nice to be able to switch to using this implementation of wcwidth by default at compiletime. |
This looks like the same wcwidth() implementation as in my "unichar" branch that gets rid of wchar_t. I think better to just use this new wcwidth() implementation unconditionally. https://github.com/sirainen/mosh/tree/unichar About ICU: I don't think it implements wcwidth() directly either, or does much else that is helpful in implementing this in a cleaner way? At least I didn't notice anything obvious when I looked into it, but I didn't look very hard. A dependency on ICU just for this small functionality seems like an overkill anyway. |
The reason I mentioned ICU was that once you use it you are free from maintaining the character width database updated in future releases of Unicode, which ICU has been and will be following in a timely manner. The wcwidth() implementation in question is already so outdated; it is based on Unicode 2.0 when we have 6.2 by now (although I haven't looked closely into the differences that may affect this feature). However, I have to admit that it would be an overkill to adopt ICU just for this. Maybe we should just live with the wcswidth() solution for the moment and revisit ICU in future when we end up with a desperate need for normalization, combining, bi-di and so on. |
Why doesn’t glibc support this natively? Is this http://sourceware.org/bugzilla/show_bug.cgi?id=4335? |
Rebased. |
Is there any hope of just getting glibc to fix the CJK locales, or make alternate CJK locales in which wcwidth() reports 2 (instead of 1) for ambiguous characters? I'm reluctant to include our own implementation of a standard library function (wcwidth() in this case), which we will then have to keep updating as Unicode evolves, especially if this is really a bug in the locale. Either way, we still have no guarantee that wcwidth() (whether internal or from the locale) matches what the outer terminal is doing. |
One crazy possibility would be to allow the user to build their own wcwidth() table, based on what their outer terminal actually does, and have the client upload it to the server. For example, using the following. (We'd have to pull the real ranges of assigned characters from the latest Unicode spec instead of just using 0..65535.) I honestly think this may be more the "right thing" than hardcoding our own wcwidth table. #!/usr/bin/perl -w
use strict;
use Term::ReadKey;
use bytes;
ReadMode 5;
my %wc;
for ( my $query = 0; $query <= 65535; $query++ ) {
print qq{\033[Hx}, (pack q{U}, $query), qq{\033[6n};
my $str; while ( 1 ) { $str .= ReadKey 0; last if $str =~ m{R$} }
my ( $row, $col ) = $str =~ m{\[(\d+);(\d+)R};
$wc{ $query } = $col - 2;
}
ReadMode 0;
print qq{\033[H\033[2J};
open OUT, q{>wcwidth.txt} or die qq{$!};
printf OUT qq{U+%04X %d\n}, $_, $wc{ $_ } for sort { $a <=> $b } keys %wc; |
I think allowing user to inject a width table is a good idea. As I mentioned in the first post, most major terminal based applications already include their own alternatives to the "broken" wcwidth(), and the fact itself is part of the negative cycle, keeping the vendors of libc reluctant to fix wcwidth() on their side. So, even if glibc decided to fix the bug, portable apps like mosh could not depend on it unless the same decision is made in OS X, *BSD, SunOS, etc. etc. |
Yes, I took your post to heart and am grateful for your attention to these issues. (Any Unicode language lawyer is a friend of ours here at Mosh HQ.) I guess tentatively let's go forward with the plan to allow the user to supply an external wcwidth table to mosh-client! And figure out a graceful way to upload this to the server, and supply an executable (possible external to mosh or mosh-client) that builds the table. |
Great! Let me help test it when it's getting ready. |
This is a real quick hack to replace wcwidth(3) with a version that takes East Asian (CJK) ambiguous characters into account. A new option -w is added to mosh(1), which is enabled by default if run under a CJK locale.
Mosh currently uses wcwidth(3) for computing the width of each character, but the somewhat outdated standard function does not take East Asian Width (http://www.unicode.org/reports/tr11/) into account.
Characters marked as having an "ambiguous" width should be displayed as wide character under East Asian (CJK) locales while they should be narrow under non-CJK locales.
Most major terminal emulators, terminal multiplexers and terminal-based text editors support this option, so I believe mosh should too. Here's an incomplete list of those that support it:
…
My patch is a quick hack that simply does the job with a least effort and hopefully with a least overhead.
To do it a cleaner way, you may want to consider using ICU (http://site.icu-project.org) which offers a native C++ API to the Unicode database, though I'm not sure how it impacts on performance.