Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for East Asian Width #375

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

knu
Copy link

@knu knu commented Jan 16, 2013

Mosh currently uses wcwidth(3) for computing the width of each character, but the somewhat outdated standard function does not take East Asian Width (http://www.unicode.org/reports/tr11/) into account.

Characters marked as having an "ambiguous" width should be displayed as wide character under East Asian (CJK) locales while they should be narrow under non-CJK locales.

Most major terminal emulators, terminal multiplexers and terminal-based text editors support this option, so I believe mosh should too. Here's an incomplete list of those that support it:

  • xterm
  • GNOME terminal
  • Konsole + unofficial patch
  • PuTTY + unofficial patch
  • iTerm2
  • GNU screen (4.01.00+)
  • tmux + unofficial patch
  • GNU Emacs
  • Vim

My patch is a quick hack that simply does the job with a least effort and hopefully with a least overhead.
To do it a cleaner way, you may want to consider using ICU (http://site.icu-project.org) which offers a native C++ API to the Unicode database, though I'm not sure how it impacts on performance.

@glance-
Copy link
Contributor

glance- commented Jan 21, 2013

It would be nice to be able to switch to using this implementation of wcwidth by default at compiletime.
That would solve #361.

@sirainen
Copy link
Contributor

This looks like the same wcwidth() implementation as in my "unichar" branch that gets rid of wchar_t. I think better to just use this new wcwidth() implementation unconditionally. https://github.com/sirainen/mosh/tree/unichar

About ICU: I don't think it implements wcwidth() directly either, or does much else that is helpful in implementing this in a cleaner way? At least I didn't notice anything obvious when I looked into it, but I didn't look very hard. A dependency on ICU just for this small functionality seems like an overkill anyway.

@knu
Copy link
Author

knu commented Jan 25, 2013

The reason I mentioned ICU was that once you use it you are free from maintaining the character width database updated in future releases of Unicode, which ICU has been and will be following in a timely manner. The wcwidth() implementation in question is already so outdated; it is based on Unicode 2.0 when we have 6.2 by now (although I haven't looked closely into the differences that may affect this feature).

However, I have to admit that it would be an overkill to adopt ICU just for this. Maybe we should just live with the wcswidth() solution for the moment and revisit ICU in future when we end up with a desperate need for normalization, combining, bi-di and so on.

@andersk
Copy link
Member

andersk commented Feb 21, 2013

Why doesn’t glibc support this natively? Is this http://sourceware.org/bugzilla/show_bug.cgi?id=4335?

@ghost
Copy link

ghost commented Feb 21, 2013

@knu
Copy link
Author

knu commented May 17, 2013

Rebased.

@keithw
Copy link
Member

keithw commented May 17, 2013

Is there any hope of just getting glibc to fix the CJK locales, or make alternate CJK locales in which wcwidth() reports 2 (instead of 1) for ambiguous characters?

I'm reluctant to include our own implementation of a standard library function (wcwidth() in this case), which we will then have to keep updating as Unicode evolves, especially if this is really a bug in the locale. Either way, we still have no guarantee that wcwidth() (whether internal or from the locale) matches what the outer terminal is doing.

@keithw
Copy link
Member

keithw commented May 17, 2013

One crazy possibility would be to allow the user to build their own wcwidth() table, based on what their outer terminal actually does, and have the client upload it to the server. For example, using the following. (We'd have to pull the real ranges of assigned characters from the latest Unicode spec instead of just using 0..65535.)

I honestly think this may be more the "right thing" than hardcoding our own wcwidth table.

#!/usr/bin/perl -w

use strict;
use Term::ReadKey;
use bytes;

ReadMode 5;

my %wc;
for ( my $query = 0; $query <= 65535; $query++ ) {
  print qq{\033[Hx}, (pack q{U}, $query), qq{\033[6n};
  my $str; while ( 1 ) { $str .= ReadKey 0; last if $str =~ m{R$} }
  my ( $row, $col ) = $str =~ m{\[(\d+);(\d+)R};
  $wc{ $query } = $col - 2;
}

ReadMode 0;

print qq{\033[H\033[2J};

open OUT, q{>wcwidth.txt} or die qq{$!};
printf OUT qq{U+%04X %d\n}, $_, $wc{ $_ } for sort { $a <=> $b } keys %wc;

@knu
Copy link
Author

knu commented May 18, 2013

I think allowing user to inject a width table is a good idea.

As I mentioned in the first post, most major terminal based applications already include their own alternatives to the "broken" wcwidth(), and the fact itself is part of the negative cycle, keeping the vendors of libc reluctant to fix wcwidth() on their side. So, even if glibc decided to fix the bug, portable apps like mosh could not depend on it unless the same decision is made in OS X, *BSD, SunOS, etc. etc.

@keithw
Copy link
Member

keithw commented May 18, 2013

Yes, I took your post to heart and am grateful for your attention to these issues. (Any Unicode language lawyer is a friend of ours here at Mosh HQ.) I guess tentatively let's go forward with the plan to allow the user to supply an external wcwidth table to mosh-client! And figure out a graceful way to upload this to the server, and supply an executable (possible external to mosh or mosh-client) that builds the table.

@knu
Copy link
Author

knu commented May 18, 2013

Great! Let me help test it when it's getting ready.

This is a real quick hack to replace wcwidth(3) with a version that
takes East Asian (CJK) ambiguous characters into account.

A new option -w is added to mosh(1), which is enabled by default if
run under a CJK locale.
@knu knu force-pushed the east-asian-width branch from dfe82d2 to 21e8df6 Compare April 25, 2018 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants