Another implementation for utf8 support. #187

yhirose · 2020-04-19T23:01:26Z

Since I saw #186, I would like to post my implementation for utf8 support, too. It supports Emoji and CJK characters which have wide width as well. When the Unicode standard gets updated, the only thing we need is to update tables in `encodings/utf8.c'. We can also easily add another encodings by implementing by providing the following functions:

typedef size_t (linenoisePrevCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseNextCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseReadCode)(int fd, char *buf, size_t buf_len, int* c);

You can see more detail from my comments in #25.

I don't have any intention to promote my implementation at all with this pull request. But hope this implementation helps you when designing the utf8 support in linenoise in addition to @ManzoniGiuseppe's excellent implementation.

Thank you!

ManzoniGiuseppe · 2020-04-20T14:03:15Z

Nah, mine was just a temporary fix for basic utf-8 support until someone wrote something more complete (which apparently was already written long ago). Anyway, I tried yours out, and I found two things. It seems I can't create issues in yours, so, I'll put them here:

The wideCharTable seems bugged to me. Things like °, ¡, €, ¤, ×, ¹, á, ¶ are shown as single column in my xterm, but your impl considers them as two columns large. Maybe it's my xterm to be bugged, but consider the following: Your impl considers the lower case á to be two columns wide, but its uppercase Á is a single column... same strange thing for the pairs ǎ/Ǎ, à/À, while ã/Ã, ä/Ä, ȧ/Ȧ are all single column.
You put this codepoint in your example, and you said to support unicode 12.1, but the Zero Width Joiner (U+200D, added in unicode 1.1.5) is not supported width-wise. I mean, it's neither a zero width nor a joiner.

yhirose · 2020-04-20T16:52:04Z

@ManzoniGiuseppe, thanks for the thorough checking!

The wideCharTable seems bugged to me. Things like °, ¡, €, ¤, ×, ¹, á, ¶ are shown as single column in my xterm, but your impl considers them as two columns large. Maybe it's my xterm to be bugged, but consider the following: Your impl considers the lower case á to be two columns wide, but its uppercase Á is a single column... same strange thing for the pairs ǎ/Ǎ, à/À, while ã/Ã, ä/Ä, ȧ/Ȧ are all single column.

You are right. My the wideCharTable generator (generate_wide_char_table.rb) treated 'Ambiguous' (A) entries as Wide chars. The new generator code now excludes them from wideCharTable. Please check with the latest code.

The reason why I considered Ambiguous entries as Wide chars is based on the following information from UAX #11 'East Asian Width' document for the Ambiguous characters:

In a broad sense, wide characters include W, F, and A (when in East Asian context), and narrow characters include N, Na, H, and A (when not in East Asian context).

But it seems like there are only a few cases where the ambiguous entries should be treated as wide chars. So I think this change is ok, and hope it fixes the problems that you reported.

You put this codepoint in your example, and you said to support unicode 12.1, but the Zero Width Joiner (U+200D, added in unicode 1.1.5) is not supported width-wise. I mean, it's neither a zero width nor a joiner.

In the EastAsianWIdth.txt in Unicode 13.0 Database has the following entry:

200B..200F;N     # Cf     [5] ZERO WIDTH SPACE..RIGHT-TO-LEFT MARK

U+200D is declared as 'Narrow' (N). So I am not sure if U+200D should be treated as a wide character at this point...

Anyway I didn't make any decision by myself as to whether characters should be wide or narrow, rather the Unicode East Asian Width database does all. The benefit of this approach is very easy to catch up to the latest Unicode standard. (Just download UnicodeData.txt and EastAsianWidth.txt from Unicode.org, and re-generate combiningCharTable and WideCharTable with the ruby scripts.)

But if there are cases that this auto-generated-table-driven way cannot handle, it seems like such cases should be handled manually...

I included tools directory where you can see the latest Unicode 13.0 UnicodeData.txt and EastAsianWidth.txt and ruby scripts to generate the C++ tables. Hope it helps you to understand my code better. :)

craigbarnes · 2020-04-28T14:28:09Z

U+200D is declared as 'Narrow' (N). So I am not sure if U+200D should be treated as a wide character at this point...

N means neutral, not narrow. U+200D is a default ignorable codepoint, so in the context of a terminal it should be zero width.

yhirose · 2020-04-28T15:09:28Z

@craigbarnes, thanks for finding my incorrect understanding for N. :)

craigbarnes · 2020-04-28T15:17:59Z

It's the same difference really. It's the default ignorable part that matters. The most straight forward way to support those codepoints is to parse DerivedCoreProperties.txt and make all Default_Ignorable_Code_Point entries have a width of 0.

antirez · 2020-04-28T15:36:56Z

Hey folks, thanks for your efforts. What do you think the best support is so far among the ones contributed? I would like to spend some time to review.

yhirose · 2020-05-01T03:01:33Z

@antirez, thank you for your interest in adding UTF-8 support.

I don't say that my implementation is the best though, you might find something helpful in it when supporting UTF-8 in linenoise.

Here is the diff between the latest master and my branch:
yhirose@72bf4bb#diff-72af1757e65272a44fc4aa3b479e2f3b

Most changes in 'linenoise.c' are basically for calculation of correct column position for multi-bytes characters. (Size of a character could now be multi-bytes, and it also could be a 'wide' char which consumes 2 column widths like Emoji and CJK characters.)

The actual UTF-8 specific code is separated in `encodings/utf8.h and c'. It implements the following interfaces which are used in 'linenoise.c'.

typedef size_t (linenoisePrevCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseNextCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseReadCode)(int fd, char *buf, size_t buf_len, int* c);

So if you want to support another encoding, what we need to do is to implement the above interfaces for the encoding and integrate them to linenoise by calling linenoiseSetEncodingFunctions function in 'linenoise.h'.

Hope it helps when you review the code. Please let me know if you have any questions.
Thank you!

ManzoniGiuseppe · 2020-05-31T16:57:47Z

Of the two I know of, this has an obviously better utf-8 support then mine.
If anything I think a function to detect if utf-8 is supported would be nice and easy. For example mine internally (didn't want to change the API) checks it by printing a U+1EE4 and detecting of how many positions the cursor was moved. Many other encodings like all ISO 8859,Windows-1252, and Mac OS Roman would print it as 3 valid chars instead of 1. I think it'd be useful for many applications as they may not know which encoding their terminal is using.

antirez · 2020-07-03T09:02:23Z

Ok I took some time to read both the implementations. They are both nice in different ways. But my feeling is that this implementation is too complex for the linenoise design idea of minimality, and @ManzoniGiuseppe implementation, which is more in the right direction, is too conservative in the abstractions, so it ends making a lot of math with buffer positions. I want to attempt a more minimal implementation, and I'll post it as a PR as well. Let's see if there is a possible, really minimal approach. Thanks.

yhirose · 2020-07-03T12:55:23Z

@antirez, sounds good to me!

dzwdz · 2023-08-09T17:43:06Z

In case anyone cares, I've updated this branch to work with the multiplexing API / resolved the current merge conflicts. yhirose#1

yhirose · 2023-08-29T04:04:33Z

I have rebased the utf8-support branch on yhirose/linenoise with the latest master branch, and I confirmed that the utf8-support branch works with Japanese and Emoji.

This is the first batch of changes from an upstream PR that provides UTF8 support. As a side effect, it means that we now will have the correct support for escape sequences in single and multiline mode. We previously merged in a smaller change which provided support for single line mode with escape sequences by fixing the cursor position but this is more comprehensive. The next commit will add the UTF8 library. Author: https://github.com/yhirose PR: antirez/linenoise#187

This is the second half of the PR that I'm merging in from upstream. It adds UTF8 13.0 support by including an extra library to get the char width of those characters (the visual width that is). During testing things seem to work quite well. I'm impressed. Author: https://github.com/yhirose PR: antirez/linenoise#187

yhirose · 2024-03-27T18:06:38Z

Rebased the utf8-support branch on yhirose/linenoise with the latest master branch.

zlaazlaa · 2024-08-19T09:21:28Z

What a nice PR!, is that means we can input Chinese and delete Chinese correctly? In master branch I even need to press Backspace for three times to delete Chinese!

yhirose · 2024-08-19T11:46:18Z

Yes, you can. Here is my branch which includes the UTF-8 support. Hope you enjoy it!
https://github.com/yhirose/linenoise/tree/utf8-support

zlaazlaa · 2024-08-20T04:28:13Z

Yes, you can. Here is my branch which includes the UTF-8 support. Hope you enjoy it! https://github.com/yhirose/linenoise/tree/utf8-support

Thanks for your reply, I'll try it.👍

yhirose force-pushed the utf8-support branch 3 times, most recently from 56859ae to 72bf4bb Compare April 20, 2020 16:20

yhirose force-pushed the utf8-support branch 2 times, most recently from 8d05ca3 to 8139f7a Compare April 7, 2022 03:51

fasterit mentioned this pull request Mar 29, 2023

Create a line editor for filter and search strings htop-dev/htop#1212

Open

dzwdz mentioned this pull request Aug 5, 2023

line editing: utf8 support, ctrl+arrows, all that jazz dzwdz/hewwo#2

Closed

yhirose force-pushed the utf8-support branch from 8139f7a to 34fbffe Compare August 29, 2023 04:01

apainintheneck mentioned this pull request Feb 5, 2024

Research pulling in more useful patches not included in upstream apainintheneck/crystal-linenoise#21

Open

9 tasks

apainintheneck mentioned this pull request Feb 27, 2024

Pull in utf8 support apainintheneck/crystal-linenoise#27

Merged

4 tasks

UTF8 encoding support.

b35616d

yhirose force-pushed the utf8-support branch from 34fbffe to b35616d Compare March 27, 2024 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Another implementation for utf8 support. #187

Another implementation for utf8 support. #187

yhirose commented Apr 19, 2020

ManzoniGiuseppe commented Apr 20, 2020

yhirose commented Apr 20, 2020

craigbarnes commented Apr 28, 2020 •

edited

Loading

yhirose commented Apr 28, 2020

craigbarnes commented Apr 28, 2020 •

edited

Loading

antirez commented Apr 28, 2020

yhirose commented May 1, 2020

ManzoniGiuseppe commented May 31, 2020

antirez commented Jul 3, 2020

yhirose commented Jul 3, 2020

dzwdz commented Aug 9, 2023

yhirose commented Aug 29, 2023 •

edited

Loading

yhirose commented Mar 27, 2024

zlaazlaa commented Aug 19, 2024

yhirose commented Aug 19, 2024

zlaazlaa commented Aug 20, 2024

Another implementation for utf8 support. #187

Are you sure you want to change the base?

Another implementation for utf8 support. #187

Conversation

yhirose commented Apr 19, 2020

ManzoniGiuseppe commented Apr 20, 2020

yhirose commented Apr 20, 2020

craigbarnes commented Apr 28, 2020 • edited Loading

yhirose commented Apr 28, 2020

craigbarnes commented Apr 28, 2020 • edited Loading

antirez commented Apr 28, 2020

yhirose commented May 1, 2020

ManzoniGiuseppe commented May 31, 2020

antirez commented Jul 3, 2020

yhirose commented Jul 3, 2020

dzwdz commented Aug 9, 2023

yhirose commented Aug 29, 2023 • edited Loading

yhirose commented Mar 27, 2024

zlaazlaa commented Aug 19, 2024

yhirose commented Aug 19, 2024

zlaazlaa commented Aug 20, 2024

craigbarnes commented Apr 28, 2020 •

edited

Loading

craigbarnes commented Apr 28, 2020 •

edited

Loading

yhirose commented Aug 29, 2023 •

edited

Loading