Fix Emoji handling for wide.py #204

singingtelegram · 2021-04-25T03:36:25Z

Some emojis are comprised of multiple Unicode characters, joined by the Zero Width Joiner (U+200D). In its current form, create breaks those emojis apart (for example, !w2 👨‍✈️ returns 👨　‍　✈　️　.

My pull request aims to fix this behavior. Everything seems to work as intended during my limited testing.

for more information, see https://pre-commit.ci

ja5087 · 2021-04-25T04:18:18Z

o good catch. This problem isn't limited to just emojis, and neither will only considering ZWJs solve it for all emojis. Looking at this Wikipedia page, there are also things like skin color modifiers and accents, and I'm not sure how your code would handle that.

The actual problem is about segmenting unicode in terms of composed symbols vs bytes, and I might suggest playing around with the regex module (not re), which can handle nifty regexes like regex.findall(r'\X', word) which can separate these properly:

Python 3.8.5 (default, Jan 27 2021, 15:41:15) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> regex.findall(r'\X','👨‍👩‍👦🏳️‍🌈🏃🏻‍♀️')
['👨\u200d👩\u200d👦', '🏳️\u200d🌈', '🏃🏻\u200d♀️']
>>>

The bad output is because GNOME terminal converts the ZWJ to text.

steven676 · 2021-04-25T06:36:10Z

When I had to solve a very similar problem for the old Terminal Emulator for Android ages ago, the key observation was this: at least when it comes to everything other than emojis, Unicode code points with a display width of zero (combining diacritics, non-spacing marks and the like) generally can be thought of as if they attach to the previous code point. In other words, you'd want something like (UNTESTED)

for i in range(len[text]):
  if char_width(text[i]) == 0:
    response += text[i]
  else
    response += text[i] + WIDE_SPACE_CHAR * width

where char_width looks something like this:

import unicodedata
def char_width(c):
  # special cases:
  if ord(c) == 0x00ad:
    # SOFT HYPHEN: format character of width 1
    return 1
  elif ord(c) in range(0x1160, 0x11ff):
    # Hangul medial vowels and final consonants are conjoining when part of a Korean syllable block
    return 0
  elif ord(c) in range(0xd7b0, 0xd7ff):
    # Hangul medial vowels/final consonants in Hangul Jamo Extended-B
    return 0
  category = unicodedata.category(c)
  if c == 'Cf' or c == 'Mn' or c == 'Me':
    # format character ('Cf'), non-spacing mark ('Mn') or enclosing mark ('Me')
    return 0
  else
    # all other characters have nonzero width
    return 1

(Alternately, look into the wcwidth package.)

Unfortunately, because emojis were a terrible mistake, you're still going to have to special-case emoji sequences. As far as I can tell, if you want to limit special-case handling to sequences that clients will actually display specially, the only way to do this is with a giant table (https://www.unicode.org/emoji/charts/emoji-zwj-sequences.html, https://www.unicode.org/emoji/charts/full-emoji-modifiers.html, and https://www.unicode.org/emoji/charts/emoji-list.html#country-flag at least; there may be others). If you're okay with your algorithm picking up potential sequences that clients don't recognize as special, you need to read and digest https://www.unicode.org/reports/tr51/ .

singingtelegram and others added 9 commits April 24, 2021 19:40

fix emoji with zwj handling

81349e8

rework

3e4688e

missed ord

8c5a061

oops

5b291f7

oops

e0d5a6b

oooops

fe62a4a

cleanup

cf28909

consistency

8e60b2e

[pre-commit.ci] auto fixes from pre-commit.com hooks

5bdeca5

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Emoji handling for wide.py #204

Fix Emoji handling for wide.py #204

singingtelegram commented Apr 25, 2021 •

edited

Loading

ja5087 commented Apr 25, 2021

steven676 commented Apr 25, 2021

Fix Emoji handling for wide.py #204

Are you sure you want to change the base?

Fix Emoji handling for wide.py #204

Conversation

singingtelegram commented Apr 25, 2021 • edited Loading

ja5087 commented Apr 25, 2021

steven676 commented Apr 25, 2021

singingtelegram commented Apr 25, 2021 •

edited

Loading