Skip to content
This repository has been archived by the owner on Oct 2, 2024. It is now read-only.

Fix Emoji handling for wide.py #204

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

singingtelegram
Copy link
Member

@singingtelegram singingtelegram commented Apr 25, 2021

Some emojis are comprised of multiple Unicode characters, joined by the Zero Width Joiner (U+200D). In its current form, create breaks those emojis apart (for example, !w2 πŸ‘¨β€βœˆοΈ returns πŸ‘¨γ€€β€γ€€βœˆγ€€οΈγ€€.

My pull request aims to fix this behavior. Everything seems to work as intended during my limited testing.

@ja5087
Copy link
Member

ja5087 commented Apr 25, 2021

o good catch. This problem isn't limited to just emojis, and neither will only considering ZWJs solve it for all emojis. Looking at this Wikipedia page, there are also things like skin color modifiers and accents, and I'm not sure how your code would handle that.

The actual problem is about segmenting unicode in terms of composed symbols vs bytes, and I might suggest playing around with the regex module (not re), which can handle nifty regexes like regex.findall(r'\X', word) which can separate these properly:

Python 3.8.5 (default, Jan 27 2021, 15:41:15) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> regex.findall(r'\X','πŸ‘¨β€πŸ‘©β€πŸ‘¦πŸ³οΈβ€πŸŒˆπŸƒπŸ»β€β™€οΈ')
['πŸ‘¨\u200dπŸ‘©\u200dπŸ‘¦', '🏳️\u200d🌈', 'πŸƒπŸ»\u200d♀️']
>>> 

The bad output is because GNOME terminal converts the ZWJ to text.

@steven676
Copy link
Contributor

When I had to solve a very similar problem for the old Terminal Emulator for Android ages ago, the key observation was this: at least when it comes to everything other than emojis, Unicode code points with a display width of zero (combining diacritics, non-spacing marks and the like) generally can be thought of as if they attach to the previous code point. In other words, you'd want something like (UNTESTED)

for i in range(len[text]):
  if char_width(text[i]) == 0:
    response += text[i]
  else
    response += text[i] + WIDE_SPACE_CHAR * width

where char_width looks something like this:

import unicodedata
def char_width(c):
  # special cases:
  if ord(c) == 0x00ad:
    # SOFT HYPHEN: format character of width 1
    return 1
  elif ord(c) in range(0x1160, 0x11ff):
    # Hangul medial vowels and final consonants are conjoining when part of a Korean syllable block
    return 0
  elif ord(c) in range(0xd7b0, 0xd7ff):
    # Hangul medial vowels/final consonants in Hangul Jamo Extended-B
    return 0
  category = unicodedata.category(c)
  if c == 'Cf' or c == 'Mn' or c == 'Me':
    # format character ('Cf'), non-spacing mark ('Mn') or enclosing mark ('Me')
    return 0
  else
    # all other characters have nonzero width
    return 1

(Alternately, look into the wcwidth package.)

Unfortunately, because emojis were a terrible mistake, you're still going to have to special-case emoji sequences. As far as I can tell, if you want to limit special-case handling to sequences that clients will actually display specially, the only way to do this is with a giant table (https://www.unicode.org/emoji/charts/emoji-zwj-sequences.html, https://www.unicode.org/emoji/charts/full-emoji-modifiers.html, and https://www.unicode.org/emoji/charts/emoji-list.html#country-flag at least; there may be others). If you're okay with your algorithm picking up potential sequences that clients don't recognize as special, you need to read and digest https://www.unicode.org/reports/tr51/ .

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants