Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation out of sync for grapheme.graphemes call #10

Open
EmilStenstrom opened this issue Mar 8, 2020 · 2 comments
Open

Documentation out of sync for grapheme.graphemes call #10

EmilStenstrom opened this issue Mar 8, 2020 · 2 comments

Comments

@EmilStenstrom
Copy link

The documentation gives this code:

>>> rainbow_flag = "πŸ³οΈβ€πŸŒˆ"
>>> [codepoint for codepoint in rainbow_flag]
['🏳', '️', '‍', '🌈']
>>> list(grapheme.graphemes("multi codepoint grapheme: " + rainbow_flag))
['m', 'u', 'l', 't', 'i', ' ', 'c', 'o', 'd', 'e', 'p', 'o', 'i', 'n', 't', ' ', 'g', 'r', 'a', 'p', 'h', 'e', 'm', 'e', ':', ' ', 'πŸ³οΈβ€πŸŒˆ']

In reality, this is how the same code runs locally using Python 3.8, in the default Mac OS Terminal:

>>> rainbow_flag = "πŸ³οΈβ€πŸŒˆ"
>>> [codepoint for codepoint in rainbow_flag]
['🏳', '️', '\u200d', '🌈']
list(grapheme.graphemes("multi codepoint grapheme: " + rainbow_flag))
['m', 'u', 'l', 't', 'i', ' ', 'c', 'o', 'd', 'e', 'p', 'o', 'i', 'n', 't', ' ', 'g', 'r', 'a', 'p', 'h', 'e', 'm', 'e', ':', ' ', '🏳️\u200d🌈']

I would expect the flag emoji to be held together as one character, like in the documentation.

@alvinlindstam
Copy link
Owner

This is interesting, I'm wondering when it changed. I'm quite sure that the documentation code has been the actual output when I originally wrote it.

I consider this a documentation bug, in that it does not really show what the function does in a good way. The function does keep the rainbow flag intact as one character/grapheme, the issue is that repr (which is what's used to control the display of the value in the command prompt) of that string returns that not very useful string:

>>> print(rainbow_flag)
πŸ³οΈβ€πŸŒˆ
>>> print(repr(rainbow_flag))
'🏳️\u200d🌈'
>>> rainbow_flag
'🏳️\u200d🌈'
>>> repr(rainbow_flag)
"'🏳️\\u200d🌈'"
>>> rainbow_flag.encode('unicode-escape')
b'\\U0001f3f3\\ufe0f\\u200d\\U0001f308'

It should be the case that list(grapheme.graphemes("multi codepoint grapheme: " + rainbow_flag))[-1] == rainbow_flag in your snippet.

I'll see if I can understand why repr does this, and if I can find a different multi-scalar grapheme cluster that can be used instead in the demo that does not look weird using repr. Input on that is appreciated.

@EmilStenstrom
Copy link
Author

@alvinlindstam Happy to hear it's only a documentation bug. I'm afraid I have no idea either when repr changed, or what a better multi-scalar grapheme would be.

Also: Thanks for this library, it was just what I needed to build my "convert datetimes across time zones with emoji"-library ;) Reference: https://github.com/EmilStenstrom/emojizones/blob/master/emojizones/convert.py#L84

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants