Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] wrong flag for emoji #655

Open
2 of 4 tasks
dseomn opened this issue Feb 22, 2025 · 15 comments
Open
2 of 4 tasks

[BUG] wrong flag for emoji #655

dseomn opened this issue Feb 22, 2025 · 15 comments
Assignees
Labels

Comments

@dseomn
Copy link
Contributor

dseomn commented Feb 22, 2025

ibus-typing-booster version

2.27.27-1

Distribution and Version

Debian testing/unstable

Desktop Environment and Version

GNOME 47.3-1

Session Type

  • Wayland
  • X11

Application and Version

ptyxis 47.8-1

Summary of the bug

With dictionary='en_US,nl_NL', I get the US flag for Dutch emoji:

Image

How to reproduce the bug?

  1. Use en_US and nl_NL dictionaries.
  2. Type grappig to search for emoji in Dutch.

Always reproducible?

  • Yes
  • No

Which Typing Booster options/settings do you use?

[/]
addspaceoncommit=false
autosettings=[['prefercommit', 'false', '^SDL2_Application:']]
candidatesdelaymilliseconds=uint32 0
dictionary='en_US,nl_NL'
emojipredictions=true
emojitriggercharacters=''
flagdictionary=true
inputmethod='t-rfc1345-plus'
keybindings={'commit': <['Return']>, 'commit_and_forward_key': <['Left']>, 'commit_candidate_1': <['KP_1', 'F1']>, 'commit_candidate_1_plus_space': <@as []>, 'commit_candidate_2': <['KP_2', 'F2']>, 'commit_candidate_2_plus_space': <@as []>, 'commit_candidate_3': <['KP_3', 'F3']>, 'commit_candidate_3_plus_space': <@as []>, 'commit_candidate_4': <['KP_4', 'F4']>, 'commit_candidate_4_plus_space': <@as []>, 'commit_candidate_5': <['KP_5', 'F5']>, 'commit_candidate_5_plus_space': <@as []>, 'commit_candidate_6': <['KP_6', 'F6']>, 'commit_candidate_6_plus_space': <@as []>, 'commit_candidate_7': <['KP_7', 'F7']>, 'commit_candidate_7_plus_space': <@as []>, 'commit_candidate_8': <['KP_8', 'F8']>, 'commit_candidate_8_plus_space': <@as []>, 'commit_candidate_9': <['KP_9', 'F9']>, 'commit_candidate_9_plus_space': <@as []>, 'toggle_emoji_prediction': <['Shift+Control+E']>, 'toggle_off_the_record': <@as []>}
offtherecord=true
pagesize=9
recordmode=3
shownumberofcandidates=true
wordpredictions=false

Anything else?

No response

@mike-fabian
Copy link
Owner

I cannot reproduce this ☹.

I think matches in the emoji data should not display dictionary flags at all:

Image

I think the relevant settings were the same than yours when I created this screenshot.

https://github.com/mike-fabian/ibus-typing-booster/blob/main/engine/hunspell_table.py#L1130

    def _append_candidate_to_lookup_table(
            self, phrase: str = '',
            user_freq: int = 0,
            comment: str = '',
            from_user_db: bool = False,
            spell_checking: bool = False) -> None:
        '''append candidate to lookup_table'''
        if not phrase:
            return
        phrase = itb_util.normalize_nfc_and_composition_exclusions(phrase)
        dictionary_matches: List[str] = (
            self.database.hunspell_obj.spellcheck_match_list(phrase))
        [...]
        if dictionary_matches:
            [...]
        if self._flag_dictionary:
            [...]
            for dictionary in dictionary_matches:
                    phrase += self._dictionary_flags.get(dictionary, '')

So if dictionary_matches is empty, no flags should be appended.

And for Emojj, there are usually no matches in any dictionaries.

Is the following different on your system?:

(I am doing this in /usr/share/ibus-typing-booster/engine to be able to do import hunspell_suggest)

mfabian@f41:/usr/share/ibus-typing-booster/engine$ python
Python 3.13.2 (main, Feb  4 2025, 00:00:00) [GCC 14.2.1 20250110 (Red Hat 14.2.1-7)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hunspell_suggest
>>> hunspell_suggest.IMPORT_HUNSPELL_SUCCESSFUL
False
>>> hunspell_suggest.IMPORT_ENCHANT_SUCCESSFUL
True
>>> h = hunspell_suggest.Hunspell(['en_US', 'nl_NL'])
>>> h.spellcheck_match_list('💩')
[]

So 💩 is not found neither in the en_US nor the nl_NL dictionary.

I thought that maybe you are using python3-pyhunspell instead of python3-enchant. Typing Booster prefers python3-enchant but falls back to python3-pyhunspell if python3-enchant is not available, near the beginnig of /usr/share/ibus-typing-booster/engine/hunspell_suggest.py there is

IMPORT_ENCHANT_SUCCESSFUL = False
IMPORT_HUNSPELL_SUCCESSFUL = False
try:
    import enchant # type: ignore
    IMPORT_ENCHANT_SUCCESSFUL = True
except (ImportError,):
    try:
        import hunspell # type: ignore
        IMPORT_HUNSPELL_SUCCESSFUL = True
    except (ImportError,):
        pass

And depending on what could be imported there the following code uses python3-enchant or python3-hunspell.

But I tried with both now and it makes no difference, in both cases I get no match for 💩 neither in the en_US nor the nl_NL dictionary.

@mike-fabian
Copy link
Owner

Really weird, I tried on Debian testing/unstable now and cannot reproduce it there either:

Image

mfabian@debian-testing:~$ dconf dump /org/freedesktop/ibus/engine/typing-booster/
[/]
addspaceoncommit=true
dictionary='nl_NL,en_US'
emojipredictions=true
inputmethod='t-rfc1345'
shownumberofcandidates=true
wordpredictions=false
mfabian@debian-testing:~$ 

@mike-fabian
Copy link
Owner

I forgot the flagdictionary=true setting:

mfabian@debian-testing:~$ dconf dump /org/freedesktop/ibus/engine/typing-booster/
[/]
addspaceoncommit=true
dictionary='nl_NL,en_US'
emojipredictions=true
flagdictionary=true
inputmethod='t-rfc1345'
shownumberofcandidates=true
wordpredictions=false
mfabian@debian-testing:~$ 

Now I can reproduce it:

Image

@mike-fabian
Copy link
Owner

fabian@debian-testing:/usr/share/ibus-typing-booster/engine$ python3
Python 3.13.2 (main, Feb  5 2025, 01:23:35) [GCC 14.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hunspell_suggest
>>> hunspell_suggest.IMPORT_HUNSPELL_SUCCESSFUL
False
>>> hunspell_suggest.IMPORT_ENCHANT_SUCCESSFUL
True
>>> h = hunspell_suggest.Hunspell(['en_US', 'nl_NL'])
>>> h.spellcheck_match_list('💩')
['en_US']
>>> 

@mike-fabian
Copy link
Owner

Debian testing:

mfabian@debian-testing:~$ python3
Python 3.13.2 (main, Feb  5 2025, 01:23:35) [GCC 14.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import enchant
>>> d = enchant.Dict('en_US')
>>> d.check('💩')
True
>>> 

Fedora 41:

mfabian@f41:~$ python3
Python 3.13.2 (main, Feb  4 2025, 00:00:00) [GCC 14.2.1 20250110 (Red Hat 14.2.1-7)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import enchant
>>> d = enchant.Dict('en_US')
>>> d.check('💩')
False
>>> 

Hm, what does that mean?

@mike-fabian
Copy link
Owner

I think for emoji, it makes no sense to show the dictionary flag. The dictionary flags are mostly useful if you write in more than one language at the same time to see which matches are valid words in which language. Something like this:

Image

Here I can see that “arrive” is a valid word in both French and English but “arriver” is only valid in French.

@mike-fabian
Copy link
Owner

For emoji, the Flags seem to make no sense anyway, emoji are valid in any language.

So I should probably just omit the flags if or candidates which are emoji.

Thinking about how I could do a check for "is this candidate an emoji?" fast without causing performance issues in filling the lookup table.

mike-fabian added a commit that referenced this issue Feb 22, 2025
…ve comments

Resolves: #655

If a candidate in a lookup table has a comment, then it must be

- a emoji
- a single “unusual” charactor or symbol
- a related word found by itb_nltk.py

In all these cases, it is not interesting to apply labels to the
lookup table for such candidates to show whether a spellcheck against
some language specific dictionaries returns True or False.

On some systems, something like 💩 might pass a spellcheck, for example on Debian:

    mfabian@debian-testing:~$ python3
    Python 3.13.2 (main, Feb  5 2025, 01:23:35) [GCC 14.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import enchant
    >>> d = enchant.Dict('en_US')
    >>> d.check('💩')
    True
    >>>

But that doesn’t  mean it is useful to append a 🇺🇸 label to a 💩 to
indicate that 💩 is a valid word in the en_US dictionary.
@mike-fabian
Copy link
Owner

With the patch ae92568 applied it looks like this on Debian testing:

Image

Image

Image

@mike-fabian
Copy link
Owner

In the last of the 3 screenshots in the previous comment one can see that the flags are now omittted for the 💩 emoji but not for the “normal” words which match in the nl_NL and/or en_US dictionaries.

@dseomn
Copy link
Contributor Author

dseomn commented Feb 22, 2025

I think I mostly figured out the difference between distros. What does this show on Fedora? I'm guessing it's not Aspell?

In [41]: import enchant

In [42]: enchant.Dict('en_US').provider
Out[42]: <Enchant: Aspell Provider>

From http://aspell.net/0.50-doc/man-html/4_Customizing.html#SECTION00522000000000000000

ignore,-W
(integer) ignore words <= n chars

I found that by digging through the code first, and https://github.com/GNUAspell/aspell/blob/4295413512cb1ceeba741876d12612e74c77f14b/modules/speller/default/speller_impl.cpp#L141 is what stood out to me as possibly causing this difference. But that uses a char which I wouldn't expect to work with an emoji in utf-8. Maybe my system is using a different implementation though, not the one in modules/speller/default/speller_impl.cpp, and maybe that other implementation treats ignore as a number of code points instead of bytes. Just to try out my theory a bit more:

In [49]: d = enchant.Dict('en_US')

In [50]: d.check('z')
Out[50]: True

In [51]: d.check('ñ')
Out[51]: True

I think your patch to just not show flags for emoji makes sense though, I was just curious what was going on with the dictionaries.

@dseomn
Copy link
Contributor Author

dseomn commented Feb 22, 2025

Oops, forgot to show why I think ignore is set to 1 on my system:

In [52]: d.check('ña')
Out[52]: False

In [53]: d.check('zz')
Out[53]: False

@mike-fabian
Copy link
Owner

On Fedora:

mfabian@f41:~$ python3
Python 3.13.2 (main, Feb  4 2025, 00:00:00) [GCC 14.2.1 20250110 (Red Hat 14.2.1-7)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import enchant
>>> enchant.Dict('en_US').provider
<Enchant: Hunspell Provider>
>>> d = enchant.Dict('en_US')
>>> d.check('z')
True
>>> d.check('ñ')
False
>>> d.check('ña')
False
>>> d.check('zz')
False
>>> 

@mike-fabian
Copy link
Owner

The thing with using flags for languages has an obvious shortcoming though, there is no one -to-one mapping between languages and country flags.

India for example has a lot of languages but only one flag. To produce something unique if several languages share the same flag I have this helper function producing unique labels for the list of dictionaries used:

https://github.com/mike-fabian/ibus-typing-booster/blob/main/engine/itb_util.py#L2358C1-L2382C16

def get_flags(dictionaries: List[str]) -> Dict[str, str]:
    # pylint: disable=line-too-long
    '''
    Examples:

    >>> get_flags(['de_DE', 'fr_FR', 'eo'])
    {'de_DE': '🇩🇪', 'fr_FR': '🇫🇷', 'eo': '🌍'}
    >>> get_flags(['fr_FR', 'de_DE', 'fy_DE', 'eo', 'de', '150'])
    {'fr_FR': '🇫🇷fr_FR', 'de_DE': '🇩🇪de_DE', 'fy_DE': '🇩🇪fy_DE', 'eo': '🌍eo', 'de': '🌍de', '150': '🌍150'}
    '''
    # pylint: enable=line-too-long
    flags: Dict[str, str] = {}
    flags_seen: Set[str] = set()
    duplicate_flags = False
    for dictionary in dictionaries:
        new_flag = get_flag(dictionary)
        flags[dictionary] = new_flag
        if new_flag in flags_seen:
            duplicate_flags = True
        flags_seen.add(new_flag)
    if duplicate_flags:
        for key, flag in flags.items():
            if not flag.endswith(key):
                flags[key] += key
    return flags

@mike-fabian
Copy link
Owner

mike-fabian commented Feb 22, 2025

And if the results of using

☑️ Use flags for dictionary suggestions

are still not satisfactory with the unique results produced by get_flags(), then one can use something like:

☑️ Use label for dictionary suggestions [ {'*': '📖', 'fy_??': '🛟', 'de_DE': '🇩🇪', 'en_GB': '💂🏻', 'fr_FR': '🗼'} ]
☐ Use flags for dictionary suggestions

The value for the label used for dictionary suggestions can be a simple string but it can also be a Python dictionary specifying exactly which symbols to use for which dictionary.

{'*': '📖', 'fy_??': '🛟', 'de_DE': '🇩🇪', 'en_GB': '💂🏻', 'fr_FR': '🗼'} would use '🇩🇪' for the de_DE dictionary and '🛟' for the fy_NL and fy_DE dictionaries. '*': '📖' is the fallback symbol if non of the more specific dictionary glob patterns matches.

@dseomn
Copy link
Contributor Author

dseomn commented Feb 22, 2025

The thing with using flags for languages has an obvious shortcoming though, there is no one -to-one mapping between languages and country flags.

Totally non-serious idea that you definitely should not implement: Just pretend the regional indicator symbols used for flags can actually be used for language codes instead. So Swiss German would be the completely sensible 🇬🇸🇼.

mike-fabian added a commit that referenced this issue Feb 25, 2025
…ve comments

Resolves: #655

If a candidate in a lookup table has a comment, then it must be

- an emoji
- a single “unusual” charactor or symbol
- a related word found by itb_nltk.py

In all these cases, it is not interesting to apply labels to the
lookup table for such candidates to show whether a spellcheck against
some language specific dictionaries returns True or False.

On some systems, something like 💩 might pass a spellcheck, for example on Debian:

    mfabian@debian-testing:~$ python3
    Python 3.13.2 (main, Feb  5 2025, 01:23:35) [GCC 14.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import enchant
    >>> d = enchant.Dict('en_US')
    >>> d.check('💩')
    True
    >>>

But that doesn’t  mean it is useful to append a 🇺🇸 label to a 💩 to
indicate that 💩 is a valid word in the en_US dictionary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In Progress
Development

No branches or pull requests

2 participants