Preedit of accented words can be improved #231

psads-git · 2021-08-23T12:04:28Z

Hi, Mike,

Preedit of accented words can be improved. Let me explain my idea. In Portuguese, many words contain accented character (typically, only one accented character). Suppose the word

estão

The prediction of the word should be fired as soon as the accent

~

is typed and not after the entire accented character is typed (ã), in order to improve the typing speed.

Thanks!

mike-fabian · 2021-08-23T16:20:47Z

Prediction is accent insensitive at the moment (for Portuguese).

I.e. estã and esta gives you exactly the same predictions.

This has the advantage that one often doesn’t have to care about typing the accents at all and can just select them from the prediction.

Like this:

mike-fabian · 2021-08-23T16:22:06Z

For me, this is very helpful when typing French or Spanish, I don’t know these languages well and often make mistakes in the accents, but with this feature, I can write the word first without the accents if I am not sure and then select the correct version.

mike-fabian · 2021-08-23T16:25:06Z

Here is a list of language where this accent insensitive matching makes sense:

https://github.com/mike-fabian/ibus-typing-booster/blob/main/engine/hunspell_suggest.py#L67

# List of languages where accent insensitive matching makes sense:
ACCENT_LANGUAGES = {
    'af': '',
    'ast': '',
    'az': '',
    'be': '',
    'bg': '',
    'br': '',
    'bs': '',
    'ca': '',
    'cs': '',
    'csb': '',
    'cv': '',
    'cy': '',
    'da': 'æÆøØåÅ',
    'de': '',

   and more like this ...

mike-fabian · 2021-08-23T16:28:47Z

As you see, pt is in that list.

You might notice that some language have a list of special characters after the language name, for example:

    'da': 'æÆøØåÅ',

That is because for a native speaker of Danish, “å” is not an accented version of “a” but a completely different letter. In Danish, “å” is also sorted after “z”, not after “a”!

In German, an “ä” is considered as some variation of “a” and therefore sorted as a secondary difference after “a”.

mike-fabian · 2021-08-23T16:36:31Z

For example, when using Danish, typing "Smørrebrød” gives me a match:

mike-fabian · 2021-08-23T16:38:45Z

But typing “Smorrebrod” does not give me a match (When using the Danish dictionary):

mike-fabian · 2021-08-23T16:39:28Z

That is because in Danish, “ø” is considered a different letter, not a variation of “o”.

mike-fabian · 2021-08-23T16:40:37Z

Now this might depend on whether one is native speaker of a langauage or not.

A non-native speaker of Danish, who is trying to n Danish, might find it helpful if “Smorrebrod“ did actually match “Smørrebrød”.

psads-git · 2021-08-23T16:41:25Z

Thanks, Mike. The point is that typing the accent increases the predictive ability of ibus-typing-booster.

out.mp4

mike-fabian · 2021-08-23T16:47:30Z

Currently this behaviour is hardcoded in the list shown above, so when using Portuguese, you have no choice, matching is *always done accent insensitive. And when using Danish, matching for the characters in the above exception list is done accent sensitive, matching for all other accented characters is done accent insensitive even when using Danish.

As users might disagree about this, especially users who are native speakers and users who are not, this should not be just hardcoded.

There should be an option, probably with 3 values:

Accent insensitive matching:   [always | never | according to the language rules]

When this option is set to “always”, accent insentive matching would occur even for Danish when typing “Smorrebrod”.

When this option is set to “never”, accent insensitive matching would never happeņ not even for “grun” -> “grün” in German or “estao” -> “estão” in Portuguese.

When this option is set to “according to language rules” this would be the current behaviour, generally accent insensitive matching is done but some languages may have some characters which are exceptions and are matched accent sensitive.

psads-git · 2021-08-23T16:54:27Z

Yeah, Mike, that is an excellent idea!

mike-fabian · 2021-08-23T17:00:29Z

Thanks, Mike. The point is that typing the accent increases the predictive ability of ibus-typing-booster.
out.mp4

Yes, if you can type the accents correctly, the number of matching words is smaller and it is more likely that the correct one is among them. If I look in the Portuguese hunspell dictionary, I find:

$ grep ^agua /usr/share/myspell/pt_PT.dic 
aguaçal	[CAT=nc,G=m,N=s]
aguaça	[CAT=nc,G=f,N=s]
aguaceiro/fp	[CAT=nc,G=m,N=s]
aguada/p	[CAT=nc,G=f,N=s]
aguadeiro/p	[CAT=a_nc,G=m,N=s]
aguardar/XYPLD	[CAT=v,T=inf,TR=t]
aguardenteiro	[CAT=nc,G=m,N=s]
aguardente/p	[CAT=nc,G=f,N=s]
aguardentoso	[CAT=adj,N=s,G=m]
aguarela/p	[CAT=nc,G=f,N=s]
aguarelar/XYPL	[CAT=v,T=inf,TR=t]
aguarelista	[CAT=nc,G=_,N=s]
aguarrás	[CAT=nc,G=f,N=s]
aguar/YPLM	[CAT=v,T=inf,TR=t,I=3]
aguas/PL	[$aguar$CAT=v,T=inf,TR=_$P=2,N=s,T=p]
agua/PL	[$aguar$CAT=v,T=inf,TR=_$P=3,N=s,T=p]
aguamos/PL	[$aguar$CAT=v,T=inf,TR=_$P=1,N=p,T=p]
aguais/PL	[$aguar$CAT=v,T=inf,TR=_$P=2,N=p,T=p]
aguam/PL	[$aguar$CAT=v,T=inf,TR=_$P=3,N=p,T=p]
agua/PL	[$aguar$CAT=v,T=inf,TR=_$P=2,N=s,T=i]
aguai/PL	[$aguar$CAT=v,T=inf,TR=_$P=2,N=p,T=i]

and

$ grep ^água /usr/share/myspell/pt_PT.dic 
água-ardente/p	[CAT=nc,G=f,N=s]
água-chilra	[CAT=nc,G=f,N=s]
água-forte	[CAT=nc,G=f,N=s]
água-furtada	[CAT=nc,G=f,N=s]
água/p	[CAT=nc,G=f,N=s]
água-marinha	[CAT=nc,G=f,N=s]
água-oxigenada	[CAT=nc,G=f,N=s]
água-pé/p	[CAT=nc,G=f,N=s]
águas-furtadas	[$água-furtada$CAT=nc,G=f,N=s$N=p]
água-tinta	[CAT=nc,G=f,N=s]

Accent insentive matching gives you the results of both, no matter whether you typed “agua” or “água”.

(And what makes it worse is that the hunspell dictionaries have no information about which words are common and which are not!)

But if you know that you want “ág” plus something and not “ag” plus something, then accent sensitive matching would help.

So if you prefer typing the accents correctly yourself, accent sensitive matching is better.

But this is really a matter of choice, I prefer accent insensitive matching a lot, even when typing my native language (which is German). I often type words without the accents and select the correct version.

For German, I can also type accents correctly without problems if I want to, so for German accent sensitive matching would work for me. Although I still prefer accent insensitive matching, even for German. For French, accent insensitive matching helps me a lot as I make too many mistakes when typing the accents and therefore often would not get the right matches at all if the match is accent insensitive.

So this really should be a user option.

mike-fabian · 2021-08-23T17:09:48Z

I didn't yet create that option although I immediately thought of this when a user from Scandinavia requested exceptions for characters like ø, å, ..., because I was thinking about how to exactly make this optional and got confused because one could even make it optional in a more fine grained way and I was wondering whether I should do that and how exactly.

If there is only an option with 3 values like

Accent insensitive matching:   [always | never | according to the language rules]

then it is obviously already better than the current situation with no option. But what if one wants accent insensitive matching for one langauge and not for others?

For example, I as a native German speaker and learner of French and Spanish have the German, French, and Spanish dictionaries configured in ibus-typing-booster.

Maybe I would want accent sensitive matching in the German dictionary but accent insensitive matching in the French and Spanish dictionaries.

So it might be useful if one could set this up not with a single option for all dictionaries, but with more fine grained options per dictionary.

And then I wondered how the user interface for this should look like ... and got confused ...

One options is rather easy to implement, making this configurable per language might be even more useful than a single option, but the UI is going to get complicated.

psads-git · 2021-08-23T17:11:45Z

Indeed, Mike, it is essentially a matter of preference.

There is a small detail that may improve ibus-typing-booster performance, in the accent sensitive case. Consider the word

também

ibus-typing-booster should display its prediction as soon as the accent is typed and not after the accented character is typed (é, in this example).

psads-git · 2021-08-23T17:16:19Z

I think there is plenty of space in the gui:

mike-fabian · 2021-08-23T17:20:48Z

You write: “as soon as the accent is typed”.

I wonder how you type your accents. There are many ways. I can type an ü for example:

Hit a key with ü directly on my keyboard
Use the t-latn-post input method (add it in the ibus-typing-booster setup) and type u"
Use the t-latn-pre input method and type "u
Type <dead_diaeresis> <u> (<dead_diaeresis> is a key which produces a dead " which does nothing at first and when an u follows it becomes ü (this is similar to the t-latn-pre input method but not the same, completely different mechanism)
Type <Multi_key> <quotedbl> <u>
Type <Multi_key> <u> <quotedbl>
Type u followed by a combinig diaeresis (U+0308 COMBINING DIAERESIS)

On my keyboard layout (heavily customized verision of a US English layout), I can actually type all of the above to get an ü.

My preferred method is the t-latn-post input method, i.e. most of the time I type u" to get an ü.

mike-fabian · 2021-08-23T17:22:04Z

I think there is plenty of space in the gui:

Yes, at the end of each dictionary line in the setup tool, there could be a combobox where you have the above mentioned 3 choices (accent sensitive, insensitive, language rules).

psads-git · 2021-08-23T17:28:25Z

I cannot name the way I and everyone here in Portugal types accented words. However, the process consists of two steps:

One presses the key with the wanted accent;
One presses the key with the letter on which one wants to place the accent.

That is why my suggestion tends to save the second step!

mike-fabian · 2021-08-23T17:29:01Z

My current feeling is, that if I make only one option with these 3 choices for all languages, then I would probably regret it later. Because most likely I would need to expand it later to more fine grained control for each dictionary separately and that would be a nasty change with backwards compatibility problems.

So at the moment I think I should make this optional per language immediately, even if it is far more complicated to implement.

mike-fabian · 2021-08-23T17:29:53Z

I cannot name the way I and everyone here in Portugal types accented words. However, the process consists of two steps:
1. One presses the key with the wanted accent;

2. One presses the key with the letter on which one wants to place the accent.

That seems to me you are using so called “dead keys”.

psads-git · 2021-08-23T17:31:05Z

To add the option per language is even better than to all, since it gives more freedom to the user.

mike-fabian · 2021-08-23T17:31:46Z

To add the option per language is even better than to all, since it gives more freedom to the user.

Yes, and in the long run I will have to do that anyway, so I better should not postpone that.

mike-fabian · 2021-08-23T17:32:50Z

I guess you are using the first keyboard layout from /usr/share/X11/xkb/symbols/pt, which is:

default partial alphanumeric_keys
xkb_symbols "basic" {

    include "latin(type4)"
    name[Group1]="Portuguese";

    key <TLDE> { [     backslash,             bar,        notsign,          notsign ] };
    key <AE03> { [             3,      numbersign,       sterling,         sterling ] };
    key <AE04> { [             4,          dollar,        section,           dollar ] };
    key <AE11> { [    apostrophe,        question,      backslash,     questiondown ] };
    key <AE12> { [ guillemotleft,  guillemotright,   dead_cedilla,      dead_ogonek ] };

    key <AD11> { [          plus,        asterisk, dead_diaeresis,   dead_abovering ] };
    key <AD12> { [    dead_acute,      dead_grave,     dead_tilde,      dead_macron ] };
    key <BKSL> { [    dead_tilde, dead_circumflex,     dead_grave,       dead_breve ] };

    key <AC10> { [      ccedilla,        Ccedilla,     dead_acute, dead_doubleacute ] };
    key <AC11> { [     masculine,     ordfeminine,dead_circumflex,       dead_caron ] };

    key <LSGT> { [          less,         greater,      backslash,        backslash ] };

    include "level3(ralt_switch)"
};

The second layout in that file is one without dead keys:

partial alphanumeric_keys
xkb_symbols "nodeadkeys" {

    include "pt(basic)"
    name[Group1]="Portuguese (no dead keys)";

    key <AE12> { [ guillemotleft,  guillemotright,        cedilla,           ogonek ] };
    key <AD11> { [          plus,        asterisk,       quotedbl,         quotedbl ] };
    key <AD12> { [         acute,           grave                                   ] };
    key <BKSL> { [    asciitilde,     asciicircum                                   ] };
    key <AC10> { [      ccedilla,        Ccedilla,          acute,      doubleacute ] };
    key <AC11> { [     masculine,     ordfeminine,    asciicircum,            caron ] };
    key <AB10> { [         minus,      underscore,  dead_belowdot,         abovedot ] };
};

mike-fabian · 2021-08-23T17:36:01Z

So probably you are using a layout like this one:

https://en.wikipedia.org/wiki/Portuguese_keyboard_layout#/media/File:KB_Portuguese.svg

The keys marked in red on that layout are dead keys.

psads-git · 2021-08-23T17:36:30Z

My keyboard is similar to this one:

https://www.worten.pt/i/8466d924afbaa5bd14e604fa3ca649a377762776.jpg

psads-git · 2021-08-23T17:37:08Z

So probably you are using a layout like this one:

https://en.wikipedia.org/wiki/Portuguese_keyboard_layout#/media/File:KB_Portuguese.svg

Exactly, Mike!

mike-fabian · 2021-08-23T17:42:22Z

The problem with these dead keys is, that they are handled in a very special way.

They don’t go directly into the preëdit. Maybe you noticed the option “Use color for the compose preview” in the setup tool. Dead keys and compose are basically the same mechanism. Try to use that option and choose a obvious colour like I did in this screenshot:

If you do that, it makes it more obvious what is going on.

mike-fabian · 2021-08-23T17:45:40Z

Peek.2021-08-23.19-44.mp4

psads-git · 2021-08-23T17:49:51Z

Thanks, Mike. I have just done that.

mike-fabian · 2021-08-23T17:51:02Z

The normal colour for the preëdit text is black.

I type est and it is black (and some completions are shown in yellow because I did choose that color for completions).

Now I type a dead_tilde and see est in black followed by a ~ in green.

Note that all completions have disappeared while the green ~ is there!

That is because the compose handling is a sort of preëdit inside a preëdit. The external preëdit (black) has to wait until the internal preëdit (green) is finished to continue searching for completions.

After the a has been typed, the green ~ plus the a combine to ã and this ã is black. Because the compose sequence is finished and the internal preëdit is now gone, the ã is now part of the “normal” preëdit.

mike-fabian · 2021-08-23T18:01:47Z

So this shows you how the compose sequence starting with a dead ~ could be completed.

What you see in that list of possible completions might be somewhat different than what I see because I include only those completions in the list which can actually be typed on the current keyboard layout.

If I didn't limit it to those possible to on the current keyboard layout, there would almost always be hundreds of possible completions. For example there is this:

$ grep '^<dead_tilde>.*ᾶ'   /usr/share/X11/locale/en_US.UTF-8/Compose
<dead_tilde> <Greek_alpha>       	: "ᾶ"   U1FB6 # GREEK SMALL LETTER ALPHA WITH PERISPOMENI

But if your keyboard layout doesn't even have a key <Greek_alpha>, then I omit this because it is probably not so interesting.

For a user of a Greek keyboard layout, I show this one but omit others which cannot be typed on the Greek keyboard layout.

mike-fabian · 2021-08-23T18:05:58Z

In my video, I typed a dead ~ and Tab and then selected ≳ from the candidate list shown.

The candidate list shows in the first column what one could type to get this.

So in case of the ≳ it shows a > in the first column and ≳ in the second column.

This tells you that after typing a dead ~ you could type a > to get ≳.

So this typing of Tab when a compose sequence is started but not finished yet tells you what choices you have to finish the sequences, makes it easier to learn the possible compose sequences, certainly easier than reading the /usr/share/X11/locale/en_US.UTF-8/Compose file where all these sequences are defined.

psads-git · 2021-08-23T18:07:50Z

I guess there is no <Greek_alpha> key in my keyboard, Mike!

mike-fabian · 2021-08-23T18:08:52Z

While an unfinished compose sequence is typed, ibus-typing-booster basically stops everything else it doing and waits until the compose sequence is finished and then continues with predictions. While the compose sequence is unfinished, the only things you can do is show possible completions with Tab or correct with Backspace or cancel the compose sequence with Escape.

psads-git · 2021-08-23T18:10:05Z

While an unfinished compose sequence is typed, ibus-typing-booster basically stops everything else it doing and waits until the compose sequence is finished and then continues with predictions. While the compose sequence is unfinished, the only things you can do is show possible completions with Tab or correct with Backspace or cancel the compose sequence with Escape.

Got it, Mike!

mike-fabian · 2021-08-23T18:16:40Z

Even if I could get the ~ from the unfinished compose sequence, it would not be useful to complete anything. Because neither in the dictionaries nor in the database are things like est~ao which one could match with est~. The dictionary only has estão.

For matching, the dictionary and database are internally converted to NFD (Normalization form D):

https://unicode.org/reports/tr15/#Norm_Forms

I.e. what is matched against is actually something like esta~o where the ~ is a combining tilde.

And then the combining characters like the combining ~ are filtered out in case of accent insensitive matches and kept in case of accent sensitive matches like the Danish ø.

mike-fabian · 2021-08-23T18:24:06Z

That's why I asked how you type an ã because there are so many ways.

In case of using something like t-latn-post or t-latn-pre, the ~ would actually be part of the “normal” preëdit, not the compose preëdit. But it can be before or after the base character a. On some ķeyboard layouts one can actually type a followed by combining ~ (which is the NFD way!). Handwriting is usually done in the same way, writing the accent after the base character.

So I think starting to match something when only a ~ has been typed is near impossible, the possibilities are enormous.

Converting all the dictionaries on reading them to forms having ã, a~, and ~a (would make the loaded dictionaries much bigger and the search much slower for very little gain.

psads-git · 2021-08-23T18:24:44Z

Well, Mike, in Portuguese, that would be easy since all accented characters are vowels! Therefore, ibus-typing-booster could search for:

estã
estẽ 
estĩ 
estõ 
estũ

😉

psads-git · 2021-08-23T18:27:36Z

Converting all the dictionaries on reading them to forms having ã, a~, and ~a (would make the loaded dictionaries much bigger and the search much slower for very little gain.

I agree that the gain would be small.

mike-fabian · 2021-08-23T18:28:55Z

So while a simple ~ might match something if the user uses t-latn-pre and often types est~ao using t-latn-pre, this does match a previously typed estão. Because it is remembered that what was actually typed was est~ao and what was committed was estão. Then the next time est~ is typed, this can complete to estão.

mike-fabian · 2021-08-23T18:33:47Z

See how est~ shows estão among the candidates here:

Peek.2021-08-23.20-32.mp4

This works only because I use t-latn-pre in this example and not a dead ~, that makes a big difference.

And of course I typed est~ao a few times before recording that video to make ibus-typing-booster learn this.

mike-fabian · 2021-08-23T18:40:36Z

On your Portuguese keyboard layout, using t-latn-pre would be very inconvenient though. Because you don’t have a normal ~, only a dead ~. To get a normal ~ you would need to type the dead ~ twice to get a normal ~, then that could combine with the following letter using t-latn-pre.

I.e. with t-latn-pre on you keyboard layout, you actually would need to type ~~a to get an ã.

Makes no sense for you, I just mentioned t-latn-pre because this is yet another way to type this which can be very useful on layouts which do not have dead keys.

mike-fabian · 2021-08-23T18:43:26Z

Well, Mike, in Portuguese, that would be easy since all accented characters are vowels! Therefore, ibus-typing-booster could search for:
estã
estẽ 
estĩ 
estõ 
estũ
wink

But ibus-typing-booster doesn't know you are typing Portuguese. One can have several languages configured at the same time.
For example one could have a Spanish and a Portuguese dictionary configured at the same time. And in Spanish typing ~ could mean an n is coming to make a ñ. And while you are typing something into the preëdit, ibus-typing-booster cannot know from which of the several languages you may have configured the word you are typing is going to be.

mike-fabian · 2021-08-23T18:50:41Z

So I think matching something forward when only some accent has been typed like matching estão when only est~ has been typed doesn’t seem reasonably possible (except for special circumstances like when using t-latn-pre).

The amount of calculation for this would be huge, it would depend very much on which languages exactly are configured, lots of special casing, no high speed matching with patterns like regular expressions possible anymore.

So I think matching ~ will probably never work.

But matching accent sensitive, as discussed above, is possible and I think I will do that.

I.e. making ã match something different than what just a would match. That is possible and probably useful as a user option.

And it probably already would do most of what you want.

psads-git · 2021-08-23T18:54:02Z

Thanks, Mike. Your arguments have convinced me that matching ~ is not a good idea.

mike-fabian · 2021-08-23T18:57:28Z

Thanks, Mike. Your arguments have convinced me that matching ~ is not a good idea.

Great, but I'll do the other thing with the accent sensitive matching.

This might take quite a while though as it is really quite difficult to implement.

I think combobox buttons at the end of each dictionary line are a good idea, but I still need to think about how to save that to gsettings and read it back from there. I have a few ideas but I think I need to think about this for a few days before starting to implement anything.

psads-git · 2021-08-23T19:02:28Z

Thanks, Mike. That is nothing really urgent! So, take your time. No hurry at all!

mike-fabian · 2021-08-24T07:28:23Z

I remembered that there needs to be an extra option for the user database.

Each dictionary line needs to have an option whether to match accent insensitive [always | never | language rules].

And there needs to be an option whether to store accents the user typed in the user database.

Currently, accents are removed from the text the user typed when storing in the database.

That means if one types estã and then selects estão and commits, the ~ is dropped from the user input. So what is stored in the database is the user typed esta (without the ~) and then selected estão.

So the next time the user types either esta or estã, estão is a match in both cases because the user input with accents removed is esta in both cases and what was recorded in the database is also esta without the accent, so it matches.

The user database is language agnostic (which is a good thing!), it just records what the user typed in what context and which completion candidate was selected.

As some users may wish to make the matching more strict by matching accent sensitive, such an option has also to be added for the user database.

Maybe a simple checkbox is enough:

[✅] Accent sensitive matching in user database

An option with 3 values like for the dictionaries ([always | never | according to language rules]) doesn’t seem to make sense for the user database because the user database has no language. So maybe that simple checkbox which allows to switch it on or off is enough for the user database.

But one could also make it a more detailed option with more possibilities, maybe even allowing the user to specify a list of characters he wants match accent sensitivly:

[✅] Accent sensitive matching in user database
Accent sensitive matching in user database only for [ÅåØø]

I.e. when [ ] Accent sensitive matching in user database is off, this would be the current behaviour, all accents are ignored. The second option does not matter then.

When [✅] Accent sensitive matching in user database is on, and the list of exception characters is empty, all accents would be kept in the user database.

When [✅] Accent sensitive matching in user database is on, and the list of exception characters is not empty, the exception characters would be stored with their accents in the user database when the user types such characters but all other accented characters still get their accents stripped.

These options for the user database would only have an effect on new input.

Stuff which is already in the user database cannot be changed later. Theoretically I could remove accents from user input which is already in the user database, but there is no way to put them back. Because if there is onl a in the user database, one cannot know whether the user really typed a or ã, ä, .... and the accent was stripped.

So after switching the [✅] Accent sensitive matching in user database to a different value, the effect would become visible slowly after more typing as the newly typed words get higher weight.

Side note: For a long time already I am thinking of a kind of expire feature in the user database, words which have not been typed for a long, long time should fade away from the user database. Currently everything is kept forever. My user database which is many years old has 32 megabytes now:

$ ls ~/.local/share/ibus-typing-booster/user.db -lh
-rw-r--r--. 1 mfabian mfabian 32M  8月 24 08:43 /home/mfabian/.local/share/ibus-typing-booster/user.db

There is old junk inside because I tested some stuff many years ago.

That hurts less than one might think because if I never type it again, it never gets a higher score. I stays with a count of 1 in the database forever but with such a low count it is unlikely that it will ever show up as a candidate. Words which I type often have much higher counts. But a needlessly huge database makes everything slower. I am thinking about something similar to radioactive decay. if an entry which has been typed 10 times, don’t keep that count forever but slowly reduce it over time and drop the entry if it reaches 0. As time passes, old junk which is never typed again would be dropped automatically. Words typed recently would get higher scores than words typed months ago ...

psads-git · 2021-08-24T12:24:46Z

I find your radioactivity decay idea quite interesting, Mike! However, I suspect that it may need some adjustment.

Imagine a person who is not using ibus-typing-booster for, say, two years and that this person now resumes the use of ibus-typing-booster. By applying the radioactivity decay idea, the database of such a person would be empty after two years!

I do agree that it is very important to reduce the database size, as the larger the size of the database the slower the queries.

The idea of allowing the user to specify a list of characters he wants match accent sensitivity is an excellent one: the more freedom the user is given, the better, up to a certain degree of gui complexity.

mike-fabian · 2021-11-30T13:05:36Z

According to the tests done while working on

#251

I have doubts that fine grained configurability of accent sensitive matching is helpful at all.

It is quite a complicated change, with a complicated user interface to set it up which most users might not even understand.

And according to the tests I did recently, it will probably accomplish very little in improving the predictions. Doing completely accent insensitive predictions (accent insensitive user input and context) did not make the percent of characters saved any worse when I tested with the French book “Notre Dame de Paris”. There may be some cases where typing the correct accents would reduce the number of candidates a bit and one would get the correct candidate a bit earlier, but that really doesn’t seem to be the case often. If that happened often, probably it would have made the percent of characters saved worse when testing with that French book. When doing this test with the French book, I had always perfect context, always the two previous words were remembered correctly in the database. In reality, this is not always the case. Sometimes because surrounding text didn’t work and maybe the fallback to remember the last two words didn’t help either because the cursor was moved or the focus was moved to a different window. In reality, it happens more often that the context is missing or even wrong than when doing such a test of reading from a book and then retyping that exact book. When the context is empty, typing the accents correctly of the current input might help a bit more compared to when two words of perfect context are available.

I am not 100% sure, but I need to look at this again carefully. Maybe it is not worth doing.

psads-git · 2021-11-30T13:15:34Z

Given the improvements and the evidence meanwhile learned, I agree with your arguments, Mike.

mike-fabian self-assigned this Aug 23, 2021

mike-fabian added the enhancement label Aug 23, 2021

mike-fabian mentioned this issue Nov 15, 2021

Should search be case-sensitive? #251

Closed

mike-fabian added this to Mike’s project Jun 25, 2024

mike-fabian moved this to Todo in Mike’s project Jun 25, 2024

Preedit of accented words can be improved #231

Preedit of accented words can be improved #231

Comments

psads-git commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

psads-git commented Aug 23, 2021

mike-fabian commented Aug 23, 2021 • edited Loading

psads-git commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

psads-git commented Aug 23, 2021 • edited Loading

psads-git commented Aug 23, 2021

mike-fabian commented Aug 23, 2021 • edited Loading

mike-fabian commented Aug 23, 2021

psads-git commented Aug 23, 2021 • edited Loading

mike-fabian commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

psads-git commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

mike-fabian commented Aug 23, 2021 • edited Loading

mike-fabian commented Aug 23, 2021 • edited Loading

psads-git commented Aug 23, 2021

psads-git commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

psads-git commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

psads-git commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

psads-git commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

psads-git commented Aug 23, 2021

psads-git commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

mike-fabian commented Aug 23, 2021 • edited Loading

mike-fabian commented Aug 23, 2021

mike-fabian commented Aug 23, 2021 • edited Loading

mike-fabian commented Aug 23, 2021

psads-git commented Aug 23, 2021

mike-fabian commented Aug 23, 2021

psads-git commented Aug 23, 2021

mike-fabian commented Aug 24, 2021

psads-git commented Aug 24, 2021

mike-fabian commented Nov 30, 2021

psads-git commented Nov 30, 2021

mike-fabian commented Aug 23, 2021 •

edited

Loading

psads-git commented Aug 23, 2021 •

edited

Loading

mike-fabian commented Aug 23, 2021 •

edited

Loading

psads-git commented Aug 23, 2021 •

edited

Loading

mike-fabian commented Aug 23, 2021 •

edited

Loading

mike-fabian commented Aug 23, 2021 •

edited

Loading

mike-fabian commented Aug 23, 2021 •

edited

Loading

mike-fabian commented Aug 23, 2021 •

edited

Loading