-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preedit of accented words can be improved #231
Comments
For me, this is very helpful when typing French or Spanish, I don’t know these languages well and often make mistakes in the accents, but with this feature, I can write the word first without the accents if I am not sure and then select the correct version. |
Here is a list of language where this accent insensitive matching makes sense: https://github.com/mike-fabian/ibus-typing-booster/blob/main/engine/hunspell_suggest.py#L67
|
As you see, pt is in that list. You might notice that some language have a list of special characters after the language name, for example:
That is because for a native speaker of Danish, “å” is not an accented version of “a” but a completely different letter. In Danish, “å” is also sorted after “z”, not after “a”! In German, an “ä” is considered as some variation of “a” and therefore sorted as a secondary difference after “a”. |
That is because in Danish, “ø” is considered a different letter, not a variation of “o”. |
Now this might depend on whether one is native speaker of a langauage or not. A non-native speaker of Danish, who is trying to n Danish, might find it helpful if “Smorrebrod“ did actually match “Smørrebrød”. |
Thanks, Mike. The point is that typing the accent increases the predictive ability of ibus-typing-booster. out.mp4 |
Currently this behaviour is hardcoded in the list shown above, so when using Portuguese, you have no choice, matching is *always done accent insensitive. And when using Danish, matching for the characters in the above exception list is done accent sensitive, matching for all other accented characters is done accent insensitive even when using Danish. As users might disagree about this, especially users who are native speakers and users who are not, this should not be just hardcoded. There should be an option, probably with 3 values:
When this option is set to “always”, accent insentive matching would occur even for Danish when typing “Smorrebrod”. When this option is set to “never”, accent insensitive matching would never happeņ not even for “grun” -> “grün” in German or “estao” -> “estão” in Portuguese. When this option is set to “according to language rules” this would be the current behaviour, generally accent insensitive matching is done but some languages may have some characters which are exceptions and are matched accent sensitive. |
Yeah, Mike, that is an excellent idea! |
Yes, if you can type the accents correctly, the number of matching words is smaller and it is more likely that the correct one is among them. If I look in the Portuguese hunspell dictionary, I find:
and
Accent insentive matching gives you the results of both, no matter whether you typed “agua” or “água”. (And what makes it worse is that the hunspell dictionaries have no information about which words are common and which are not!) But if you know that you want “ág” plus something and not “ag” plus something, then accent sensitive matching would help. So if you prefer typing the accents correctly yourself, accent sensitive matching is better. But this is really a matter of choice, I prefer accent insensitive matching a lot, even when typing my native language (which is German). I often type words without the accents and select the correct version. For German, I can also type accents correctly without problems if I want to, so for German accent sensitive matching would work for me. Although I still prefer accent insensitive matching, even for German. For French, accent insensitive matching helps me a lot as I make too many mistakes when typing the accents and therefore often would not get the right matches at all if the match is accent insensitive. So this really should be a user option. |
I didn't yet create that option although I immediately thought of this when a user from Scandinavia requested exceptions for characters like ø, å, ..., because I was thinking about how to exactly make this optional and got confused because one could even make it optional in a more fine grained way and I was wondering whether I should do that and how exactly. If there is only an option with 3 values like
then it is obviously already better than the current situation with no option. But what if one wants accent insensitive matching for one langauge and not for others? For example, I as a native German speaker and learner of French and Spanish have the German, French, and Spanish dictionaries configured in ibus-typing-booster. Maybe I would want accent sensitive matching in the German dictionary but accent insensitive matching in the French and Spanish dictionaries. So it might be useful if one could set this up not with a single option for all dictionaries, but with more fine grained options per dictionary. And then I wondered how the user interface for this should look like ... and got confused ... One options is rather easy to implement, making this configurable per language might be even more useful than a single option, but the UI is going to get complicated. |
Indeed, Mike, it is essentially a matter of preference. There is a small detail that may improve ibus-typing-booster performance, in the accent sensitive case. Consider the word também ibus-typing-booster should display its prediction as soon as the accent is typed and not after the accented character is typed (é, in this example). |
You write: “as soon as the accent is typed”. I wonder how you type your accents. There are many ways. I can type an
On my keyboard layout (heavily customized verision of a US English layout), I can actually type all of the above to get an ü. My preferred method is the t-latn-post input method, i.e. most of the time I type |
I cannot name the way I and everyone here in Portugal types accented words. However, the process consists of two steps:
That is why my suggestion tends to save the second step! |
My current feeling is, that if I make only one option with these 3 choices for all languages, then I would probably regret it later. Because most likely I would need to expand it later to more fine grained control for each dictionary separately and that would be a nasty change with backwards compatibility problems. So at the moment I think I should make this optional per language immediately, even if it is far more complicated to implement. |
That seems to me you are using so called “dead keys”. |
To add the option per language is even better than to all, since it gives more freedom to the user. |
Yes, and in the long run I will have to do that anyway, so I better should not postpone that. |
I guess you are using the first keyboard layout from
The second layout in that file is one without dead keys:
|
So probably you are using a layout like this one: https://en.wikipedia.org/wiki/Portuguese_keyboard_layout#/media/File:KB_Portuguese.svg The keys marked in red on that layout are dead keys. |
My keyboard is similar to this one: https://www.worten.pt/i/8466d924afbaa5bd14e604fa3ca649a377762776.jpg |
Exactly, Mike! |
The problem with these dead keys is, that they are handled in a very special way. They don’t go directly into the preëdit. Maybe you noticed the option “Use color for the compose preview” in the setup tool. Dead keys and compose are basically the same mechanism. Try to use that option and choose a obvious colour like I did in this screenshot: If you do that, it makes it more obvious what is going on. |
Peek.2021-08-23.19-44.mp4 |
Thanks, Mike. I have just done that. |
The normal colour for the preëdit text is black. I type Now I type a dead_tilde and see Note that all completions have disappeared while the green That is because the compose handling is a sort of preëdit inside a preëdit. The external preëdit (black) has to wait until the internal preëdit (green) is finished to continue searching for completions. After the |
So this shows you how the compose sequence starting with a dead What you see in that list of possible completions might be somewhat different than what I see because I include only those completions in the list which can actually be typed on the current keyboard layout. If I didn't limit it to those possible to on the current keyboard layout, there would almost always be hundreds of possible completions. For example there is this:
But if your keyboard layout doesn't even have a key For a user of a Greek keyboard layout, I show this one but omit others which cannot be typed on the Greek keyboard layout. |
In my video, I typed a dead The candidate list shows in the first column what one could type to get this. So in case of the This tells you that after typing a dead So this typing of Tab when a compose sequence is started but not finished yet tells you what choices you have to finish the sequences, makes it easier to learn the possible compose sequences, certainly easier than reading the |
I guess there is no |
While an unfinished compose sequence is typed, ibus-typing-booster basically stops everything else it doing and waits until the compose sequence is finished and then continues with predictions. While the compose sequence is unfinished, the only things you can do is show possible completions with Tab or correct with Backspace or cancel the compose sequence with Escape. |
Got it, Mike! |
Even if I could get the For matching, the dictionary and database are internally converted to NFD (Normalization form D): https://unicode.org/reports/tr15/#Norm_Forms I.e. what is matched against is actually something like And then the combining characters like the combining |
That's why I asked how you type an In case of using something like So I think starting to match something when only a Converting all the dictionaries on reading them to forms having |
Well, Mike, in Portuguese, that would be easy since all accented characters are vowels! Therefore, ibus-typing-booster could search for:
😉 |
I agree that the gain would be small. |
So while a simple |
See how Peek.2021-08-23.20-32.mp4This works only because I use And of course I typed |
On your Portuguese keyboard layout, using I.e. with Makes no sense for you, I just mentioned |
But ibus-typing-booster doesn't know you are typing Portuguese. One can have several languages configured at the same time. |
So I think matching something forward when only some accent has been typed like matching The amount of calculation for this would be huge, it would depend very much on which languages exactly are configured, lots of special casing, no high speed matching with patterns like regular expressions possible anymore. So I think matching But matching accent sensitive, as discussed above, is possible and I think I will do that. I.e. making And it probably already would do most of what you want. |
Thanks, Mike. Your arguments have convinced me that matching |
Great, but I'll do the other thing with the accent sensitive matching. This might take quite a while though as it is really quite difficult to implement. I think combobox buttons at the end of each dictionary line are a good idea, but I still need to think about how to save that to gsettings and read it back from there. I have a few ideas but I think I need to think about this for a few days before starting to implement anything. |
Thanks, Mike. That is nothing really urgent! So, take your time. No hurry at all! |
I remembered that there needs to be an extra option for the user database. Each dictionary line needs to have an option whether to match accent insensitive [always | never | language rules]. And there needs to be an option whether to store accents the user typed in the user database. Currently, accents are removed from the text the user typed when storing in the database. That means if one types So the next time the user types either The user database is language agnostic (which is a good thing!), it just records what the user typed in what context and which completion candidate was selected. As some users may wish to make the matching more strict by matching accent sensitive, such an option has also to be added for the user database. Maybe a simple checkbox is enough:
An option with 3 values like for the dictionaries ( But one could also make it a more detailed option with more possibilities, maybe even allowing the user to specify a list of characters he wants match accent sensitivly:
I.e. when When When These options for the user database would only have an effect on new input. Stuff which is already in the user database cannot be changed later. Theoretically I could remove accents from user input which is already in the user database, but there is no way to put them back. Because if there is onl So after switching the Side note: For a long time already I am thinking of a kind of expire feature in the user database, words which have not been typed for a long, long time should fade away from the user database. Currently everything is kept forever. My user database which is many years old has 32 megabytes now:
There is old junk inside because I tested some stuff many years ago. That hurts less than one might think because if I never type it again, it never gets a higher score. I stays with a count of 1 in the database forever but with such a low count it is unlikely that it will ever show up as a candidate. Words which I type often have much higher counts. But a needlessly huge database makes everything slower. I am thinking about something similar to radioactive decay. if an entry which has been typed 10 times, don’t keep that count forever but slowly reduce it over time and drop the entry if it reaches 0. As time passes, old junk which is never typed again would be dropped automatically. Words typed recently would get higher scores than words typed months ago ... |
I find your radioactivity decay idea quite interesting, Mike! However, I suspect that it may need some adjustment. Imagine a person who is not using ibus-typing-booster for, say, two years and that this person now resumes the use of ibus-typing-booster. By applying the radioactivity decay idea, the database of such a person would be empty after two years! I do agree that it is very important to reduce the database size, as the larger the size of the database the slower the queries. The idea of allowing the user to specify a list of characters he wants match accent sensitivity is an excellent one: the more freedom the user is given, the better, up to a certain degree of gui complexity. |
According to the tests done while working on I have doubts that fine grained configurability of accent sensitive matching is helpful at all. It is quite a complicated change, with a complicated user interface to set it up which most users might not even understand. And according to the tests I did recently, it will probably accomplish very little in improving the predictions. Doing completely accent insensitive predictions (accent insensitive user input and context) did not make the percent of characters saved any worse when I tested with the French book “Notre Dame de Paris”. There may be some cases where typing the correct accents would reduce the number of candidates a bit and one would get the correct candidate a bit earlier, but that really doesn’t seem to be the case often. If that happened often, probably it would have made the percent of characters saved worse when testing with that French book. When doing this test with the French book, I had always perfect context, always the two previous words were remembered correctly in the database. In reality, this is not always the case. Sometimes because surrounding text didn’t work and maybe the fallback to remember the last two words didn’t help either because the cursor was moved or the focus was moved to a different window. In reality, it happens more often that the context is missing or even wrong than when doing such a test of reading from a book and then retyping that exact book. When the context is empty, typing the accents correctly of the current input might help a bit more compared to when two words of perfect context are available. I am not 100% sure, but I need to look at this again carefully. Maybe it is not worth doing. |
Given the improvements and the evidence meanwhile learned, I agree with your arguments, Mike. |
Hi, Mike,
Preedit of accented words can be improved. Let me explain my idea. In Portuguese, many words contain accented character (typically, only one accented character). Suppose the word
estão
The prediction of the word should be fired as soon as the accent
~
is typed and not after the entire accented character is typed (ã), in order to improve the typing speed.
Thanks!
The text was updated successfully, but these errors were encountered: