Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should search be case-sensitive? #251

Closed
psads-git opened this issue Nov 4, 2021 · 110 comments
Closed

Should search be case-sensitive? #251

psads-git opened this issue Nov 4, 2021 · 110 comments
Assignees

Comments

@psads-git
Copy link
Contributor

While I am in process of changing the timestamps of my database, I have something that I would like to suggest to you: To make search case-insensitive. For instance, if I write mike, nothing is found by ibus-typing-booster, but if I start writing Mi... the Mike suggestion emerges immediately. I think it would be useful to be possible to write mi... and then the suggestion Mike being immediately offered.

What do you think about this?

@mike-fabian mike-fabian self-assigned this Nov 5, 2021
@mike-fabian
Copy link
Owner

Yes, I think search should be case-insensitive and if this is possible it should be case insensitive by default.

@mike-fabian
Copy link
Owner

I don’t want to make this optional though, I think it should always be case insensitive.

@psads-git
Copy link
Contributor Author

Thanks, Mike. That will significantly improve ibus-typing-booster.

mike-fabian added a commit that referenced this issue Nov 9, 2021
Resolves: #251

- For matching in the dictionaries, case insensitive regexps are used instead
  of “.startswith()”

- For the sqlite3 database, use “PRAGMA case_sensitive_like = false;”
  instead of “PRAGMA case_sensitive_like = true;” to make the LIKE
  operator work case insensitive.

See also this old commit:

    commit 0be3d24
    Author: Mike FABIAN <[email protected]>
    Date:   Thu May 23 11:10:10 2013 +0200

        SQL LIKE should behave case sensitively

        http://www.sqlite.org/lang_expr.html#like states:

        > (A bug: SQLite only understands upper/lower case for ASCII characters
        > by default. The LIKE operator is case sensitive by default for unicode
        > characters that are beyond the ASCII range. For example, the
        > expression 'a' LIKE 'A' is TRUE but 'æ' LIKE 'Æ' is FALSE.)

        That makes no sense at all for us. But even if LIKE behaved
        case insensitively for all Unicode characters, we would still
        prefer to it to behave case sensitively.

        So add the PRAGMA to make it case sensitive.

https://www.sqlite.org/lang_expr.html#like

now contains:

“The ICU extension to SQLite includes an enhanced version of the LIKE
operator that does case folding across all unicode characters.”

So it might even be possible to make that work.

But even if it works only for ASCII characters, this is already mostly
good enough because Typing Booster stores the user input with the
accents removed in the input_phrase column in the database. I.e.
if it works only for ASCII this is still mostly good enough.
@mike-fabian
Copy link
Owner

https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/ has 2.14.18 builds now which have the case insensitive matching.

@psads-git
Copy link
Contributor Author

Thanks, Mike. This version of ibus-typing-booster is almost unusable, since the search is extremely slow.

Let me add that my database has about 70MB (I input into it several books, to help the prediction).

@mike-fabian
Copy link
Owner

mike-fabian commented Nov 9, 2021

Is it slower than before? I didn’t notice any slowdown.

@psads-git
Copy link
Contributor Author

Yes, Mike, several orders of magnitude slower.

@mike-fabian
Copy link
Owner

But probably because of your huge database, or do you think the case insensitive matching change anything?

@mike-fabian
Copy link
Owner

If you downgrade to 2.14.17, is that faster?

@psads-git
Copy link
Contributor Author

If I downgrade to 2.14.17, ibus-typing-booster is very fast.

I do not have any solid idea why is that happening, but, since the result of each search has many more elements, the ordering by use frequency may be extremely slow, because of the number of elements that it has to order -- if I remember well, the sorting algorithms are O(n^2) (or something close to that).

@mike-fabian
Copy link
Owner

Let me add that my database has about 70MB (I input into it several books, to help the prediction).

By the way, I found reading books into Typing Booster not to be helpful. For example, in 2014, I let Typing Booster read the “Hitchhikers Guide to the Galaxy” and “The Picture of Dorian Gray”. It didn’t seem to help my typing at all. I could easily open any page in one of these books and type any sentence from these books easily with good prediction, but what I really wanted to type (almost) never seemed to occur in these books. I guess this is true for almost any book except possibly if you wrote it yourself. User input is so variable, it doesn’t help much to read text from other writers, even a huge amount of such text doesn’t seem to help.

Now when I implemented the expiry of old entries and looked at what entries were expired, I noticed that a most entries from both these books had been created in 2014 and have never been touched since then.

So I think reading books doesn’t help much.

Maybe I should rethink how to read huge texts, I have no good idea at the moment though. Learning from your own input is helpful, learning from other peoples input seems to have very limited value. Maybe, when reading a text above a certain size, I should only add stuff from that text to the database if it really appears very often. That still might not help very much, for example “The picture of Dorian Gray” contains 47 times the text “said Lord Henry”. I will probably never type that, I will probably type “said” often but “Lord” and “Henry” very rarely and it is unlikely that I ever type the complete text “said Lord Henry”.

@mike-fabian
Copy link
Owner

Now I have measured and indeed it is much slower.

My database has 140000 rows and a size of 10MB and with that it is about 10 times slower.

I measured both the time it takes to do the case insensitive lookup in the hunspell dictionaries and in the database.

The difference for the lookup in the dictionaries is insignificant, the difference in speed between the case insensitive regular expressions used now and the .startswith() used before seems to be within measurement error.

But the difference when doing the database lookup is huge. As soon as I use

PRAGMA case_sensitive_like = false;

for the sqlite database, it becomes about 10 times slower for me.

So probably there are two things slowing this down:

  • the case insensitive LIKE operator is probably slow already

  • When doing the LIKE case insensitive, the number of records matched is probably a lot bigger and then calculating the linear combinations of the counts of the previous two words has a lot more work to do and becomes much slower

@psads-git
Copy link
Contributor Author

So, Mike, case-insensitiveness is perhaps a bad idea! ;-)

@mike-fabian
Copy link
Owner

Some examples from my current database:

sqlite> PRAGMA case_sensitive_like = true;
sqlite> select sum(user_freq) from phrases where input_phrase like "Thi%" ;
411
sqlite> PRAGMA case_sensitive_like = false;
sqlite> select sum(user_freq) from phrases where input_phrase like "Thi%" ;
2768
sqlite>

sqlite> PRAGMA case_sensitive_like = true;
sqlite> select sum(user_freq) from phrases where input_phrase like "Th%" ;
2712
sqlite> PRAGMA case_sensitive_like = false;
sqlite> select sum(user_freq) from phrases where input_phrase like "Th%" ;
18887
sqlite> 

sqlite> PRAGMA case_sensitive_like = true;
sqlite> select sum(user_freq) from phrases where input_phrase like "Te%" ;
188
sqlite> PRAGMA case_sensitive_like = false;
sqlite> select sum(user_freq) from phrases where input_phrase like "Te%" ;
2199
sqlite> 

sqlite> PRAGMA case_sensitive_like = true;
sqlite> select sum(user_freq) from phrases where input_phrase like "te%" ;
1998
sqlite> PRAGMA case_sensitive_like = false;
sqlite> select sum(user_freq) from phrases where input_phrase like "te%" ;
2199
sqlite>

Looks like the amount of entries returned by a case insensitive LIKE operator is really a lot bigger.

Probably we cannot do case insensitive matching in the database then.

Maybe only in the dictionaries.

So for example when using the en_US dictionary, “mike” and “Mike” would both match:

$ grep -i mike /usr/share/myspell/en_US.dic 
Mike/M
mike/MGDS
mfabian@taka:~
$

but not when using the en_GB dictionary:

mfabian@taka:~
$ grep -i mike /usr/share/myspell/en_GB.dic 
mike/DMGS
mfabian@taka:~
$ 

It looks like I can do the case insensitive match in the dictionaries without measurable slowdown and at least that makes it possible to type “corme” or “Corme” and get “Cormeilles-en-Parisis” when using the French dictionary:

mfabian@taka:~
$ grep Paris /usr/share/myspell/fr_FR.dic 
Cormeilles-en-Parisis
Paris
Seyssinet-Pariset
Tout-Paris
mfabian@taka:~
$ 

@mike-fabian
Copy link
Owner

mike-fabian commented Nov 9, 2021

Would a “halfway case insensitive” solution, i.e. case insensitive in the dictionaries but not in the database as described above be good?

I tend to think this is better than nothing.

@psads-git
Copy link
Contributor Author

Yes, Mike, a “halfway case insensitive” solution would definitely be better than nothing. And to use case insensitiveness in the database is, as is very clear, impractical.

@mike-fabian
Copy link
Owner

I just noticed that case insensitive matching in the database is even possible with no loss in performance at alŀ if I do it the same way I do it for accent insensitive matching.

Accent insensitive matching in the database is currently done like this:

Before the input the user has typed (input_phrase) is saved to the database all accents from the user input are removed, i.e. the input_phrase is always stored without accents in the database.

If the user then types the something again, the accents are removed from the input and then the match against the database is done.

This works well and does not cause any loss in performance. The disadvantage is of course that this is an option one cannot switch immediately. When I make accent sensitive matching in the database configurable as discussed in

#231

Then a change in that option will only be effective for new input after that option was changed. Old rows in the database cannot be changed anymore, they would stay as they were.

One could do the same with upper and lower case: Always store input_phrase in lower case in the database, always convert user input to lower case before matching.

That would be fast.

But if this would be added as an option in the setup tool, then changing that option would only have an effect on new input.

@psads-git
Copy link
Contributor Author

Thanks, Mike, but I am not sure whether it will work. Suppose that one wants to write

Mike

and types mike. How can ibus-typing-booster automatically transform mike into Mike?

My idea of case-insensitive search was to avoid typing uppercase letters mainly in names.

I believe that you are thinking in something like the following: One types

Mi (notice the capital M)

and ibus-typing-booster will suggest

Mike

This would be a great progress though, as it would allow a substantial size-reduction of the database while maintaining the same prediction performance.

@mike-fabian
Copy link
Owner

Thanks, Mike, but I am not sure whether it will work. Suppose that one wants to write

Mike

and types mike. How can ibus-typing-booster automatically transform mike into Mike?

That can be done easily in Python:

mfabian@taka:~
$ python3
Python 3.10.0 (default, Oct  4 2021, 00:00:00) [GCC 11.2.1 20210728 (Red Hat 11.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 'Mike'.lower()
'mike'
>>> 'Mike'.upper()
'MIKE'
>>> 'mike'.title()
'Mike'
>>> 'mike'.upper()
'MIKE'
>>> 'mIKe'.upper()
'MIKE'
>>> 'mIKe'.lower()
'mike'
>>> 'mIKe'.title()
'Mike'
>>> 

My idea of case-insensitive search was to avoid typing uppercase letters mainly in names.

I believe that you are thinking in something like the following: One types

Mi (notice the capital M)

and ibus-typing-booster will suggest

Mike

I think that you can type either Mi or mi and in both cases you will get the suggestion Mike.

That is almost the same as with the accents, for example my database currently contains:

sqlite> select * from phrases where input_phrase  == "deja";
426586|deja|déjà|tu|As|1|1636959600.50043
426719|deja|déjà assisté|tu|as|1|1636960347.05661
427645|deja|déjà|||2|1636983062.65822
sqlite> 

The input_phrase column always contains deja without the accents, even when I actually typed it with the accents.

When saving the the database, that column is always saved without the accents (There are some language dependent exceptions, we talked about these before, but let’s ignore these for this explanation). And from the user input the accents are also removed before attempting a match in the database. That means no matter whether the user types de or , both will match the above rows.

One could do the same with upper and lower case. For example when the user currently types Mi and selects Mike, or types mi and selects Mike, that would create two database entries like this:

426586|Mi|Mike|||1|1636959600.50043
426719|mi|Mike|||1|1636960347.05661

If the user input was always converted to lower case when when saving to the database and when matching against the database, one would have only one entry with count 2 like this:

426719|mi|Mike|||2|1636960347.05661

and typing Mi would still match that row because Mi would be converted to mi before trying the match.

But switching that option would only have an effect of new entries. The old ones would very slowly disappear because of the expiry.

Theoretically one could convert the database to all lower case input_phrase automatically when the option case insensitive matching is switched on.

But when it is switched off again, there would be no way to automatically convert it back as some information was lost when doing the conversion to lower case.

I think it might be better not to attempt any conversion of the existing rows at all and let the value of that option only have an effect on new rows.

This would be a great progress though, as it would allow a substantial size-reduction of the database while maintaining the same prediction performance.

I am not sure whether this is a good idea or not, probably I would make it an option if I implement this. I think the reduction in the number of rows would not be substantial, there are nto so many words which usually start with upper case in most languages. There are more in German because all nouns start with upper-case in German, but even in the case of German I think the number or rows which would be merged into one because of this would be rather small.

Whether prediction performance gets better or worse is partly a matter of opinion. If you type more exact, the prediction can be more exact. If you always type upper-case letters correctly, then the prediction has less choice. It is the same as with the accents, I think. If you type de now, words starting with de and can be predicted because we use accent insensitive matching. Accent sensitive matching would reduce the number of matching candidates.

This is exactly the same with case insensitive matching, it increases the number of matching candidates.

If there are more candidates, it may take more time to scroll through the candidate list and select the correct one.

So in the long run, both accent insensitive matching and case insensitive matching should be options. Both on by default probably.

@psads-git
Copy link
Contributor Author

Thanks, Mike. That is not only names that need to be capitalized, but all first words after a period! Example:

This is a period. Now is the first word after period.

So, the need of capitalization is very recurrent!

@mike-fabian
Copy link
Owner

There is an autocapitalization feature after sentence endings.

@psads-git
Copy link
Contributor Author

Thanks, Mike, for remembering me that feature, which I had, meanwhile, forgotten about.

@mike-fabian
Copy link
Owner

By the way, with the current database limited to 50000 rows, doing case insensitive matching by using

 PRAGMA case_sensitive_like = false;

causes a slowdown of about a factor of 2. Enough that I can notice the slowdown while typing.

The advantage of this method is of course that one could switch the option in an instant and it would have an effect on all rows existing in the database.

Whereas the other method of storing input phrases only in lower case in the database and converting each new input to lower case before doing the matching would effect only new rows.

  1. Use PRAGMA case_sensitive_like = false;
  • Disadvantage: Slowdown by a factor of 2
  • Advantage: If this is made into an option, it can be switched immediately
  1. Use python lower()
  • Disadvantage: If this is made into an option, switching the option effects only new rows
  • Advantage: No slowdown

I tend to think 2. is better because one will probably not switch that option very often. One will probably figure out what setting one likes and then just keep that. So one will have the disadvantage that it effects only new rows for a limited time but will have the speed advantage forever.

And I am not even sure whether I want to make this into an option, maybe just do case insensitive matching always with no option to switch to case sensitive matching. I don’t know yet whether I should add this as an option or not.

First I’ll make a build with case insensitive matching now with no option to switch if off and let you test and hear your opinion.

@mike-fabian
Copy link
Owner

While doing the case insensitive match according to method 2. I found this small problem when learning by reading from text files which contain accented words:

#252

@psads-git
Copy link
Contributor Author

Thanks, Mike. I also prefer the option 2 (python lower()). A power user can change all database entries to lowercase...

And I do not see any important reason why to leave optional (to the user) the search to be case-insensitive or case-sensitive.

@mike-fabian
Copy link
Owner

A colleague found this:

https://towardsdatascience.com/next-word-prediction-with-nlp-and-deep-learning-48b9fe0a17bf

but reading this I have the impression that they do less than I already do and that in a more complicated way.

They even write in a note:

Note: There are certain cases where the program might not return the expected result. This is obvious because each word is being considered only once. This will cause certain issues for particular sentences and you will not receive the desired output. To improve the accuracy of the model you can consider trying out bi-grams or tri-grams. We have only used uni-grams in this approach. Also, a few more additional steps can be done in the pre-processing steps. Overall, there is a lot of scope for improvement.

@mike-fabian
Copy link
Owner

mike-fabian commented Nov 22, 2021

I found a side effect of making the user input into the database case insensitive and opened a new issue for this:

#255

@mike-fabian
Copy link
Owner

New issue for the case and accent insensitivity in the context:

#256

@mike-fabian
Copy link
Owner

I did another interesting test:

  • Let Typing Booster read the book victor_hugo_notre_dame_de_paris.txt into an empty, in-memory database.

After doing that, the database has 156245 rows. We know already from the tests above that a simulated typing of that book will now save about -27% of the characters.

But that database is huge and will be cut down to the 50000 “best” entries on the next restart of Typing Booster. So how much will that degrade the prediction quality? So I do these next steps in the test:

  • Call cleanup_database() on that in-memory database
  • Now the database has only 50000 entries
  • Now do the typing simulation

Result: Only -24% are saved instead of -27%

But that doesn’t seem bad to me, only a small loss of 3% prediction accuracy with a database of less then 1/3 the original size.

@mike-fabian
Copy link
Owner

mike-fabian commented Nov 22, 2021

Getting higher quality data into a smaller database is more useful than a huge database which low quality data.

The database trained by the huge book “Notre dame de Paris” is probably mostly useful only to retype exactly that book and of very limited use to a “normal” user.

A “normal” user is probably much better served by a smaller database with contents which fit better to his style of writing.

@psads-git
Copy link
Contributor Author

A colleague found this:

https://towardsdatascience.com/next-word-prediction-with-nlp-and-deep-learning-48b9fe0a17bf

but reading this I have the impression that they do less than I already do and that in a more complicated way.

They even write in a note:

Note: There are certain cases where the program might not return the expected result. This is obvious because each word is being considered only once. This will cause certain issues for particular sentences and you will not receive the desired output. To improve the accuracy of the model you can consider trying out bi-grams or tri-grams. We have only used uni-grams in this approach. Also, a few more additional steps can be done in the pre-processing steps. Overall, there is a lot of scope for improvement.

I do not think, Mike, it would be a good idea to use deep learning methods on ibus-typing-booster, as such a methods would require periodic training.

@psads-git
Copy link
Contributor Author

I did another interesting test:

  • Let Typing Booster read the book victor_hugo_notre_dame_de_paris.txt into an empty, in-memory database.

After doing that, the database has 156245 rows. We know already from the tests above that a simulated typing of that book will now save about -27% of the characters.

But that database is huge and will be cut down to the 50000 “best” entries on the next restart of Typing Booster. So how much will that degrade the prediction quality? So I do these next steps in the test:

  • Call cleanup_database() on that in-memory database
  • Now the database has only 50000 entries
  • Now do the typing simulation

Result: Only -24% are saved instead of -27%

But that doesn’t seem bad to me, only a small loss of 3% prediction accuracy with a database of less then 1/3 the original size.

Yes, Mike, that is great that saving a lot of database size impacts so little on accuracy!

@psads-git
Copy link
Contributor Author

psads-git commented Nov 22, 2021

Getting higher quality data into a smaller database is more useful than a huge database which low quality data.

The database trained by the huge book “Notre dame de Paris” is probably mostly useful only to retype exactly that book and of very limited use to a “normal” user.

A “normal” user is probably much better served by a smaller database with contents which fit better to his style of writing.

I totally agree with you, Mike!

@mike-fabian
Copy link
Owner

https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/ has 2.15.4 now with these changes:

@psads-git
Copy link
Contributor Author

Thanks, Mike. If I find any problem, I will let you know.

@mike-fabian
Copy link
Owner

mike-fabian commented Nov 23, 2021

https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/ The 2.15.7 build contains an additional small tweak:

While reading training data from a file, the context in the database is converted to lower case and no accents.

So if you want to convert the context in the existing rows in your database you can read training data from file.

Size of the file doesn’t matter, it can even be empty.

@psads-git
Copy link
Contributor Author

Thanks, Mike. That is an useful tweak!

@mike-fabian
Copy link
Owner

What do you think about this?:

#257

Do you have any opinions? If yes, please comment.

@psads-git
Copy link
Contributor Author

psads-git commented Nov 25, 2021

I do not think that it is a high-needed feature, Mike, given the fact that it only needs to copy a file. But, who knows, it may be useful for non sophisticated users.

If you choose to add such a feature, then maybe it should include not only the database, but all configurations. And maybe using a single compressed file as the exported file.

I hope to have helped, Mike!

@mike-fabian
Copy link
Owner

https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/ has 2.15.8 builds.

Almost everything works 30%-40% faster with this experimental version (but uses somewhat more memory, not sure how much)

@psads-git
Copy link
Contributor Author

Thanks, Mike. With uses somewhat more memory, do mean RAM or disk?

@mike-fabian
Copy link
Owner

RAM.

I didn't measure how much more ram it uses and I am not sure how to measure that.

The change which achieves this big speedup is actually only two lines:

diff --git a/engine/itb_util.py b/engine/itb_util.py
index ea55b7b7..9b4cfee6 100644
--- a/engine/itb_util.py
+++ b/engine/itb_util.py
@@ -33,6 +33,7 @@ from enum import Enum, Flag
 import sys
 import os
 import re
+import functools
 import collections
 import unicodedata
 import locale
@@ -2784,6 +2785,7 @@ TRANS_TABLE = {
     ord('Ŧ'): 'T',
 }
 
[email protected]
 def remove_accents(text: str, keep: str = '') -> str:
     # pylint: disable=line-too-long
     '''Removes accents from the text

I noticed that the function which removes accents froma a string is a major bottleneck.

I couldn’t find a way to make that function faster but I tried caching the results.
The easiest way to do that is to add that function decorator.

That means if this function is called twice with the same arguments, for example if you call something like this twice:

remove_accents('abcÅøßẞüxyz', keep='åÅØø'))

then the second call with return the result

'abcÅøssSSuxyz'

from the cache which is of course much faster.

As this remove_accents() function is used really a lot in Typing Booster, caching results from only that function already achieves that huge speedup of 30%-40%.

But as I didn’t limit the size of the cache, this means that every time this function is called with a different word during a typing booster session, that word gets added to the cache. And this function is really used a lot, i.e. the cache might get quite big.

I could use something like:

@functools.lru_cache(maxsize=100000)

to limit the maximum size of the cache. That would limit the cache to up to the most recent one hundred thousand calls. As each call typically has a word of input and a word of output, that would limit the maximum size of the cache to a few megabytes.

According to the documentation

https://docs.python.org/3/library/functools.html

adding such a limit makes it slightly slower though:

Returns the same as lru_cache(maxsize=None), creating a thin wrapper around a dictionary lookup for the function arguments. Because it never needs to evict old values, this is smaller and faster than lru_cache() with a size limit.

I have not yet measured how much slower.

@psads-git
Copy link
Contributor Author

psads-git commented Nov 28, 2021

Thanks, Mike, for the clarification.

Let us see how much RAM does the cache consumes. But there is a cache size upper bound: The database size, which is only about 3.5MB!

@mike-fabian
Copy link
Owner

No, the database doesn’t really have much do do with it. The user input and the context are already stored without accents in the database. So for reading from the database, accents don’t need to be removed.

But while the user types and candidates are suggested, each time the user types an additional letter and new candidates are suggested, the remove_accents() function is called 3 times, on the current user input and on the 2 words of context. As the two words of context don’t change while the user types the current word, only for the first letter typed the remove_accents() function might actually do some work to remove accents from the context, for the next letters typed remove_accents() gets the results of the words from the context without accents from the cache already. That makes it faster and I feel the difference is big enough that I notice it while typing.

Each time the user has finished typing a word a new row is added or an existing row is updated. That causes the remove_accents function to be called 3 times.

During startup of Typing Booster, dictionaries are read. In case of languages using accents like Portuguese, German, ... the accents are removed from all words read from the dictionary.
This are typically a few ten thousand words per dictionary, 43887 words in case of Portuguese:

$ wc /usr/share/myspell/pt_PT.dic 
  43887   87788 1459095 /usr/share/myspell/pt_PT.dic
$ wc /usr/share/myspell/de_DE.dic 
  75782   75852 1101641 /usr/share/myspell/de_DE.dic
$

If each word is on average 5 characters long, input and output would be 10 characters on average, plus some overhead when storing in the cache, so probably the cache needs about 1 MB for a typical dictionary. That seems very reasonable to me. And it is unlikely that it grows significantly after that as the number of words a user types until the session is restarted will be much smaller than that. And all the words the user types which were already in the dictionary, are already in the cache and don’t make the cache grow.

@psads-git
Copy link
Contributor Author

Thanks, Mike, for your clarification!

My reasoning is as follows. The need to remove accents arises from the fact that the database works with non-accented words. Hence, by absurd, one could put the entire database with all accented words on memory, with no trouble removing accents -- so the cache size is bounded by the size of the database with all words accented!

I have meanwhile had an idea that may dispense with any cache. As the user types a word, letter by letter, there is no need to remove the accents of the ongoing word but only of the last character if accented!

@mike-fabian
Copy link
Owner

The database has the words without accents to make it possible to match he user input without accents.
So I don’t understand what you mean by "the database with all accented words". There are no accented words in the database except for the resulting candidates.

Converting the user input character by character is also not possible because sometimes accents get created by typing several characters, depending on the input methods used. For example if t-latn-post is used, an ü can be typed by typing u followed by ", both are ASCII. Only when the final word is produced by transliteration the ü appears.

@mike-fabian
Copy link
Owner

I think 1 MB of cache for this accent removal is no problem.

@psads-git
Copy link
Contributor Author

The database has the words without accents to make it possible to match he user input without accents. So I don’t understand what you mean by "the database with all accented words". There are no accented words in the database except for the resulting candidates.

Yes, Mike, but if you worked with accented context in the database records, there would not be any need to remove accents. Note, though, Mike, that I am not advocating that you use accented context -- I am completely on the opposite side. This is just a mental exercise to show that, at most, caching is limited to the size of the database.

Converting the user input character by character is also not possible because sometimes accents get created by typing several characters, depending on the input methods used. For example if t-latn-post is used, an ü can be typed by typing u followed by ", both are ASCII. Only when the final word is produced by transliteration the ü appears.

Sorry, Mike, but I was ignorant of that detail.

I think 1 MB of cache for this accent removal is no problem.

I agree, Mike!

@mike-fabian
Copy link
Owner

We even see in the build times at
https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/

that it is faster. 2.15.8 and 2.15.9 did build in 4 minutes, builds for older versions took 5-6 minutes.

This is the time saved in doing the test cases during the build. In real use the savings should be even greater because the build times include also the setup of the build (installing required packages into the chroot etc. ...), the building and the packing of the results into the rpms.

@psads-git
Copy link
Contributor Author

Thanks, Mike. If I notice any problem with the new version, I will let you know it.

@mike-fabian
Copy link
Owner

Note, though, Mike, that I am not advocating that you use accented context -- I am completely on the opposite side.

Maybe we should revisit this issue:

#231

and think whether any fine grained configurability of accent insensitiv matching is helpful at all.

What we discussed there is quite a complicated change, with a complicated user interface to set it up which most users might not even understand.

And according to the tests I did recently, it will probably accomplish very little in improving the predictions. Doing completely accent insensitive predictions (accent insensitive user input and context) did not make the percent of characters saved any worse when I tested with the French book “Notre Dame de Paris”. There may be some cases where typing the correct accents would reduce the number of candidates a bit and one would get the correct candidate a bit earlier, but that really doesn’t seem to be the case often. If that happened often, probably it would have made the percent of characters saved worse when testing with that French book. When doing this test with the French book, I had always perfect context, always the two previous words were remembered correctly in the database. In reality, this is not always the case. Sometimes because surrounding text didn’t work and maybe the fallback to remember the last two words didn’t help either because the cursor was moved or the focus was moved to a different window. In reality, it happens more often that the context is missing or even wrong than when doing such a test of reading from a book and then retyping that exact book. When the context is empty, typing the accents correctly of the current input might help a bit more compared to when two words of perfect context are available.

I am not 100% sure, but I need to look at this again carefully. Maybe it is not worth doing what we discussed in #231

@psads-git
Copy link
Contributor Author

Given the improvements and the knowledge meanwhile learned, I tend to agree with your arguments, Mike.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants