-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should search be case-sensitive? #251
Comments
Yes, I think search should be case-insensitive and if this is possible it should be case insensitive by default. |
I don’t want to make this optional though, I think it should always be case insensitive. |
Thanks, Mike. That will significantly improve ibus-typing-booster. |
Resolves: #251 - For matching in the dictionaries, case insensitive regexps are used instead of “.startswith()” - For the sqlite3 database, use “PRAGMA case_sensitive_like = false;” instead of “PRAGMA case_sensitive_like = true;” to make the LIKE operator work case insensitive. See also this old commit: commit 0be3d24 Author: Mike FABIAN <[email protected]> Date: Thu May 23 11:10:10 2013 +0200 SQL LIKE should behave case sensitively http://www.sqlite.org/lang_expr.html#like states: > (A bug: SQLite only understands upper/lower case for ASCII characters > by default. The LIKE operator is case sensitive by default for unicode > characters that are beyond the ASCII range. For example, the > expression 'a' LIKE 'A' is TRUE but 'æ' LIKE 'Æ' is FALSE.) That makes no sense at all for us. But even if LIKE behaved case insensitively for all Unicode characters, we would still prefer to it to behave case sensitively. So add the PRAGMA to make it case sensitive. https://www.sqlite.org/lang_expr.html#like now contains: “The ICU extension to SQLite includes an enhanced version of the LIKE operator that does case folding across all unicode characters.” So it might even be possible to make that work. But even if it works only for ASCII characters, this is already mostly good enough because Typing Booster stores the user input with the accents removed in the input_phrase column in the database. I.e. if it works only for ASCII this is still mostly good enough.
https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/ has 2.14.18 builds now which have the case insensitive matching. |
Thanks, Mike. This version of ibus-typing-booster is almost unusable, since the search is extremely slow. Let me add that my database has about 70MB (I input into it several books, to help the prediction). |
Is it slower than before? I didn’t notice any slowdown. |
Yes, Mike, several orders of magnitude slower. |
But probably because of your huge database, or do you think the case insensitive matching change anything? |
If you downgrade to 2.14.17, is that faster? |
If I downgrade to 2.14.17, ibus-typing-booster is very fast. I do not have any solid idea why is that happening, but, since the result of each search has many more elements, the ordering by use frequency may be extremely slow, because of the number of elements that it has to order -- if I remember well, the sorting algorithms are O(n^2) (or something close to that). |
By the way, I found reading books into Typing Booster not to be helpful. For example, in 2014, I let Typing Booster read the “Hitchhikers Guide to the Galaxy” and “The Picture of Dorian Gray”. It didn’t seem to help my typing at all. I could easily open any page in one of these books and type any sentence from these books easily with good prediction, but what I really wanted to type (almost) never seemed to occur in these books. I guess this is true for almost any book except possibly if you wrote it yourself. User input is so variable, it doesn’t help much to read text from other writers, even a huge amount of such text doesn’t seem to help. Now when I implemented the expiry of old entries and looked at what entries were expired, I noticed that a most entries from both these books had been created in 2014 and have never been touched since then. So I think reading books doesn’t help much. Maybe I should rethink how to read huge texts, I have no good idea at the moment though. Learning from your own input is helpful, learning from other peoples input seems to have very limited value. Maybe, when reading a text above a certain size, I should only add stuff from that text to the database if it really appears very often. That still might not help very much, for example “The picture of Dorian Gray” contains 47 times the text “said Lord Henry”. I will probably never type that, I will probably type “said” often but “Lord” and “Henry” very rarely and it is unlikely that I ever type the complete text “said Lord Henry”. |
Now I have measured and indeed it is much slower. My database has 140000 rows and a size of 10MB and with that it is about 10 times slower. I measured both the time it takes to do the case insensitive lookup in the hunspell dictionaries and in the database. The difference for the lookup in the dictionaries is insignificant, the difference in speed between the case insensitive regular expressions used now and the But the difference when doing the database lookup is huge. As soon as I use
for the sqlite database, it becomes about 10 times slower for me. So probably there are two things slowing this down:
|
So, Mike, case-insensitiveness is perhaps a bad idea! ;-) |
Some examples from my current database:
Looks like the amount of entries returned by a case insensitive LIKE operator is really a lot bigger. Probably we cannot do case insensitive matching in the database then. Maybe only in the dictionaries. So for example when using the en_US dictionary, “mike” and “Mike” would both match:
but not when using the en_GB dictionary:
It looks like I can do the case insensitive match in the dictionaries without measurable slowdown and at least that makes it possible to type “corme” or “Corme” and get “Cormeilles-en-Parisis” when using the French dictionary:
|
Would a “halfway case insensitive” solution, i.e. case insensitive in the dictionaries but not in the database as described above be good? I tend to think this is better than nothing. |
Yes, Mike, a “halfway case insensitive” solution would definitely be better than nothing. And to use case insensitiveness in the database is, as is very clear, impractical. |
I just noticed that case insensitive matching in the database is even possible with no loss in performance at alŀ if I do it the same way I do it for accent insensitive matching. Accent insensitive matching in the database is currently done like this: Before the input the user has typed ( If the user then types the something again, the accents are removed from the input and then the match against the database is done. This works well and does not cause any loss in performance. The disadvantage is of course that this is an option one cannot switch immediately. When I make accent sensitive matching in the database configurable as discussed in Then a change in that option will only be effective for new input after that option was changed. Old rows in the database cannot be changed anymore, they would stay as they were. One could do the same with upper and lower case: Always store That would be fast. But if this would be added as an option in the setup tool, then changing that option would only have an effect on new input. |
Thanks, Mike, but I am not sure whether it will work. Suppose that one wants to write
and types My idea of case-insensitive search was to avoid typing uppercase letters mainly in names. I believe that you are thinking in something like the following: One types
and ibus-typing-booster will suggest
This would be a great progress though, as it would allow a substantial size-reduction of the database while maintaining the same prediction performance. |
That can be done easily in Python:
I think that you can type either That is almost the same as with the accents, for example my database currently contains:
The input_phrase column always contains When saving the the database, that column is always saved without the accents (There are some language dependent exceptions, we talked about these before, but let’s ignore these for this explanation). And from the user input the accents are also removed before attempting a match in the database. That means no matter whether the user types One could do the same with upper and lower case. For example when the user currently types
If the user input was always converted to lower case when when saving to the database and when matching against the database, one would have only one entry with count 2 like this:
and typing But switching that option would only have an effect of new entries. The old ones would very slowly disappear because of the expiry. Theoretically one could convert the database to all lower case input_phrase automatically when the option But when it is switched off again, there would be no way to automatically convert it back as some information was lost when doing the conversion to lower case. I think it might be better not to attempt any conversion of the existing rows at all and let the value of that option only have an effect on new rows.
I am not sure whether this is a good idea or not, probably I would make it an option if I implement this. I think the reduction in the number of rows would not be substantial, there are nto so many words which usually start with upper case in most languages. There are more in German because all nouns start with upper-case in German, but even in the case of German I think the number or rows which would be merged into one because of this would be rather small. Whether prediction performance gets better or worse is partly a matter of opinion. If you type more exact, the prediction can be more exact. If you always type upper-case letters correctly, then the prediction has less choice. It is the same as with the accents, I think. If you type This is exactly the same with case insensitive matching, it increases the number of matching candidates. If there are more candidates, it may take more time to scroll through the candidate list and select the correct one. So in the long run, both accent insensitive matching and case insensitive matching should be options. Both on by default probably. |
Thanks, Mike. That is not only names that need to be capitalized, but all first words after a period! Example:
So, the need of capitalization is very recurrent! |
There is an autocapitalization feature after sentence endings. |
Thanks, Mike, for remembering me that feature, which I had, meanwhile, forgotten about. |
By the way, with the current database limited to 50000 rows, doing case insensitive matching by using
causes a slowdown of about a factor of 2. Enough that I can notice the slowdown while typing. The advantage of this method is of course that one could switch the option in an instant and it would have an effect on all rows existing in the database. Whereas the other method of storing input phrases only in lower case in the database and converting each new input to lower case before doing the matching would effect only new rows.
I tend to think 2. is better because one will probably not switch that option very often. One will probably figure out what setting one likes and then just keep that. So one will have the disadvantage that it effects only new rows for a limited time but will have the speed advantage forever. And I am not even sure whether I want to make this into an option, maybe just do case insensitive matching always with no option to switch to case sensitive matching. I don’t know yet whether I should add this as an option or not. First I’ll make a build with case insensitive matching now with no option to switch if off and let you test and hear your opinion. |
While doing the case insensitive match according to method 2. I found this small problem when learning by reading from text files which contain accented words: |
Thanks, Mike. I also prefer the option 2 (python lower()). A power user can change all database entries to lowercase... And I do not see any important reason why to leave optional (to the user) the search to be case-insensitive or case-sensitive. |
A colleague found this: https://towardsdatascience.com/next-word-prediction-with-nlp-and-deep-learning-48b9fe0a17bf but reading this I have the impression that they do less than I already do and that in a more complicated way. They even write in a note:
|
I found a side effect of making the user input into the database case insensitive and opened a new issue for this: |
New issue for the case and accent insensitivity in the context: |
I did another interesting test:
After doing that, the database has 156245 rows. We know already from the tests above that a simulated typing of that book will now save about -27% of the characters. But that database is huge and will be cut down to the 50000 “best” entries on the next restart of Typing Booster. So how much will that degrade the prediction quality? So I do these next steps in the test:
Result: Only -24% are saved instead of -27% But that doesn’t seem bad to me, only a small loss of 3% prediction accuracy with a database of less then 1/3 the original size. |
Getting higher quality data into a smaller database is more useful than a huge database which low quality data. The database trained by the huge book “Notre dame de Paris” is probably mostly useful only to retype exactly that book and of very limited use to a “normal” user. A “normal” user is probably much better served by a smaller database with contents which fit better to his style of writing. |
I do not think, Mike, it would be a good idea to use deep learning methods on ibus-typing-booster, as such a methods would require periodic training. |
Yes, Mike, that is great that saving a lot of database size impacts so little on accuracy! |
I totally agree with you, Mike! |
https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/ has 2.15.4 now with these changes:
|
Thanks, Mike. If I find any problem, I will let you know. |
https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/ The 2.15.7 build contains an additional small tweak: While reading training data from a file, the context in the database is converted to lower case and no accents. So if you want to convert the context in the existing rows in your database you can read training data from file. Size of the file doesn’t matter, it can even be empty. |
Thanks, Mike. That is an useful tweak! |
What do you think about this?: Do you have any opinions? If yes, please comment. |
I do not think that it is a high-needed feature, Mike, given the fact that it only needs to copy a file. But, who knows, it may be useful for non sophisticated users. If you choose to add such a feature, then maybe it should include not only the database, but all configurations. And maybe using a single compressed file as the exported file. I hope to have helped, Mike! |
https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/ has 2.15.8 builds. Almost everything works 30%-40% faster with this experimental version (but uses somewhat more memory, not sure how much) |
Thanks, Mike. With uses somewhat more memory, do mean RAM or disk? |
RAM. I didn't measure how much more ram it uses and I am not sure how to measure that. The change which achieves this big speedup is actually only two lines:
I noticed that the function which removes accents froma a string is a major bottleneck. I couldn’t find a way to make that function faster but I tried caching the results. That means if this function is called twice with the same arguments, for example if you call something like this twice:
then the second call with return the result
from the cache which is of course much faster. As this But as I didn’t limit the size of the cache, this means that every time this function is called with a different word during a typing booster session, that word gets added to the cache. And this function is really used a lot, i.e. the cache might get quite big. I could use something like:
to limit the maximum size of the cache. That would limit the cache to up to the most recent one hundred thousand calls. As each call typically has a word of input and a word of output, that would limit the maximum size of the cache to a few megabytes. According to the documentation https://docs.python.org/3/library/functools.html adding such a limit makes it slightly slower though:
I have not yet measured how much slower. |
Thanks, Mike, for the clarification. Let us see how much RAM does the cache consumes. But there is a cache size upper bound: The database size, which is only about 3.5MB! |
No, the database doesn’t really have much do do with it. The user input and the context are already stored without accents in the database. So for reading from the database, accents don’t need to be removed. But while the user types and candidates are suggested, each time the user types an additional letter and new candidates are suggested, the Each time the user has finished typing a word a new row is added or an existing row is updated. That causes the During startup of Typing Booster, dictionaries are read. In case of languages using accents like Portuguese, German, ... the accents are removed from all words read from the dictionary.
If each word is on average 5 characters long, input and output would be 10 characters on average, plus some overhead when storing in the cache, so probably the cache needs about 1 MB for a typical dictionary. That seems very reasonable to me. And it is unlikely that it grows significantly after that as the number of words a user types until the session is restarted will be much smaller than that. And all the words the user types which were already in the dictionary, are already in the cache and don’t make the cache grow. |
Thanks, Mike, for your clarification! My reasoning is as follows. The need to remove accents arises from the fact that the database works with non-accented words. Hence, by absurd, one could put the entire database with all accented words on memory, with no trouble removing accents -- so the cache size is bounded by the size of the database with all words accented! I have meanwhile had an idea that may dispense with any cache. As the user types a word, letter by letter, there is no need to remove the accents of the ongoing word but only of the last character if accented! |
The database has the words without accents to make it possible to match he user input without accents. Converting the user input character by character is also not possible because sometimes accents get created by typing several characters, depending on the input methods used. For example if t-latn-post is used, an |
I think 1 MB of cache for this accent removal is no problem. |
Yes, Mike, but if you worked with accented context in the database records, there would not be any need to remove accents. Note, though, Mike, that I am not advocating that you use accented context -- I am completely on the opposite side. This is just a mental exercise to show that, at most, caching is limited to the size of the database.
Sorry, Mike, but I was ignorant of that detail.
I agree, Mike! |
We even see in the build times at that it is faster. 2.15.8 and 2.15.9 did build in 4 minutes, builds for older versions took 5-6 minutes. This is the time saved in doing the test cases during the build. In real use the savings should be even greater because the build times include also the setup of the build (installing required packages into the chroot etc. ...), the building and the packing of the results into the rpms. |
Thanks, Mike. If I notice any problem with the new version, I will let you know it. |
Maybe we should revisit this issue: and think whether any fine grained configurability of accent insensitiv matching is helpful at all. What we discussed there is quite a complicated change, with a complicated user interface to set it up which most users might not even understand. And according to the tests I did recently, it will probably accomplish very little in improving the predictions. Doing completely accent insensitive predictions (accent insensitive user input and context) did not make the percent of characters saved any worse when I tested with the French book “Notre Dame de Paris”. There may be some cases where typing the correct accents would reduce the number of candidates a bit and one would get the correct candidate a bit earlier, but that really doesn’t seem to be the case often. If that happened often, probably it would have made the percent of characters saved worse when testing with that French book. When doing this test with the French book, I had always perfect context, always the two previous words were remembered correctly in the database. In reality, this is not always the case. Sometimes because surrounding text didn’t work and maybe the fallback to remember the last two words didn’t help either because the cursor was moved or the focus was moved to a different window. In reality, it happens more often that the context is missing or even wrong than when doing such a test of reading from a book and then retyping that exact book. When the context is empty, typing the accents correctly of the current input might help a bit more compared to when two words of perfect context are available. I am not 100% sure, but I need to look at this again carefully. Maybe it is not worth doing what we discussed in #231 |
Given the improvements and the knowledge meanwhile learned, I tend to agree with your arguments, Mike. |
While I am in process of changing the timestamps of my database, I have something that I would like to suggest to you: To make search case-insensitive. For instance, if I write mike, nothing is found by ibus-typing-booster, but if I start writing Mi... the Mike suggestion emerges immediately. I think it would be useful to be possible to write mi... and then the suggestion Mike being immediately offered.
What do you think about this?
The text was updated successfully, but these errors were encountered: