Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy and fullpath fuzzy finding. #12

Open
vredesbyyrd opened this issue Jun 8, 2019 · 2 comments
Open

Fuzzy and fullpath fuzzy finding. #12

vredesbyyrd opened this issue Jun 8, 2019 · 2 comments

Comments

@vredesbyyrd
Copy link

vredesbyyrd commented Jun 8, 2019

Hi. After a fair amount of testing I have some observations regarding full path fuzzy finding and fuzzy search in general. The main search tool I was using before this was angrysearch , which also had the stated goal of being an Everything for linux. Angrysearch came very close to that goal, minus the always up-to-date database, which is obviously a big downside. But its filtering logic is as good as Everything's, IMO. So any comparisons I make will be with angrysearch/everything. To me, a good fuzzy finder means always being able to filter down to what you need from lazy queries. I'll show what I mean by lazy queries below.

First, this example does not necessarily concern full_path finding, only filename fuzzy searching. Lets say I am looking for all photographs of the artist 'fka twigs' , but I do not know where they are located or how the basename is formatted, so I query just her name. gosearch -c -r -fp fka\ twigs .

go-test_01

I only see a couple cover.png album covers in the results which I know are not what I am looking for. Making the same query with angrysearch with comparable fuzzy_whole_path settings finds the sought after photographs (highlighted below) in addition to all strings in the database containing fka twigs

angry

In my mind thats the benefit of whole path fuzzy finding. It allows the user to make broad based 'lazy' searches and 99% of the time successfully filter down to what they are looking for. E.g, If I was looking for all albums by artist fka_twigs the same query in full_path mode would find the results I am looking for.

Regarding finding the filename fka-twigs_01 from the query fka\ twigs , is this already possible with gosearch + using regex filters via the config? A bug in the fuzzy finding? I tried a few different regex patterns but could not get the results I wanted.

EDIT: I guess the important question here is what is the best way to avoid troublesome meta-characters. I think angrysearch removes them from the database altogether. Which seems a bit overkill perhaps? E.g, You would not be able to query C++. Maybe a flag to replace the most common troublesome characters [ hyphen, underscore, period ] with spaces would be a good middle road. Just some thoughts. Ill take a look how Everything handles it.

Performance concerns:

I did not do any real benchmarks between the master branch and fuzzy_path branch, but on my 2014 i5 laptop there is no perceivable performance difference. This is just some anecdotal evidence, but the gosearchServer process uses nearly identical memory on both branches. Not sure how good of a measure that is though. Initial index time was also essentially the same. IMHO, having the option of full-path query in the main search tool I use is very important and would have to come at a large performance cost to consider not implementing it.

Cheers.

  • path fuzzy finding branch index
    • sys = 205mb
    • totalalloc = 231mb
    • alloc = 113mb
    • 472244 files / 38463 dir in 6.996571 sec

Screenshot from 2019-06-08 12-05-36

  • master branch index
    • sys = 205mb
    • totalalloc = 284mb
    • alloc 106mb
    • 472244 files / 38463 dir in 5.099121 sec

Screenshot from 2019-06-08 12-07-15

@ozeidan
Copy link
Owner

ozeidan commented Jun 9, 2019

Hi @vredesbyyrd, thanks a lot for your feedback!

For your first issue: I looked at how angrysearch implements the indexing and it seems like they don't really have fuzzy finding (at least as it is defined in this project). Instead, a space denotes in angrysearch seems to split the query into phrases which can be present in the search result in either order.

The fuzzy searching, as defined in this project (which is also not actual fuzzy searching afaik), takes the query as a single phrase and looks for strings that contain all of the characters of the phrase in the same order, but not necessarily continuously. So in your case, since you searched for "fka\ twigs", there has to be a space somewhere between the occurrences of "fka" and "twigs". If you don't want that, you can search for "fkatwigs" instead.

Maybe it would be a smart thing to split the query into phrases as angrysearch does it, and then process those phrases differently. I'm also thinking about how we can improve the sort order, i.e. if the filename contains the whole query, it should be prioritized.

Regarding the performance concerns, I think we're good here. I think if we end up creating a useful full path search, I'll merge it into master and keep a build option around to disable the changes.

@vredesbyyrd
Copy link
Author

For your first issue: I looked at how angrysearch implements the indexing and it seems like they don't really have fuzzy finding (at least as it is defined in this project). Instead, a space denotes in angrysearch seems to split the query into phrases which can be present in the search result in either order.

The fuzzy searching, as defined in this project (which is also not actual fuzzy searching afaik), takes the query as a single phrase and looks for strings that contain all of the characters of the phrase in the same order, but not necessarily continuously. So in your case, since you searched for "fka\ twigs", there has to be a space somewhere between the occurrences of "fka" and "twigs". If you don't want that, you can search for "fkatwigs" instead.

Thank you for detailing that! The results I was getting or not getting makes sense now...in hindsight I should have recognized why. Using my example before with the query "fkatwigs" does indeed find what one would expect. I am admittedly pretty naive when it comes to the the inner workings of search tools. Like breadth-first vs depth-first algorithms, how different matching methods really work and differ.

Maybe it would be a smart thing to split the query into phrases as angrysearch does it, and then process those phrases differently. I'm also thinking about how we can improve the sort order, i.e. if the filename contains the whole query, it should be prioritized.

From a user point of view, I would agree splitting the query into phrases in a similar fashion as angrysearch would be a nice addition, its just a bit more user friendly imo. On prioritizing complete matches, agreed.

Regarding the performance concerns, I think we're good here. I think if we end up creating a useful full path search, I'll merge it into master and keep a build option around to disable the changes.

Right on 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants