-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TOP Tailored search engine #4350
base: main
Are you sure you want to change the base?
Conversation
Nice work @Mclilzee, this is impressive! We have an open issue for search, but its been blocked with design for a while. I think the approach you've taken here of adding search as a feature the bot would consume first is a brilliant way around that. I’m not so sure about building out and maintaining our own search engine. We've got great search tooling available to by virtue of having Postgresql as our database - it has great full text search support. If we add the pg-search gem and a little bit of config to the lesson model, we’d get a very flexible and powerful search for very little effort and long term maintenance overhead. |
@KevinMulhern Thanks for the nice words. I would say to take your decision depending on the result, testing this myself compared to google It was doing excellent and going toe to toe with Google on the results. I haven't tested pg search result yet in comparison to this, so I don't have an opinion on that. But from my understanding that pg search uses full text search which count on queries matching snippets of text, while the way I did it was to use word weighting per document to return best result even if words doesn't construct a full sentence. Likewise, I do understand the maintainability issue, but this is fairly a straight forward rake which will generate a database for searching, if the performance of it over classes that of PG Search then it would be a lose to let maintainability stand in the way of it. I would gladly lend a hand in the maintainability if that necessary also. In the end that is up to your preference, as for the work that I have done on this, you should completely disregard it. By the time you have reviewed this, I built 2 other versions of it, one that crawls all 2000+ links inside the curriculum to index each page for searching, although I ended up with searchable results that barely give TOP pages, the other one I separated each lesson sections into its own document and indexed it that way. The idea was searching will send back a specific section that best matches the search query. My point is, I had fun doing it and I would do it again, it was a fun experience and none of what I did I would personally consider a waste of time, so don't let that influence your decision in any way. |
Thanks @Mclilzee, You've been busy! For an internal TOP search, full text search is likely to be what we need. Off the top of my head, the requirements we'd have for a TOP search would be:
I think thats why my preference would be a general purpose tool like pg-search, its equipped to do all that without us needing to deeply understand and maintain the internals. While I can definitely respect this search engine is simple now, it will inevitably grow bigger and more complex as we need it to do more. Thats when the maintenance will start to sting. If prior experience has taught me anything about search, its lean on the existing solutions and only make your own if you have no other choice. You don't want to be stuck maintaining a bespoke and complex search engine on your own 😆 But this is just my own opinion / anecdotal experience. I'd like to get other @TheOdinProject/maintainers to weigh in before we make any decisions. |
Before we start, there are a few important points that I want you to take into consideration to answer questions you might have at this point:
Now that these questions, let me propose my suggestion to add to TOP as an open Issue.
TOP tailored search engine, is made using tf_idf algorithm, it parses all the documents from the Lessons database, extract the data by parsing HTML using
Nokogori
library (But I could build HTML parser if we want to avoid using external libraries) Then create a database with each word, their scores and linked to Lessons table joins on frequency lesson id = lessons.idThe table in question will have about 130k entries, which takes 16M of space, it can be lower by introducing a stop_words like
a the for how what
and so on to filter out words that going to be presence in the documents. This will not affect the search quality as the algorithm will filter out the words by having smaller score because they are spread out in the whole curriculum.Another filtering we can do, is filtering by HTML tag, I currently only filter the code tags from appearing, but we can filter more tags if necessary. I didn't want to add those filtering in place because I'm not sure how important they are, I'm not an expert when it comes to performance but moving forward it can be tailored as we see fit.
You can play around with the queries API, I have so far only created the API and tested by fetching JSON data of queries. The API will greatly fit the top-bot, and we can make a specific view where it's reached by searching using a new search bar on the Nav, but that is for later.
The database currently gets indexed by running
rails search:index
. I feel at this point that I'm explaining implementation details that you can look up by reading the code, dear reader.Be careful that if you run the same update_content more than once as of now, you will repopulate the database with extra data, I haven't figured out how to reset it yet, I have to read more of the code to understand what's going on
Let's talk about some problems that I have faced, right now the search don't distinguish between ruby, or JS paths, they return the result that best fit, and I tried to make the result distinct using the
identifier_uuid
column, but found out that some have different UUID but still the same lessons, then I went and used the title to make each result a unique one.Another problem that I have faced is some slugs that are in the lessons table, are invalid, like the React ones. The React links doesn't work, the new React courses have newer links, I'm not sure what is the
slug
at this point, I thought that is the unique path for each lesson, it works for most of them tho.Another thing is tests haven't been written yet, there need to be some tests to be written, all of this was manually tested by me.
If you wish to test, the quality of the searches, make sure to compare them to google searches.