TOP Tailored search engine #4350

Mclilzee · 2024-01-19T04:05:20Z

Before we start, there are a few important points that I want you to take into consideration to answer questions you might have at this point:

Why did I not create an issue and wait for approval to be following the contributing guidelines? That is because I was building this project for fun, for my own interest. We never had a search engine on the website itself and the bot uses google instead of a search API, so wanted to try and build one for fun.
I want you to truly treat this as if it were the issue that I'm opening. Do not take the work that I have done into consideration or feel bad for turning it down, because I was truly going to build it anyway even if the issue were to be down voted.
Why is the Ruby code so terrible, is this person not ashamed of writing such terrible code? Yes, I 100% agree, I'm very bad when it comes to Ruby (very obvious when you see that I concatenated hard coded top URL with the slug), and I was dragging my self as I went along through documentations.

Now that these questions, let me propose my suggestion to add to TOP as an open Issue.

TOP tailored search engine, is made using tf_idf algorithm, it parses all the documents from the Lessons database, extract the data by parsing HTML using Nokogori library (But I could build HTML parser if we want to avoid using external libraries) Then create a database with each word, their scores and linked to Lessons table joins on frequency lesson id = lessons.id

The table in question will have about 130k entries, which takes 16M of space, it can be lower by introducing a stop_words like a the for how what and so on to filter out words that going to be presence in the documents. This will not affect the search quality as the algorithm will filter out the words by having smaller score because they are spread out in the whole curriculum.

Another filtering we can do, is filtering by HTML tag, I currently only filter the code tags from appearing, but we can filter more tags if necessary. I didn't want to add those filtering in place because I'm not sure how important they are, I'm not an expert when it comes to performance but moving forward it can be tailored as we see fit.

You can play around with the queries API, I have so far only created the API and tested by fetching JSON data of queries. The API will greatly fit the top-bot, and we can make a specific view where it's reached by searching using a new search bar on the Nav, but that is for later.

The database currently gets indexed by running rails search:index. I feel at this point that I'm explaining implementation details that you can look up by reading the code, dear reader.

Be careful that if you run the same update_content more than once as of now, you will repopulate the database with extra data, I haven't figured out how to reset it yet, I have to read more of the code to understand what's going on

Let's talk about some problems that I have faced, right now the search don't distinguish between ruby, or JS paths, they return the result that best fit, and I tried to make the result distinct using the identifier_uuid column, but found out that some have different UUID but still the same lessons, then I went and used the title to make each result a unique one.

Another problem that I have faced is some slugs that are in the lessons table, are invalid, like the React ones. The React links doesn't work, the new React courses have newer links, I'm not sure what is the slug at this point, I thought that is the unique path for each lesson, it works for most of them tho.

Another thing is tests haven't been written yet, there need to be some tests to be written, all of this was manually tested by me.

If you wish to test, the quality of the searches, make sure to compare them to google searches.

KevinMulhern · 2024-01-23T22:06:21Z

Nice work @Mclilzee, this is impressive!

We have an open issue for search, but its been blocked with design for a while. I think the approach you've taken here of adding search as a feature the bot would consume first is a brilliant way around that.

I’m not so sure about building out and maintaining our own search engine. We've got great search tooling available to by virtue of having Postgresql as our database - it has great full text search support. If we add the pg-search gem and a little bit of config to the lesson model, we’d get a very flexible and powerful search for very little effort and long term maintenance overhead.

Mclilzee · 2024-01-23T23:40:47Z

@KevinMulhern Thanks for the nice words.

I would say to take your decision depending on the result, testing this myself compared to google It was doing excellent and going toe to toe with Google on the results. I haven't tested pg search result yet in comparison to this, so I don't have an opinion on that. But from my understanding that pg search uses full text search which count on queries matching snippets of text, while the way I did it was to use word weighting per document to return best result even if words doesn't construct a full sentence.

Likewise, I do understand the maintainability issue, but this is fairly a straight forward rake which will generate a database for searching, if the performance of it over classes that of PG Search then it would be a lose to let maintainability stand in the way of it. I would gladly lend a hand in the maintainability if that necessary also.

In the end that is up to your preference, as for the work that I have done on this, you should completely disregard it. By the time you have reviewed this, I built 2 other versions of it, one that crawls all 2000+ links inside the curriculum to index each page for searching, although I ended up with searchable results that barely give TOP pages, the other one I separated each lesson sections into its own document and indexed it that way. The idea was searching will send back a specific section that best matches the search query.

My point is, I had fun doing it and I would do it again, it was a fun experience and none of what I did I would personally consider a waste of time, so don't let that influence your decision in any way.

KevinMulhern · 2024-01-24T01:00:47Z

Thanks @Mclilzee, You've been busy!

For an internal TOP search, full text search is likely to be what we need. Off the top of my head, the requirements we'd have for a TOP search would be:

Searching against different attributes - like titles, descriptions and content; with different weightings for each.
Extending to different models - in the near future we'll have a few different models to search against instead of lessons alone.
Different search tools in different contexts - We'll want to have typical text search for users and more advanced search tools on our admin interface

I think thats why my preference would be a general purpose tool like pg-search, its equipped to do all that without us needing to deeply understand and maintain the internals.

While I can definitely respect this search engine is simple now, it will inevitably grow bigger and more complex as we need it to do more. Thats when the maintenance will start to sting. If prior experience has taught me anything about search, its lean on the existing solutions and only make your own if you have no other choice. You don't want to be stuck maintaining a bespoke and complex search engine on your own 😆

But this is just my own opinion / anecdotal experience. I'd like to get other @TheOdinProject/maintainers to weigh in before we make any decisions.

Mclilzee added 30 commits January 17, 2024 20:18

Add tokenizing

1b09a53

Remove printing

2ab26b8

Add words frequency db

dd6a538

Update to latest before migration

8a0ffe4

Update Lesson

1915866

Add words frequency migrate

faeddcc

Fix typo

25fba2c

Run migration

9dcb1bb

Update word frequencies name

76c2261

Add returning tf tuple

a9d3832

Update schema of td_idf

25ae90f

Add calculating and summarizing

d78d104

Fix word_frequencies issue

7c3d950

Update saving to database

6beefe1

Fix issue with calculations

7c4ef71

Change timer locatioin

6ff942b

Update to use bulk insertion

2e1e080

Add stop words

6c7fb86

Remove code tags

7dd568b

Add more stop words

f920253

Add more stop words

4afa635

Fix tf_idf equation

3ba0117

Remove stop words and fix tf_idf calculation

7e351bc

Change iteration to use words iterator

0b27dd2

Update iterating

ad54468

TODO: REMOVE

d147330

Update routes

5f0d03f

Add search api

b71df2a

Complete searching endpoint

3504661

Move indexer into a service

8764fc0

Mclilzee added 27 commits January 20, 2024 16:21

Connect search record with tf_idfs

1dfeefe

Create tf_idf service class

81fff5b

Finalize tf_idf service

fb2c155

Fix service location

e9eb56d

Fix bug returning title duplicated

4c84eff

Update to use array instead of map

baa8a65

Fix calculating tf_idf

d03a53f

Refactor service to extract indexes for new table schema

d515c49

Update tf_idf_service to exclude stop_words

cf93a04

Fix stop words wrongly generated error

44dcf82

Refactor naming

3efaf78

Add extracting external links

826c192

Refactor

4791836

Update service to crawl links

c9a7cdd

Update migratioin

8b8f022

Refactor

0f8c7dd

Update to insert in bulk

9c83511

Add unique index for tf_idf table

9266e47

Refactor naming

27b6806

Refactor tf service

a3e3aee

Refactor structure

fc4c63b

Refactor index service

23b37ba

Fix bugs and migrate

ba578e9

Refactor controller

da086b8

Update regex

3691e16

Fine tune indexing to generate better results for titles

b51c245

Tune down the words scoring of titles and desc

90ef5ac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TOP Tailored search engine #4350

TOP Tailored search engine #4350

Mclilzee commented Jan 19, 2024 •

edited

Loading

KevinMulhern commented Jan 23, 2024

Mclilzee commented Jan 23, 2024

KevinMulhern commented Jan 24, 2024 •

edited

Loading

TOP Tailored search engine #4350

Are you sure you want to change the base?

TOP Tailored search engine #4350

Conversation

Mclilzee commented Jan 19, 2024 • edited Loading

KevinMulhern commented Jan 23, 2024

Mclilzee commented Jan 23, 2024

KevinMulhern commented Jan 24, 2024 • edited Loading

Mclilzee commented Jan 19, 2024 •

edited

Loading

KevinMulhern commented Jan 24, 2024 •

edited

Loading