Skip to content

How to write a Search Engine Module

milos.medic edited this page Oct 2, 2023 · 2 revisions

How to write a Search Engine Module

Look at examples of some already implemented search engines modules, to get an idea on how they should look. An example of a very simple module is mojeek (src/engines/mojeek). The duckduckgo module should also be looked at, as it is unique in multiple ways.

Folder Structure

The folder for a search engines can contain the following files:

  • <name>.go (e.g. qwant.go) - the main Go file for the module
  • options.go - a Go file with:
    • var Info engines.Info
    • var Support engines.SupportedSettings
    • optionally var dompaths engines.DOMPaths
    • optionally var timings config.Timings (this may be moved soon)
  • optionally a <name>.md (e.g. qwant.md) markdown file, explaining things worthy of note
  • optionally a json_response.go Go file that has the structures and functions necessary for parsing JSON responses if the module receives them
  • for anything less standard (for search engines that require a more unique implementation), try to refer to previous implementations

Notes

  • Two things to keep in mind when creating a module:
    • It should be as fast as possible
    • It should lower the chances of it being rate limited as much as possible.
      • This will be achieved by emulating user interaction as closely as possible. For example, making the request to the first page not have the page URL parameter (e.g. &s= in mojeek)
  • Cleanup the URL retrieved by passing it to parse.ParseURL.
  • Different search engines have different formats for the locale, device, safeSearch (and similar) parameters, while the format is standardized in engines.Options. Parsing this should be done in module functions like getLocale, getDevice, getSafeSearch (and similar). Refer to qwant.
  • If the search engine has a "Load More" functionality (like yep), the page value for all results should be 1.

Common problems

And the modules that solve them.

How to keep track of what page a result is from?

Pass a context with the page number (1-indexed). Almost every module does this, example: mojeek.

How to keep track of what is the index of a result on some page?

Refer to var pageRankCounter []int from mojeek. This works because the matches to the OnHTML function are called in order and synchronously.

What if some needed field is not always in the same HTML field?

It's okay to hardcode some elements (instead of putting them in dompaths). Refer to descText from brave.

How to escape a telemetry link?

Various methods may be necessary, refer to duckduckgo and etools.

The Search Engine uses an identifier cookie.

Cookies gotten through the Set-Cookie are saved passed in subsequent requests by colly automatically. We do need to wait for a response that actually sets the cookie though (can't have everything async). Refer to etools.

How to parse (API) JSON responses?

Through unmarshalling. Refer to qwant and swisscows. If the JSON has an array that doesn't have consistent objects, refer to yep.

The Search Engine has countermeasures towards scraping.

Good luck. swisscows uses a nonce + signature. You may also refer to metager (not implemented at the time of writing).

The Search Engine has a good Captcha.

Good luck. yandex is an example (not implemented at the time of writing).