Skip to content
/ GIR Public

Improving a minimum viable search engine as part of information retrieval course

License

Notifications You must be signed in to change notification settings

anaterovic/GIR

Repository files navigation

GIR Exercise 2022W

Exercise overview

This is the repository for the exercise of Grundlagen des Information Retrieval (2022).

The aim of this exercise is to improve a minimum viable search engine, such that it returns the right documents for predefined user queries.

As part of this exercise, we were provided with an already working starter code along with tests that check how many of the predefined queries are already satisfied. The exercise is divided up into two milestones. In the 1st milestone the system was tested only against milestone 1 queries, whereas in the 2nd milestone against all the queries.

Data

The data contains around 25.000 crawled Wikipedia pages of English-speaking movies from 2000 until today. The data has been collected by the code found in the wiki folder.

Queries

During the exercise, we had to improve the starter system to satisfy the following queries.

Milestone 1

[
  { 
      "query":  "Russell Crowe in roman age",
      "expected_result":  "Gladiator (2000 film)"
  },
  { 
      "query": "Oscar winning Christopher Nolan movie about 2nd world war",
      "expected_result": "Dunkirk (2017 film)"
  },
  { 
      "query": "moon landing movie with Ryan Gostling",
      "expected_result": "First Man (film)"
  },
  { 
      "query": "an astronaut is left on the Mars, it was shot in Budapest",
      "expected_result": "The Martian (film)"
  },
  { 
      "query": "beautiful mi",
      "expected_result": "A Beautiful Mind (film)"
  }
]

Milestone 2

[
  { 
      "query":  "12 districts fighting in a deathly game",
      "expected_result":  "The Hunger Games (film)"
  },
  { 
      "query": "no mas bebes",
      "expected_result": "No más bebés"
  },
  { 
      "query": "It",
      "expected_result": "It (2017 film)"
  },
  { 
      "query": "2014 american coming-of-age romance based on a novel",
      "expected_result": "The Fault in Our Stars (film)"
  },
  { 
      "query": "documentary about women in STEM",
      "expected_result": "Picture a Scientist"
  }
]

Prerequisites

You need an elasticsearch up and running on your machine. For details about how to install, set up and start elasticsearch, see here.

You have to install the necessary python packages. For details, see here.

Download the data

Use the preprocessor code to download and save the data onto your machine.

# With downloading the data
python ir_exercise/preprocessor.py -d

# Without downloading the data
python ir_exercise/preprocessor.py

This creates a data folder with a data.json file in it, which serves as input for the indexer.

Run the indexer

Run the indexer code to create and populate an elasticsearch index.

# If we want to delete and recreate the indices
python ir_exercise/indexer.py -r

# If you only want to update the indices
python ir_exercise/indexer.py

Start the search engine

You can start the service from terminal. Leave this service running and execute the other commands from a new terminal.

python ir_exercise/search_service.py

You can send request via curl, e.g.:

# Linux
curl -X GET http://localhost:6000/ir-search-service -H "Content-Type: application/json" -d '{"text": "test queary", "size": 5}'

# Windows
curl -X GET http://localhost:6000/ir-search-service -H "Content-Type: application/json" -d "{\"text\": \"test queary\", \"size\": 5}"

(Optional) Start frontend

We also provide a simple streamlit frontend that prints the top 10 hits for a search query.

streamlit run ir_exercise/frontend.py

The website will be automatically opened in your browser, or you can reach it manually under http://localhost:8501/.

Debug Elasticsearch with Kibana Dev Tools

Kibana provides a quick way of querying your elasticsearch cluster. For details about how to setup and use it, see here.

It is recommended to use Kibana.

Run tests

The test can be found in the test folder. They can be executed from command line, or if you are using an IDE (e.g. PyCharm), you can also execute them within the IDE (in case of PyCharm by clicking the green play buttons).

Failed tests come with detailed logs, i.e. the query sent to the server and the documents retrieved by the server. You can copy and paste these queries to kibana and play around with modifications.

All tests

pytest

Milestone 1

pytest ir_exercise/test/test_milestone_1.py

Milestone 2

# Only milestone 2 tests
pytest ir_exercise/test/test_milestone_2.py

# But beware, for milestone 2, all tests will be assessed
pytest ir_exercise/test/test_milestone_*.py

Specific tests

# All tests from a test class
pytest ir_exercise/test/test_milestone_1.py -k GladiatorTest

# Only one test
pytest ir_exercise/test/test_milestone_1.py::GladiatorTest::test_top_1

Troubleshooting

If you encounter some troubles with elasticsearch, check this document for possible solutions.

(Optional) Code formatting

The requirements also install black, a common python code formatter.

Possible usages (you can do none or also all of them):

  • Run it manually from the console by calling black . from the root folder or to a specific file.
  • Set your IDE to format your code by using black.
    • e.g. for PyCharm:
      • Settings -> Tools -> External Tools -> click the "+" icon
      • Name: Black
      • Description: Black in PyCharm configuration
      • Program: <path_to_black>, you can find it out by calling which black
      • Arguments: --config pyproject.toml $FilePath$
      • Working directory: $ProjectFileDir$
  • Set up a pre-commit hook by calling pre-commit install. This creates the .git/hooks/pre-commit file, which automatically reformats all the modified files prior to any commit.

About

Improving a minimum viable search engine as part of information retrieval course

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published