This is the repository for the exercise of Grundlagen des Information Retrieval (2022).
The aim of this exercise is to improve a minimum viable search engine, such that it returns the right documents for predefined user queries.
As part of this exercise, we were provided with an already working starter code along with tests that check how many of the predefined queries are already satisfied. The exercise is divided up into two milestones. In the 1st milestone the system was tested only against milestone 1 queries, whereas in the 2nd milestone against all the queries.
The data contains around 25.000 crawled Wikipedia pages of English-speaking movies from 2000 until today. The data has been collected by the code found in the wiki folder.
During the exercise, we had to improve the starter system to satisfy the following queries.
[
{
"query": "Russell Crowe in roman age",
"expected_result": "Gladiator (2000 film)"
},
{
"query": "Oscar winning Christopher Nolan movie about 2nd world war",
"expected_result": "Dunkirk (2017 film)"
},
{
"query": "moon landing movie with Ryan Gostling",
"expected_result": "First Man (film)"
},
{
"query": "an astronaut is left on the Mars, it was shot in Budapest",
"expected_result": "The Martian (film)"
},
{
"query": "beautiful mi",
"expected_result": "A Beautiful Mind (film)"
}
]
[
{
"query": "12 districts fighting in a deathly game",
"expected_result": "The Hunger Games (film)"
},
{
"query": "no mas bebes",
"expected_result": "No más bebés"
},
{
"query": "It",
"expected_result": "It (2017 film)"
},
{
"query": "2014 american coming-of-age romance based on a novel",
"expected_result": "The Fault in Our Stars (film)"
},
{
"query": "documentary about women in STEM",
"expected_result": "Picture a Scientist"
}
]
You need an elasticsearch up and running on your machine. For details about how to install, set up and start elasticsearch, see here.
You have to install the necessary python packages. For details, see here.
Use the preprocessor code to download and save the data onto your machine.
# With downloading the data
python ir_exercise/preprocessor.py -d
# Without downloading the data
python ir_exercise/preprocessor.py
This creates a data folder with a data.json file in it, which serves as input for the indexer.
Run the indexer code to create and populate an elasticsearch index.
# If we want to delete and recreate the indices
python ir_exercise/indexer.py -r
# If you only want to update the indices
python ir_exercise/indexer.py
You can start the service from terminal. Leave this service running and execute the other commands from a new terminal.
python ir_exercise/search_service.py
You can send request via curl
, e.g.:
# Linux
curl -X GET http://localhost:6000/ir-search-service -H "Content-Type: application/json" -d '{"text": "test queary", "size": 5}'
# Windows
curl -X GET http://localhost:6000/ir-search-service -H "Content-Type: application/json" -d "{\"text\": \"test queary\", \"size\": 5}"
We also provide a simple streamlit
frontend that prints the top 10 hits for a search query.
streamlit run ir_exercise/frontend.py
The website will be automatically opened in your browser, or you can reach it manually under http://localhost:8501/.
Kibana provides a quick way of querying your elasticsearch cluster. For details about how to setup and use it, see here.
It is recommended to use Kibana.
The test can be found in the test folder. They can be executed from command line, or if you are using an IDE (e.g. PyCharm), you can also execute them within the IDE (in case of PyCharm by clicking the green play buttons).
Failed tests come with detailed logs, i.e. the query sent to the server and the documents retrieved by the server. You can copy and paste these queries to kibana and play around with modifications.
pytest
pytest ir_exercise/test/test_milestone_1.py
# Only milestone 2 tests
pytest ir_exercise/test/test_milestone_2.py
# But beware, for milestone 2, all tests will be assessed
pytest ir_exercise/test/test_milestone_*.py
# All tests from a test class
pytest ir_exercise/test/test_milestone_1.py -k GladiatorTest
# Only one test
pytest ir_exercise/test/test_milestone_1.py::GladiatorTest::test_top_1
If you encounter some troubles with elasticsearch, check this document for possible solutions.
The requirements also install black, a common python code formatter.
Possible usages (you can do none or also all of them):
- Run it manually from the console by calling
black .
from the root folder or to a specific file. - Set your IDE to format your code by using black.
- e.g. for PyCharm:
- Settings -> Tools -> External Tools -> click the "+" icon
- Name:
Black
- Description:
Black in PyCharm configuration
- Program:
<path_to_black>
, you can find it out by callingwhich black
- Arguments:
--config pyproject.toml $FilePath$
- Working directory:
$ProjectFileDir$
- e.g. for PyCharm:
- Set up a pre-commit hook by calling
pre-commit install
. This creates the.git/hooks/pre-commit
file, which automatically reformats all the modified files prior to any commit.