Code for the generation of MELArt. A Multimodal Entity Linking Dataset for Art.
The code for the experiments with the baselines and for generating model-specific versions od the dataset can be found here
- Create a
.env
file (you can use.env_sample
as a tempate) and set the access token for Wikimedia API and the user agent. - Install required libraries. The easiest way is to use the provided conda environment
environment.yaml
- Install spacy English model
python -m spacy download en_core_web_sm
- Download the Artpedia dataset from https://aimagelab.ing.unimore.it/imagelab/page.asp?IdPage=35 and place the
artpedia.json
file in theinput_files/
folder.
You can also avoid having the input_files/
folder, by adjusting the paths in the paths.py
script.
Execute the following scripts to generate the dataset.
-
convert_wikipedia_tables.sh
: This script converts the Wikipedia tables from the Wikidata dump to a more readable format. The output is stored in theaux_files/
folder. -
art_merging.py
: It matches Artpedia paintings to Wikidata entities using the Wikipedia title, and extracts painting information from Wikidata. -
text_matcher.py
: Matches the labels of the depicted entities in the visual and contextual sentences. -
get_candidates.py
: Get the candidates for the depicted entities in the visual and contextual sentences, using Wikidata search API. For each candidate, it creates a json file with its information in theaux_files/el_candidates
folder. Also creates a file with the list of entity image paths. -
get_candidate_types.py
: Reads each candidate information, builds a set of all the types, and uses Wikidata's API to get the type labels. The types information is stored in theaux_files/candidate_types_dict.json
file. -
crawl_images.py
: crawl the images from Wikimedia Commons based on theimgs_url.txt
file (from get_candidates.py) -
filter_candidate_images
: Removes the candidate images that correspond to the paintings in MELArt. -
combine_curated_annotations
: This script combines the automatically generated annotations, with the manually curated annotations to produce the final dataset in theoutput_files/melart_annotations.json
file.