MELArt. A Multimodal Entity Linking Dataset for Art.

Code for the generation of MELArt. A Multimodal Entity Linking Dataset for Art.

The code for the experiments with the baselines and for generating model-specific versions od the dataset can be found here

Pre-requisites

Create a .env file (you can use .env_sample as a tempate) and set the access token for Wikimedia API and the user agent.
Install required libraries. The easiest way is to use the provided conda environment environment.yaml
Install spacy English model python -m spacy download en_core_web_sm

Download the Artpedia dataset from https://aimagelab.ing.unimore.it/imagelab/page.asp?IdPage=35 and place the artpedia.json file in the input_files/ folder.

You can also avoid having the input_files/ folder, by adjusting the paths in the paths.py script.

Execute the following scripts to generate the dataset.

convert_wikipedia_tables.sh: This script converts the Wikipedia tables from the Wikidata dump to a more readable format. The output is stored in the aux_files/ folder.
art_merging.py: It matches Artpedia paintings to Wikidata entities using the Wikipedia title, and extracts painting information from Wikidata.
text_matcher.py: Matches the labels of the depicted entities in the visual and contextual sentences.
get_candidates.py: Get the candidates for the depicted entities in the visual and contextual sentences, using Wikidata search API. For each candidate, it creates a json file with its information in the aux_files/el_candidates folder. Also creates a file with the list of entity image paths.
get_candidate_types.py: Reads each candidate information, builds a set of all the types, and uses Wikidata's API to get the type labels. The types information is stored in the aux_files/candidate_types_dict.json file.
crawl_images.py: crawl the images from Wikimedia Commons based on the imgs_url.txt file (from get_candidates.py)
filter_candidate_images: Removes the candidate images that correspond to the paintings in MELArt.
combine_curated_annotations: This script combines the automatically generated annotations, with the manually curated annotations to produce the final dataset in the output_files/melart_annotations.json file.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
input_files		input_files
.env_sample		.env_sample
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
art_merging.py		art_merging.py
combine_curated_annotations.py		combine_curated_annotations.py
convert_wikipedia_tables.py		convert_wikipedia_tables.py
crawl_images.py		crawl_images.py
environment.yaml		environment.yaml
filter_candidate_images.py		filter_candidate_images.py
get_artpedia_depicted.sh		get_artpedia_depicted.sh
get_artpedia_depicted_ids.py		get_artpedia_depicted_ids.py
get_candidate_types.py		get_candidate_types.py
get_candidates.py		get_candidates.py
get_img_urls.py		get_img_urls.py
index_wikidata_labels.py		index_wikidata_labels.py
paths.py		paths.py
solrqueries.py		solrqueries.py
sparqlqueries.py		sparqlqueries.py
text_matcher.py		text_matcher.py
utils.py		utils.py