This repository contains the solution of Group 3 for the Entity Linking Assignment from the course Web Processing Data Systems 2021 at VU Amsterdam. Our solution recognizes named entity mentions in Web pages and link them to Wikidata. In the following sections, the installation instructions and coding choices will be further elaborated. Note that this was primarily and elaborately tested on Windows specifically, since neither group member owns a Macbook or has a separate linux installation. Therefore, it was not possible to check whether everything properly runs on other Operating Systems.
To understand how the directory is structured, a small summary will be provided here. The assets folder contains all the wikidata information and the optional local elasticsearch mechanisms, which have not been altered. The data directory contains all the warc files, including the sample warc file. This has not been altered either. However, the content of both directories will not be present in the final product, given their size. Therefore, the user should move their assets and data to these directories.
The old files directory contains all the original files, such as the starter_code, test_sparql, etc. of the original directory, which are currently not in use by our program. Additionally,some of our old codes are also put there. The outputs directory provides all the pickle files that have been saved. These files include the processed WARC files and the ElasticSearch candidates for each entity in the processed WARC files. Some additional parallelisation was also tried, but did not perform as expected. Therefore, these files can be found within the parallel files directory and are not used further. The results of our program are saved separately to avoid confusion with the pickle files and can thus be found within the results directory. The src directory contains all major functions used currently within the program. The files in the main directory includes files that are used by the starter_code.py file, which executes the main program.
Before doing anything, make sure that this directory is unzipped and put in the Docker Image provided by the course itself. Otherwise, it will miss some crucial packages and code for our program. Additionally, ensure that all the arguments are correctly set within the config.ini file. These arguments include whether to process the WARC files, perform ElasticSearch, use the remote server, etc. For further explanation, check the config file. This is useful if some part of the process goes wrong and you want to rerun only a specific part.
In order to simplify the process of installing everything required for this assignment, a shell script has been created. If this is your first time running this script, simply execute the following code in your docker image:
sh run_entity_link_script.sh
This will execute the following:
- Installs all packages contained in the requirements.txt file
- Downloads the Spacy model for the Named entity recognition process
- Downloads the dataset and information for the various NLTK functions
- Executes the starter_code.py script
IMPORTANT NOTE: Please note that, for fastText on Windows, there needs to be some installation for Microsoft Visual C++ build > 14.0. If this is not yet installed on your laptop, do this first or you will get an error.
Our code starts by receiving input from the config.ini file. For more information, check the config.ini file for comments on each specific parameter. The idea is to first process the warc files found within the data/warcs folder, which should be provided within the config.ini file. If there are multiple warc files, the process is parallelised. This is done within the extraction.py file by using the multiprocessing package. Afterwards, the entity generation is performed from _entity_generation_ES.py, which generates various candidate entities through multiple elastic searches. Lastly, the candidates will be ranked according to the model proposed by the user through the config file. The fastest models are "prominence" and "lesk" and are the ones we propose for running the program to prevent excessive load times. For more details, check the Model section in this README.
In this section, the rationale of all major code snippets (src folder) will be explained.
Our first task is to process the WARC archives, which splits these files into individual records. This allows us to process the records one by one, extracting all mentions of entities and their context. When multiple archive files are processed, we carry out these steps in a parallel fashion.
The crawled records are noisy. Most of this noise originates from a few sources. First, website content does not only include HTML, which in practice contains the continuous text we extract mention from, but also CSS and JavaScript. This can be partially addressed by using the BeautifulSoup library, which allows us to filter for the website bodies. These should generally not contain style or script code snippets. In practice, however, we decided to only process records that contain the tag due to the abundance of examples where CSS and JavaScript are in the website body. However, it is possible for content with this tag to still contain code. A second source of noise is the record language. The returned texts are in a multitude of different languages, but our scope only covers English. To filter for solely English content, we employ the language detection model from the fastText library. This is, however, not without errors, especially when there is code switching. Finally, we occasionally run into encoding errors where specific Unicode encodings and ASCII escape characters show up in the WARC records. These problems are addressed by explicit filtering.
To extract candidate mentions from continuous English text, the spaCy library is used, whose models often reach state-of-the-art performance. We define a single sentence as a mention context. To both retrieve candidate mentions and their contexts, we use spaCy to carry out sentence splitting and carry out named entity recognition to extract the specific mentions. Here, we have to make a few considerations. First, spaCy identifies entity types we do not consider as proper mentions. These include words referring to numbers such as CARDINAL, QUANTITY, PERCENT, and ORDINAL, but also date and time (DATE and TIME, respectively). There are also fringe cases such as EVENT, MONEY, LANGUAGE, and NORP. After observing the candidate mentions returned for these entity types, we decided neither of these are relevant for our task and thus we excluded them. This means we are only focusing on the following entity types: GPE (geo-political entities), LOC (locations), ORG (organizations), PERSON (person names), PRODUCT, WORK_OF_ART, LAW, and FAC.
The second consideration is related to our inability to exclude all instances of CSS and JavaScript. Both these languages contain many special characters such as “=”, “;”, or “{}”. Tokens containing special characters are often erroneously identified as candidate mentions by spaCy. To address this problem, we maintain a list of symbols so if a candidate mention contains any of these symbols, we can exclude them. Even this filtering does not allow us to reliably identify reoccurring strings that might get identified as candidate mentions. To do that, we also maintain a list of these strings, mostly containing tokens from the crawling metadata such as "WARC-Type". The final issue is related to errors in the tokenization of the WARC records, the source of which is likely in the crawling process. It is very common that neighbouring words are not separated by whitespaces, instead they conflate into a single word. The resulting string often has an idiosyncratic capitalization pattern, making spaCy identify it as a candidate mention, for instance “isMillenial Media”, “FRINGEBarcelona”, and “ConditionsContact”. This is a common phenomenon which we have no way of addressing directly. However, this will be addressed through ElasticSearch later on, since it most likely will not find any entities related to these words.
We extract each identified mention in the form of a tuple containing the mention, the entity type spaCy identifies it with, as well as the context, i.e., the sentence it appears in. These are then added to an output dictionary where the keys are the URI ids of the records, and the values are lists of these tuples.
In order to properly generate the links that are related to each of the entities, it is crucial to make a distinction between ambiguous entities and non-ambiguous entities. This implies that popular entities, such as "Washington", will be difficult to find within the Wikidata database, since there are hundreds or thousands of links related to this word alone. "Washington" could refer to the state, the city, the American president and much more. Therefore, for such entities, disambiguation needs to be performed to obtain relevant entity links. For this purpose, the package NLTK is used and Wordnet's synonym sets are checked to disambiguate the entities and generate the correct URI's from the wikidata database. However, there are many words that will not be included in the Wordnet synsets, such as "IBM". However, generally speaking, these entities are so specific that it is relatively easy to find the correct entity URI's through ElasticSearch. Therefore, the assumption is made that, if an entity is not found within Wordnet, that the correct entities will be found within the first 30 results from ElasticSearch. This assumption was created based on the logic that, if a word is not found within Wordnet, it is not ambiguous enough that it won't simply be found through ElasticSearch and that thus the related URI's can always be found through these means. This was checked manually with various entities, such as "IBM" and "Roger Federer", and the correct entity was found generally within the first 5 "hits". Since this cannot be guaranteed, a safety margin of 25 is taken and simply the first 30 hits will be returned.
The problem with Wordnet's synsets is that, for the most popular entities, there are a large number of synonym sets. To reduce the amount of synsets that are checked through this disambiguation process, the context of the entity from the HTML page is checked with that of the definitions of the synsets. This check is performed with a methodology similar to the Simplified Lesk Algorithm. The program deviates from this algorithm a bit by taking the lemma of all words and removing all words that are not nouns. The lemma is taken to increase the chance of hits being generated by the algorithm. Additionally, it is assumed that, for instance, articles, punctuation and stopwords do not contribute to the meaning or context of the entities and synonyms, and are thus excluded. After counting the number of similar words between the context and multiple definitions, only the best 3 synsets, with the highest similarity counts, are returned. If no context is given, only the first 3 synsets will be considered, which are the most popular ones according to Wordnet.
However, another problem is that there may not be enough synonyms to choose between. It could be that there is at most 1 match, which only corresponds to the entity itself. For instance, "Glasgow" has 1 synset, which is the Glasgow synset itself. If this is the case, then no extra synonyms can be retrieved via synsets. To still obtain the correct links, the definition entry will be checked for nouns. Logically speaking, the definition includes nouns that relate Glasgow and thus these words can be used, in conjunction with the entity itself, to hopefully generate the most relevant or correct entity links. The nouns are extracted by using NLTK tokenization and removing all the stopwords from the sentence using "stopwords" corpora for English. This methodology was tested for various entities, such as Glasgow, with great success, as it generally finds the correct entity within at least the first 3 hits for a specific combination. Since again no guarantees can be made, the number of hits should be at least 8. In the end, the synonyms and nouns from the defintion are combined and the first 8 nouns/synonyms are checked. Again, this was tested rigorously through trial and error and these numbers showed the best trade-off between processing time and performance.
After retrieving the mentions from the WARC files and generating their respective candidate entities, the best candidate needs to be selected for the entity linking process. This selection is done using unsupervised methods, since we are dealing with a large amount of unannotated data on which a supervised model is unlikely to generalize well.
In order to optimize the scalability of the program, we decided to implement a simple yet less computationally expensive approach to candidate selection. The idea behind ranking candidates according to their Wikidata "prominence" is that the entity is chosen with the lowest 'Q' id within wikidata. The logic is that more well-known entities such as "George Washington" and "dog" will generally have a lower Q value than a higher Q value. For example, if an entity is generated with the Wikidata id "Q1876321" for the mention "George Washington", then this is less likely to be the correct link in comparison to "Q23". Therefore, Q23 is chosen as the best candidate for the given mention.
Another less expensive approach we implement is a method based on the Simplified Lesk Algorithm. It simply counts how many tokens in the sentential context of the mention are also in the Wikidata entity description. The entity candidate whose description has the highest overlap with the mention context is the best candidate. It uses the same methodology described in the section Generating Candidates with ElasticSearch and will thus not be further elaborated here.
We also opted to experiment with a Vector Space Model (VSM) approach with the 100-dimensional GloVe embeddings. The idea is that both mention and candidate entities are represented with vectors in a shared space. Then, we calculate the similarity between the mention vector and each candidate vector and select the candidate entity with the highest similarity score.
To generate such vectors, we experimented with encoding the mention and candidate entities with a language model. The mention is represented by concatenating the embeddings averaged over the tokens of the mention span and its context. The entity is represented by concatenating the embeddings averaged over the tokens of the entity name and its description. The entity name and description were obtained from the fields "schema_name" (or "rdsf_label" in case there was no "schema_name") and "schema_description" from ElasticSearch. It turns out to be relatively expensive to compute, taking about 18 minutes to encode the mentions of one WARC file.
We experimented with two language models to generate such embeddings. The first was BERT, which generates contextualized embeddings in a parallel manner. However, contextualized embeddings can be very expensive to compute if dealing with large amounts of data. For that reason, we choose a version of BERT, called DistilBERT, which is around 40% faster while attaining 95% of the accuracy. Even with this version, it took about 30 minutes to encode the mentions of a single WARC file.
In order to calculate the precision, recall and f1-scores, we compare the mention strings and the respective entity Wikidata URIs predicted by the model with the ones in the file sample_annotations.tsv that was provided in the original assignment folder. Since the record ids in the sample_annotations.tsv do not correspond with the WARC record ids we access, we only consider the entity names and Wikidata URIs to evaluate performance. To generate performance scores, the score.py script is used.
Due to the instability of the ElasticSearch public client, the performance of our models was first tested on a small set of 200 entities. Both the Prominence and Lesk methods yielded 39% precision and 8% recall. Low recall is expected here, given that we considered only the first 200 unique mentions in disambiguation. Using Glove slighly dropped precision to 33%, while using DistilBERT dropped precision to 29%. We speculate the drop in precision to result from the short mention contexts and entity descriptions and the relatively high amount of noise in the data (e.g. tokenization errors).
Since the results of the VSM approach with dense embeddings were not promising, we only continued testing for the Prominence and Lesk methods with the entire candidate dictionary of 4000 unique mentions extracted from the sample warc file. Again, both methods yielded the same performance, with a precision of 3% and a recall of 17%. The increase in recall is expected, given that all detected mentions are considered here. Additionally, the fact that Prominence and Lesk provide the same performance may be due to the short ElasticSearch entity descriptions. Due to the shortness of these descriptions, the overlap between the context and the description is relatively small. Additionally, the mention context is noisy, which exacerbates the problem. Overall, the Prominence and Lesk are our prefered models, as they are significantly faster and more accurate than VSM with dense embeddings or the contextual embedding method.