This repository contains all the codes and jupyter notebooks used to experiment and implement various NLP solutions to develop components for a document and data discovery application.
The objective of this KCP project is to address the knowledge discoverability issue that is common in traditional document databases. This project expects to catalyze the utility and use of the immense volume of knowledge on economic and social development stored across development organizations. We will develop new data and data discovery products using state-of-the-art machine learning techniques and artificial intelligence models. These solutions will rely exclusively on open source software and algorithms.
The project will develop models that require a massive corpus and intensive computational resources. We aim to make these models easily accessible to organizations that will not have access to such resources. The outputs (scripts, search tools, and derived datasets) will be made openly accessible to allow adaptation and use in resource-constrained environments.
The scripts, codebase, and guidelines are also intended to be used as training materials.
The project is broken down into the following components:
- Web scraping and automation:
/src/wb_nlp/scraping
- Metadata extraction:
/src/wb_nlp/processing/extraction
- Document preprocessing and cleaning:
/src/wb_nlp/processing/cleaning
- NLP modeling:
/src/wb_nlp/modeling
- Application development:
/src/wb_nlp/app
The objective of this project is to use NLP models to learn topics, embedding, and other representations that could be useful for improving discoverability of documents. A major requirement for this undertaking is the availability of data that can be used to train machine learning models.
We aim to use web scraping and API access when possible to aggregate a large collection of documents. The document aggregation will focus on documents produced by international organizations and multilateral development banks to generate a specialized collection of economic and development centric corpus.
Technologies such as Scrapy and NiFi, as well as vanilla implementations of API access are used to implement the web scraping part of the project.
Useful metadata that are directly available from the source websites are also collected to enrich the information and expressivity of downstream features. We also implement various knowledge extraction modules designed to capture named entities and extract keywords from the documents to complement the originally published metadata. We are using SpaCy and regex to power most of our knowledge extraction implementations.
Later, results from the NLP models will also be integrated into the metadata such as automatically learned LDA topics of documents, inferred topic classifications based on a model trained to predict topics from a predefined taxonomy, translated titles for non-english documents, etc.
Most of the raw data that we are using are in the form of PDF and text documents. We develop a suite of preprocessing and cleaning modules to handle the transformations required to generate a high quality input to our models.
An overview of the pipeline is as follows:
- Convert pdf to text
- Parse the text document and perform sentence tokenization.
- Lemmatize the tokens and remove stop words.
- Drop all non-alphabetical tokens.
- Apply spell check and try to recover misspelled words.
- Normalize tokens by converting to lowercase.
Part of the preprocessing is also the inference of phrases in the documents. Phrases are logical grouping of tokens that represent an intrinsic meaning.
We are primarily leveraging the Gensim NLP toolkit and Spacy to develop the phrase detection algorithms.
Acronyms are fairly common in documents from development organizations and multilateral development banks. In this project, we include in our pipeline an acronym detector and expander. The idea is to detect acronyms in a document and replace all of the acronyms with the appropriate expansion.
We also keep track of multiple instances of an acronym and generate prototypes for each that encodes the information of the acronym, e.g., PPP -> private-public partnership or purchasing power parity.
The above processes are precursor to generate inputs to the different NLP models. The NLP models are trained to generate derived data that could be used in downstream use-cases. Primarily, the models are unsupervised which are classified into two categories: topic models and embedding models.
The topic model employed in this project is an LDA implementation under the Mallet language toolkit. A Gensim wrapper is used as interface to interact with the original Java implementation.
The resulting topics learned by the model are used to characterize documents. With this, we build a search tool
The Gensim implementation for the word2vec model is used in this project. We extract the learned word embeddings and use them to represent documents by taking the mean of the word vectors in the document. This simple strategy allows us to have a semantic based search of documents using keywords or full document inputs.
-
Upload the document metadata into the mongodb
nlp
database under thedocs_metadata
collection. Refer to load_json_dump_to_db.py for an example. -
Configure the cleaning config file (lda.yml) for the LDA model.
-
Run the clean_corpus.py script to perform the cleaning of the documents to be used by the LDA model.
-
Configure the model config file (lda/default.yml) for the LDA model.
-
Run the train_lda_model.py to train the LDA model with the cleaned text.
-
Upload the document metadata into the mongodb
nlp
database under thedocs_metadata
collection. Refer to load_json_dump_to_db.py for an example. Make sure that thepath_original
field corresponds to the path where the corresponding document is stored. Documents should ideally be stored following this convention:/data/corpus/<corpus_id>/*.txt
. -
Create a cleaning config based on the
/configs/cleaning/default.yml
and upload to mongodb at thenlp/cleaning_configs
collection. It is recommended to add some description about the configuration in themeta
field of the configuration file. You can use the/scripts/configs/load_configs_to_db.py
script to load the configuration into the database. -
To start the cleaning of the corpus, run the script
scripts/cleaning/clean_docs_from_db.py
. Provide thecleaning_config_id
of the configuration that you want to use in cleaning the documents. This assumes that all the documents in thedocs_metadata
will be cleaned. The cleaning script will store the cleaned data at/data/corpus/cleaned/<cleaning_config_id>/<corpus>/<file_name>
. Thefile_name
and thecorpus
will be extracted from the metadata corresponding to the document in thenlp/docs_metadata
collection.
When you want to train an LDA model, you need to first generate a valid configuration. This configuration must be uploaded to mongodb under the nlp/model_configs
collection. You can use the /scripts/configs/load_configs_to_db.py
script to load the configuration into the database.
After the configuration is uploaded, select which cleaned documents (defined by the cleaning_config_id
) will be used. Also provide the model_config_id
that will be used for the model.
The script scripts/models/train_lda_model_from_db.py
can be used to train an LDA model given a valid cleaning_config_id
and model_config_id
. Implemented in the script are the following steps:
-
Perform validity checks of the configuration. This is done by making sure that the
model_name = "lda"
. It also checks whether the version of the installed library matches with the one defined in the config. This reduces the likelihood of bugs in model training that may be introduced when updates in the implementation of the model occurs across different versions of the library. -
Check whether the a processed corpus is already available. If yes, load it. Otherwise, create a dictionary to process the cleaned data. Then, create the corpus and save the dictionary and process the cleaned documents. Save all to disk so that there's no need to recreate the corpus.
-
Create a summary of the model run (
model_run_info
) to summarize the details of the experiment. This will be saved in mongodb for tracking of which models are available. -
Train the model using the parameters in the
model_config
. -
Save the model to disk and insert the
model_run_info
into the database.
Please read the DEVELOPERS.md for more details about the development workflow.