-
Notifications
You must be signed in to change notification settings - Fork 4
Wikipedia Data Extraction
The English Wikipedia pages used for this project come in high volume. The latest bz2 zip is about 15 GB and another 120 MB for the index file. The dump used for the initial load came from https://dumps.wikimedia.org/enwiki/ using the multistream version. This was preferable because with multistream, it is possible to get an article from the archive without unpacking the whole thing. The utility used to extract the data was forked from Wikiforia (https://github.com/marcusklang/wikiforia). There were several code changed necessary to get the data the was desired as well as getting the files in the correct formats and separated for later use.
To extract the data from the archive the following command was executed:
# Plain text extraction of Wikipedia pages. The output argument points to a directory in which
# each article is contained in its own file. The file name is named by the article ID which
# will later correspond to the wiki_maping.csv for the bigram_indexer
java -jar Wikiforia.jar -index ../wikipages/enwiki-latest-pages-articles-multistream-index.txt.bz2 -pages
../wikipages/enwiki-latest-pages-articles-multistream.xml.bz2 --filesPerDir = 100000 --output-format plain-text -o ../all_pages > pages.log
This would take in the archive and index file, and for this particular case it was using plain-text. The plain-text modifications would extract only the text from the articles and remove any other markdown that was not needed. The output directory would create sub-directories that each would contain files that were named by the article's page ID. At the time of this extraction there were about 7.2 million articles. In order to manage the size of the batches which are imported into the indexing database, the filesPerDir will limit the number of files that will be processed at a time.
The plain text articles would then be fed into the bigram_indexer.py script in order to index the page contents. Using a TF-IDF implementation, the data is stored in our PostgreSQL database. Importing this data is accomplished using the following command:
python3 bigram_indexer.py --new-docs --type 5 -d ~/all_wiki_pages/files_# -m ~/page_mapping/wiki_mapping.csv
This script can be executed multiple times over different directories. In this case, it would be ~/all_pages/files_0 ... files_n. In order to increase throughput, the use of GNU Parallel (https://www.gnu.org/software/parallel/) was used to execute the bigram_indexer script.
parallel python3 bigram_indexer.py --new-docs --type 5 -d ~/all_wiki_pages/files_{} -m ~/page_mapping/wiki_mapping.csv ::: 0 1 2 3 4
This would start 5 processes that would each execute the bigram_indexer script. Each execution points to a separate directory that holds the plain text Wikipedia articles.
Note: when running this in SSH, in order to keep the processes running without an active connection, use the screen command as follows:
screen -d -m -L parallel python3 bigram_indexer.py --new-docs --type 5 -d ~/all_wiki_pages/files_{} -m ~/page_mapping/wiki_mapping.csv ::: 0 1 2 3 4
The session will log the output of the script to a file and start the execution in a detached session. Use the "screen -r" command to re-attach.
When users search for research papers the query terms used will also search across the indexed Wikipedia articles. Once articles have been identified, the IDs of those articles pill point back to the document database, which in this case in MongoDB. Using the modified Wikiforia utility, extraction of the Wikipedia article's title, URL, and ID will be stored in a JSON format within MongoDB. This process is a little more straight forward and produces a single JSON file which will later be used for import. To extract this data the following command will be executed:
# JSON extraction of Wikipedia pages. The output argument points to a directory where a file
# will be created that contains all of the article's title's, IDs, and URLs. This file will
# be used by mongoimport to insert all of this data into the wiki collection.
java -jar Wikiforia.jar -index ../wikipages/enwiki-latest-pages-articles-multistream-index.txt.bz2 -pages ../wikipages/enwiki-latest-pages-articles-multistream.xml.bz2 --output-format json -o ../page_url_output
Notice that the outputformat has been changed to json instead of plain text. This will tranform the ID, title, and URL into a JSON format.
Once the JSON file for article titles has been created using Wikiforia it can be imported into MongoDB. The first step in this process is to create a wiki collection within the sharesci database:
$ mongo
> db.createCollection("wiki", {size: 1073741824 });
The initial size is intentionally created large so that the collection does not need to dynamically grow while data is being imported. Once the collection has been created the mongoimport utility can be used to read the JSON file and insert the data. Use the following command to import:
# Import Wikipedia URLs, titles, and IDs into MongoDB from Wikipedia page extraction (bz2)
mongoimport --db sharesci --collection wiki --file wikiOutput.json
For searching purposes, an index may be required on the id field. One related application is for sorting output for incremental imports as well as direct searches through the engine.
> db.wiki.createIndex( {id: 1}, {unique: true} );
In order to index data using the bigram_indexer script, a mapping file must be created. This is created after the article titles, IDs, and URLs have been imported to MongoDB. This mapping file essentially maps the internal MongoDB ID of the article to it's actual article ID. The output is a CSV file in which the first item is the Mongo object ID and the second item is the document ID. This can be done on either Wikipedia pages or research papers. Use the following command to create the mapping file:
# Create mapping file from wiki docs in MongoDB
mongoexport --db=sharesci --collection=wiki --fields='_id,id' --type=csv --out wiki_mapping.csv
When the Wikipedia data is indexed, any article that is found using the search query will point back to the PostgreSQL document table (document.text_id) to get the MongoDB object ID. Using the Mongo object ID we can get the article for display based on search results. Similar to Wikipedia articles, this can be done on other document sources such as arXiv using the following command:
# Create mapping file from papers docs in MongoDB
mongoexport --db=sharesci --collection=papers --fields='_id,arXiv_id' --type=csv --out papers_mapping.csv
In either case, the bigram_indexer expects the mapping to use the MongoDB object ID first and the document/article ID second.