PlainTextWikipedia

Convert Wikipedia database dumps into plain text files (JSON). This can parse literally all of Wikipedia with pretty high fidelity. There's a copy available on Kaggle Datasets

Instructions

Download all the .bz2 files from a dump: https://dumps.wikimedia.org/enwiki/ The filename should look like enwiki-20201120-pages-articles-multistream1.xml-p1p41242.bz2
Unzip all bz2 files directly to another directory, such as WikipediaArchive
Install REQUIREMENTS.TXT
Update the source and destination directory variables in jsonify_wikipedia.py
Run the script jsonify_wikipedia.py

This will deposit a ~40MB JSON files into the destination folder. Each filename is guaranteed to be completely unique as it is based on UUIDv4.

File Schema

Each file is a JSON object with the root element as a list. Each dictionary within the list has only 3 keys: id, title and text. The ID field comes from the wikipedia article ID. The title and text are the page title and plain-text parsed article respectively. An example follows.

[
 {
  "id": "17279752",
  "text": "Hawthorne Road was a cricket and football ground in Bootle in England...",
  "title": "Hawthorne Road"
 }
]

Legal

https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content

Wikipedia is published under Creative Commons Attribution Share-Alike license (CC-BY-SA).

My script is published under the MIT license but this does not confer the same privileges to the material you convert with it.

Future Improvements

Maintain some article structure in JSON format
Better demarkup handling
Better retention of link and image context

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
REQUIREMENTS.TXT		REQUIREMENTS.TXT
Wikipedia JSON.jpg		Wikipedia JSON.jpg
jsonify_simple_wikipedia.py		jsonify_simple_wikipedia.py
jsonify_wikipedia.py		jsonify_wikipedia.py
simple_wiki.sqlite_bak		simple_wiki.sqlite_bak
simple_wikipedia_to_sqlite.py		simple_wikipedia_to_sqlite.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PlainTextWikipedia

Instructions

File Schema

Legal

Future Improvements

About

Releases

Packages

Languages

License

Katzmann1983/PlainTextWikipedia

Folders and files

Latest commit

History

Repository files navigation

PlainTextWikipedia

Instructions

File Schema

Legal

Future Improvements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages