The toolkit for moving IPFS wikipedia articles to the Great Web.
Status: alpha
- This is very alpha soft. Highly recommend to create special account for this crawler.
- Keep
data/links.csv
and don't remove it to avoid invalid transactions to network
- ipfs version 0.4.22
- python3
- Clone this repo and got into it:
git clone https://github.com/SaveTheAles/wiki-crawler.git
cd wiki-crawler
- Install all requirements
- Install python packages:
pip3 install -r requirements.txt
- Fill
config.py
with your personal credentials. - Fill
data/queries.txt
with keywords you interested in for parsing. Every word from the new line.
- Launch IPFS daemon
ifps daemon
- Run:
python3 main.py
The crawler gets keywords from your data/queries.txt
and search for article titles on wikipedia by those keywords and create cyberlinks:
query -> [titles]
[titles] -> query
After that crawler gets every article in distributed wikipedia by the title it found and create cyberlinks:
[titles] -> [articles]
And finally, it gets links from the articles with query keyword and cyberlink them too:
[articles] -> [links]
All you created cyberlinks storing at data/links.csv
-
cids.py
- tool for extracting all CIDs you crawled todata/cids.txt
. Should be usefu if you need to pin your CIDs to the remote machine with IPFS node or IPFS cluster. -
rpc_check.py
- tool for extra check if your address cyberlinked some cyberlinks. You can use it to avoid invalid transactions with already links existed.
- Move
wallet.py
andtransaction.py
to cyber-py library and refactor - Add Mongo or another db as local storage for cyberlinks
- Include
rpc_check.py
as a parallel process