piazza-scraper

scrape piazza posts for your class

usage

make and start a virtual environment e.g. virtualenv env && source env/bin/activate
install requirements e.g. pip install -r requirements.txt
copy config.txt into config.py via cp config.txt config.py
write your piazza email and password securely into config.py
run the scraper python3 scraper.py

posts are put into the same file outputs/piazza_posts.md, so rerunning the program will overwrite previous results
I used selenium to check out links and scrape html since piazza doesn't use a publically available API call to grab data separately
scraper may take a while (took 1 hr for 600 posts on macbook pro), have thought about parallelizing it though which would be pretty cool using multiprocessing correctly with locks
config.py kinda important! make sure you put the correct url for your class's piazza and the number of posts you'd like to scrape
invalid post ids that were not successfully scraped are printed to console at the end
some of the selenium methods are deprecated so console mayn be flooded oops

you need the right version of the chromedriver to run. This means the version of chrome needs to match the bot that's running
remember to run everything in the environment and to set config.py correctly

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
output		output
.gitignore		.gitignore
README.md		README.md
config.txt		config.txt
requirements.txt		requirements.txt
scraper.py		scraper.py