This repo contains a minimal Python library for parsing posts from the blog of Bayer Zsolt, a well-know political publicist in Hungary. I developed it for Python 2.7 some time ago when I needed a Hungarian text corpus for NLP experiments.
The heart of this library is a class named BayerBlogParser
. Its constructor receives one optional argument: depth
defines how many pages on the blog should be parsed to collect blog posts. If nothing is provided, all pages (meaning all blog posts) are going to be parsed. Example usage:
from bayer_blog_parser.blog_parser import BayerBlogParser
post_data = BayerBlogParser().parse_all_posts()
After parsing, .parse_all_posts()
will return a list of dictionaries with the following keys:
url
- The blog post's URLtitle
- Titleauthor
- Will be Bayer Zsolt most of the time :)lead_text
- The short summary text visible before opening the particular blog postthumbnail_url
- URL of the thumbnail picturedate
- Date publishedfull_content
- The full textual content. Paragraphs are separated by newline characters (\n)
You can make a pandas DataFrame from it quite easily: pd.DataFrame(post_data)
.
For an example, check out example_parse_all.py
.
Just run python setup.py install
.
Install dependencies pip install -r requirements.txt
and execute nosetests
.