Pipeline for articles scraping with scrapy, postgres and sqlalchemy

A simple customizable pipeline to capture articles in a specific format for text analysis.

Hey there! I made this repository to practice basic scrapinng and sql. The motivation behind the project was to setup a simple pipeline to store articles that then can be analyzed using postgresql and jupyter-notebooks. This is not optimal for advanced text querying and NPL but is a good basis, for the Obtaining part of data analysis.

Stack

Scrapy for getting and preprocessing data
Postgresql to store data in a structured way
SQLAlchemy to write database relationships, models, and talk to the database
Alembic to handle migrations

Features

Setup database easily with docker-compose
Define your own crawlers and pipelines
Predefined database model to start collecting data

Entity relationship diagram

Usage

We recommend to create a specific virtual environment for the project.
- For instance you can do with python 3: python3 -m venv my-env. Activate it in Unix source my-env/bin/activate.
Run pip install -e . to develop and modify modules.
Run docker-compose up -d to spin the postgres database instance.
Run pytest to check that everything is setup,
Run scrapy crawl example
Run the example.ipynb notebook to explore the database.
Write your own scrapers to get your data
Enjoy :)

References

[1]“Common Text Mining Workflow - DZone Big Data,” dzone.com. https://dzone.com/articles/common-text-mining-workflow (accessed May 04, 2021).

[2]“Web crawler,” Wikipedia. Apr. 16, 2021. Accessed: May 04, 2021. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Web_crawler&oldid=1018112513

[3]“Text Analysis: the only guide you’ll ever need,” MonkeyLearn. https://monkeylearn.com/text-analysis/ (accessed Jun. 23, 2021).

[4]“How to store articles or other large texts in a database,” Stack Overflow. https://stackoverflow.com/questions/1084506/how-to-store-articles-or-other-large-texts-in-a-database (accessed Aug. 03, 2021).

[5]R. Python, “Data Management With Python, SQLite, and SQLAlchemy – Real Python.” https://realpython.com/python-sqlite-sqlalchemy/ (accessed Aug. 27, 2021).

[6]H. Wang, “A Minimalist End-to-End Scrapy Tutorial (Part III),” Medium, Apr. 30, 2020. https://towardsdatascience.com/a-minimalist-end-to-end-scrapy-tutorial-part-iii-bcd94a2e8bf3 (accessed Sep. 09, 2021).

[7]M. Hoyos, “What is an ORM and Why You Should Use it,” Medium, Mar. 22, 2019. https://blog.bitsrc.io/what-is-an-orm-and-why-you-should-use-it-b2b6f75f5e2a (accessed Sep. 16, 2021).

[8]M. F. Zafra, “Web Scraping news articles in Python,” Medium, May 25, 2020. https://towardsdatascience.com/web-scraping-news-articles-in-python-9dd605799558 (accessed Sep. 16, 2021).

[9]Balaji, “Best Article Scraper Free Web Scraping Content Scraper tool,” CoderDuck. https://www.coderduck.com/article-scraper (accessed Sep. 16, 2021).

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.vscode		.vscode
alembic		alembic
docs/img		docs/img
src		src
tests		tests
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
__init__.py		__init__.py
alembic.ini		alembic.ini
config.py		config.py
database.ini		database.ini
docker-compose.yml		docker-compose.yml
example.ipynb		example.ipynb
pytest.ini		pytest.ini
restart_db.sh		restart_db.sh
scrapy.cfg		scrapy.cfg
setup.cfg		setup.cfg
setup.py		setup.py
todo.md		todo.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pipeline for articles scraping with scrapy, postgres and sqlalchemy

Stack

Features

Entity relationship diagram

Usage

References

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

jurra/articles_scraper

Folders and files

Latest commit

History

Repository files navigation

Pipeline for articles scraping with scrapy, postgres and sqlalchemy

Stack

Features

Entity relationship diagram

Usage

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages