Over the years Web Scraping has become a personal hobby, a kind of challenge to practice my skills. Most of the projects done in this period were not distributed to the general public, so I decided to organize and publish them here on GitHub and the data on Kaggle.
The interest in Data Science encouraged me to use Web Scraping to analyze some data I was interested in, such as games and anime.
This repository will contain the code used for the data distributed in Kaggle, and also a step-by-step explanation of the process. Have fun with me as I venture into various sites with unstructured data.
Disclaimer: This repository is a personal project distributed under an MIT license to practice Web Scraping, distributing free data for people to do exploratory data analysis. I do not recommend using it for other purposes. Use at your own risk.
I exclusively use Python and some of its packages, like:
- BeautifulSoup
- Requests
- CloudScraper
Remember, respect the request limit of the site to not cause any harm.
You can recommend me any site to be part of this project, just send me an e-mail with the site and the reason to be part of this repository.
Below are all the projects I have done with the links. I hope you have a lot of fun.
projects | category | github | kaggle | |
---|---|---|---|---|
01 | anime-planet | comics | Link | Link |
02 | tapas | comics | Link | Link |
03 | toomics | comics | Link | Link |
04 | jmlr | articles | Link | Link |
05 | webtoons | comics | Link | Link |
06 | afk-arena | games | Link | |
07 | arknights | games | Link | Link |
08 | justwatch | streamings | Link | Multiple Links¹ |
09 | funko pop | collectibles | Link | Link |
10 | a24 | movies | ||
11 | ||||
12 | ||||
13 | ||||
14 |
Ref. 1: Each streaming contains a link. Below is a list of all the streamings links:
streamings | kaggle |
---|---|
hbo max | Link |
hulu | Link |
netflix | Link |
amazon prime | Link |
paramount | Link |
disney+ | Link |
crunchyroll | Link |
dark matter | Link |
rakuten viki | Link |
Copyright (c) 2022 Victor Soeiro
This project is licensed under the MIT License
If you have any questions or suggestions, send me an email to [email protected]