Both seaborn
and sklearn
provide built-in sample datasets that you can experiment with. Check out the documentation for more information.
The following are a few places you can search for data on a variety of topics:
- DataHub
- Google Dataset Search
- Open data on Amazon Web Services
- OpenML
- SNAP library of datasets collected by Stanford University
- UCI Machine Learning Repository
This section contains selected data resources across various topics, which can be accessed through a website. Obtaining the data for an analysis may be as simple as downloading a CSV file or may require parsing HTML with pandas. If you must resort to scraping the page (make sure you have tried the ways we discussed in this book first), be sure that you aren't violating the terms of use of the website.
In addition to the pandas_datareader
and stock_analysis
packages we discussed in chapter 7, consult the following:
- Coronavirus (Covid-19) Data in the United States (NYTimes)
- COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University
- COVID-19 pandemic (European Centre for Disease Prevention and Control)
- Open COVID-19 Datasets
For those interested in text-based data or graph data, check out the following resources on social networks:
- Baseball database (practice working with a DB)
- Baseball player statistics
- Basketball player statistics
- Football (American) player statistics
- Football (soccer) statistics
- Hockey player statistics
The following resources vary in topic, but be sure to check these out if nothing so far has piqued your interest:
- Amazon reviews data
- Data extracted from Wikipedia
- Google Trends
- Movies from MovieLens
- Yahoo Webscope (reference library of datasets)