This repository contains training materials focused on fundamental principles of data science. The content is primarily designed for use in training workshops during a two-week bootcamp.
The tutorials in this repository rely on various Python libraries that may not be pre-installed on your system. To ensure you have all the necessary libraries, run the following command in the main directory: pip install -r requirements.txt
. This command will install all the dependencies listed in the requirements.txt
file.
In this lesson, we focus on ensuring that your data processing steps for large-scale projects are reproducible. We'll cover strategies to avoid common pitfalls in large dataset handling and emphasize the importance of reproducibility in data science workflows.
This lesson introduces basic functions for exploring datasets. We'll delve into simple statistical analyses and visualize correlations to gain insights into our data. The goal is to equip you with tools for preliminary data analysis and understanding.
Here, we explore methods to examine the probability distributions of your data. Understanding data distributions is crucial for selecting appropriate statistical models and for data preprocessing.
The final lesson covers various aspects of regression analysis. We start with Least Squares Regression, then move to more specific applications like the USGS regression equations for streamflow recurrence. We also explore Random Forest Regression for a CONUS-wide streamflow recurrence model, demonstrating a practical application of machine learning techniques in hydrological studies.
By following these lessons, you'll gain a solid foundation in key data science concepts and techniques, preparing you for more advanced topics and applications.
We welcome contributions and ideas for new modules! If you have suggestions for improvements, additional content, or ideas for entirely new modules, please share them with us.
-
Submit Ideas for New Modules or Improvements: If you have an idea for a new module or suggestions for improving existing content, please open an issue in this GitHub repository with a detailed description of your proposal.
-
Contribute Directly: If you're interested in directly contributing to the development of new modules or enhancements, please fork this repository, make your changes, and submit a pull request with your contributions.
Your insights and contributions are valuable to us, and they play a significant role in continuously improving and expanding this repository for the benefit of all learners.
We look forward to your ideas and contributions, and together, we can make this resource even more beneficial for everyone interested in data science!