Skip to content

vkoul/Data-Science-Resources

Repository files navigation

GitHub Repo stars GitHub forks

Data Science Resource List πŸ“‹

Learning new things has become more accesible now due to the plethora of material available online. This is particularly the case for Data Science and Machine Learning. Since I got interested in the field, I have come across a huge amount of learning material which I found immensely useful. This is an attempt to put them togther and make it accesible to others.
There are many wonderful resources which Professors have put up online and this is an attempt to catalogue these awesome resources. It also has been done by Prakhar onGithub, which is suited to Software Engineering, so the below list is an attempt to list down resources pertaining to Data Science and focussed more on R software language. I plan to add more Python Material going forward. Hope you find this list useful.

Made with ❀️ by Vikesh. Say Hi!πŸ‘‹
Twitter Follow Linkedin Badge


Content

Data Science/Statistics Books πŸ“š Cheatsheets πŸ”‘ Courses πŸ’»

Data Science/Statistics Books πŸ“š

Statistics Books πŸ“–
Machine Learning Books πŸ“–
DataViz Books πŸ“–
R in Other Fields πŸ“–
R Tool Books πŸ“–
Other R resources

Cheatsheets πŸ”‘

Click to expand!

Courses 🏫 πŸ’»

Click to expand!

R Studio Online Tutorials

Programming with R Software Carpentry Foundation

Courses taught by Hadley Wickham H. Wickham

Statistics courses offered in Harvard Harvard University

PROB 140 Probability for Data Science UC- Berkeley πŸ“ πŸ“– πŸ’»

  • Prob 140 (formally Statistics 140 or STAT 140) is a probability course for undergraduates who have taken Data 8, have a math background, and wish to go deeper into the theory of data science. The emphasis on simulation and the bootstrap in Data 8 gives students a concrete sense of randomness and sampling variability. Prob 140 will capitalize on this. Because of the students’ backgrounds, Prob 140 will move swiftly over basics, avoid approximations that are unnecessary when SciPy is at hand, and replace some of the routine calculus by symbolic math done in SymPy. This will create time to focus on the more demanding concepts that are part of the theoretical foundations of data science.

  • Syllabus

  • Textbook

  • Lectures/Slides

  • Assignments

CS 109 Probability for Computer Scientists Stanford University πŸ“ πŸ“– πŸ’»

  • The class starts by providing a fundamental grounding in combinatorics, and then quickly moves into the basics of probability theory. We will then cover many essential concepts in probability theory, including particular probability distributions, properties of probabilities, and mathematical tools for analyzing probabilities. Finally, the last third of the class will focus on data analysis and Machine Learning as a means for seeing direct applications of probability in this exciting and quickly growing subfield of computer science.

  • Syllabus

  • Textbook

  • Lectures/Slides

  • Assignments

DS 101 Data Science 101 Stanford University πŸ“ πŸ“– πŸ’»

  • The course provides a solid introduction to data science, both exposing students to computational tools they can proficently use to analyze data and exploring the conceptual challenges of inferential reasoning. Each module/week represents a new β€œdata adventure,” analyzing real datasets, exploring different questions and trying out tools.

  • Syllabus

  • Lectures/Slides

  • Assignments

CME/STATS 195 Introduction to R Stanford University πŸ“ πŸ“– πŸ’»

  • The goal of this short course is to familiarize students with R’s tools for scientific computing. Class lectures will have interactive elements, and assignments will be application-driven.Topics covered include basic data structures, file I/O, control structures, functions, visualizations, packages for statistical analysis.

  • Syllabus

  • Lectures/Slides

  • Assignments

  • Final Project

Stat 48N Riding the data wave Stanford University πŸ“ πŸ“– πŸ’»

  • How can we make sense of all the information we are acquiring about ourselves? During each week, we will consider a different data set to be summarized with a different goal. We will review analyses of similar problems carried out in the past and explore if and how the same tools can be useful today. We will pay attention to contemporary media (newspapers, blogs, etc.) to identify settings similar to the ones we are examining and critique the displays and summaries there documented

  • Syllabus

  • Lectures/Slides

  • Assignments

MS&E 226 Small Data Stanford University πŸ“ πŸ“– πŸ’»

  • This course is about understanding β€œsmall data”: these are datasets that allow interaction, visualization, exploration, and analysis on a local machine. The material provides an introduction to applied data analysis, with an emphasis on providing a conceptual framework for thinking about data from both statistical and machine learning perspectives. Topics will be drawn from the following list, depending on time constraints and class interest: approaches to data analysis: statistics (frequentist, Bayesian) and machine learning; binary classification; regression; bootstrapping; causal inference and experimental design; multiple hypothesis testing.

  • Syllabus

  • Lectures/Slides

  • Datasets

DS100 Principles and Techniques of Data Science UC- Berkley πŸ“ πŸ“– πŸ’»

  • Combining data, computation, and inferential thinking, data science is redefining how people and organizations solve challenging problems and understand their world. This intermediate level class bridges between Data8 and upper division computer science and statistics courses as well as methods courses in other fields

  • Syllabus

  • Material

  • Assignments

Stats 200 Introduction to Statistical Inference Stanford University πŸ“ πŸ“– πŸ’»

  • The class will introduce the students to formal statistical reasoning. Building on knowledge of probability and calculus, we will explore how limited noisy observations can be used to learn general characteristics of a population. We will study the basics of decision theory, including frequentist and Bayesian solutions to the "paradox of induction."

  • Syllabus

  • Lectures/Slides

  • Assignments

INFO 201A Technical Foundations of Informatics University of Washington πŸ“ πŸ“–

  • This course introduces fundamental tools and technologies necessary to transform data into knowledge. We'll cover skill associated with each component of the information lifecycle, including the collection, storage, analysis, and visualization of data. Core competencies underlying this process, including functional programming, use of databases, data wrangling, version control, and command line proficiency, are acquired through real-world data-driven assignments

  • Lectures/Slides

  • Assignments

STAT 405 Introduction to Data Analysis (using R, 2012) Rice University πŸ“ πŸ“– πŸ’»

  • This course will teach you to be a data analyst. You will learn how to take a large dataset break up into manageable pieces and use a range of qualitative and quantitative tools to summarise it and learn what it has to tell. You will learn the importance of scepticism and curiosity, and how to communicate your findings. Each section of the course is motivated by a particular dataset, and you will gain experience working with a wide variety of data sources varying in size and quality.

  • Syllabus

  • Lectures/Slides

  • Assignments

STAT 385 Statistics Programming Methods UIUC πŸ“ πŸ“–

MY472 Data for Data Scientists LSE πŸ“ πŸ“–

  • This course will cover the principles of digital methods for storing and structuring data, including data types, relational and non-relational database design, and query languages. Students will learn to build, populate, manipulate and query databases based on datasets relevant to their fields of interest

  • Syllabus

  • Lectures/Slides

  • Assignments

STAT 149 Generalized Linear Models Harvard University πŸ“ πŸ“–

  • An introduction to methods for analyzing categorical data. Emphasis will be on understanding models and applying them to datasets. Topics include visualizing categorical data, analysis of contingency tables, odds ratios, log-linear models, generalized linear models, logistic regression, Poisson regression and model diagnostics. Examples drawn from many fields, including biology, medicine and the social sciences.

  • Syllabus

  • Lectures/Slides

  • Assignments

DSO 530 Applied Modern Statistical Learning Techniques Univ. of Southern California πŸ“ πŸ“– πŸ’»

  • This course aims to go far beyond the classical statistical methods, such as linear regression, that are introduced in GSBA 524

  • Syllabus

  • Lectures/Slides

    • The course follows ISLR and provides succinct summary of the book in the slides
  • Assignments

  • Videos

STAT 320 Design and Analysis of Causal Studies Duke University πŸ“ πŸ“– πŸ’»

  • Presents an overview of methods for estimating causal effects: how to answer the question of β€œWhat is the effect of A on B?” Includes discussion of randomized designs, but with more emphasis on alternative designs and methods for when randomization is infeasible: matching methods, propensity scores, longitudinal treatments, regression discontinuity, instrumental variables, and principal stratification. Methods are motivated by examples from social sciences, policy and health sciences.

  • Syllabus

  • Lectures/Slides

  • Assignments

  • Webpage of Dr. Kari Lock Morgan for other course links

Statistics 585X Data Technologies for Statistical Analysis Iowa State University πŸ“ πŸ“– πŸ’»

  • Not all data lives in nice, clean spreadsheets, not all data fits in a computer’s main memory. As statisticians we cannot always rely on other people and sciences to get the data into formats that we can deal with: we will discuss aspects of statistical computing as they are relevant for data analysis. Read and work with data in different formats: flat files, databases, web technologies. Elements of literate programming help us with making our workflow transparent and analyses reproducible. We will discuss communication of results in form of R packages and interactive web application.

  • Syllabus

  • Lectures/Slides

  • Assignments

  • Final Project

STATS 202 Data Mining and Analysis (using R) Stanford University πŸ“ πŸ“– πŸ’»

  • Stats 202 is an introduction to Data Mining. Students will:

  • Understand the distinction between supervised and unsupervised learning and be able to identify appropriate tools to answer different research questions.Become familiar with basic unsupervised procedures including clustering and principal components analysis. Become familiar with the following regression and classification algorithms: linear regression, ridge regression, the lasso, logistic regression, linear discriminant analysis, K-nearest neighbors, splines, generalized additive models, tree-based methods, and support vector machines.Gain a practical appreciation of the bias-variance tradeoff and apply model selection methods based on cross-validation and bootstrapping to a prediction challenge.Analyze a real dataset of moderate size using R.Develop the computational skills for data wrangling, collaboration, and reproducible research.Be exposed to other topics in machine learning, such as missing data, prediction using time series and relational data, non-linear dimensionality reduction techniques, web-based data visualizations, anomaly detection, and representation learning.

  • Syllabus

  • Lectures/Slides

  • Assignments

  • Final Project- Kaggle

STATS 203 Introduction to Regression Models and Analysis of Variance Stanford University πŸ“ πŸ“– πŸ’»

6.S085 Statistics for Research Projects MIT πŸ“ πŸ“– πŸ’»

  • This class is a practical introduction to statistical modeling and experimental design, intended to provide essential skills for doing research. We'll cover basic techniques (e.g., hypothesis testing and regression models) for both traditional experiments and newer paradigms such as evaluating simulations. Students with research projects will be encouraged to share their experiences and project-specific questions.

  • Syllabus

  • Lectures/Slides

  • Assignments

  • Case Study

Statistics 36-350 Statistical Computing: Spring 2018 Carnegie Mellon University πŸ“ πŸ“– πŸ’»

  • Computational data analysis is an essential part of modern statistics. Competent statisticians must not just be able to run existing programs, but to understand the principles on which they work. They must also be able to read, modify, and write code, so that they can assemble the computational tools needed to solve their data analysis problems, rather than distorting problems to fit tools provided by others. This class is an introduction to statistically-oriented programming, targeted at statistics majors, without assuming extensive programming background

  • Syllabus

  • Lectures/Slides

  • Assignments

Statistics 231 Statistical Learning Theory Stanford University πŸ“ πŸ“– πŸ’»

  • Uncover common statistical principles underlying diverse array of machine learning techniques.

    • Linear algebra
    • Probability
    • Optimization
  • Syllabus

  • Lectures/Slides

  • Assignments

Sta 323 Statistical Programming(2018) Duke University πŸ“ πŸ“– πŸ’»

STATS 401 Applied Statistical Methods II University of Michigan πŸ“ πŸ“– πŸ’»

  • An intermediate course in applied statistics, covering a range of topics in modeling and analysis of data including: review of simple linear regression, two-sample problems, one-way analysis of variance; multiple linear regression, diagnostics and model selection; two-way analysis of variance, multiple comparisons, and other selected topics

  • Lectures/Slides

  • Assignments

  • Lab Material

Stats 531 Analysis of Time Series University of Michigan πŸ“ πŸ“– πŸ’»

  • This course gives an introduction to time series analysis using time domain methods and frequency domain methods. The goal is to acquire the theoretical and computational skills required to investigate data collected as a time series. The first half of the course will develop classical time series methodology, including auto-regressive moving average (ARMA) models, regression with ARMA errors, and estimation of the spectral density.

  • Lectures/Slides

  • Assignments

  • Projects

AGRON 590RD Data Stewardship for Earth Systems Scientists Iowa State University πŸ“ πŸ“– πŸ’»

  • Learn how to clearly organize, track, and communicate data-based work, collect and house data through analysis and publication, collaborate in a reproducible way, model data structures and wrangle data, and complete the entire research cycle in a responsible way.

  • Syllabus

  • Lectures/Slides

  • Assignments

MPA 635 Data Visualization Brigham Young University. πŸ“ πŸ“– πŸ’»

  • Become literate in data and graphic design principles, (2) an ethical data communicator, and (3) a collaborative sharer by producing beautiful, powerful, and clear visualizations of your own data

  • Syllabus

  • Lectures/Slides

  • Assignments

CME 252 Introduction to Optimization Stanford University πŸ“ πŸ“– πŸ’»

  • This course introduces mathematical optimization and modeling, with a focus on convex optimization. Topics include: varieties of mathematical optimization, convexity of functions and sets, convex optimization modeling with CVXPY, gradient descent and basic distributed optimization, in-depth examples from machine learning, statistics and other fields and applications of bi-convexity and non-convex gradient descent.

  • Lectures/Slides

  • Assignments

CSC 321 Intro to Neural Networks and Machine Learning University of Toronto πŸ“ πŸ“– πŸ’»

  • This course gives an overview of both the foundational ideas and the recent advances in neural net algorithms. Roughly the first 2/3 of the course focuses on supervised learning -- training the network to produce a specified behavior when one has lots of labeled examples of that behavior. The last 1/3 focuses on unsupervised learning and reinforcement learning..

  • Lectures/Slides

  • Assignments

EECS 349 Machine Learning- Spring 2018 Northwestern University πŸ“ πŸ“– πŸ’»

STAT 365/665 Data Mining and Machine Learning (uses R) Yale UniversityπŸ“ πŸ“– πŸ’»

TJ-ML TJHSST Machine Learning Thomas Jefferson High School πŸ“ πŸ“– πŸ’»

  • TJHSST Machine Learning Club aims to bring the complex and vast topic of machine learning to high school students. We teach a variety of topics, including SVMs, Neural Networks, Convolutional Neural Networks, Recurrent Neural Networks, and more.

Note: Great Initiative, that too from High School students @Mihir Patel

SIGIL Statistical Analysis of Corpus Data with R Postdam University πŸ“ πŸ“– πŸ’»

CIS 419/519 Applied Machine Learning- Spring 2018 UPenn Engineering πŸ“ πŸ“– πŸ’»

This course will introduce some of the key machine learning methods that have proved valuable and successful in practical applications. We will discuss some of the foundational questions in machine learning in order to get a good understanding of the basic issues in this area, and present the main paradigms and techniques needed to obtain successful performance in application areas such as natural language and text understanding, speech recognition, computer vision, data mining, adaptive computer systems and others. The main body of the course will review several supervised and (semi/un)supervised learning approaches. These include methods for learning linear representations, decision-tree methods, Bayesian methods, kernel based methods and neural networks methods, as well as clustering, dimensionality reduction and reinforcement learning methods.