Name	Name	Last commit message	Last commit date
parent directory ..
assets	assets
code	code
README.md	README.md

title

duration

creator

What is Data Science?

2:50

name	city
K. Nathaniel Tucker	SF

Welcome to Data Science

DS | Lesson 1

LEARNING OBJECTIVES

After this lesson, you will be able to:

Describe the roles and components of a successful learning environment
Define data science and the data science workflow
Apply the data science workflow to meet your classmates
Setup your development environment and review python basics

STUDENT PRE-WORK

Before this lesson, you should already be able to:

Define basic data types used in object-oriented programming
Recall the Python syntax for lists, dictionaries, and functions
Create files and navigate directories using the command line interface (for your specific environment)

LESSON GUIDE

TIMING	TYPE	TOPIC
20 min	Opening	Welcome to GA
20 min	Introduction	What is Data Science
10 min	Quiz	Data Science Quiz
25 min	Introduction	Data Science Workflow
25 min	Guided Practice	Workflow Application
65 min	Demo	Data Science Dev Environment
5 min	Conclusion	Review

Welcome to GA! (20 mins)

Instructor Note: Use the "Day 1 deck" from your local production team

GA is a special learning environment

Introduce the instructors, TAs, Producers
GA is a global community of individuals empowered to pursue the work we love.
GA Resources- discounts, community events, hub, office hours
GA feedback loop- exit tickets, mid-course feedback, final feedback

Road to Success

Emotional cycle of change
Student learning responsibility
GA graduation requirements
After GA- build network, find opportunities, community, perks
Q/A

Introduction: What is Data Science (20 mins)

A set of tools and techniques used to extract useful information from data
A interdisciplinary, problem-solving oriented subject
Application of scientific techniques to practical problems

Who uses Data Science

Netflix - movie recommendations
Amazon's algorithm - "you might also like x"
Five Thirty Eight - election and sports coverage
Draft Kings - using data science to predict daily bets
Google - auto-translate and search results
- Ask students if they know of any other examples

What are the roles in Data Science?

Roles:

Data Developer
Data Researcher
Data Creative
Data Businessperson

Skills:

Business
ML/Big Data
Math
Programming
Stats

Break down of skills by role:

Quiz: Data Science Baseline (10 Min)

Instructor Note: This quiz is intended as a helpful gauge of your students' background knowledge in data science related topics. It asks them questions on topics they haven't learned yet to estimate their prior knowledge and give you a chance to tailor materials accordingly (and correct misconceptions, etc). You are welcome to substitute or modify this quiz further as you see fit.

Quiz

True or False: Gender (coded: male= 0 female= 1) is a continuous variable
Draw a normal distribution.
True or False: Linear regression is an unsupervised learning algorithm.
What is a hypothesis test?

Instructor Note: Discuss results

Introduction: The Data Science Work Flow (25 mins)

Overview of Steps:

Throughout the class and for the our projects we will be following a general workflow. This workflow will help you produce reliable and reproducible results.

Reliable: Accurate findings
Reproducible: Others can follow your steps and get the same results.

Data Science Workflow Steps:

Identify
Acquire
Parse
Mine
Refine
Build
Present

Project 1: Futurama Example

IDENTIFY: Understand the problem:

Using Planet Express customer data from January 3001-3005, determine how likely previous customers are to request a repeat delivery using demographic information (profession, company size, location) and previous delivery data (days since last delivery, number of total deliveries)

Identify business/product objectives:
- Are previous customers are to request a repeat delivery?
Identify and hypothesize goals and criteria for success:
- What factors are likely to influence a customer's decision to be reuse Planet Express for Delivery?
Create a set of questions to help you identify the correct data set.

ACQUIRE: Obtain the data

Ideal data vs. data that is available Often times we start by identifying the ideal data we would want for a project.

During the data acquisition phase, we'll learn about the limitations on the types of data that are available. We have to decide if these limitations will inhibit our ability to answer our question of interest or if we can work with what we have to find a reasonable and reliable answer.

Data for this example:

demographic information (profession, company size, location)
previous delivery data (days since last delivery, number of total deliveries)

Questions we may ask include:

Identifying the “right” data set(s)
Is there enough data?
Does it appropriately align with the question/problem statement?
Can the dataset be trusted? How was it collected?
Is this dataset aggregated? Can we use the aggregation or do we need to get it pre-aggregation?
Assess resources, requirements, assumptions, and constraints
Import data from the web (Google Analytics, HTML, XML)
Import data from a file (CSV, XML, TXT, JSON)
Import data from a preexisting database (SQL)
Set up local or remote data structure
Determine most appropriate tools to work with data
Tool follows the format, size of the dataset

PARSE: Understand the data

Many times we are given secondary data, or data that was collected previously. In these cases, we have to learn as much as possible about our data using tools like data dictionaries and source documentation to determine how the data was gathered.

Example data dictionary:

Variable	Description	Type of Variable
Profession	Title of the account owner	Categorical
Company Size	1- small, 2- medium, 3- large	Categorical
Location	Planet of the company	Categorical
Days Since Last Delivery	Integer	Continuous
Number of Deliveries	Integer	Continuous

Common questions include:

Read any documentation provided with the data (e.g. data dictionary above)
Perform exploratory surface analysis via filtering, sorting, and simple visualizations
Describe data structure and the information being collected
Explore variables, data types via select
Assess preliminary outliers, trends
Verify the quality of the data (feedback loop -> 1)

MINE: Prepare, structure, and clean the data

Often times, our data will need to be cleaned prior performing an analysis.

Common steps include:

Sample the data, determine sampling methodology
Iterate and explore outliers, null values via select
Intro qualitative vs quantitative data
Format and clean data in Python (dates, number signs, formatting)
Define how to appropriately address missing values (cleaning)
Categorization, manipulation, slicing, format, integrate data
Format and combining different data points, separate columns, etc.
Determine most appropriate aggregations, cleaning, etc. methods
Create necessary derived columns from the data (new data)

REFINE: Exploratory data analysis

As an example of basic statistics, you might check the Mean (STD) or specific frequency counts.

Variable	Mean (STD) or Frequency (%)
Number of Deliveries	50.0 (10)
Earth	50 (10%)
Amphibios 9	100 (20%)
Bogad	100 (20%)
Colgate 8	100 (20%)
Other	150 (30%)

These descriptive stats allow us to:

Identify trends and outliers
Decide how to deal with outliers - excluding, filtering, and communication
Apply descriptive and inferential statistics
Determine initial visualization techniques
Document and capture knowledge
Choose visualization techniques for different data types
Transform data

BUILD: Create a data model

We select a model based on the outcome we are interested in or the assumptions of the model we are using. An example of a model statement might look like this:

We completed a logistic regression using Statsmodels v. XX. We calculated the probability of a customer placing another order with Planet Express.

Here, we are using a logistic model because we are determine the probability that a customer may place a return order, which at its heart is a classification problem.

The steps for model building are:

Select appropriate model
Build model
Evaluate and refine model
Predict outcomes, action items

PRESENT: Communicate the results of your analysis

Presentations are a critical part of your analysis. It doesn't matter how brilliant your model is or how illuminating your findings are, if you are not able to effectively communicate your results then they will not be used.

The most basic form of a data science presentation should include a simple sentence that describes your results:

"Customers from large companies had twice (CI 1.9, 2.1) the odds of of placing another order with Planet Express compared to customers from small companies."

Data science presentations can also be far more complex and exciting, like some of the research presented by Nate Silver's 538 blog.

When creating a presentation, always consider your audience and make sure to practice your presentation beforehand. Consider the types of questions people might have or - better yet - test your presentation on a few people and pay attention to their response. Clarify and refine your presentation accordingly.

Make sure to consider your needs and goals as well as those of your audience. A presentation created for your fellow data scientists will be vastly different than a presentation intended for some executives who are trying to make a business decision.

Key factors of a good presentation include:

Summarize findings with narrative and storytelling techniques
Refine your visualizations for broader comprehension
Present both limitations and assumptions
Determine the integrity of your analysis
Consider the degree of disclosure for various stakeholders
Test and evaluate the effectiveness of your presentation beforehand

A Note About Iteration

Iteration is an important part of every step in the Data Science Workflow. At any given point in the process, you may find yourself repeating or going back and re-doing elements in order to better understand your data, clarify your model, and refine your presentation.

For example, after presenting your findings, you may want to:

Identify follow-up problems and questions for future analysis
Create a visually effective summary or report
Consider the needs of different stakeholders and how your report might be changed for them
Identify the limitations of your analysis
Identify relationships between visualizations

Practice: Data Science Work Flow (25 mins)

Use three of the steps from the Data Science Workflow (identify, acquire, present) to get to know your classmates!

IDENTIFY: Understand the problem

Have each group develop 1 research question that they would like to know about the class and make a hypothesis.

Examples:

What is your current favorite tool for working with data?
What are you most excited about learning?
What can you help your classmates with when it comes to data analysis?

ACQUIRE: Obtain the data

Rotate through the groups to "collect the data" and record the raw data on white boards.

PRESENT: Communicate the results of your analysis

Summarize findings in a narrative
Provide a basic visualization for broader comprehension on white board
Have one student present for the group

Demo: Dev Environment Setup (65 min)

Brief intro to the tools we will use as data scientists
Workshop to help with environment set up
IPython Notebook to test dataset and complete Python Review

Conclusion (5 mins)

By now, you should be able to answer the following questions with ease:

What is data science?
What is the data science workflow?
How can you have a successful learning experience at GA?

BEFORE NEXT CLASS


UPCOMING PROJECTS	Project 1 Instructions

ADDITIONAL RESOURCES

MIT 6.034

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lesson-01

lesson-01

README.md