16 Chapter: Data Science at Scale

Data Science at Scale

Overview

Increasingly, data scientists are expected to know the fundamentals of building web-scale, cloud-based applications. This unit teaches the fundamentals of Spark, the most popular tool used today to build data science distributed applications at scale. You’ll also learn advanced topics in SQL for data scientists. Please review the Unit Plan’s What Will Help section to ensure you’re set up for success in this unit.

What You’ll Learn: Learning Objectives

Learn the fundamentals for using Spark in Python (via PySpark), including basic concepts such as Resilient Distributed Datasets (RDDs) and algorithms such as MapReduce.
Use Spark’s MLlib library to scale machine learning applications.

Words to Know: Key Terms & Concepts

Big Data: Algorithms and technology associated with storing and manipulating datasets that are too large to fit in the memory of a single machine
Resilient distributed datasets: A data structure in Spark which is persistent (i.e. is saved to disk and can be retrieved whether or not it's currently in memory) and can be shared across machines
Transformations: Mathematical operations performed on data
Commutative Functions: A mathematical function with two arguments whose order can be reversed with the same result. For example, adding two numbers is a commutative operation i.e. 1 + 2 produces the same result as 2 + 1.
Associative Functions: A mathematical function with two arguments, where a series of applications of the function produced the same result regardless of the part of the sequence executed first. For example, multiplication is associative i.e. 234 can be evaluated as (23)4 or 2(34) with the same result

What will Help

Ensure that you have emailed [email protected] to request for lynda access. You will be taking a Lynda course at the start of this unit.
If you feel you need a refresher in some of the advanced Python concepts such as lambda functions, list comprehensions and so on, this would be a good time to go back and repeat the appropriate lessons in the Programming Bootup and Data Wrangling chapters.

Chapter 16.1 Advanced Data Wrangling for Large Datasets

When you start working with large datasets, the techniques and tools you’ve used so far for data wrangling may not handle that scale. For example, it might be impossible to load up your entire dataset in Pandas to look for missing values, because your computer may run out of memory. In this section, you’ll learn some advanced tools and techniques to help you wrangle big data. You may not need to use it in your second capstone project (depending on the size of your data set), but it’ll certainly give you a leg up in your interviews and in the workplace.

1 Review SQL OPTIONAL

It's time to review your SQL basics. Please feel free to go back to the Data Wrangling Unit to review SQL using the Mode Analytics tutorials. In case you'd like something different, you can also do the following DataCamp resources:

Intro to SQL for Data Science: https://www.datacamp.com/courses/intro-to-sql-for-data-science
Joining data in PostgresSQL: https://www.datacamp.com/courses/joining-data-in-postgresql

2 Course: Advanced SQL for Data Scientists

Open exercises
Students typically spend 2 - 3 Hours This course from LinkedIn Learning begins with a brief overview of SQL, and then covers the five major topics a data scientist should understand when working with relational databases: basic statistics in SQL, data preparation in SQL, advanced filtering and data aggregation, window functions, and preparing data for use with analytics tools.

https://www.linkedin.com/learning/advanced-sql-for-data-scientists

Please email [email protected] to request access to this LinkedIn Learning course.

3 Video: Big Pandas

Students typically spend 1.5 - 2 Hours

https://www.youtube.com/watch?v=YGk09nK_xnM

This tutorial looks inside Pandas to help you understand how DataFrames work when building, indexing, and grouping tables. You'll learn how to write fast, efficient code, and how to scale up to bigger problems with libraries, like Dask

Chapter 16.2 Spark and PySpark

Spark and PySpark Spark, a computing framework developed at Berkeley’s AMPLab by Matei Zaharia, has become one of the most prominent frameworks for running large-scale data analyses on computing clusters. In this section, you’ll learn Spark and Python tools and libraries that will give you a running start.

1 Interactive Exercises: Introduction to Pyspark Open exercises
Students typically spend 4 - 6 Hours

PySpark is the Python package that makes the magic happen when using Spark from Python. In this resource, you'll use this package to work with data about flights from Portland and Seattle. You'll learn to wrangle data and build a whole machine learning pipeline to predict whether flights will be late. Get ready to put some Spark in your Python code and dive into the world of high performance machine learning!

2 Video: Introduction to Spark with Python - Orlando Karam Watch video
Students typically spend 3 - 5 Hours

https://www.youtube.com/watch?v=9xYfNznjClE&feature=player_embedded

In this PyCon 2015 tutorial, Orlando Karam covers the basics of writing spark programs in python (initially from the pyspark shell, later with independent applications). He also discusses some of the theory behind spark, and some performance considerations when using spark in a cluster.

Note: The code and slide deck for this tutorial are available in this github repository.

3 Video: Introduction to Machine Learning on Apache Spark MLlib (Cloudera) Students typically spend 1 - 2 Hours

https://www.youtube.com/watch?v=qKYpMPPL-fo

Juliet Hougland, a Senior Data Scientist at Cloudera, presents this tutorial on Spark MLlib, a library for performing machine learning and associated tasks on massive datasets. With MLlib, fitting a machine-learning model to a billion observations can take only a few lines of code and leverage hundreds of machines. This talk demonstrates how to use Spark MLlib to fit an ML model that can predict which customers of a telecommunications company are likely to stop using their service.

Note: The code for this talk is available in this github repository.

16 Chapter: Data Science at Scale

Chapter 16.1 Advanced Data Wrangling for Large Datasets

1 Review SQL OPTIONAL

2 Course: Advanced SQL for Data Scientists

3 Video: Big Pandas

Chapter 16.2 Spark and PySpark

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally