05 Chapter: Data Wrangling

Data Wrangling

Overview

Before performing analysis and running algorithms, there is upfront work to do in the form of data wrangling. Data scientists collect, clean, and transform messy data to create usable data. They also transform and combine a dataset with other data which is a critical skill to learn, as your analysis, visualizations, and algorithms will only be as good and accurate as your raw data. Pandas, an important tool for data wrangling in Python, will be a key tool you use to wrangle data in this unit.

Unit Plan (What you’ll learn, Words to know, What will help)

Work to Submit:

SQL case study

JSON based mini-project

API based mini-project

Report on data wrangling techniques used on Capstone Project

Unit Plan : Data Wrangling

What You’ll Learn: Learning Objectives

Words to Know: Key Terms & Concepts

What will Help

Become proficient at data manipulation using Pandas and other Python packages as needed
Work with missing or invalid values
Extract and manipulate data in formats such as XML and JSON
Work with SQL-based databases and write basic SQL queries up to basic aggregations and joins
Raw data: data from original or secondary sources that may be unstructured or corrupted and needs more work performed on it before it can be analyzed.
Data wrangling: process of taking data in its ‘raw’ form and manipulating it in various ways into a ‘useful’ form.
‘Messy’ or ‘Dirty’ data: data can be ‘messy’ or ‘dirty’ in the sense that it might contain values that are invalid, missing, corrupted, inconsistent or non-uniform.

Data manipulation and various activities in this unit are challenging. Be sure to work with your student advisor to discuss strategies for success as you work through this unit.

Ch 5.1 Data Wrangling with Pandas

Data Wrangling with Pandas Pandas is the standard tool for data scientists in Python to clean data. In this unit, we'll learn several Pandas techniques to manipulate and clean data, such as dealing with missing values and corrupt values.

Resources:

Video - Pandas from the Ground Up

Relevant Course Notes:

Pandas Foundations: https://www.datacamp.com/courses/pandas-foundations
Manipulating DataFrames with Pandas: https://www.datacamp.com/courses/manipulating-dataframes-with-pandas
Merging DataFrames with Pandas: https://www.datacamp.com/courses/merging-dataframes-with-pandas
Cleaning Data in Python: https://www.datacamp.com/courses/cleaning-data-in-python

Working with Data in Files Data sources vary from unstructured or semi-structured text files (.txt) and delimited, structured or nested format files (excel, csv, json, xml). While working with data stored in files, the basic operation is to read files into a Pandas dataframe. Most formats have standard row and column tables and are relatively easy to work with. JSON and XML are nested formats and need some more work.

1 Interactive Exercises: Python Data Science Toolbox (Part 2) Open exercises
Students typically spend 4 - 6 Hours

https://www.datacamp.com/courses/python-data-science-toolbox-part-2

In this DataCamp resource, you'll continue to build your Python data science skills. First, you'll enter the wonderful world of iterators, objects that you’ve already encountered in the context of loops. You’ll also learn about list comprehensions, which are handy tools that form a basic component in the toolboxes of all modern data scientists working in Python. You'll end the course by working through a case study in which you'll apply all of the techniques you’ve learned.

2 Interactive Exercises: Importing Data in Python (Part 1) Open exercises
Students typically spend 3 - 5 Hours

https://www.datacamp.com/courses/importing-data-in-python-part-1

As a data scientist, you’ll need to clean, wrangle and munge, visualize, build predictive models, and interpret data. Before doing any of these, however, you’ll need to know how to get data into Python.

In this DataCamp resource, you'll learn three ways to import data into Python:

from flat files such as .txts and .csvs

from files native to other software such as Excel spreadsheets, Stata, SAS and MATLAB files

from relational databases such as SQLite and PostgreSQL

3 Interactive Exercises: Importing Data in Python (Part 2) Open exercises
Students typically spend 2 - 3 Hours

https://www.datacamp.com/courses/importing-data-in-python-part-2

In this DataCamp resource, you'll extend your knowledge of data import in Python by learning to import data from the web and by pulling data in the JSON format from Application Programming Interfaces, also known as APIs, such as the Twitter streaming API, which allows us to stream real-time tweets.

Ch 5.3 Working with Data in Databases

Article: Overview of NoSQL databases

Ch 5.4 Collecting Data From the Internet

Interacting with APIs

https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-apis-application-programming-interfaces-5-apis-a-data-scientist-must-know/
API example using Requests and Python: https://www.youtube.com/watch?time_continue=141&v=WyOYwfIgPDQ
Rest APIs 101: https://www.youtube.com/watch?v=7YcW25PHnAA
Using scrapy: https://www.youtube.com/watch?v=O_j3OTXw2_E

Web scraping is a way to create datasets where none are easily available. In this tutorial, you’ll learn how to create a dataset from a crowdfunding website. We recommend that you go through this tutorial only if your capstone project involves web scraping. If it doesn’t, feel free to skip or return to it later.

Additional Resources:

Requests: HTTP for Humans - http://docs.python-requests.org/en/master/
Using requests: https://www.pythonforbeginners.com/requests/using-requests-in-python
GET and POST requests using Python: https://www.geeksforgeeks.org/get-post-requests-using-python/

05 Chapter: Data Wrangling

Data Wrangling

Unit Plan : Data Wrangling

Ch 5.1 Data Wrangling with Pandas

Resources:

Relevant Course Notes:

Ch 5.3 Working with Data in Databases

Ch 5.4 Collecting Data From the Internet

Interacting with APIs

Additional Resources:

Additional resources for data wrangling

Relevant Course Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally