Skip to content

05 Chapter: Data Wrangling

Mikiko Bazeley edited this page Dec 19, 2019 · 1 revision

Data Wrangling

Overview

Before performing analysis and running algorithms, there is upfront work to do in the form of data wrangling. Data scientists collect, clean, and transform messy data to create usable data. They also transform and combine a dataset with other data which is a critical skill to learn, as your analysis, visualizations, and algorithms will only be as good and accurate as your raw data. Pandas, an important tool for data wrangling in Python, will be a key tool you use to wrangle data in this unit.

Unit Plan (What you’ll learn, Words to know, What will help)

Work to Submit:

SQL case study

JSON based mini-project

API based mini-project

Report on data wrangling techniques used on Capstone Project


Unit Plan : Data Wrangling

What You’ll Learn: Learning Objectives

Words to Know: Key Terms & Concepts

What will Help

  • Become proficient at data manipulation using Pandas and other Python packages as needed

  • Work with missing or invalid values

  • Extract and manipulate data in formats such as XML and JSON

  • Work with SQL-based databases and write basic SQL queries up to basic aggregations and joins

  • Raw data: data from original or secondary sources that may be unstructured or corrupted and needs more work performed on it before it can be analyzed.

  • Data wrangling: process of taking data in its ‘raw’ form and manipulating it in various ways into a ‘useful’ form.

  • ‘Messy’ or ‘Dirty’ data: data can be ‘messy’ or ‘dirty’ in the sense that it might contain values that are invalid, missing, corrupted, inconsistent or non-uniform.

Data manipulation and various activities in this unit are challenging. Be sure to work with your student advisor to discuss strategies for success as you work through this unit.


Ch 5.1 Data Wrangling with Pandas

Data Wrangling with Pandas Pandas is the standard tool for data scientists in Python to clean data. In this unit, we'll learn several Pandas techniques to manipulate and clean data, such as dealing with missing values and corrupt values.

Resources:

Relevant Course Notes:


Working with Data in Files Data sources vary from unstructured or semi-structured text files (.txt) and delimited, structured or nested format files (excel, csv, json, xml). While working with data stored in files, the basic operation is to read files into a Pandas dataframe. Most formats have standard row and column tables and are relatively easy to work with. JSON and XML are nested formats and need some more work.

1 Interactive Exercises: Python Data Science Toolbox (Part 2) Open exercises
Students typically spend 4 - 6 Hours

In this DataCamp resource, you'll continue to build your Python data science skills. First, you'll enter the wonderful world of iterators, objects that you’ve already encountered in the context of loops. You’ll also learn about list comprehensions, which are handy tools that form a basic component in the toolboxes of all modern data scientists working in Python. You'll end the course by working through a case study in which you'll apply all of the techniques you’ve learned.

2 Interactive Exercises: Importing Data in Python (Part 1) Open exercises
Students typically spend 3 - 5 Hours

As a data scientist, you’ll need to clean, wrangle and munge, visualize, build predictive models, and interpret data. Before doing any of these, however, you’ll need to know how to get data into Python.

In this DataCamp resource, you'll learn three ways to import data into Python:

from flat files such as .txts and .csvs

from files native to other software such as Excel spreadsheets, Stata, SAS and MATLAB files

from relational databases such as SQLite and PostgreSQL

3 Interactive Exercises: Importing Data in Python (Part 2) Open exercises
Students typically spend 2 - 3 Hours

In this DataCamp resource, you'll extend your knowledge of data import in Python by learning to import data from the web and by pulling data in the JSON format from Application Programming Interfaces, also known as APIs, such as the Twitter streaming API, which allows us to stream real-time tweets.


Ch 5.3 Working with Data in Databases


Ch 5.4 Collecting Data From the Internet

Interacting with APIs

Web scraping is a way to create datasets where none are easily available. In this tutorial, you’ll learn how to create a dataset from a crowdfunding website. We recommend that you go through this tutorial only if your capstone project involves web scraping. If it doesn’t, feel free to skip or return to it later.

Additional Resources:

Additional resources for data wrangling

Relevant Course Notes

Clone this wiki locally