Open-ended assignments

In Homework 1, Homework 4, and the Final Project, you will pick your own dataset(s). For each:

Use at least one dataset that you aren't familiar with.
- Using data from a primary source is preferred.
It should have between one thousand and one million rows.
- If it's larger than that, you can make it smaller.
Finding a dataset available in CSV or JSON is recommended, though pandas can read other formats.
It's ok if you pick the same dataset as another student, as long as you're following the Academic Integrity rules. {%- if id == "columbia" %}
If you'd be interested in working with SIPA alumni employment data, reach out to the instructor. {%- endif %}

Open data portals

There are countless places to get data, notably:

{% if id == "columbia" -%}

Columbia Data Platform {% else -%}
NYU Libraries Data Sources {%- endif %}
Local:
- NYC Open Data
  - Scout can be used to find datasets with certain columns
- BetaNYC
U.S. Federal:
United Nations
World Bank
World Health Organization (WHO)
HealthData.gov
The Humanitarian Data Exchange
Economic Policy Institute
Kaggle
Google Dataset Search
Black Wealth Data
DataHub
Lists of open data portals:
- DataPortals
- Open Data Network

Inspiration

For starters, see the Final Project examples from past semesters.

Probably not realistic to make visualizations that are as fancy as these ones made by professionals, but they may give you ideas. Some also include links/downloads of the source data.

Climate & Economic Justice Screening Tool
FiveThirtyEight Interactives
The Guardian Visual Journalism
Information is Beautiful Awards
New York Times Graphics
Our World in Data
ProPublica News Apps
The Pudding
Statista
Visual Capitalist

Storing data

{% if id == "columbia" -%} To work with uploaded files in {{coding_env_name}}, you have two options.

Direct upload

Fewer steps, but your file(s) will disappear when your session ends.

$Steps to get data into {{coding_env_name}} directly$

In the {{coding_env_name}} sidebar, click the Files icon (A).
Click the upload button (B).
Select your file.
You'll use read_csv("MY_FILENAME.csv") in your code.

Google Drive

More steps, but your file(s) are preserved between sessions.

$Steps to get data into {{coding_env_name}} via Drive$

Upload the file(s) somewhere in Drive.
In the {{coding_env_name}} sidebar, click the Files icon (A).
Click the Mount Drive icon (B).
- You may need to run the code it injects to authorize it (C).
- Think of this as attaching your Drive to your {{coding_env_name}} instance, as if you were plugging in a USB flash drive.
Navigate to the file (D).
- You may need to click into content, then drive.
Next to the filename, click the three dots.
Click Copy path (E).
- The value should be something like /content/drive/My Drive/....
Use this path with read_csv() (F).

{% else -%}

Open the {{coding_env_name}} file browser.
Navigate to the folder your notebook is in.
Upload the data.
From Python, use read_csv("./<filename>.csv").

Note that that file path should be to relative to the notebook within {{coding_env_name}} — ./ means "in the same directory". {% endif %}{{coding_env_name}} cannot access the file on your local machine; in other words, the path shouldn't start with C:\\ or anything like that. More info about file paths.

{% if id == "nyu" -%}

Limits

{{coding_env_name}} has a disk storage limit of 1GB (a.k.a. 1,024 MB or 1,048,576 KB) across all your files, and a memory limit of 3GB. {%- endif %}

Reducing data size

You can make data smaller before uploading by filtering it through:

The data portal, if it supports it
- This makes the download faster, including only the data you need.
- Instructions for Socrata-based portals
The $limit parameter (or equivalent), if using an API
- Socrata documentation
In a spreadsheet program

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!