In Homework 1 and the Final Project, you will pick your own dataset(s).
- Use at least one dataset that you aren't familiar with.
- Using data from a primary source is preferred.
- It should have between one thousand and one million rows.
- If it's larger than that, you can make it smaller.
- Finding a dataset available in CSV or JSON is recommended, though pandas can read other formats.
- It's ok if you pick the same dataset as another student, as long as you're following the Academic Integrity rules. {%- if id == "columbia" %}
- If you'd be interested in working with SIPA alumni employment data, reach out to the instructor. {%- endif %}
There are countless places to get data, notably:
{% if id == "columbia" -%}
- Columbia Data Platform {% else -%}
- NYU Libraries Data Sources {%- endif %}
- Local:
- NYC Open Data
- Scout can be used to find datasets with certain columns
- BetaNYC
- NYC Open Data
- U.S. Federal:
- United Nations
- World Bank
- World Health Organization (WHO)
- HealthData.gov
- Economic Policy Institute
- Kaggle
- Google Dataset Search
- Black Wealth Data
- DataHub
- Lists of open data portals:
For starters, see the Final Project examples from past semesters.
Probably not realistic to make visualizations that are as fancy as these ones made by professionals, but they may give you ideas. Some also include links/downloads of the source data.
- Climate & Economic Justice Screening Tool
- FiveThirtyEight Interactives
- The Guardian Visual Journalism
- Information is Beautiful Awards
- New York Times Graphics
- Our World in Data
- ProPublica News Apps
- The Pudding
- Statista
- Visual Capitalist
{% if id == "columbia" -%} To work with uploaded files in {{coding_env_name}}, you have two options.
Fewer steps, but your file(s) will disappear when your session ends.
- In the {{coding_env_name}} sidebar, click the
Files
icon (A). - Click the upload button (B).
- Select your file.
- You'll use
read_csv("MY_FILENAME.csv")
in your code.
More steps, but your file(s) are preserved between sessions.
- Upload the file(s) somewhere in Drive.
- In the {{coding_env_name}} sidebar, click the
Files
icon (A). - Click the
Mount Drive
icon (B).- You may need to run the code it injects to authorize it (C).
- Think of this as attaching your Drive to your {{coding_env_name}} instance, as if you were plugging in a USB flash drive.
- Navigate to the file (D).
- You may need to click into
content
, thendrive
.
- You may need to click into
- Next to the filename, click the three dots.
- Click
Copy path
(E).- The value should be something like
/content/drive/My Drive/...
.
- The value should be something like
- Use this path with
read_csv()
(F).
{% else -%}
- Open the {{coding_env_name}} file browser.
- Navigate to the folder your notebook is in.
- Upload the data.
- From Python, use
read_csv("./<filename>.csv")
.
Note that that file path should be to relative to the notebook within {{coding_env_name}} — ./
means "in the same directory". {% endif %}{{coding_env_name}} cannot access the file on your local machine; in other words, the path shouldn't start with C:\\
or anything like that. More info about file paths.
{% if id == "nyu" -%}
{{coding_env_name}} has a disk storage limit of 1GB (a.k.a. 1,024 MB or 1,048,576 KB) across all your files, and a memory limit of 3GB. {%- endif %}
You can make data smaller before uploading by filtering it through:
- The data portal, if it supports it
- This makes the download faster, including only the data you need.
- Instructions for Socrata-based portals
- The
$limit
parameter (or equivalent), if using an API - In a spreadsheet program