Project

Goals

As stated in the course description:

Over the semester, students will build a complex end-to-end data system.

You'll be building a live dashboard, with all the infrastructure behind it:

Automated data ingestion
A database
Web-based interactive data visualization

All of this will be in the cloud.

Inspiration

Center for Disease Control (CDC) dashboards
Chicago Region Transit Dashboard
Chicago Transit Authority Historical Bus Crowding
Colorado Behavioral Health Administration (BHA) Performance Hub
Congestion Pricing Tracker
Johns Hopkins COVID map
New York Flu Tracker
New York Traffic Data Viewer (TDV)
NYPD TrafficStat
TransitMatters
United States of Health

Expectations

At all times

All code is peer-reviewed, through pair programming and/or pull requests.
All team members are contributing equal amounts.
The Project leverages at least one dataset that's regularly updated.
The code, documentation, and repository are clean, following good coding style and other best practices.

By the end

The site doesn't need to read like a blog post necessarily, but it should explain what's going on.
The site + codebase should be a polished portfolio piece.
Data is being automatically updated.

Teams

Part 1

Goals

Your group will pick an initial:

Problem space
Dataset

Part of this project is getting experience with automated data ingestion. Doing so is more interesting with data that changes regularly. You can incorporate additional datasets in the future.

Steps

Do the following as a group:

Discuss what you'd like your project to focus on. Don't need to get too specific yet.
Explore datasets that are updated weekly (the more often, the better) and pick one.
Create a new notebook in Google Colab.
Ensure you can load the data.
Narrow down on 1-3 research questions.
- In other words, at the end of this project, what do you want to be able to show?
Draw an example visualization that you'd like to produce.
- You can do so digitally or on a piece of paper.
- Include a title, legend, and axes labels (where appropriate).
- This is just a sketch; don't worry about the specific values.

Proposal

You will then submit the following to the Discussion on Ed:

What dataset are you going to use?
- Please include a link.
What are your research question(s)?
- It should be specific, and objectively answerable through the data available.
What's the link to your notebook?
- Go to Share -> General access -> LionMail -> Commenter.
What's your target visualization?
- Include a picture.
What are your known unknowns?
What challenges do you anticipate?

Only one person from your group needs to submit. None of this is set in stone long term, it is just a starting place. It can all be changed later.

Lab 3

Part 2

Goal: Get experience with an application development framework

Steps

Using your dataset from Part 1:
1. Create a Streamlit app.
2. Deploy the app.
3. Add a visualization.
  - You can get fancy, but don't have to at this stage. Get something simple working first.
Bring in a second relevant dataset. (This one doesn't need to be regularly updated.)
- This can be shown on a separate page of your Streamlit app, or combined in a single visualization.
Add the names of the people on your team to your Streamlit app homepage.
Turn in the link to your live app via CourseWorks.

Tips

You can load data:
- from a URL (preferred), either:
  - An API
  - A link to a CSV
- from a file, checked into the repository
Note the Streamlit app resource limits.
At this stage, feel free to make the dataset small to get it working.
If the app is slow to reload, experiment with caching.

Part 3

Goal: Get experience with unit testing

Steps

Work on branches and submit pull requests for the chunks of work — you decide what the "chunks" are.

Without writing any code:
1. Review your existing code.
  - What can be refactored into functions?
  - Where can we make our code DRY?
2. Decide what function you're going to create.
3. Come up with test cases (inputs) and expected outputs.
  - This can be in a text file, doc, piece of paper, etc.
Then, as code:
1. Write tests.
2. Confirm they fail.
3. Refactor your code into the function.
4. Make the tests pass.
Repeat until you feel your code is well-organized and well-tested.
Submit the links to the pull requests via CourseWorks.

Outcome

As a result, your:

"main" scripts (for Streamlit pages or otherwise)
Functions

should be relatively short and easy to read.

This isn't a one-time thing; continue testing and refactoring as you continue with the Project.

Lab 6

Part 4

Retro

You will hold a team retrospective, with the goal of improving how your team works together. Since the groups are small, it can be fairly informal.

Schedule 45 minutes for the retro.
- The retro needs to be done live/synchronous, not asynchronous.
Read about retros.
Decide who will be the Facilitator.
- Optional: Get someone from outside the team.
Facilitator: Set up EasyRetro. Instructions.
In the actual retro:
1. Read the Agile Prime Directive out loud.
2. 5 minutes: Individually write down "what went well" and "what could be better".
3. 10-15 minutes: Discuss what has gone well.
4. 20-25 minutes: Discuss what could be better.
5. 5 minutes: Document takeaways / action items.

Analysis

Move your Proposal to the Streamlit app as is.
- How to add text elements
Revisit the Proposal.
- Any new insights?
- Anything you want to adjust?
Document any changes to the Proposal on the Streamlit page.
Proceed with the analysis.
- If the majority of your code (to call APIs, etc.) is in modules/functions, it can be imported from a Jupyter notebook. You can do exploratory analysis there, moving things to modules/Streamlit as you go.
- You might not be able to fully answer the question(s) yet, but get as close as you can.

At this point, your project should be looking more like one of the examples. Looking through the Streamlit data elements may be helpful.

Submit

Submit links to:

The EasyRetro board
Jupyter notebook(s), if any
The (updated) Streamlit app

Part 5

Goal: Understand how to work with a cloud-based database

Notes

A service account has been created in your Project for you. It has been given read-only access to BigQuery.
There are various things that can go wrong in these steps. Don't wait until the last minute.

Steps

Do the following for your regularly-updated data source. Only do one for now — we'll do the rest in Lab 10.

Install pandas-gbq.
Load data.
- Create a Python script that:
  1. Creates the table, if it doesn't exist
  2. Pulls data from your data source
  3. Copies the data to BigQuery using the appropriate technique
- Since you'll be running the script locally, authenticate with a user account.
- How to write tables with pandas-gbq
- How will you know it worked as intended?
Have your app use BigQuery.
1. Each team member will need to:
  1. Create a service account key as JSON. The service account is streamlit@[project].iam.gserviceaccount.com.
  2. Set up secrets management locally.
    - Make sure to add secrets.toml to your .gitignore so that you don't accidentally commit it to Git.
  3. Copy the key information to your secrets.toml file.
2. Modify your app to read data from BigQuery.
  - Simplified example
3. Copy the secrets to your deployed app.
4. Re-deploy.
Submit the links via CourseWorks for:
- The pull request(s)
- The link to your live Streamlit app

Check-in

We'll do this once now, once at the end.
This will be factored into an individual score.

Lab 10

Lab 11

Part 6

Data flow

Visually map your data flow, end to end.

What happens at each step?
What can go wrong?
Get granular
Go all the way upstream. How does the data get collected/generated?
You can use:
- Paper
- Google Drawings
- A fancier diagramming tool
  - Don't over-complicate this

Airflow

~~Convert one of your data loading scripts to an Airflow DAG.~~

Submit

Submit via CourseWorks:

An image of / a link to your map
~~Link(s) to your pull request(s)~~
~~A screenshot of your DAG's graph view in the Airtable UI, showing a successful run~~

Lab 13

Part 7

Goal: Determine and prioritize TODOs

You'll do this prioritization exercise as a group.

This must be done synchronously.
Look back at the expectations.
The Prep can be done in the meeting itself.
You can use paper/stickies or a digital template like Miro's

Submit a photo/link to the matrix via CourseWorks.

Part 8

Refinement

Goal: Meet the expectations

Do tasks you came up with in the prioritization exercise in order of priority.

Presentation

Goal: Force clarity of the project and code by having to show and explain them to others

Each group will do a presentation on their Project in class.

10-ish minutes
Slides optional
Everyone in the group should speak.
Explain the initial proposal and how it's evolved.
Show the live app.
Walk through the code.
Talk through your findings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!