As stated in the course description:
Over the semester, students will build a complex end-to-end data system.
You'll be building a live dashboard, with all the infrastructure behind it:
- Automated data ingestion
- A database
- Web-based interactive data visualization
All of this will be in the cloud.
- Center for Disease Control (CDC) dashboards
- Chicago Region Transit Dashboard
- Chicago Transit Authority Historical Bus Crowding
- Colorado Behavioral Health Administration (BHA) Performance Hub
- Congestion Pricing Tracker
- Johns Hopkins COVID map
- New York Flu Tracker
- New York Traffic Data Viewer (TDV)
- NYPD TrafficStat
- TransitMatters
- United States of Health
- All code is peer-reviewed, through pair programming and/or pull requests.
- All team members are contributing equal amounts.
- The Project leverages at least one dataset that's regularly updated.
- The site doesn't need to read like a blog post necessarily, but it should explain what's going on.
- More to come
Your group will pick an initial:
- Problem space
- Dataset
Part of this project is getting experience with automated data ingestion. Doing so is more interesting with data that changes regularly. You can incorporate additional datasets in the future.
Do the following as a group:
- Discuss what you'd like your project to focus on. Don't need to get too specific yet.
- Explore datasets that are updated weekly (the more often, the better) and pick one.
- Create a new notebook in Google Colab.
- Ensure you can load the data.
- Narrow down on 1-3 research questions.
- In other words, at the end of this project, what do you want to be able to show?
- Draw an example visualization that you'd like to produce.
- You can do so digitally or on a piece of paper.
- Include a title, legend, and axes labels (where appropriate).
- This is just a sketch; don't worry about the specific values.
You will then submit the following to the Discussion on Ed:
- What dataset are you going to use?
- Please include a link.
- What are your research question(s)?
- It should be specific, and objectively answerable through the data available.
- What's the link to your notebook?
- Go to Share -> General access -> LionMail -> Commenter.
- What's your target visualization?
- Include a picture.
- What are your known unknowns?
- What challenges do you anticipate?
Only one person from your group needs to submit. None of this is set in stone long term, it is just a starting place. It can all be changed later.
Goal: Get experience with an application development framework
- Using your dataset from Part 1:
- Create a Streamlit app.
- Deploy the app.
- Add a visualization.
- You can get fancy, but don't have to at this stage. Get something simple working first.
- Bring in a second relevant dataset. (This one doesn't need to be regularly updated.)
- This can be shown on a separate page of your Streamlit app, or combined in a single visualization.
- Add the names of the people on your team to your Streamlit app homepage.
- Turn in the link to your live app via CourseWorks.
- You can load data:
- from a URL (preferred), either:
- An API
- A link to a CSV
- from a file, checked into the repository
- from a URL (preferred), either:
- Note the Streamlit app resource limits.
- At this stage, feel free to make the dataset small to get it working.
- If the app is slow to reload, experiment with caching.
Goal: Get experience with unit testing
Work on branches and submit pull requests for the chunks of work — you decide what the "chunks" are.
- Without writing any code:
- Review your existing code.
- What can be refactored into functions?
- Where can we make our code DRY?
- Decide what function you're going to create.
- Come up with test cases (inputs) and expected outputs.
- This can be in a text file, doc, piece of paper, etc.
- Review your existing code.
- Then, as code:
- Write tests.
- Confirm they fail.
- Refactor your code into the function.
- Make the tests pass.
- Repeat until you feel your code is well-organized and well-tested.
- Submit the links to the pull requests via CourseWorks.
As a result, your:
- "main" scripts (for Streamlit pages or otherwise)
- Functions
should be relatively short and easy to read.
This isn't a one-time thing; continue testing and refactoring as you continue with the Project.
You will hold a team retrospective, with the goal of improving how your team works together. Since the groups are small, it can be fairly informal.
- Schedule 45 minutes for the retro.
- The retro needs to be done live/synchronous, not asynchronous.
- Read about retros.
- Decide who will be the Facilitator.
- Optional: Get someone from outside the team.
- Facilitator: Set up EasyRetro. Instructions.
- In the actual retro:
- Read the Agile Prime Directive out loud.
- 5 minutes: Individually write down "what went well" and "what could be better".
- 10-15 minutes: Discuss what has gone well.
- 20-25 minutes: Discuss what could be better.
- 5 minutes: Document takeaways / action items.
- Move your Proposal to the Streamlit app as is.
- Revisit the Proposal.
- Any new insights?
- Anything you want to adjust?
- Document any changes to the Proposal on the Streamlit page.
- Proceed with the analysis.
- If the majority of your code (to call APIs, etc.) is in modules/functions, it can be
import
ed from a Jupyter notebook. You can do exploratory analysis there, moving things to modules/Streamlit as you go. - You might not be able to fully answer the question(s) yet, but get as close as you can.
- If the majority of your code (to call APIs, etc.) is in modules/functions, it can be
At this point, your project should be looking more like one of the examples. Looking through the Streamlit data elements may be helpful.
Submit links to:
- The EasyRetro board
- Jupyter notebook(s), if any
- The (updated) Streamlit app
Goal: Understand how to work with a cloud-based database
- A service account has been created in your Project for you. It has been given read-only access to BigQuery.
- There are various things that can go wrong in these steps. Don't wait until the last minute.
- Install pandas-gbq.
- Load data.
- Create a Python script that:
- Creates the table, if it doesn't exist
- Pulls data from your regularly-updated data source
- Loads it incrementally.
- Since you'll be running the script locally, authenticate with a user account.
- How to write tables with pandas-gbq
- How will you know it worked as intended?
- Create a Python script that:
- Have your app use BigQuery.
- Create a service account key as JSON. The service account is
streamlit@[project].iam.gserviceaccount.com
. - Set up secrets management locally.
- Make sure to add
secrets.toml
to your.gitignore
so that you don't accidentally commit it to Git.
- Make sure to add
- Copy the key information to your
secrets.toml
file. - Modify your app to read data from BigQuery.
- Copy the secrets to your deployed app.
- Re-deploy.
- Create a service account key as JSON. The service account is
- Submit the links via CourseWorks for:
- The pull request(s)
- The link to your live Streamlit app