Get hands-on experience with:
- Unit testing
- Data profiling
- Data quality checks
All together
- Add tests for parsing the dollar amount
- Refactor the code
- Modules and
import
- Modules and
- Make them pass
Code needs to be testable. This encourages good habits, like:
- Ensuring your code can change without unexpected breakage
- Making small, reusable functions with well-defined behavior
- Organizing code into modules
- Allowing the loading of a module without running all the code
Seeing projects with <name>2.py
. Splitting code up into smaller files will help you work in parallel without stepping on each others' toes.
You'll pair in your Lab group. Work on branches and submit pull requests for the chunks of work — you decide what the "chunks" are.
Set up YData Profiling for your dataset. If it's slow, see their documentation on profiling large datasets.
At time of writing, YData Profiling is only compatible with Python 3.12 and below. If you're on Python 3.13+, the easiest thing will be to run the profiling in Colab and include a link to that notebook. Install the package there with:
!pip install ydata-profiling
- Unit tests for data
- Things to check for when cleaning data
- Can be flexible, like checking for:
- Standard deviation being in a certain range
- X% of values matching certain criteria
- There are commercial tools that help with this - we're going to write the code ourselves.
Look around your data profiling report. Using that information, write data quality checks for three of the things to check for with pytest. The pandas documentation around comparing if objects are equivalent will be helpful.