Skip to content

Commit e5bb2c4

Browse files
committed
added lab_06 slides
1 parent 951ff6e commit e5bb2c4

File tree

3 files changed

+2880
-2
lines changed

3 files changed

+2880
-2
lines changed

.gitignore

-2
Original file line numberDiff line numberDiff line change
@@ -233,5 +233,3 @@ $RECYCLE.BIN/
233233
*.lnk
234234

235235
# End of https://www.toptal.com/developers/gitignore/api/macos,python,windows
236-
237-
lab_06_slides/

lab_06_slides/lab_06.html

+2,694
Large diffs are not rendered by default.

lab_06_slides/lab_06.qmd

+186
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
---
2+
title: "Lab 06"
3+
subtitle: "Advanced Computing for Policy"
4+
format: revealjs
5+
highlight-style: arrow
6+
self-contained: true
7+
---
8+
9+
```{python}
10+
from ydata_profiling import ProfileReport
11+
import pandas as pd
12+
```
13+
14+
## Lab Overview
15+
16+
- Finishing [Lab 5](https://github.com/advanced-computing/course-materials/blob/main/labs/lab_05.md): Profiling and data quality checks
17+
- Linting and formatting
18+
- Continuous integration
19+
20+
Task:
21+
22+
- Set up continuous integration to run tests and linting on your code.
23+
- You'll work in your [Project teams](../docs/project_teams.csv).
24+
25+
26+
## Finishing [Lab 5](https://github.com/advanced-computing/course-materials/blob/main/labs/lab_05.md) {.scrollable}
27+
### Profiling
28+
29+
::: {style="font-size: 80%"}
30+
```{python}
31+
#| echo: true
32+
data = pd.read_csv('../lab_04/videos_data.csv')
33+
data['Likes_numeric'] = data['Likes'].str.replace(',', '').astype(int)
34+
profile = ProfileReport(data, title="Pandas Profiling Report")
35+
profile.to_widgets()
36+
```
37+
38+
- Some findings:
39+
- Variables: Likes is a string. Most liked video has 44M likes. Least poular has 433 likes (?)
40+
- Interactions tab: Most top 200 videos were published after 2017.
41+
- Missing values: Almost half of the videos are missing the 'Dislikes' column.
42+
43+
44+
- Did you find anything surprising/interesting/useful?
45+
:::
46+
47+
## Finishing [Lab 5](https://github.com/advanced-computing/course-materials/blob/main/labs/lab_05.md)
48+
### Data quality checks
49+
50+
::: {style="font-size: 80%"}
51+
- Unit tests for data
52+
- Example 1: Checking variables' types
53+
54+
```{python}
55+
#| echo: true
56+
#| error: true
57+
def check_numeric(data, column):
58+
assert data[column].dtype in ['int64', 'float64'], f"{column} is not numeric"
59+
60+
cols = ['Rank', 'Likes', 'Dislikes']
61+
for col in cols:
62+
check_numeric(data, col)
63+
```
64+
:::
65+
66+
## Finishing [Lab 5](https://github.com/advanced-computing/course-materials/blob/main/labs/lab_05.md)
67+
### Data quality checks (cont.)
68+
::: {style="font-size: 80%"}
69+
- Unit tests for data
70+
- Example 2: Checking outliers
71+
72+
```{python}
73+
#| echo: true
74+
#| error: true
75+
def is_outlier(value,q1,q3):
76+
iqr = q3 - q1 # Interquartile range
77+
lower_bound = q1 - 1.5 * iqr
78+
upper_bound = q3 + 1.5 * iqr
79+
return value < lower_bound or value > upper_bound
80+
81+
def column_has_outliers(data, column):
82+
q1 = data[column].quantile(0.25) # First quartile
83+
q3 = data[column].quantile(0.75) # Third quartile
84+
return any(data[column].apply(lambda x: is_outlier(x, q1, q3)))
85+
86+
assert not column_has_outliers(data, 'Likes_numeric'), "Likes has outliers"
87+
```
88+
:::
89+
90+
91+
## Linting
92+
93+
- A type of [static analysis](https://en.wikipedia.org/wiki/Static_program_analysis)
94+
- Analyzing code without executing it
95+
- Checks for: Code quality
96+
- We'll be starting with [ruff](https://docs.astral.sh/ruff/).
97+
98+
## Example of Low Quality Code
99+
100+
::: {style="font-size: 80%"}
101+
```{python}
102+
#| echo: true
103+
#| code-line-numbers: "|2|9|12-13"
104+
#| output-location: column
105+
import numpy as np
106+
import pandas as pd
107+
108+
def simulate_data(n):
109+
x = np.random.uniform(0, 1, n)
110+
y = 2 + 3 * x + np.random.normal(0, 1, n)
111+
return x, y
112+
113+
from matplotlib import pyplot as plt
114+
115+
def plot_data(x, y):
116+
width = 100
117+
height = 100
118+
plt.scatter(x, y)
119+
plt.xlabel('x')
120+
plt.ylabel('y')
121+
plt.show()
122+
123+
plot_data(*simulate_data(100))
124+
```
125+
:::
126+
127+
128+
129+
## Continuous integration
130+
131+
- You're going to set up your tests and linting to run automatically every time you push code to GitHub.
132+
133+
- This is one of those times where you'll follow instructions without necessarily knowing what's going on
134+
- You'll learn more about it in [this week's reading](https://github.com/advanced-computing/course-materials/blob/main/readings/week_07.md#readings).
135+
136+
## Workflows
137+
138+
::: {style="font-size: 65%"}
139+
- A workflow is an automated process made up of one or more jobs
140+
- We use a YAML file to define our workflow configuration
141+
142+
```{.yaml code-line-numbers="|1|3|9-18|19-21|23-26"}
143+
name: Run tests
144+
145+
on: push
146+
147+
jobs:
148+
tests:
149+
runs-on: ubuntu-latest
150+
steps:
151+
- name: Clone repository
152+
uses: actions/checkout@v4
153+
# https://github.com/actions/setup-python
154+
- name: Install Python
155+
uses: actions/setup-python@v5
156+
with:
157+
python-version: "3.12"
158+
cache: pip
159+
- name: Install dependencies
160+
run: pip install -r requirements.txt
161+
- name: Run tests
162+
# https://pytest-cov.readthedocs.io/en/latest/readme.html
163+
run: pytest --cov
164+
# https://github.com/astral-sh/ruff-action
165+
- name: Run ruff
166+
uses: astral-sh/ruff-action@v3
167+
with:
168+
version: latest
169+
```
170+
171+
:::
172+
173+
## Task
174+
### Steps
175+
176+
::: {style="font-size: 80%"}
177+
1. Install Ruff
178+
1. Install the [ruff VSCode extension](https://marketplace.visualstudio.com/items?itemName=charliermarsh.ruff).
179+
1. Open up your Python files, you'll likely see some warnings.
180+
- Don't do anything with them yet.
181+
1. Set up a GitHub Actions workflow
182+
1. In a branch, add a copy of [`.github/workflows/tests.yml`](https://github.com/advanced-computing/course-materials/blob/main/.github/workflows/tests.yml).
183+
1. Create a pull request.
184+
1. [View the results of the Actions run.](https://docs.github.com/en/actions/writing-workflows/quickstart#viewing-your-workflow-results)
185+
1. If the workflow is failing, review the errors and address them.
186+
:::

0 commit comments

Comments
 (0)