|
| 1 | +--- |
| 2 | +title: "Lab 06" |
| 3 | +subtitle: "Advanced Computing for Policy" |
| 4 | +format: revealjs |
| 5 | +highlight-style: arrow |
| 6 | +self-contained: true |
| 7 | +--- |
| 8 | + |
| 9 | +```{python} |
| 10 | +from ydata_profiling import ProfileReport |
| 11 | +import pandas as pd |
| 12 | +``` |
| 13 | + |
| 14 | +## Lab Overview |
| 15 | + |
| 16 | +- Finishing [Lab 5](https://github.com/advanced-computing/course-materials/blob/main/labs/lab_05.md): Profiling and data quality checks |
| 17 | +- Linting and formatting |
| 18 | +- Continuous integration |
| 19 | + |
| 20 | +Task: |
| 21 | + |
| 22 | +- Set up continuous integration to run tests and linting on your code. |
| 23 | +- You'll work in your [Project teams](../docs/project_teams.csv). |
| 24 | + |
| 25 | + |
| 26 | +## Finishing [Lab 5](https://github.com/advanced-computing/course-materials/blob/main/labs/lab_05.md) {.scrollable} |
| 27 | +### Profiling |
| 28 | + |
| 29 | +::: {style="font-size: 80%"} |
| 30 | +```{python} |
| 31 | +#| echo: true |
| 32 | +data = pd.read_csv('../lab_04/videos_data.csv') |
| 33 | +data['Likes_numeric'] = data['Likes'].str.replace(',', '').astype(int) |
| 34 | +profile = ProfileReport(data, title="Pandas Profiling Report") |
| 35 | +profile.to_widgets() |
| 36 | +``` |
| 37 | + |
| 38 | +- Some findings: |
| 39 | + - Variables: Likes is a string. Most liked video has 44M likes. Least poular has 433 likes (?) |
| 40 | + - Interactions tab: Most top 200 videos were published after 2017. |
| 41 | + - Missing values: Almost half of the videos are missing the 'Dislikes' column. |
| 42 | + |
| 43 | + |
| 44 | +- Did you find anything surprising/interesting/useful? |
| 45 | +::: |
| 46 | + |
| 47 | +## Finishing [Lab 5](https://github.com/advanced-computing/course-materials/blob/main/labs/lab_05.md) |
| 48 | +### Data quality checks |
| 49 | + |
| 50 | +::: {style="font-size: 80%"} |
| 51 | +- Unit tests for data |
| 52 | +- Example 1: Checking variables' types |
| 53 | + |
| 54 | +```{python} |
| 55 | +#| echo: true |
| 56 | +#| error: true |
| 57 | +def check_numeric(data, column): |
| 58 | + assert data[column].dtype in ['int64', 'float64'], f"{column} is not numeric" |
| 59 | +
|
| 60 | +cols = ['Rank', 'Likes', 'Dislikes'] |
| 61 | +for col in cols: |
| 62 | + check_numeric(data, col) |
| 63 | +``` |
| 64 | +::: |
| 65 | + |
| 66 | +## Finishing [Lab 5](https://github.com/advanced-computing/course-materials/blob/main/labs/lab_05.md) |
| 67 | +### Data quality checks (cont.) |
| 68 | +::: {style="font-size: 80%"} |
| 69 | +- Unit tests for data |
| 70 | +- Example 2: Checking outliers |
| 71 | + |
| 72 | +```{python} |
| 73 | +#| echo: true |
| 74 | +#| error: true |
| 75 | +def is_outlier(value,q1,q3): |
| 76 | + iqr = q3 - q1 # Interquartile range |
| 77 | + lower_bound = q1 - 1.5 * iqr |
| 78 | + upper_bound = q3 + 1.5 * iqr |
| 79 | + return value < lower_bound or value > upper_bound |
| 80 | +
|
| 81 | +def column_has_outliers(data, column): |
| 82 | + q1 = data[column].quantile(0.25) # First quartile |
| 83 | + q3 = data[column].quantile(0.75) # Third quartile |
| 84 | + return any(data[column].apply(lambda x: is_outlier(x, q1, q3))) |
| 85 | + |
| 86 | +assert not column_has_outliers(data, 'Likes_numeric'), "Likes has outliers" |
| 87 | +``` |
| 88 | +::: |
| 89 | + |
| 90 | + |
| 91 | +## Linting |
| 92 | + |
| 93 | +- A type of [static analysis](https://en.wikipedia.org/wiki/Static_program_analysis) |
| 94 | + - Analyzing code without executing it |
| 95 | +- Checks for: Code quality |
| 96 | +- We'll be starting with [ruff](https://docs.astral.sh/ruff/). |
| 97 | + |
| 98 | +## Example of Low Quality Code |
| 99 | + |
| 100 | +::: {style="font-size: 80%"} |
| 101 | +```{python} |
| 102 | +#| echo: true |
| 103 | +#| code-line-numbers: "|2|9|12-13" |
| 104 | +#| output-location: column |
| 105 | +import numpy as np |
| 106 | +import pandas as pd |
| 107 | +
|
| 108 | +def simulate_data(n): |
| 109 | + x = np.random.uniform(0, 1, n) |
| 110 | + y = 2 + 3 * x + np.random.normal(0, 1, n) |
| 111 | + return x, y |
| 112 | +
|
| 113 | +from matplotlib import pyplot as plt |
| 114 | +
|
| 115 | +def plot_data(x, y): |
| 116 | + width = 100 |
| 117 | + height = 100 |
| 118 | + plt.scatter(x, y) |
| 119 | + plt.xlabel('x') |
| 120 | + plt.ylabel('y') |
| 121 | + plt.show() |
| 122 | +
|
| 123 | +plot_data(*simulate_data(100)) |
| 124 | +``` |
| 125 | +::: |
| 126 | + |
| 127 | + |
| 128 | + |
| 129 | +## Continuous integration |
| 130 | + |
| 131 | +- You're going to set up your tests and linting to run automatically every time you push code to GitHub. |
| 132 | + |
| 133 | +- This is one of those times where you'll follow instructions without necessarily knowing what's going on |
| 134 | + - You'll learn more about it in [this week's reading](https://github.com/advanced-computing/course-materials/blob/main/readings/week_07.md#readings). |
| 135 | + |
| 136 | +## Workflows |
| 137 | + |
| 138 | +::: {style="font-size: 65%"} |
| 139 | +- A workflow is an automated process made up of one or more jobs |
| 140 | +- We use a YAML file to define our workflow configuration |
| 141 | + |
| 142 | +```{.yaml code-line-numbers="|1|3|9-18|19-21|23-26"} |
| 143 | +name: Run tests |
| 144 | +
|
| 145 | +on: push |
| 146 | +
|
| 147 | +jobs: |
| 148 | + tests: |
| 149 | + runs-on: ubuntu-latest |
| 150 | + steps: |
| 151 | + - name: Clone repository |
| 152 | + uses: actions/checkout@v4 |
| 153 | + # https://github.com/actions/setup-python |
| 154 | + - name: Install Python |
| 155 | + uses: actions/setup-python@v5 |
| 156 | + with: |
| 157 | + python-version: "3.12" |
| 158 | + cache: pip |
| 159 | + - name: Install dependencies |
| 160 | + run: pip install -r requirements.txt |
| 161 | + - name: Run tests |
| 162 | + # https://pytest-cov.readthedocs.io/en/latest/readme.html |
| 163 | + run: pytest --cov |
| 164 | + # https://github.com/astral-sh/ruff-action |
| 165 | + - name: Run ruff |
| 166 | + uses: astral-sh/ruff-action@v3 |
| 167 | + with: |
| 168 | + version: latest |
| 169 | +``` |
| 170 | + |
| 171 | +::: |
| 172 | + |
| 173 | +## Task |
| 174 | +### Steps |
| 175 | + |
| 176 | +::: {style="font-size: 80%"} |
| 177 | +1. Install Ruff |
| 178 | + 1. Install the [ruff VSCode extension](https://marketplace.visualstudio.com/items?itemName=charliermarsh.ruff). |
| 179 | + 1. Open up your Python files, you'll likely see some warnings. |
| 180 | + - Don't do anything with them yet. |
| 181 | +1. Set up a GitHub Actions workflow |
| 182 | + 1. In a branch, add a copy of [`.github/workflows/tests.yml`](https://github.com/advanced-computing/course-materials/blob/main/.github/workflows/tests.yml). |
| 183 | + 1. Create a pull request. |
| 184 | + 1. [View the results of the Actions run.](https://docs.github.com/en/actions/writing-workflows/quickstart#viewing-your-workflow-results) |
| 185 | + 1. If the workflow is failing, review the errors and address them. |
| 186 | +::: |
0 commit comments