-
Notifications
You must be signed in to change notification settings - Fork 68
Writing a Tutorial
A tutorial in this repository is referred to a Jupyter notebook which is written for a dataset to either do a deep analysis of it, or apply some ML technique to it. Some possibilities for tutorials are as follows:
- Walk through of a dataset, go over its main features, use data visualization to tell a story about the dataset
- Correlation/causation analysis
- Time series analysis
- Supervised Learning, such as Classification, Regression, Forecasting, etc
- Unsupervised Learning, such as clustering
Make sure the dataset you choose is tabular and onboarded by our team. There should be a directory available for that dataset here.
Your tutorial can be anything you want, as long as it shows something interesting about the dataset.
You may start by downloading a copy of the template and upload it to Colab.
- While Colab offers many cool macros and shortcuts, we ask you not to use them, since these tutorials should also be runnable in Workbench, and locally.
- To help the reader understand your tutorial easier, make sure to add enough description in markdown cells before your code cells.
- We encourage you to submit your notebook for code review via a Pull Request on GitHub. This helps us to keep track of your progress, and you have access to the history of your work later on.
- When you are done with your code, try and download your notebook to your local machine and run it locally to make sure it still runs without any issues.
Each tutorial requires a number of metadata, which should be stored in a artifact.yaml
file. Here is an example of how the file should look like:
artifact:
title: The title of your tutorial
description: A brief description of what the tutorial is about.
tags:
- libraries:sklearn,matplotlib
- ml:classification
- vertical:government
- tier:free
The vertical
variable is one of healthcare
, environment
, finance
, information
, education
, retail
, government
, and manufacturing
. tier' is one of
free' or paid
, and it is paid
only when the tutorial requires some GCP services such as Vertex AI.
We use flake8 to format the python code properly. The following commands should be helpful for this purpose:
# Running black on Python files:
poetry run black .
# Running flake8 on Python files:
poetry run flake8 .
# Running flake8 on Jupyter Notebook files:
poetry run nbqa flake8 .
Each notebook tutorial requires a test file. We use testbook to write our test units. Let's assume we have a simple notebook called my_notebook.ipynb
with the following four cells:
# Cell 1
import pandas as pd
from google.cloud import bigquery
# Cell 2
QUERY = 'SELECT * FROM table LIMIT 1000'
# Cell 3
bqclient = bigquery.Client(project='my_project')
dataframe = bqclient.query(QUERY).result().to_dataframe()
# Cell 4
var = 3 + 4
In our test, we want to mock the bigquery client and avoid making a real request. The trick is to inject a cell before calling bigquery.Client
and mock it. Here is how to do it using testbook
:
from testbook import testbook
@testbook('./my_notebook.ipynb')
def test_get_details(tb):
tb.inject(
"""
import mock
mock_client = mock.MagicMock()
mock_df = pd.DataFrame()
mock_df['week'] = range(10)
mock_df['count'] = 5
p1 = mock.patch.object(bigquery, 'Client', return_value=mock_client)
mock_client.query().result().to_dataframe.return_value = mock_df
p1.start()
""",
before=2,
run=False
)
tb.execute()
dataframe = tb.get('dataframe')
assert dataframe.shape == (10, 2)
var = tb.get('var')
assert var == 7
A full example can be found here.
In order to run the test file and see if your tests pass, you can use Poetry. Run the following command from the root directory of the repository:
poetry run python -m pytest -v datasets/DATASET_NAME/docs/tutorials/TUTORIAL_DIRECTORY/NOTEBOOK_test.py
If you want, you may provide an overview.md
file with more details about your tutorial and include it alongside your other files for your tutorial.
Your code needs to be submitted for review via a Pull Request. Here is a guideline to show how to do it. We encourage you to submit your tutorial frequently for review on GitHub for incremental feedback from the reviewers.
Let's assume you wrote a tutorial for austin_bikeshare dataset which trains a model to predict the duration of a trip and save it in a file called bike_trip_predict.ipynb
.
For each dataset, all tutorials should be placed in a new subdirectory under .../docs/tutorials
. For this example, create a subdirectory called bike_trip_prediction
and place all your new files inside it. So, the final tree structure should look something like this:
├── datasets
└── austin_bikeshare
└── docs
└── tutorials
└── bike_trip_prediction
├── bike_trip_predict.ipynb
├── bike_trip_predict_test.py
└── artifact.yaml