Skip to content

Commit 78dba7b

Browse files
committed
Add ML project checklist
1 parent afcac83 commit 78dba7b

File tree

1 file changed

+129
-0
lines changed

1 file changed

+129
-0
lines changed

ml-project-checklist.md

+129
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
This checklist can guide you through your Machine Learning projects. There are eight main steps:
2+
3+
1. Frame the problem and look at the big picture.
4+
2. Get the data.
5+
3. Explore the data to gain insights.
6+
4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms.
7+
5. Explore many different models and short-list the best ones.
8+
6. Fine-tune your models and combine them into a great solution.
9+
7. Present your solution.
10+
8. Launch, monitor, and maintain your system.
11+
12+
Obviously, you should feel free to adapt this checklist to your needs.
13+
14+
# Frame the problem and look at the big picture
15+
1. Define the objective in business terms.
16+
2. How will your solution be used?
17+
3. What are the current solutions/workarounds (if any)?
18+
4. How should you frame this problem (supervised/unsupervised, online/offline, etc.)
19+
5. How should performance be measured?
20+
6. Is the performance measure aligned with the business objective?
21+
7. What would be the minimum performance needed to reach the business objective?
22+
8. What are comparable problems? Can you reuse experience or tools?
23+
9. Is human expertise available?
24+
10. How would you solve the problem manually?
25+
11. List the assumptions you or others have made so far.
26+
12. Verify assumptions if possible.
27+
28+
# Get the data
29+
Note: automate as much as possible so you can easily get fresh data.
30+
31+
1. List the data you need and how much you need.
32+
2. Find and document where you can get that data.
33+
3. Check how much space it will take.
34+
4. Check legal obligations, and get the authorization if necessary.
35+
5. Get access authorizations.
36+
6. Create a workspace (with enough storage space).
37+
7. Get the data.
38+
8. Convert the data to a format you can easily manipulate (without changing the data itself).
39+
9. Ensure sensitive information is deleted or protected (e.g., anonymized).
40+
10. Check the size and type of data (time series, sample, geographical, etc.).
41+
11. Sample a test set, put it aside, and never look at it (no data snooping!).
42+
43+
# Explore the data
44+
Note: try to get insights from a field expert for these steps.
45+
46+
1. Create a copy of the data for exploration (sampling it down to a manageable size if necessary).
47+
2. Create a Jupyter notebook to keep record of your data exploration.
48+
3. Study each attribute and its characteristics:
49+
- Name
50+
- Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
51+
- % of missing values
52+
- Noisiness and type of noise (stochastic, outliers, rounding errors, etc.)
53+
- Possibly useful for the task?
54+
- Type of distribution (Gaussian, uniform, logarithmic, etc.)
55+
4. For supervised learning tasks, identify the target attribute(s).
56+
5. Visualize the data.
57+
6. Study the correlations between attributes.
58+
7. Study how you would solve the problem manually.
59+
8. Identify the promising transformations you may want to apply.
60+
9. Identify extra data that would be useful (go back to "Get the Data" on page 502).
61+
10. Document what you have learned.
62+
63+
# Prepare the data
64+
Notes:
65+
- Work on copies of the data (keep the original dataset intact).
66+
- Write functions for all data transformations you apply, for five reasons:
67+
- So you can easily prepare the data the next time you get a fresh dataset
68+
- So you can apply these transformations in future projects
69+
- To clean and prepare the test set
70+
- To clean and prepare new data instances
71+
- To make it easy to treat your preparation choices as hyperparameters
72+
73+
1. Data cleaning:
74+
- Fix or remove outliers (optional).
75+
- Fill in missing values (e.g., with zero, mean, median...) or drop their rows (or columns).
76+
2. Feature selection (optional):
77+
- Drop the attributes that provide no useful information for the task.
78+
3. Feature engineering, where appropriates:
79+
- Discretize continuous features.
80+
- Decompose features (e.g., categorical, date/time, etc.).
81+
- Add promising transformations of features (e.g., log(x), sqrt(x), x^2, etc.).
82+
- Aggregate features into promising new features.
83+
4. Feature scaling: standardize or normalize features.
84+
85+
# Short-list promising models
86+
Notes:
87+
- If the data is huge, you may want to sample smaller training sets so you can train many different models in a reasonable time (be aware that this penalizes complex models such as large neural nets or Random Forests).
88+
- Once again, try to automate these steps as much as possible.
89+
90+
1. Train many quick and dirty models from different categories (e.g., linear, naive, Bayes, SVM, Random Forests, neural net, etc.) using standard parameters.
91+
2. Measure and compare their performance.
92+
- For each model, use N-fold cross-validation and compute the mean and standard deviation of their performance.
93+
3. Analyze the most significant variables for each algorithm.
94+
4. Analyze the types of errors the models make.
95+
- What data would a human have used to avoid these errors?
96+
5. Have a quick round of feature selection and engineering.
97+
6. Have one or two more quick iterations of the five previous steps.
98+
7. Short-list the top three to five most promising models, preferring models that make different types of errors.
99+
100+
# Fine-Tune the System
101+
Notes:
102+
- You will want to use as much data as possible for this step, especially as you move toward the end of fine-tuning.
103+
- As always automate what you can.
104+
105+
1. Fine-tune the hyperparameters using cross-validation.
106+
- Treat your data transformation choices as hyperparameters, especially when you are not sure about them (e.g., should I replace missing values with zero or the median value? Or just drop the rows?).
107+
- Unless there are very few hyperparamter values to explore, prefer random search over grid search. If training is very long, you may prefer a Bayesian optimization approach (e.g., using a Gaussian process priors, as described by Jasper Snoek, Hugo Larochelle, and Ryan Adams ([https://goo.gl/PEFfGr](https://goo.gl/PEFfGr)))
108+
2. Try Ensemble methods. Combining your best models will often perform better than running them invdividually.
109+
3. Once you are confident about your final model, measure its performance on the test set to estimate the generalization error.
110+
111+
> Don't tweak your model after measuring the generalization error: you would just start overfitting the test set.
112+
113+
# Present your solution
114+
1. Document what you have done.
115+
2. Create a nice presentation.
116+
- Make sure you highlight the big picture first.
117+
3. Explain why your solution achieves the business objective.
118+
4. Don't forget to present interesting points you noticed along the way.
119+
- Describe what worked and what did not.
120+
- List your assumptions and your system's limitations.
121+
5. Ensure your key findings are communicated through beautiful visualizations or easy-to-remember statements (e.g., "the median income is the number-one predictor of housing prices").
122+
123+
# Launch!
124+
1. Get your solution ready for production (plug into production data inputs, write unit tests, etc.).
125+
2. Write monitoring code to check your system's live performance at regular intervals and trigger alerts when it drops.
126+
- Beware of slow degradation too: models tend to "rot" as data evolves.
127+
- Measuring performance may require a human pipeline (e.g., via a crowdsourcing service).
128+
- Also monitor your inputs' quality (e.g., a malfunctioning sensor sending random values, or another team's output becoming stale). This is particulary important for online learning systems.
129+
3. Retrain your models on a regular basis on fresh data (automate as much as possible).

0 commit comments

Comments
 (0)