|
| 1 | +This checklist can guide you through your Machine Learning projects. There are eight main steps: |
| 2 | + |
| 3 | +1. Frame the problem and look at the big picture. |
| 4 | +2. Get the data. |
| 5 | +3. Explore the data to gain insights. |
| 6 | +4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms. |
| 7 | +5. Explore many different models and short-list the best ones. |
| 8 | +6. Fine-tune your models and combine them into a great solution. |
| 9 | +7. Present your solution. |
| 10 | +8. Launch, monitor, and maintain your system. |
| 11 | + |
| 12 | +Obviously, you should feel free to adapt this checklist to your needs. |
| 13 | + |
| 14 | +# Frame the problem and look at the big picture |
| 15 | +1. Define the objective in business terms. |
| 16 | +2. How will your solution be used? |
| 17 | +3. What are the current solutions/workarounds (if any)? |
| 18 | +4. How should you frame this problem (supervised/unsupervised, online/offline, etc.) |
| 19 | +5. How should performance be measured? |
| 20 | +6. Is the performance measure aligned with the business objective? |
| 21 | +7. What would be the minimum performance needed to reach the business objective? |
| 22 | +8. What are comparable problems? Can you reuse experience or tools? |
| 23 | +9. Is human expertise available? |
| 24 | +10. How would you solve the problem manually? |
| 25 | +11. List the assumptions you or others have made so far. |
| 26 | +12. Verify assumptions if possible. |
| 27 | + |
| 28 | +# Get the data |
| 29 | +Note: automate as much as possible so you can easily get fresh data. |
| 30 | + |
| 31 | +1. List the data you need and how much you need. |
| 32 | +2. Find and document where you can get that data. |
| 33 | +3. Check how much space it will take. |
| 34 | +4. Check legal obligations, and get the authorization if necessary. |
| 35 | +5. Get access authorizations. |
| 36 | +6. Create a workspace (with enough storage space). |
| 37 | +7. Get the data. |
| 38 | +8. Convert the data to a format you can easily manipulate (without changing the data itself). |
| 39 | +9. Ensure sensitive information is deleted or protected (e.g., anonymized). |
| 40 | +10. Check the size and type of data (time series, sample, geographical, etc.). |
| 41 | +11. Sample a test set, put it aside, and never look at it (no data snooping!). |
| 42 | + |
| 43 | +# Explore the data |
| 44 | +Note: try to get insights from a field expert for these steps. |
| 45 | + |
| 46 | +1. Create a copy of the data for exploration (sampling it down to a manageable size if necessary). |
| 47 | +2. Create a Jupyter notebook to keep record of your data exploration. |
| 48 | +3. Study each attribute and its characteristics: |
| 49 | + - Name |
| 50 | + - Type (categorical, int/float, bounded/unbounded, text, structured, etc.) |
| 51 | + - % of missing values |
| 52 | + - Noisiness and type of noise (stochastic, outliers, rounding errors, etc.) |
| 53 | + - Possibly useful for the task? |
| 54 | + - Type of distribution (Gaussian, uniform, logarithmic, etc.) |
| 55 | +4. For supervised learning tasks, identify the target attribute(s). |
| 56 | +5. Visualize the data. |
| 57 | +6. Study the correlations between attributes. |
| 58 | +7. Study how you would solve the problem manually. |
| 59 | +8. Identify the promising transformations you may want to apply. |
| 60 | +9. Identify extra data that would be useful (go back to "Get the Data" on page 502). |
| 61 | +10. Document what you have learned. |
| 62 | + |
| 63 | +# Prepare the data |
| 64 | +Notes: |
| 65 | +- Work on copies of the data (keep the original dataset intact). |
| 66 | +- Write functions for all data transformations you apply, for five reasons: |
| 67 | + - So you can easily prepare the data the next time you get a fresh dataset |
| 68 | + - So you can apply these transformations in future projects |
| 69 | + - To clean and prepare the test set |
| 70 | + - To clean and prepare new data instances |
| 71 | + - To make it easy to treat your preparation choices as hyperparameters |
| 72 | + |
| 73 | +1. Data cleaning: |
| 74 | + - Fix or remove outliers (optional). |
| 75 | + - Fill in missing values (e.g., with zero, mean, median...) or drop their rows (or columns). |
| 76 | +2. Feature selection (optional): |
| 77 | + - Drop the attributes that provide no useful information for the task. |
| 78 | +3. Feature engineering, where appropriates: |
| 79 | + - Discretize continuous features. |
| 80 | + - Decompose features (e.g., categorical, date/time, etc.). |
| 81 | + - Add promising transformations of features (e.g., log(x), sqrt(x), x^2, etc.). |
| 82 | + - Aggregate features into promising new features. |
| 83 | +4. Feature scaling: standardize or normalize features. |
| 84 | + |
| 85 | +# Short-list promising models |
| 86 | +Notes: |
| 87 | +- If the data is huge, you may want to sample smaller training sets so you can train many different models in a reasonable time (be aware that this penalizes complex models such as large neural nets or Random Forests). |
| 88 | +- Once again, try to automate these steps as much as possible. |
| 89 | + |
| 90 | +1. Train many quick and dirty models from different categories (e.g., linear, naive, Bayes, SVM, Random Forests, neural net, etc.) using standard parameters. |
| 91 | +2. Measure and compare their performance. |
| 92 | + - For each model, use N-fold cross-validation and compute the mean and standard deviation of their performance. |
| 93 | +3. Analyze the most significant variables for each algorithm. |
| 94 | +4. Analyze the types of errors the models make. |
| 95 | + - What data would a human have used to avoid these errors? |
| 96 | +5. Have a quick round of feature selection and engineering. |
| 97 | +6. Have one or two more quick iterations of the five previous steps. |
| 98 | +7. Short-list the top three to five most promising models, preferring models that make different types of errors. |
| 99 | + |
| 100 | +# Fine-Tune the System |
| 101 | +Notes: |
| 102 | +- You will want to use as much data as possible for this step, especially as you move toward the end of fine-tuning. |
| 103 | +- As always automate what you can. |
| 104 | + |
| 105 | +1. Fine-tune the hyperparameters using cross-validation. |
| 106 | + - Treat your data transformation choices as hyperparameters, especially when you are not sure about them (e.g., should I replace missing values with zero or the median value? Or just drop the rows?). |
| 107 | + - Unless there are very few hyperparamter values to explore, prefer random search over grid search. If training is very long, you may prefer a Bayesian optimization approach (e.g., using a Gaussian process priors, as described by Jasper Snoek, Hugo Larochelle, and Ryan Adams ([https://goo.gl/PEFfGr](https://goo.gl/PEFfGr))) |
| 108 | +2. Try Ensemble methods. Combining your best models will often perform better than running them invdividually. |
| 109 | +3. Once you are confident about your final model, measure its performance on the test set to estimate the generalization error. |
| 110 | + |
| 111 | +> Don't tweak your model after measuring the generalization error: you would just start overfitting the test set. |
| 112 | + |
| 113 | +# Present your solution |
| 114 | +1. Document what you have done. |
| 115 | +2. Create a nice presentation. |
| 116 | + - Make sure you highlight the big picture first. |
| 117 | +3. Explain why your solution achieves the business objective. |
| 118 | +4. Don't forget to present interesting points you noticed along the way. |
| 119 | + - Describe what worked and what did not. |
| 120 | + - List your assumptions and your system's limitations. |
| 121 | +5. Ensure your key findings are communicated through beautiful visualizations or easy-to-remember statements (e.g., "the median income is the number-one predictor of housing prices"). |
| 122 | + |
| 123 | +# Launch! |
| 124 | +1. Get your solution ready for production (plug into production data inputs, write unit tests, etc.). |
| 125 | +2. Write monitoring code to check your system's live performance at regular intervals and trigger alerts when it drops. |
| 126 | + - Beware of slow degradation too: models tend to "rot" as data evolves. |
| 127 | + - Measuring performance may require a human pipeline (e.g., via a crowdsourcing service). |
| 128 | + - Also monitor your inputs' quality (e.g., a malfunctioning sensor sending random values, or another team's output becoming stale). This is particulary important for online learning systems. |
| 129 | +3. Retrain your models on a regular basis on fresh data (automate as much as possible). |
0 commit comments