"
+ ],
+ "text/latex": [
+ "\\begin{tabular}{r|cc}\n",
+ "\t& measure & measurement\\\\\n",
+ "\t\\hline\n",
+ "\t& Measure & Measurem…\\\\\n",
+ "\t\\hline\n",
+ "\t1 & BrierLoss() & $0.3131 \\pm 0.0071$ \\\\\n",
+ "\t2 & AreaUnderCurve() & $0.789 \\pm 0.011$ \\\\\n",
+ "\t3 & Accuracy() & $0.7809 \\pm 0.0072$ \\\\\n",
+ "\\end{tabular}\n"
+ ],
+ "text/plain": [
+ "\u001b[1m3×2 DataFrame\u001b[0m\n",
+ "\u001b[1m Row \u001b[0m│\u001b[1m measure \u001b[0m\u001b[1m measurement \u001b[0m\n",
+ "\u001b[1m \u001b[0m│\u001b[90m Measure \u001b[0m\u001b[90m Measuremen… \u001b[0m\n",
+ "─────┼─────────────────────────────────\n",
+ " 1 │ BrierLoss() 0.3131±0.0071\n",
+ " 2 │ AreaUnderCurve() 0.789±0.011\n",
+ " 3 │ Accuracy() 0.7809±0.0072"
+ ]
+ },
+ "execution_count": 63,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "confidence_intervals_basic_model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "As each pair of intervals overlap, it's doubtful the small changes\n",
+ "here can be assigned statistical significance. Default `booster`\n",
+ "hyper-parameters do a pretty good job."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Testing the final model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We now determine the performance of our model on our\n",
+ "lock-and-throw-away-the-key holdout set. To demonstrate\n",
+ "deserialization, we'll pretend we're in a new Julia session (but\n",
+ "have called `import`/`using` on the same packages). Then the\n",
+ "following should suffice to recover our model trained under\n",
+ "\"Hyper-parameter optimization\" above:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 64,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Machine{ProbabilisticTunedModel{RandomSearch,…},…} trained 1 time; caches data\n",
+ " model: MLJTuning.ProbabilisticTunedModel{RandomSearch, MLJIteration.ProbabilisticIteratedModel{MLJBase.ProbabilisticPipeline{NamedTuple{(:continuous_encoder, :feature_selector, :evo_tree_classifier), Tuple{Unsupervised, Unsupervised, Probabilistic}}, MLJModelInterface.predict}}}\n",
+ " args: \n"
+ ]
+ },
+ "execution_count": 64,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "mach_restored = machine(\"tuned_iterated_pipe.jlso\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We compute predictions on the holdout set:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 65,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ " \u001b[1mUnivariateFinite{Multiclass{2}}\u001b[22m \n",
+ " \u001b[90m┌ ┐\u001b[39m \n",
+ " \u001b[0mNo \u001b[90m┤\u001b[39m\u001b[38;5;2m■■■■■■■■■■■■■■■■■\u001b[39m\u001b[0m 0.45987161928543624 \u001b[90m \u001b[39m \n",
+ " \u001b[0mYes \u001b[90m┤\u001b[39m\u001b[38;5;2m■■■■■■■■■■■■■■■■■■■■\u001b[39m\u001b[0m 0.5401283807145638 \u001b[90m \u001b[39m \n",
+ " \u001b[90m└ ┘\u001b[39m "
+ ]
+ },
+ "execution_count": 65,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "ŷ_tuned = predict(mach_restored, Xtest);\n",
+ "ŷ_tuned[1]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "And can compute the final performance measures:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 66,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "┌ Info: Tuned model measurements on test:\n",
+ "│ brier_loss(ŷ_tuned, ytest) |> mean = 0.26876013030554674\n",
+ "│ auc(ŷ_tuned, ytest) = 0.8494239631336405\n",
+ "│ accuracy(mode.(ŷ_tuned), ytest) = 0.8104265402843602\n",
+ "└ @ Main In[66]:1\n"
+ ]
+ }
+ ],
+ "source": [
+ "@info(\"Tuned model measurements on test:\",\n",
+ " brier_loss(ŷ_tuned, ytest) |> mean,\n",
+ " auc(ŷ_tuned, ytest),\n",
+ " accuracy(mode.(ŷ_tuned), ytest)\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "For comparison, here's the performance for the basic pipeline model"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 67,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "┌ Info: Basic model measurements on test set:\n",
+ "│ brier_loss(ŷ_basic, ytest) |> mean = 0.2815220182102922\n",
+ "│ auc(ŷ_basic, ytest) = 0.8496543778801844\n",
+ "│ accuracy(mode.(ŷ_basic), ytest) = 0.8009478672985781\n",
+ "└ @ Main In[67]:6\n"
+ ]
+ }
+ ],
+ "source": [
+ "mach_basic = machine(pipe, X, y)\n",
+ "fit!(mach_basic, verbosity=0)\n",
+ "\n",
+ "ŷ_basic = predict(mach_basic, Xtest);\n",
+ "\n",
+ "@info(\"Basic model measurements on test set:\",\n",
+ " brier_loss(ŷ_basic, ytest) |> mean,\n",
+ " auc(ŷ_basic, ytest),\n",
+ " accuracy(mode.(ŷ_basic), ytest)\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "\n",
+ "*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Julia 1.6.5",
+ "language": "julia",
+ "name": "julia-1.6"
+ },
+ "language_info": {
+ "file_extension": ".jl",
+ "mimetype": "application/julia",
+ "name": "julia",
+ "version": "1.6.5"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 3
+}
diff --git a/examples/telco/notebook.jl b/examples/telco/notebook.jl
new file mode 100644
index 000000000..85f301ad5
--- /dev/null
+++ b/examples/telco/notebook.jl
@@ -0,0 +1,756 @@
+# # MLJ for Data Scientists in Two Hours
+
+# An application of the [MLJ
+# toolbox](https://alan-turing-institute.github.io/MLJ.jl/dev/) to the
+# Telco Customer Churn dataset, aimed at practicing data scientists
+# new to MLJ (Machine Learning in Julia). This tutorial does not
+# cover exploratory data analysis.
+
+# MLJ is a *multi-paradigm* machine learning toolbox (i.e., not just
+# deep-learning).
+
+# For other MLJ learning resources see the [Learning
+# MLJ](https://alan-turing-institute.github.io/MLJ.jl/dev/learning_mlj/)
+# section of the
+# [manual](https://alan-turing-institute.github.io/MLJ.jl/dev/).
+
+# **Topics covered**: Grabbing and preparing a dataset, basic
+# fit/predict workflow, constructing a pipeline to include data
+# pre-processing, estimating performance metrics, ROC curves, confusion
+# matrices, feature importance, basic feature selection, controlling iterative
+# models, hyper-parameter optimization (tuning).
+
+# **Prerequisites for this tutorial.** Previous experience building,
+# evaluating, and optimizing machine learning models using
+# scikit-learn, caret, MLR, weka, or similar tool. No previous
+# experience with MLJ. Only fairly basic familiarity with Julia is
+# required. Uses
+# [DataFrames.jl](https://dataframes.juliadata.org/stable/) but in a
+# minimal way ([this
+# cheatsheet](https://ahsmart.com/pub/data-wrangling-with-data-frames-jl-cheat-sheet/index.html)
+# may help).
+
+# **Time.** Between two and three hours, first time through.
+
+
+# ## Summary of methods and types introduced
+
+# |code | purpose|
+# |:-------|:-------------------------------------------------------|
+# |`OpenML.load(id)` | grab a dataset from [OpenML.org](https://www.openml.org)|
+# |`scitype(X)` | inspect the scientific type (scitype) of object `X`|
+# |`schema(X)` | inspect the column scitypes (scientific types) of a table `X`|
+# |`coerce(X, ...)` | fix column encodings to get appropriate scitypes|
+# |`partition(data, frac1, frac2, ...; rng=...)` | vertically split `data`, which can be a table, vector or matrix|
+# |`unpack(table, f1, f2, ...)` | horizontally split `table` based on conditions `f1`, `f2`, ..., applied to column names|
+# |`@load ModelType pkg=...` | load code defining a model type|
+# |`input_scitype(model)` | inspect the scitype that a model requires for features (inputs)|
+# |`target_scitype(model)`| inspect the scitype that a model requires for the target (labels)|
+# |`ContinuousEncoder` | built-in model type for re-encoding all features as `Continuous`|# |`model1 |> model2` |> ...` | combine multiple models into a pipeline
+# | `measures("under curve")` | list all measures (metrics) with string "under curve" in documentation
+# | `accuracy(yhat, y)` | compute accuracy of predictions `yhat` against ground truth observations `y`
+# | `auc(yhat, y)`, `brier_loss(yhat, y)` | evaluate two probabilistic measures (`yhat` a vector of probability distributions)
+# | `machine(model, X, y)` | bind `model` to training data `X` (features) and `y` (target)
+# | `fit!(mach, rows=...)` | train machine using specified rows (observation indices)
+# | `predict(mach, rows=...)`, | make in-sample model predictions given specified rows
+# | `predict(mach, Xnew)` | make predictions given new features `Xnew`
+# | `fitted_params(mach)` | inspect learned parameters
+# | `report(mach)` | inspect other outcomes of training
+# | `confmat(yhat, y)` | confusion matrix for predictions `yhat` and ground truth `y`
+# | `roc(yhat, y)` | compute points on the receiver-operator Characteristic
+# | `StratifiedCV(nfolds=6)` | 6-fold stratified cross-validation resampling strategy
+# | `Holdout(fraction_train=0.7)` | holdout resampling strategy
+# | `evaluate(model, X, y; resampling=..., options...)` | estimate performance metrics `model` using the data `X`, `y`
+# | `FeatureSelector()` | transformer for selecting features
+# | `Step(3)` | iteration control for stepping 3 iterations
+# | `NumberSinceBest(6)`, `TimeLimit(60/5), InvalidValue()` | iteration control stopping criteria
+# | `IteratedModel(model=..., controls=..., options...)` | wrap an iterative `model` in control strategies
+# | `range(model, :some_hyperparam, lower=..., upper=...)` | define a numeric range
+# | `RandomSearch()` | random search tuning strategy
+# | `TunedModel(model=..., tuning=..., options...)` | wrap the supervised `model` in specified `tuning` strategy
+
+
+# ## Instantiate a Julia environment
+
+# The following code replicates precisely the set of Julia packages
+# used to develop this tutorial. If this is your first time running
+# the notebook, package instantiation and pre-compilation may take a
+# minute or so to complete. **This step will fail** if the [correct
+# Manifest.toml and Project.toml
+# files](https://github.com/alan-turing-institute/MLJ.jl/tree/dev/examples/telco)
+# are not in the same directory as this notebook.
+
+using Pkg
+Pkg.activate(@__DIR__) # get env from TOML files in same directory as this notebook
+Pkg.instantiate()
+
+
+# ## Warm up: Building a model for the iris dataset
+
+# Before turning to the Telco Customer Churn dataset, we very quickly
+# build a predictive model for Fisher's well-known iris data set, as way of
+# introducing the main actors in any MLJ workflow. Details that you
+# don't fully grasp should become clearer in the Telco study.
+
+# This section is a condensed adaption of the [Getting Started
+# example](https://alan-turing-institute.github.io/MLJ.jl/dev/getting_started/#Fit-and-predict)
+# in the MLJ documentation.
+
+# First, using the built-in iris dataset, we load and inspect the features
+# `X_iris` (a table) and target variable `y_iris` (a vector):
+
+using MLJ
+
+#-
+
+const X_iris, y_iris = @load_iris;
+schema(X_iris)
+
+#-
+
+y_iris[1:4]
+
+#
+
+levels(y_iris)
+
+# We load a decision tree model, from the package DecisionTree.jl:
+
+DecisionTree = @load DecisionTreeClassifier pkg=DecisionTree # model type
+model = DecisionTree(min_samples_split=5) # model instance
+
+# In MLJ, a *model* is just a container for hyper-parameters of
+# some learning algorithm. It does not store learned parameters.
+
+# Next, we bind the model together with the available data in what's
+# called a *machine*:
+
+mach = machine(model, X_iris, y_iris)
+
+# A machine is essentially just a model (ie, hyper-parameters) plus data, but
+# it additionally stores *learned parameters* (the tree) once it is
+# trained on some view of the data:
+
+train_rows = vcat(1:60, 91:150); # some row indices (observations are rows not columns)
+fit!(mach, rows=train_rows)
+fitted_params(mach)
+
+# A machine stores some other information enabling [warm
+# restart](https://alan-turing-institute.github.io/MLJ.jl/dev/machines/#Warm-restarts)
+# for some models, but we won't go into that here. You are allowed to
+# access and mutate the `model` parameter:
+
+mach.model.min_samples_split = 10
+fit!(mach, rows=train_rows) # re-train with new hyper-parameter
+
+# Now we can make predictions on some other view of the data, as in
+
+predict(mach, rows=71:73)
+
+# or on completely new data, as in
+
+Xnew = (sepal_length = [5.1, 6.3],
+ sepal_width = [3.0, 2.5],
+ petal_length = [1.4, 4.9],
+ petal_width = [0.3, 1.5])
+yhat = predict(mach, Xnew)
+
+# These are probabilistic predictions which can be manipulated using a
+# widely adopted interface defined in the Distributions.jl
+# package. For example, we can get raw probabilities like this:
+
+pdf.(yhat, "virginica")
+
+
+# We now turn to the Telco dataset.
+
+
+# ## Getting the Telco data
+
+import DataFrames
+
+#-
+
+data = OpenML.load(42178) # data set from OpenML.org
+df0 = DataFrames.DataFrame(data)
+first(df0, 4)
+
+# The object of this tutorial is to build and evaluate supervised
+# learning models to predict the `:Churn` variable, a binary variable
+# measuring customer retention, based on other variables that are
+# relevant.
+
+# In the table, observations correspond to rows, and features to
+# columns, which is the convention for representing all
+# two-dimensional data in MLJ.
+
+
+# ## Type coercion
+
+# > Introduces: `scitype`, `schema`, `coerce`
+
+# A ["scientific
+# type"](https://juliaai.github.io/ScientificTypes.jl/dev/) or
+# *scitype* indicates how MLJ will *interpret* data. For example,
+# `typeof(3.14) == Float64`, while `scitype(3.14) == Continuous` and
+# also `scitype(3.14f0) == Continuous`. In MLJ, model data
+# requirements are articulated using scitypes.
+
+# Here are common "scalar" scitypes:
+
+# ![](assets/scitypes.png)
+
+# There are also container scitypes. For example, the scitype of any
+# `N`-dimensional array is `AbstractArray{S, N}`, where `S` is the scitype of the
+# elements:
+
+scitype(["cat", "mouse", "dog"])
+
+# The `schema` operator summarizes the column scitypes of a table:
+
+schema(df0) |> DataFrames.DataFrame # converted to DataFrame for better display
+
+# All of the fields being interpreted as `Textual` are really
+# something else, either `Multiclass` or, in the case of
+# `:TotalCharges`, `Continuous`. In fact, `:TotalCharges` is
+# mostly floats wrapped as strings. However, it needs special
+# treatment because some elements consist of a single space, " ",
+# which we'll treat as "0.0".
+
+fix_blanks(v) = map(v) do x
+ if x == " "
+ return "0.0"
+ else
+ return x
+ end
+end
+
+df0.TotalCharges = fix_blanks(df0.TotalCharges);
+
+# Coercing the `:TotalCharges` type to ensure a `Continuous` scitype:
+
+coerce!(df0, :TotalCharges => Continuous);
+
+# Coercing all remaining `Textual` data to `Multiclass`:
+
+coerce!(df0, Textual => Multiclass);
+
+# Finally, we'll coerce our target variable `:Churn` to be
+# `OrderedFactor`, rather than `Multiclass`, to enable a reliable
+# interpretation of metrics like "true positive rate". By convention,
+# the first class is the negative one:
+
+coerce!(df0, :Churn => OrderedFactor)
+levels(df0.Churn) # to check order
+
+# Re-inspecting the scitypes:
+
+schema(df0) |> DataFrames.DataFrame
+
+
+# ## Preparing a holdout set for final testing
+
+# > Introduces: `partition`
+
+# To reduce training times for the purposes of this tutorial, we're
+# going to dump 90% of observations (after shuffling) and split off
+# 30% of the remainder for use as a lock-and-throw-away-the-key
+# holdout set:
+
+df, df_test, df_dumped = partition(df0, 0.07, 0.03, # in ratios 7:3:90
+ stratify=df0.Churn,
+ rng=123);
+
+# The reader interested in including all data can instead do
+# `df, df_test = partition(df0, 0.7, rng=123)`.
+
+
+# ## Splitting data into target and features
+
+# > Introduces: `unpack`
+
+# In the following call, the column with name `:Churn` is copied over
+# to a vector `y`, and every remaining column, except `:customerID`
+# (which contains no useful information) goes into a table `X`. Here
+# `:Churn` is the target variable for which we seek predictions, given
+# new versions of the features `X`.
+
+const y, X = unpack(df, ==(:Churn), !=(:customerID));
+schema(X).names
+
+#-
+
+intersect([:Churn, :customerID], schema(X).names)
+
+# We'll do the same for the holdout data:
+
+const ytest, Xtest = unpack(df_test, ==(:Churn), !=(:customerID));
+
+# ## Loading a model and checking type requirements
+
+# > Introduces: `@load`, `input_scitype`, `target_scitype`
+
+# For tools helping us to identify suitable models, see the [Model
+# Search](https://alan-turing-institute.github.io/MLJ.jl/dev/model_search/#model_search)
+# section of the manual. We will build a gradient tree-boosting model,
+# a popular first choice for structured data like we have here. Model
+# code is contained in a third-party package called
+# [EvoTrees.jl](https://github.com/Evovest/EvoTrees.jl) which is
+# loaded as follows:
+
+Booster = @load EvoTreeClassifier pkg=EvoTrees
+
+# Recall that a *model* is just a container for some algorithm's
+# hyper-parameters. Let's create a `Booster` with default values for
+# the hyper-parameters:
+
+booster = Booster()
+
+# This model is appropriate for the kind of target variable we have because of
+# the following passing test:
+
+scitype(y) <: target_scitype(booster)
+
+# However, our features `X` cannot be directly used with `booster`:
+
+scitype(X) <: input_scitype(booster)
+
+# As it turns out, this is because `booster`, like the majority of MLJ
+# supervised models, expects the features to be `Continuous`. (With
+# some experience, this can be gleaned from `input_scitype(booster)`.)
+# So we need categorical feature encoding, discussed next.
+
+
+# ## Building a model pipeline to incorporate feature encoding
+
+# > Introduces: `ContinuousEncoder`, pipeline operator `|>`
+
+# The built-in `ContinuousEncoder` model transforms an arbitrary table
+# to a table whose features are all `Continuous` (dropping any fields
+# it does not know how to encode). In particular, all `Multiclass`
+# features are one-hot encoded.
+
+# A *pipeline* is a stand-alone model that internally combines one or
+# more models in a linear (non-branching) pipeline. Here's a pipeline
+# that adds the `ContinuousEncoder` as a pre-processor to the
+# gradient tree-boosting model above:
+
+pipe = ContinuousEncoder() |> booster
+
+# Note that the component models appear as hyper-parameters of
+# `pipe`. Pipelines are an implementation of a more general [model
+# composition](https://alan-turing-institute.github.io/MLJ.jl/dev/composing_models/#Composing-Models)
+# interface provided by MLJ that advanced users may want to learn about.
+
+# From the above display, we see that component model hyper-parameters
+# are now *nested*, but they are still accessible (important in hyper-parameter
+# optimization):
+
+pipe.evo_tree_classifier.max_depth
+
+
+# ## Evaluating the pipeline model's performance
+
+# > Introduces: `measures` (function), **measures:** `brier_loss`, `auc`, `accuracy`;
+# > `machine`, `fit!`, `predict`, `fitted_params`, `report`, `roc`, **resampling strategy** `StratifiedCV`, `evaluate`, `FeatureSelector`
+
+# Without touching our test set `Xtest`, `ytest`, we will estimate the
+# performance of our pipeline model, with default hyper-parameters, in
+# two different ways:
+
+# **Evaluating by hand.** First, we'll do this "by hand" using the `fit!` and `predict`
+# workflow illustrated for the iris data set above, using a
+# holdout resampling strategy. At the same time we'll see how to
+# generate a **confusion matrix**, **ROC curve**, and inspect
+# **feature importances**.
+
+# **Automated performance evaluation.** Next we'll apply the more
+# typical and convenient `evaluate` workflow, but using `StratifiedCV`
+# (stratified cross-validation) which is more informative.
+
+# In any case, we need to choose some measures (metrics) to quantify
+# the performance of our model. For a complete list of measures, one
+# does `measures()`. Or we also can do:
+
+measures("Brier")
+
+# We will be primarily using `brier_loss`, but also `auc` (area under
+# the ROC curve) and `accuracy`.
+
+
+# ### Evaluating by hand (with a holdout set)
+
+# Our pipeline model can be trained just like the decision tree model
+# we built for the iris data set. Binding all non-test data to the
+# pipeline model:
+
+mach_pipe = machine(pipe, X, y)
+
+# We already encountered the `partition` method above. Here we apply
+# it to row indices, instead of data containers, as `fit!` and
+# `predict` only need a *view* of the data to work.
+
+train, validation = partition(1:length(y), 0.7)
+fit!(mach_pipe, rows=train)
+
+# We note in passing that we can access two kinds of information from a trained machine:
+
+# - The **learned parameters** (eg, coefficients of a linear model): We use `fitted_params(mach_pipe)`
+# - Other **by-products of training** (eg, feature importances): We use `report(mach_pipe)`
+
+fp = fitted_params(mach_pipe);
+keys(fp)
+
+# For example, we can check that the encoder did not actually drop any features:
+
+Set(fp.continuous_encoder.features_to_keep) == Set(schema(X).names)
+
+# And, from the report, extract feature importances:
+
+rpt = report(mach_pipe)
+keys(rpt.evo_tree_classifier)
+
+#-
+
+fi = rpt.evo_tree_classifier.feature_importances
+feature_importance_table =
+ (feature=Symbol.(first.(fi)), importance=last.(fi)) |> DataFrames.DataFrame
+
+# For models not reporting feature importances, we recommend the
+# [Shapley.jl](https://expandingman.gitlab.io/Shapley.jl/) package.
+
+# Returning to predictions and evaluations of our measures:
+
+ŷ = predict(mach_pipe, rows=validation);
+@info("Measurements",
+ brier_loss(ŷ, y[validation]) |> mean,
+ auc(ŷ, y[validation]),
+ accuracy(mode.(ŷ), y[validation])
+ )
+
+# Note that we need `mode` in the last case because `accuracy` expects
+# point predictions, not probabilistic ones. (One can alternatively
+# use `predict_mode` to generate the predictions.)
+
+# While we're here, lets also generate a **confusion matrix** and
+# [receiver-operator
+# characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
+# (ROC):
+
+confmat(mode.(ŷ), y[validation])
+
+# Note: Importing the plotting package and calling the plotting
+# functions for the first time can take a minute or so.
+
+using Plots
+
+#-
+
+roc_curve = roc(ŷ, y[validation])
+plt = scatter(roc_curve, legend=false)
+plot!(plt, xlab="false positive rate", ylab="true positive rate")
+plot!([0, 1], [0, 1], linewidth=2, linestyle=:dash, color=:black)
+
+
+# ### Automated performance evaluation (more typical workflow)
+
+# We can also get performance estimates with a single call to the
+# `evaluate` function, which also allows for more complicated
+# resampling - in this case stratified cross-validation. To make this
+# more comprehensive, we set `repeats=3` below to make our
+# cross-validation "Monte Carlo" (3 random size-6 partitions of the
+# observation space, for a total of 18 folds) and set
+# `acceleration=CPUThreads()` to parallelize the computation.
+
+# We choose a `StratifiedCV` resampling strategy; the complete list of options is
+# [here](https://alan-turing-institute.github.io/MLJ.jl/dev/evaluating_model_performance/#Built-in-resampling-strategies).
+
+e_pipe = evaluate(pipe, X, y,
+ resampling=StratifiedCV(nfolds=6, rng=123),
+ measures=[brier_loss, auc, accuracy],
+ repeats=3,
+ acceleration=CPUThreads())
+
+# (There is also a version of `evaluate` for machines. Query the
+# `evaluate` and `evaluate!` doc-strings to learn more about these
+# functions and what the `PerformanceEvaluation` object `e_pipe` records.)
+
+# While [less than ideal](https://arxiv.org/abs/2104.00673), let's
+# adopt the common practice of using the standard error of a
+# cross-validation score as an estimate of the uncertainty of a
+# performance measure's expected value. Here's a utility function to
+# calculate confidence intervals for our performance estimates based
+# on this practice, and it's application to the current evaluation:
+
+using Measurements
+
+#-
+
+function confidence_intervals(e)
+ measure = e.measure
+ nfolds = length(e.per_fold[1])
+ measurement = [e.measurement[j] ± std(e.per_fold[j])/sqrt(nfolds - 1)
+ for j in eachindex(measure)]
+ table = (measure=measure, measurement=measurement)
+ return DataFrames.DataFrame(table)
+end
+
+const confidence_intervals_basic_model = confidence_intervals(e_pipe)
+
+
+# ## Filtering out unimportant features
+
+# > Introduces: `FeatureSelector`
+
+# Before continuing, we'll modify our pipeline to drop those features
+# with low feature importance, to speed up later optimization:
+
+unimportant_features = filter(:importance => <(0.005), feature_importance_table).feature
+
+pipe2 = ContinuousEncoder() |>
+ FeatureSelector(features=unimportant_features, ignore=true) |> booster
+
+
+# ## Wrapping our iterative model in control strategies
+
+# > Introduces: **control strategies:** `Step`, `NumberSinceBest`, `TimeLimit`, `InvalidValue`, **model wrapper** `IteratedModel`, **resampling strategy:** `Holdout`
+
+# We want to optimize the hyper-parameters of our model. Since our
+# model is iterative, these parameters include the (nested) iteration
+# parameter `pipe.evo_tree_classifier.nrounds`. Sometimes this
+# parameter is optimized first, fixed, and then maybe optimized again
+# after the other parameters. Here we take a more principled approach,
+# **wrapping our model in a control strategy** that makes it
+# "self-iterating". The strategy applies a stopping criterion to
+# *out-of-sample* estimates of the model performance, constructed
+# using an internally constructed holdout set. In this way, we avoid
+# some data hygiene issues, and, when we subsequently optimize other
+# parameters, we will always being using an optimal number of
+# iterations.
+
+# Note that this approach can be applied to any iterative MLJ model,
+# eg, the neural network models provided by
+# [MLJFlux.jl](https://github.com/FluxML/MLJFlux.jl).
+
+# First, we select appropriate controls from [this
+# list](https://alan-turing-institute.github.io/MLJ.jl/dev/controlling_iterative_models/#Controls-provided):
+
+controls = [
+ Step(1), # to increment iteration parameter (`pipe.nrounds`)
+ NumberSinceBest(4), # main stopping criterion
+ TimeLimit(2/3600), # never train more than 2 sec
+ InvalidValue() # stop if NaN or ±Inf encountered
+]
+
+# Now we wrap our pipeline model using the `IteratedModel` wrapper,
+# being sure to specify the `measure` on which internal estimates of
+# the out-of-sample performance will be based:
+
+iterated_pipe = IteratedModel(model=pipe2,
+ controls=controls,
+ measure=brier_loss,
+ resampling=Holdout(fraction_train=0.7))
+
+# We've set `resampling=Holdout(fraction_train=0.7)` to arrange that
+# data attached to our model should be internally split into a train
+# set (70%) and a holdout set (30%) for determining the out-of-sample
+# estimate of the Brier loss.
+
+# For demonstration purposes, let's bind `iterated_model` to all data
+# not in our don't-touch holdout set, and train on all of that data:
+
+mach_iterated_pipe = machine(iterated_pipe, X, y)
+fit!(mach_iterated_pipe);
+
+# To recap, internally this training is split into two separate steps:
+
+# - A controlled iteration step, training on the holdout set, with the total number of iterations determined by the specified stopping criteria (based on the out-of-sample performance estimates)
+# - A final step that trains the atomic model on *all* available
+# data using the number of iterations determined in the first step. Calling `predict` on `mach_iterated_pipe` means using the learned parameters of the second step.
+
+
+# ## Hyper-parameter optimization (model tuning)
+
+# > Introduces: `range`, **model wrapper** `TunedModel`, `RandomSearch`
+
+# We now turn to hyper-parameter optimization. A tool not discussed
+# here is the `learning_curve` function, which can be useful when
+# wanting to visualize the effect of changes to a *single*
+# hyper-parameter (which could be an iteration parameter). See, for
+# example, [this section of the
+# manual](https://alan-turing-institute.github.io/MLJ.jl/dev/learning_curves/)
+# or [this
+# tutorial](https://github.com/ablaom/MLJTutorial.jl/blob/dev/notebooks/04_tuning/notebook.ipynb).
+
+# Fine tuning the hyper-parameters of a gradient booster can be
+# somewhat involved. Here we settle for simultaneously optimizing two
+# key parameters: `max_depth` and `η` (learning_rate).
+
+# Like iteration control, **model optimization in MLJ is implemented as
+# a model wrapper**, called `TunedModel`. After wrapping a model in a
+# tuning strategy and binding the wrapped model to data in a machine
+# called `mach`, calling `fit!(mach)` instigates a search for optimal
+# model hyperparameters, within a specified range, and then uses all
+# supplied data to train the best model. To predict using that model,
+# one then calls `predict(mach, Xnew)`. In this way the wrapped model
+# may be viewed as a "self-tuning" version of the unwrapped
+# model. That is, wrapping the model simply transforms certain
+# hyper-parameters into learned parameters (just as `IteratedModel`
+# does for an iteration parameter).
+
+# To start with, we define ranges for the parameters of
+# interest. Since these parameters are nested, let's force a
+# display of our model to a larger depth:
+
+show(iterated_pipe, 2)
+
+#-
+
+p1 = :(model.evo_tree_classifier.η)
+p2 = :(model.evo_tree_classifier.max_depth)
+
+r1 = range(iterated_pipe, p1, lower=-2, upper=-0.5, scale=x->10^x)
+r2 = range(iterated_pipe, p2, lower=2, upper=6)
+
+# Nominal ranges are defined by specifying `values` instead of `lower`
+# and `upper`.
+
+# Next, we choose an optimization strategy from [this
+# list](https://alan-turing-institute.github.io/MLJ.jl/dev/tuning_models/#Tuning-Models):
+
+tuning = RandomSearch(rng=123)
+
+# Then we wrap the model, specifying a `resampling` strategy and a
+# `measure`, as we did for `IteratedModel`. In fact, we can include a
+# battery of `measures`; by default, optimization is with respect to
+# performance estimates based on the first measure, but estimates for
+# all measures can be accessed from the model's `report`.
+
+# The keyword `n` specifies the total number of models (sets of
+# hyper-parameters) to evaluate.
+
+tuned_iterated_pipe = TunedModel(model=iterated_pipe,
+ range=[r1, r2],
+ tuning=tuning,
+ measures=[brier_loss, auc, accuracy],
+ resampling=StratifiedCV(nfolds=6, rng=123),
+ acceleration=CPUThreads(),
+ n=40)
+
+# To save time, we skip the `repeats` here.
+
+# Binding our final model to data and training:
+
+mach_tuned_iterated_pipe = machine(tuned_iterated_pipe, X, y)
+fit!(mach_tuned_iterated_pipe)
+
+# As explained above, the training we have just performed was split
+# internally into two separate steps:
+
+# - A step to determine the parameter values that optimize the aggregated cross-validation scores
+# - A final step that trains the optimal model on *all* available data. Future predictions `predict(mach_tuned_iterated_pipe, ...)` are based on this final training step.
+
+# From `report(mach_tuned_iterated_pipe)` we can extract details about
+# the optimization procedure. For example:
+
+rpt2 = report(mach_tuned_iterated_pipe);
+best_booster = rpt2.best_model.model.evo_tree_classifier
+
+#-
+
+@info "Optimal hyper-parameters:" best_booster.max_depth best_booster.η;
+
+# Using the `confidence_intervals` function we defined earlier:
+
+e_best = rpt2.best_history_entry
+confidence_intervals(e_best)
+
+# Digging a little deeper, we can learn what stopping criterion was
+# applied in the case of the optimal model, and how many iterations
+# were required:
+
+rpt2.best_report.controls |> collect
+
+# Finally, we can visualize the optimization results:
+
+plot(mach_tuned_iterated_pipe, size=(600,450))
+
+
+# ## Saving our model
+
+# > Introduces: `MLJ.save`
+
+# Here's how to serialize our final, trained self-iterating,
+# self-tuning pipeline machine:
+
+MLJ.save("tuned_iterated_pipe.jlso", mach_tuned_iterated_pipe)
+
+
+# We'll deserialize this in "Testing the final model" below.
+
+# ## Final performance estimate
+
+# Finally, to get an even more accurate estimate of performance, we
+# can evaluate our model using stratified cross-validation and all the
+# data attached to our machine. Because this evaluation implies
+# [nested
+# resampling](https://mlr.mlr-org.com/articles/tutorial/nested_resampling.html),
+# this computation takes quite a bit longer than the previous one
+# (which is being repeated six times, using 5/6th of the data each
+# time):
+
+e_tuned_iterated_pipe = evaluate(tuned_iterated_pipe, X, y,
+ resampling=StratifiedCV(nfolds=6, rng=123),
+ measures=[brier_loss, auc, accuracy])
+
+#-
+
+confidence_intervals(e_tuned_iterated_pipe)
+
+# For comparison, here are the confidence intervals for the basic
+# pipeline model (no feature selection and default hyperparameters):
+
+confidence_intervals_basic_model
+
+# As each pair of intervals overlap, it's doubtful the small changes
+# here can be assigned statistical significance. Default `booster`
+# hyper-parameters do a pretty good job.
+
+
+# ## Testing the final model
+
+# We now determine the performance of our model on our
+# lock-and-throw-away-the-key holdout set. To demonstrate
+# deserialization, we'll pretend we're in a new Julia session (but
+# have called `import`/`using` on the same packages). Then the
+# following should suffice to recover our model trained under
+# "Hyper-parameter optimization" above:
+
+mach_restored = machine("tuned_iterated_pipe.jlso")
+
+# We compute predictions on the holdout set:
+
+ŷ_tuned = predict(mach_restored, Xtest);
+ŷ_tuned[1]
+
+# And can compute the final performance measures:
+
+@info("Tuned model measurements on test:",
+ brier_loss(ŷ_tuned, ytest) |> mean,
+ auc(ŷ_tuned, ytest),
+ accuracy(mode.(ŷ_tuned), ytest)
+ )
+
+# For comparison, here's the performance for the basic pipeline model
+
+mach_basic = machine(pipe, X, y)
+fit!(mach_basic, verbosity=0)
+
+ŷ_basic = predict(mach_basic, Xtest);
+
+@info("Basic model measurements on test set:",
+ brier_loss(ŷ_basic, ytest) |> mean,
+ auc(ŷ_basic, ytest),
+ accuracy(mode.(ŷ_basic), ytest)
+ )
+
diff --git a/examples/telco/notebook.pluto.jl b/examples/telco/notebook.pluto.jl
new file mode 100644
index 000000000..76c5e650e
--- /dev/null
+++ b/examples/telco/notebook.pluto.jl
@@ -0,0 +1,1319 @@
+### A Pluto.jl notebook ###
+# v0.16.0
+
+using Markdown
+using InteractiveUtils
+
+# ╔═╡ f0cc864c-8b26-441f-9bca-7c69b794f8ce
+md"# MLJ for Data Scientists in Two Hours"
+
+# ╔═╡ 8a6670b8-96a8-4a5d-b795-033f6f2a0674
+md"""
+An application of the [MLJ
+toolbox](https://alan-turing-institute.github.io/MLJ.jl/dev/) to the
+Telco Customer Churn dataset, aimed at practicing data scientists
+new to MLJ (Machine Learning in Julia). This tutorial does not
+cover exploratory data analysis.
+"""
+
+# ╔═╡ aa49e638-95dc-4249-935f-ddf6a6bfbbdd
+md"""
+MLJ is a *multi-paradigm* machine learning toolbox (i.e., not just
+deep-learning).
+"""
+
+# ╔═╡ b04c4790-59e0-42a3-af2a-25235e544a31
+md"""
+For other MLJ learning resources see the [Learning
+MLJ](https://alan-turing-institute.github.io/MLJ.jl/dev/learning_mlj/)
+section of the
+[manual](https://alan-turing-institute.github.io/MLJ.jl/dev/).
+"""
+
+# ╔═╡ 4eb8dff4-c23a-4b41-8af5-148d95ea2900
+md"""
+**Topics covered**: Grabbing and preparing a dataset, basic
+fit/predict workflow, constructing a pipeline to include data
+pre-processing, estimating performance metrics, ROC curves, confusion
+matrices, feature importance, basic feature selection, controlling iterative
+models, hyper-parameter optimization (tuning).
+"""
+
+# ╔═╡ a583d175-d623-4888-a6bf-47194d7e8e12
+md"""
+**Prerequisites for this tutorial.** Previous experience building,
+evaluating, and optimizing machine learning models using
+scikit-learn, caret, MLR, weka, or similar tool. No previous
+experience with MLJ. Only fairly basic familiarity with Julia is
+required. Uses
+[DataFrames.jl](https://dataframes.juliadata.org/stable/) but in a
+minimal way ([this
+cheatsheet](https://ahsmart.com/pub/data-wrangling-with-data-frames-jl-cheat-sheet/index.html)
+may help).
+"""
+
+# ╔═╡ 4830fc64-3d70-4869-828a-4cc485149963
+md"**Time.** Between two and three hours, first time through."
+
+# ╔═╡ 28197138-d6b7-433c-9e54-3f7b3ca87ecb
+md"## Summary of methods and types introduced"
+
+# ╔═╡ 39c599b2-3c1f-4292-ba1f-6363f43ec697
+md"""
+|code | purpose|
+|:-------|:-------------------------------------------------------|
+|`OpenML.load(id)` | grab a dataset from [OpenML.org](https://www.openml.org)|
+|`scitype(X)` | inspect the scientific type (scitype) of object `X`|
+|`schema(X)` | inspect the column scitypes (scientific types) of a table `X`|
+|`coerce(X, ...)` | fix column encodings to get appropriate scitypes|
+|`partition(data, frac1, frac2, ...; rng=...)` | vertically split `data`, which can be a table, vector or matrix|
+|`unpack(table, f1, f2, ...)` | horizontally split `table` based on conditions `f1`, `f2`, ..., applied to column names|
+|`@load ModelType pkg=...` | load code defining a model type|
+|`input_scitype(model)` | inspect the scitype that a model requires for features (inputs)|
+|`target_scitype(model)`| inspect the scitype that a model requires for the target (labels)|
+|`ContinuousEncoder` | built-in model type for re-encoding all features as `Continuous`|# |`model1 |> model2` |> ...` | combine multiple models into a pipeline
+| `measures("under curve")` | list all measures (metrics) with string "under curve" in documentation
+| `accuracy(yhat, y)` | compute accuracy of predictions `yhat` against ground truth observations `y`
+| `auc(yhat, y)`, `brier_loss(yhat, y)` | evaluate two probabilistic measures (`yhat` a vector of probability distributions)
+| `machine(model, X, y)` | bind `model` to training data `X` (features) and `y` (target)
+| `fit!(mach, rows=...)` | train machine using specified rows (observation indices)
+| `predict(mach, rows=...)`, | make in-sample model predictions given specified rows
+| `predict(mach, Xnew)` | make predictions given new features `Xnew`
+| `fitted_params(mach)` | inspect learned parameters
+| `report(mach)` | inspect other outcomes of training
+| `confmat(yhat, y)` | confusion matrix for predictions `yhat` and ground truth `y`
+| `roc(yhat, y)` | compute points on the receiver-operator Characteristic
+| `StratifiedCV(nfolds=6)` | 6-fold stratified cross-validation resampling strategy
+| `Holdout(fraction_train=0.7)` | holdout resampling strategy
+| `evaluate(model, X, y; resampling=..., options...)` | estimate performance metrics `model` using the data `X`, `y`
+| `FeatureSelector()` | transformer for selecting features
+| `Step(3)` | iteration control for stepping 3 iterations
+| `NumberSinceBest(6)`, `TimeLimit(60/5), InvalidValue()` | iteration control stopping criteria
+| `IteratedModel(model=..., controls=..., options...)` | wrap an iterative `model` in control strategies
+| `range(model, :some_hyperparam, lower=..., upper=...)` | define a numeric range
+| `RandomSearch()` | random search tuning strategy
+| `TunedModel(model=..., tuning=..., options...)` | wrap the supervised `model` in specified `tuning` strategy
+"""
+
+# ╔═╡ 7c0464a0-4114-46bf-95ea-2955abd45275
+md"## Instantiate a Julia environment"
+
+# ╔═╡ 256869d0-5e0c-42af-b1b3-dd47e367ba54
+md"""
+The following code replicates precisely the set of Julia packages
+used to develop this tutorial. If this is your first time running
+the notebook, package instantiation and pre-compilation may take a
+minute or so to complete. **This step will fail** if the [correct
+Manifest.toml and Project.toml
+files](https://github.com/alan-turing-institute/MLJ.jl/tree/dev/examples/telco)
+are not in the same directory as this notebook.
+"""
+
+# ╔═╡ 60fe49c1-3434-4a77-8d7e-8e449afd1c48
+begin
+ using Pkg
+ Pkg.activate(@__DIR__) # get env from TOML files in same directory as this notebook
+ Pkg.instantiate()
+end
+
+# ╔═╡ e8c13e9d-7910-4a0e-a949-7f3bd292fe31
+md"## Warm up: Building a model for the iris dataset"
+
+# ╔═╡ 6bf6ef98-302f-478c-8514-99938a2932db
+md"""
+Before turning to the Telco Customer Churn dataset, we very quickly
+build a predictive model for Fisher's well-known iris data set, as way of
+introducing the main actors in any MLJ workflow. Details that you
+don't fully grasp should become clearer in the Telco study.
+"""
+
+# ╔═╡ 33ca287e-8cba-47d1-a0de-1721c1bc2df2
+md"""
+This section is a condensed adaption of the [Getting Started
+example](https://alan-turing-institute.github.io/MLJ.jl/dev/getting_started/#Fit-and-predict)
+in the MLJ documentation.
+"""
+
+# ╔═╡ d5b8bf1d-6c9a-46c1-bca8-7eec7950fd82
+md"""
+First, using the built-in iris dataset, we load and inspect the features
+`X_iris` (a table) and target variable `y_iris` (a vector):
+"""
+
+# ╔═╡ 9b7b9ade-9318-4b95-9873-1b1430e635cc
+using MLJ
+
+# ╔═╡ 4f3f061d-259d-4479-b43e-d0ebe87da176
+begin
+ const X_iris, y_iris = @load_iris;
+ schema(X_iris)
+end
+
+# ╔═╡ a546f372-4ae0-48c2-9009-4cfb2012998f
+y_iris[1:4]
+
+# ╔═╡ f8f79dd2-b6de-48e8-abd3-cb77d7a79683
+levels(y_iris)
+
+# ╔═╡ 617d63f6-8a62-40db-879e-dc128f3db7b3
+md"We load a decision tree model, from the package DecisionTree.jl:"
+
+# ╔═╡ ceda7ed0-3b98-44b3-a367-9aa3c6cf34d0
+begin
+ DecisionTree = @load DecisionTreeClassifier pkg=DecisionTree # model type
+ model = DecisionTree(min_samples_split=5) # model instance
+end
+
+# ╔═╡ e69cd764-e4b8-4ff8-bf32-94657e65284e
+md"""
+In MLJ, a *model* is just a container for hyper-parameters of
+some learning algorithm. It does not store learned parameters.
+"""
+
+# ╔═╡ 4a175e0f-4b87-4b53-9afd-65a5b5facac9
+md"""
+Next, we bind the model together with the available data in what's
+called a *machine*:
+"""
+
+# ╔═╡ 3c23e9f8-49bd-4cde-b6c8-54f5ed90a964
+mach = machine(model, X_iris, y_iris)
+
+# ╔═╡ 6ad971c7-9fe5-45a9-9292-fe822525fc77
+md"""
+A machine is essentially just a model (ie, hyper-parameters) plus data, but
+it additionally stores *learned parameters* (the tree) once it is
+trained on some view of the data:
+"""
+
+# ╔═╡ 63151d31-8428-4ebd-ae5e-4b83dcbc9675
+begin
+ train_rows = vcat(1:60, 91:150); # some row indices (observations are rows not columns)
+ fit!(mach, rows=train_rows)
+ fitted_params(mach)
+end
+
+# ╔═╡ 0f978839-cc95-4c3a-8a29-32f11452654a
+md"""
+A machine stores some other information enabling [warm
+restart](https://alan-turing-institute.github.io/MLJ.jl/dev/machines/#Warm-restarts)
+for some models, but we won't go into that here. You are allowed to
+access and mutate the `model` parameter:
+"""
+
+# ╔═╡ 5edc98cd-df7d-428a-a5f1-151fcbe229a2
+begin
+ mach.model.min_samples_split = 10
+ fit!(mach, rows=train_rows) # re-train with new hyper-parameter
+end
+
+# ╔═╡ 1848109f-c94c-4cdd-81bc-2d5603785a09
+md"Now we can make predictions on some other view of the data, as in"
+
+# ╔═╡ eee948b7-2b42-439e-9d87-1d1fbb0e3997
+predict(mach, rows=71:73)
+
+# ╔═╡ 7c98e10b-1fa3-4ab6-b950-fddef2a1fb10
+md"or on completely new data, as in"
+
+# ╔═╡ 413d9c3c-a68e-4bf0-951b-64cb2a36c8e3
+begin
+ Xnew = (sepal_length = [5.1, 6.3],
+ sepal_width = [3.0, 2.5],
+ petal_length = [1.4, 4.9],
+ petal_width = [0.3, 1.5])
+ yhat = predict(mach, Xnew)
+end
+
+# ╔═╡ eb79663c-c671-4c4c-b0e6-441461cc8770
+md"""
+These are probabilistic predictions which can be manipulated using a
+widely adopted interface defined in the Distributions.jl
+package. For example, we can get raw probabilities like this:
+"""
+
+# ╔═╡ ae9e9377-552c-4186-8cb0-de601961bc02
+pdf.(yhat, "virginica")
+
+# ╔═╡ 5c70ee06-edb9-4789-a87d-950c50fb2955
+md"We now turn to the Telco dataset."
+
+# ╔═╡ bde7bc43-a4bb-4a42-8448-4bcb089096cd
+md"## Getting the Telco data"
+
+# ╔═╡ d5a9e9b8-c67e-4bdd-a012-a84240254fb6
+import DataFrames
+
+# ╔═╡ 8de2c7b4-1950-4f05-bbdd-9799f7bb2e60
+begin
+ data = OpenML.load(42178) # data set from OpenML.org
+ df0 = DataFrames.DataFrame(data)
+ first(df0, 4)
+end
+
+# ╔═╡ 768f369a-5f3d-4dcc-97a7-88e3af4f10ee
+md"""
+The object of this tutorial is to build and evaluate supervised
+learning models to predict the `:Churn` variable, a binary variable
+measuring customer retention, based on other variables that are
+relevant.
+"""
+
+# ╔═╡ bc4c0eb9-3b5f-49fc-b372-8c9866e51852
+md"""
+In the table, observations correspond to rows, and features to
+columns, which is the convention for representing all
+two-dimensional data in MLJ.
+"""
+
+# ╔═╡ f04becbb-42a6-409f-8f3d-b82f1e7b6f7a
+md"## Type coercion"
+
+# ╔═╡ 1d08ed75-5377-4971-ab04-579e5608ae53
+md"> Introduces: `scitype`, `schema`, `coerce`"
+
+# ╔═╡ eea585d4-d55b-4ccc-86cf-35420d9e6995
+md"""
+A ["scientific
+type"](https://juliaai.github.io/ScientificTypes.jl/dev/) or
+*scitype* indicates how MLJ will *interpret* data. For example,
+`typeof(3.14) == Float64`, while `scitype(3.14) == Continuous` and
+also `scitype(3.14f0) == Continuous`. In MLJ, model data
+requirements are articulated using scitypes.
+"""
+
+# ╔═╡ ca482134-299b-459a-a29a-8d0445351914
+md"Here are common \"scalar\" scitypes:"
+
+# ╔═╡ 5c7c97e4-fd8b-4ae5-be65-28d0fcca50a8
+md"![](assets/scitypes.png)"
+
+# ╔═╡ fd103be7-6fc9-43bc-9a2e-c468345d87d1
+md"""
+There are also container scitypes. For example, the scitype of any
+`N`-dimensional array is `AbstractArray{S, N}`, where `S` is the scitype of the
+elements:
+"""
+
+# ╔═╡ 9d2f0b19-2942-47ac-b5fa-cb79ebf595ef
+scitype(["cat", "mouse", "dog"])
+
+# ╔═╡ 1d18aca8-11f4-4104-91c5-524da38aa391
+md"The `schema` operator summarizes the column scitypes of a table:"
+
+# ╔═╡ accd3a09-5371-45ab-ad90-b4405b216771
+schema(df0) |> DataFrames.DataFrame # converted to DataFrame for better display
+
+# ╔═╡ bde0406f-3964-42b8-895b-997a92b731e0
+md"""
+All of the fields being interpreted as `Textual` are really
+something else, either `Multiclass` or, in the case of
+`:TotalCharges`, `Continuous`. In fact, `:TotalCharges` is
+mostly floats wrapped as strings. However, it needs special
+treatment because some elements consist of a single space, " ",
+which we'll treat as "0.0".
+"""
+
+# ╔═╡ 25e60566-fc64-4f35-a525-e9b04a4bd246
+begin
+ fix_blanks(v) = map(v) do x
+ if x == " "
+ return "0.0"
+ else
+ return x
+ end
+ end
+
+ df0.TotalCharges = fix_blanks(df0.TotalCharges);
+end
+
+# ╔═╡ 441317a8-3b72-4bf1-80f0-ba7781e173cf
+md"Coercing the `:TotalCharges` type to ensure a `Continuous` scitype:"
+
+# ╔═╡ d2f2b67e-d06e-4c0b-9cbc-9a2c39793333
+coerce!(df0, :TotalCharges => Continuous);
+
+# ╔═╡ 7c93b820-cbce-48be-b887-2b16710e5502
+md"Coercing all remaining `Textual` data to `Multiclass`:"
+
+# ╔═╡ 5ac4a13e-9983-41dd-9452-5298a8a4a401
+coerce!(df0, Textual => Multiclass);
+
+# ╔═╡ 9f28e6c3-5f18-4fb2-b017-96f1602f2cad
+md"""
+Finally, we'll coerce our target variable `:Churn` to be
+`OrderedFactor`, rather than `Multiclass`, to enable a reliable
+interpretation of metrics like "true positive rate". By convention,
+the first class is the negative one:
+"""
+
+# ╔═╡ c9506e5b-c111-442c-8be2-2a4017c45345
+begin
+ coerce!(df0, :Churn => OrderedFactor)
+ levels(df0.Churn) # to check order
+end
+
+# ╔═╡ 8c865df5-ac95-438f-a7ad-80d84f5b0070
+md"Re-inspecting the scitypes:"
+
+# ╔═╡ 22690828-0992-4ed0-8378-5aa686f0b407
+schema(df0) |> DataFrames.DataFrame
+
+# ╔═╡ 3bd753fc-163a-4b50-9f43-49763e8691a1
+md"## Preparing a holdout set for final testing"
+
+# ╔═╡ 7b12673f-cc63-4cd9-bb0e-3a45761c733a
+md"> Introduces: `partition`"
+
+# ╔═╡ 3c98ab4e-47cd-4247-9979-7044b2f2df33
+md"""
+To reduce training times for the purposes of this tutorial, we're
+going to dump 90% of observations (after shuffling) and split off
+30% of the remainder for use as a lock-and-throw-away-the-key
+holdout set:
+"""
+
+# ╔═╡ ea010966-bb7d-4c79-b2a1-fbbde543f620
+df, df_test, df_dumped = partition(df0, 0.07, 0.03, # in ratios 7:3:90
+ stratify=df0.Churn,
+ rng=123);
+
+# ╔═╡ 401c6f14-1128-4df1-910b-f8a9a217eff2
+md"""
+The reader interested in including all data can instead do
+`df, df_test = partition(df0, 0.7, rng=123)`.
+"""
+
+# ╔═╡ 81aa1aa9-27d9-423f-aa36-d2e8546da46a
+md"## Splitting data into target and features"
+
+# ╔═╡ fff47e76-d7ac-4377-88a1-a2bf91434413
+md"> Introduces: `unpack`"
+
+# ╔═╡ f1dbbc22-9844-4e1c-a1cb-54e34396a855
+md"""
+In the following call, the column with name `:Churn` is copied over
+to a vector `y`, and every remaining column, except `:customerID`
+(which contains no useful information) goes into a table `X`. Here
+`:Churn` is the target variable for which we seek predictions, given
+new versions of the features `X`.
+"""
+
+# ╔═╡ 20713fb3-3a3b-42e7-8037-0de5806e1a54
+begin
+ const y, X = unpack(df, ==(:Churn), !=(:customerID));
+ schema(X).names
+end
+
+# ╔═╡ 369cb359-8512-4f80-9961-a09632c33fb0
+intersect([:Churn, :customerID], schema(X).names)
+
+# ╔═╡ af484868-b29c-4808-b7cc-46f4ef988c68
+md"We'll do the same for the holdout data:"
+
+# ╔═╡ 8f3eda66-6183-4fe4-90fb-7f2721f6fcc7
+const ytest, Xtest = unpack(df_test, ==(:Churn), !=(:customerID));
+
+# ╔═╡ 598142b5-0828-49ac-af65-ccd25ecb9818
+md"## Loading a model and checking type requirements"
+
+# ╔═╡ c5309d64-7409-4148-8890-979691212d9b
+md"> Introduces: `@load`, `input_scitype`, `target_scitype`"
+
+# ╔═╡ f97969e2-c15c-42cf-a6fa-eaf14df5d44b
+md"""
+For tools helping us to identify suitable models, see the [Model
+Search](https://alan-turing-institute.github.io/MLJ.jl/dev/model_search/#model_search)
+section of the manual. We will build a gradient tree-boosting model,
+a popular first choice for structured data like we have here. Model
+code is contained in a third-party package called
+[EvoTrees.jl](https://github.com/Evovest/EvoTrees.jl) which is
+loaded as follows:
+"""
+
+# ╔═╡ edf1d726-542a-4fc1-8025-5084804a9f6c
+Booster = @load EvoTreeClassifier pkg=EvoTrees
+
+# ╔═╡ ee41a121-df0d-40ad-9e90-09023d201062
+md"""
+Recall that a *model* is just a container for some algorithm's
+hyper-parameters. Let's create a `Booster` with default values for
+the hyper-parameters:
+"""
+
+# ╔═╡ be6cef47-d585-45fe-b7bb-2f33ef765cc0
+booster = Booster()
+
+# ╔═╡ 452edd51-b878-46d0-9625-01172c4a0081
+md"""
+This model is appropriate for the kind of target variable we have because of
+the following passing test:
+"""
+
+# ╔═╡ fdd20843-c981-41ad-af4f-11c7de9e21dd
+scitype(y) <: target_scitype(booster)
+
+# ╔═╡ 2b25f9cb-d12d-4a6f-8dba-241d9b744683
+md"However, our features `X` cannot be directly used with `booster`:"
+
+# ╔═╡ b139c43b-382b-415a-a6e5-2f0b4dca5c59
+scitype(X) <: input_scitype(booster)
+
+# ╔═╡ 90a7fe9a-934f-445a-8550-20498aa03ed0
+md"""
+As it turns out, this is because `booster`, like the majority of MLJ
+supervised models, expects the features to be `Continuous`. (With
+some experience, this can be gleaned from `input_scitype(booster)`.)
+So we need categorical feature encoding, discussed next.
+"""
+
+# ╔═╡ fc97e937-66ca-4434-9e7b-5b943cf6b560
+md"## Building a model pipeline to incorporate feature encoding"
+
+# ╔═╡ b1d95557-51fc-4f1e-bce6-0a1379cc1259
+md"> Introduces: `ContinuousEncoder`, pipeline operator `|>`"
+
+# ╔═╡ 38961daa-dd5f-4f97-9608-b5032c116833
+md"""
+The built-in `ContinuousEncoder` model transforms an arbitrary table
+to a table whose features are all `Continuous` (dropping any fields
+it does not know how to encode). In particular, all `Multiclass`
+features are one-hot encoded.
+"""
+
+# ╔═╡ ccd3ffd0-ed41-497d-b473-4aa968e6937a
+md"""
+A *pipeline* is a stand-alone model that internally combines one or
+more models in a linear (non-branching) pipeline. Here's a pipeline
+that adds the `ContinuousEncoder` as a pre-processor to the
+gradient tree-boosting model above:
+"""
+
+# ╔═╡ 35cf2011-4fb6-4e7a-8d9e-644a1b3cc6b6
+pipe = ContinuousEncoder() |> booster
+
+# ╔═╡ 4b6e3239-fd47-495a-ac0a-23ded81445da
+md"""
+Note that the component models appear as hyper-parameters of
+`pipe`. Pipelines are an implementation of a more general [model
+composition](https://alan-turing-institute.github.io/MLJ.jl/dev/composing_models/#Composing-Models)
+interface provided by MLJ that advanced users may want to learn about.
+"""
+
+# ╔═╡ ce9a3479-601a-472a-8535-1a088a6a3228
+md"""
+From the above display, we see that component model hyper-parameters
+are now *nested*, but they are still accessible (important in hyper-parameter
+optimization):
+"""
+
+# ╔═╡ 45c2d8f9-cbf9-4cda-a39f-7893c73eef39
+pipe.evo_tree_classifier.max_depth
+
+# ╔═╡ 232b7775-1f79-43f7-bcca-51a27994a151
+md"## Evaluating the pipeline model's performance"
+
+# ╔═╡ ac84c13f-391f-405a-9b33-09b4b6661170
+md"""
+> Introduces: `measures` (function), **measures:** `brier_loss`, `auc`, `accuracy`;
+> `machine`, `fit!`, `predict`, `fitted_params`, `report`, `roc`, **resampling strategy** `StratifiedCV`, `evaluate`, `FeatureSelector`
+"""
+
+# ╔═╡ 442078e4-a695-471c-b45d-88d068bb0fa2
+md"""
+Without touching our test set `Xtest`, `ytest`, we will estimate the
+performance of our pipeline model, with default hyper-parameters, in
+two different ways:
+"""
+
+# ╔═╡ 23fb37b0-08c1-4688-92c8-6db3a590d963
+md"""
+**Evaluating by hand.** First, we'll do this "by hand" using the `fit!` and `predict`
+workflow illustrated for the iris data set above, using a
+holdout resampling strategy. At the same time we'll see how to
+generate a **confusion matrix**, **ROC curve**, and inspect
+**feature importances**.
+"""
+
+# ╔═╡ 1def55f5-71fc-4257-abf5-96f457eb2bdf
+md"""
+**Automated performance evaluation.** Next we'll apply the more
+typical and convenient `evaluate` workflow, but using `StratifiedCV`
+(stratified cross-validation) which is more informative.
+"""
+
+# ╔═╡ 2c4016c6-dc7e-44ec-8a60-2e0214c059f5
+md"""
+In any case, we need to choose some measures (metrics) to quantify
+the performance of our model. For a complete list of measures, one
+does `measures()`. Or we also can do:
+"""
+
+# ╔═╡ 236c098c-0189-4a96-a38a-beb5c7157b57
+measures("Brier")
+
+# ╔═╡ 9ffe1654-7e32-4c4b-81f5-507803ea9ed6
+md"""
+We will be primarily using `brier_loss`, but also `auc` (area under
+the ROC curve) and `accuracy`.
+"""
+
+# ╔═╡ 48413a7b-d01b-4005-9b21-6d4eb642d87e
+md"### Evaluating by hand (with a holdout set)"
+
+# ╔═╡ 84d7bcbc-df3c-4602-b98c-3df1f31879bf
+md"""
+Our pipeline model can be trained just like the decision tree model
+we built for the iris data set. Binding all non-test data to the
+pipeline model:
+"""
+
+# ╔═╡ f8552404-f95f-4484-92b7-4ceba56e97ad
+mach_pipe = machine(pipe, X, y)
+
+# ╔═╡ e0b1bffb-dbd5-4b0a-b122-214de244406c
+md"""
+We already encountered the `partition` method above. Here we apply
+it to row indices, instead of data containers, as `fit!` and
+`predict` only need a *view* of the data to work.
+"""
+
+# ╔═╡ d8ad2c21-fd27-44fb-8a4b-c83694978e38
+begin
+ train, validation = partition(1:length(y), 0.7)
+ fit!(mach_pipe, rows=train)
+end
+
+# ╔═╡ ceff13ba-a526-4ecb-a8b6-f7ff516dedc4
+md"We note in passing that we can access two kinds of information from a trained machine:"
+
+# ╔═╡ dcf0eb93-2368-4158-81e1-74ef03c2e79e
+md"""
+- The **learned parameters** (eg, coefficients of a linear model): We use `fitted_params(mach_pipe)`
+- Other **by-products of training** (eg, feature importances): We use `report(mach_pipe)`
+"""
+
+# ╔═╡ 442be3e3-b2e9-499d-a04e-0b76409c14a7
+begin
+ fp = fitted_params(mach_pipe);
+ keys(fp)
+end
+
+# ╔═╡ f03fdd56-6c30-4d73-b979-3458f2f26667
+md"For example, we can check that the encoder did not actually drop any features:"
+
+# ╔═╡ 3afbe65c-18bf-4576-97e3-6677afc6ca9f
+Set(fp.continuous_encoder.features_to_keep) == Set(schema(X).names)
+
+# ╔═╡ 527b193b-737d-4701-b10e-562ce21caa04
+md"And, from the report, extract feature importances:"
+
+# ╔═╡ b332842d-6397-45c6-8f78-ab549ef1544e
+begin
+ rpt = report(mach_pipe)
+ keys(rpt.evo_tree_classifier)
+end
+
+# ╔═╡ 3035e6d0-3801-424d-a8a4-a53f5149481e
+begin
+ fi = rpt.evo_tree_classifier.feature_importances
+ feature_importance_table =
+ (feature=Symbol.(first.(fi)), importance=last.(fi)) |> DataFrames.DataFrame
+end
+
+# ╔═╡ 8b67dcac-6a86-4502-8704-dbab8e09b4f1
+md"""
+For models not reporting feature importances, we recommend the
+[Shapley.jl](https://expandingman.gitlab.io/Shapley.jl/) package.
+"""
+
+# ╔═╡ 9cd42f39-942e-4011-a02f-27b6c05e4d02
+md"Returning to predictions and evaluations of our measures:"
+
+# ╔═╡ 0fecce89-102f-416f-bea4-3d467d48781c
+begin
+ ŷ = predict(mach_pipe, rows=validation);
+ @info("Measurements",
+ brier_loss(ŷ, y[validation]) |> mean,
+ auc(ŷ, y[validation]),
+ accuracy(mode.(ŷ), y[validation])
+ )
+end
+
+# ╔═╡ 0b11a81e-42a7-4519-97c5-2979af8a507d
+md"""
+Note that we need `mode` in the last case because `accuracy` expects
+point predictions, not probabilistic ones. (One can alternatively
+use `predict_mode` to generate the predictions.)
+"""
+
+# ╔═╡ e6f6fc92-f9e7-4472-b639-63c76c72c513
+md"""
+While we're here, lets also generate a **confusion matrix** and
+[receiver-operator
+characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
+(ROC):
+"""
+
+# ╔═╡ f3933688-1ea4-4239-8f5a-e1ee9eb5c15c
+confmat(mode.(ŷ), y[validation])
+
+# ╔═╡ ce90835f-6709-4c34-adc5-e671db8bca5d
+md"""
+Note: Importing the plotting package and calling the plotting
+functions for the first time can take a minute or so.
+"""
+
+# ╔═╡ 15996832-b736-45f4-86f1-0ea38de21abb
+using Plots
+
+# ╔═╡ 3c4984d2-9e1a-450b-a55c-0cb44ab816d7
+begin
+ roc_curve = roc(ŷ, y[validation])
+ plt = scatter(roc_curve, legend=false)
+ plot!(plt, xlab="false positive rate", ylab="true positive rate")
+ plot!([0, 1], [0, 1], linewidth=2, linestyle=:dash, color=:black)
+end
+
+# ╔═╡ 27221b90-8506-4857-be86-92ecfd0d2343
+md"### Automated performance evaluation (more typical workflow)"
+
+# ╔═╡ 445146e5-e45e-450d-9cf1-facf39e3f302
+md"""
+We can also get performance estimates with a single call to the
+`evaluate` function, which also allows for more complicated
+resampling - in this case stratified cross-validation. To make this
+more comprehensive, we set `repeats=3` below to make our
+cross-validation "Monte Carlo" (3 random size-6 partitions of the
+observation space, for a total of 18 folds) and set
+`acceleration=CPUThreads()` to parallelize the computation.
+"""
+
+# ╔═╡ 562887bb-b7fb-430f-b61c-748aec38e674
+md"""
+We choose a `StratifiedCV` resampling strategy; the complete list of options is
+[here](https://alan-turing-institute.github.io/MLJ.jl/dev/evaluating_model_performance/#Built-in-resampling-strategies).
+"""
+
+# ╔═╡ f9be989e-2604-44c2-9727-ed822e4fd85d
+e_pipe = evaluate(pipe, X, y,
+ resampling=StratifiedCV(nfolds=6, rng=123),
+ measures=[brier_loss, auc, accuracy],
+ repeats=3,
+ acceleration=CPUThreads())
+
+# ╔═╡ ff7cfc36-b9fc-4570-b2f2-e08965e5be66
+md"""
+(There is also a version of `evaluate` for machines. Query the
+`evaluate` and `evaluate!` doc-strings to learn more about these
+functions and what the `PerformanceEvaluation` object `e_pipe` records.)
+"""
+
+# ╔═╡ 2468db48-0ffd-459e-897c-f79712f9ec7c
+md"""
+While [less than ideal](https://arxiv.org/abs/2104.00673), let's
+adopt the common practice of using the standard error of a
+cross-validation score as an estimate of the uncertainty of a
+performance measure's expected value. Here's a utility function to
+calculate confidence intervals for our performance estimates based
+on this practice, and it's application to the current evaluation:
+"""
+
+# ╔═╡ 0f76a79f-8675-4ec1-a543-f9324a87efad
+using Measurements
+
+# ╔═╡ f8d9dea6-d5a8-4b27-8657-9fba0caf3cb7
+begin
+ function confidence_intervals(e)
+ measure = e.measure
+ nfolds = length(e.per_fold[1])
+ measurement = [e.measurement[j] ± std(e.per_fold[j])/sqrt(nfolds - 1)
+ for j in eachindex(measure)]
+ table = (measure=measure, measurement=measurement)
+ return DataFrames.DataFrame(table)
+ end
+
+ const confidence_intervals_basic_model = confidence_intervals(e_pipe)
+end
+
+# ╔═╡ c3f71e42-8bbe-47fe-a217-e58f442fc85c
+md"## Filtering out unimportant features"
+
+# ╔═╡ db354064-c2dd-4e6a-b8ad-0340f15a03ba
+md"> Introduces: `FeatureSelector`"
+
+# ╔═╡ 3bbb26ed-7d1e-46ac-946d-b124a8db5f7c
+md"""
+Before continuing, we'll modify our pipeline to drop those features
+with low feature importance, to speed up later optimization:
+"""
+
+# ╔═╡ cdfe840d-4e87-467f-b582-dfcbeb05bcc5
+begin
+ unimportant_features = filter(:importance => <(0.005), feature_importance_table).feature
+
+ pipe2 = ContinuousEncoder() |>
+ FeatureSelector(features=unimportant_features, ignore=true) |> booster
+end
+
+# ╔═╡ de589473-4c37-4f70-9143-546b2286a5fe
+md"## Wrapping our iterative model in control strategies"
+
+# ╔═╡ 0b80d69c-b60c-4cf4-a7d8-c3b94fb18495
+md"> Introduces: **control strategies:** `Step`, `NumberSinceBest`, `TimeLimit`, `InvalidValue`, **model wrapper** `IteratedModel`, **resampling strategy:** `Holdout`"
+
+# ╔═╡ 19e7e4c9-95c0-49d6-8396-93fa872d2512
+md"""
+We want to optimize the hyper-parameters of our model. Since our
+model is iterative, these parameters include the (nested) iteration
+parameter `pipe.evo_tree_classifier.nrounds`. Sometimes this
+parameter is optimized first, fixed, and then maybe optimized again
+after the other parameters. Here we take a more principled approach,
+**wrapping our model in a control strategy** that makes it
+"self-iterating". The strategy applies a stopping criterion to
+*out-of-sample* estimates of the model performance, constructed
+using an internally constructed holdout set. In this way, we avoid
+some data hygiene issues, and, when we subsequently optimize other
+parameters, we will always being using an optimal number of
+iterations.
+"""
+
+# ╔═╡ 6ff08b40-906f-4154-a4ac-2ddb495858ce
+md"""
+Note that this approach can be applied to any iterative MLJ model,
+eg, the neural network models provided by
+[MLJFlux.jl](https://github.com/FluxML/MLJFlux.jl).
+"""
+
+# ╔═╡ 8fc99d35-d8cc-455f-806e-1bc580dc349d
+md"""
+First, we select appropriate controls from [this
+list](https://alan-turing-institute.github.io/MLJ.jl/dev/controlling_iterative_models/#Controls-provided):
+"""
+
+# ╔═╡ 29f33708-4a82-4acc-9703-288eae064e2a
+controls = [
+ Step(1), # to increment iteration parameter (`pipe.nrounds`)
+ NumberSinceBest(4), # main stopping criterion
+ TimeLimit(2/3600), # never train more than 2 sec
+ InvalidValue() # stop if NaN or ±Inf encountered
+]
+
+# ╔═╡ 9f80b5b5-95b9-4f01-b2c3-413ae5867f7d
+md"""
+Now we wrap our pipeline model using the `IteratedModel` wrapper,
+being sure to specify the `measure` on which internal estimates of
+the out-of-sample performance will be based:
+"""
+
+# ╔═╡ fd2e2ee4-c256-4fde-93d8-53c527b0a48c
+iterated_pipe = IteratedModel(model=pipe2,
+ controls=controls,
+ measure=brier_loss,
+ resampling=Holdout(fraction_train=0.7))
+
+# ╔═╡ bb7a34eb-4bf1-41d9-af98-8dffdf3118fc
+md"""
+We've set `resampling=Holdout(fraction_train=0.7)` to arrange that
+data attached to our model should be internally split into a train
+set (70%) and a holdout set (30%) for determining the out-of-sample
+estimate of the Brier loss.
+"""
+
+# ╔═╡ 71bead68-2d38-468c-8620-127a8c4021ec
+md"""
+For demonstration purposes, let's bind `iterated_model` to all data
+not in our don't-touch holdout set, and train on all of that data:
+"""
+
+# ╔═╡ f245005b-ca1e-4c38-a1f6-fec4c3edfa7b
+begin
+ mach_iterated_pipe = machine(iterated_pipe, X, y)
+ fit!(mach_iterated_pipe);
+end
+
+# ╔═╡ 16d8cd4e-ca20-475a-82f7-021485ee0115
+md"To recap, internally this training is split into two separate steps:"
+
+# ╔═╡ ab59d744-51f8-4364-9ecb-94593d972599
+md"""
+- A controlled iteration step, training on the holdout set, with the total number of iterations determined by the specified stopping criteria (based on the out-of-sample performance estimates)
+- A final step that trains the atomic model on *all* available
+ data using the number of iterations determined in the first step. Calling `predict` on `mach_iterated_pipe` means using the learned parameters of the second step.
+"""
+
+# ╔═╡ ed9f284e-1e45-46e7-b54b-e44bea97c579
+md"## Hyper-parameter optimization (model tuning)"
+
+# ╔═╡ b8b0e5ee-8468-4e02-9121-4e392242994e
+md"> Introduces: `range`, **model wrapper** `TunedModel`, `RandomSearch`"
+
+# ╔═╡ 50372dc8-bb10-4f9e-b221-79886442efe7
+md"""
+We now turn to hyper-parameter optimization. A tool not discussed
+here is the `learning_curve` function, which can be useful when
+wanting to visualize the effect of changes to a *single*
+hyper-parameter (which could be an iteration parameter). See, for
+example, [this section of the
+manual](https://alan-turing-institute.github.io/MLJ.jl/dev/learning_curves/)
+or [this
+tutorial](https://github.com/ablaom/MLJTutorial.jl/blob/dev/notebooks/04_tuning/notebook.ipynb).
+"""
+
+# ╔═╡ 3e385eb4-0a44-40f6-8df5-b7371beb6b3f
+md"""
+Fine tuning the hyper-parameters of a gradient booster can be
+somewhat involved. Here we settle for simultaneously optimizing two
+key parameters: `max_depth` and `η` (learning_rate).
+"""
+
+# ╔═╡ caa5153f-6633-41f7-a475-d52dc8eba727
+md"""
+Like iteration control, **model optimization in MLJ is implemented as
+a model wrapper**, called `TunedModel`. After wrapping a model in a
+tuning strategy and binding the wrapped model to data in a machine
+called `mach`, calling `fit!(mach)` instigates a search for optimal
+model hyperparameters, within a specified range, and then uses all
+supplied data to train the best model. To predict using that model,
+one then calls `predict(mach, Xnew)`. In this way the wrapped model
+may be viewed as a "self-tuning" version of the unwrapped
+model. That is, wrapping the model simply transforms certain
+hyper-parameters into learned parameters (just as `IteratedModel`
+does for an iteration parameter).
+"""
+
+# ╔═╡ 0b74bfe8-bff6-469d-804b-89f5009710b0
+md"""
+To start with, we define ranges for the parameters of
+interest. Since these parameters are nested, let's force a
+display of our model to a larger depth:
+"""
+
+# ╔═╡ b17f6f89-2fd4-46de-a14a-b0adc2955e1c
+show(iterated_pipe, 2)
+
+# ╔═╡ db7a8b4d-611b-4e5e-bd20-120a7a4020d0
+begin
+ p1 = :(model.evo_tree_classifier.η)
+ p2 = :(model.evo_tree_classifier.max_depth)
+
+ r1 = range(iterated_pipe, p1, lower=-2, upper=-0.5, scale=x->10^x)
+ r2 = range(iterated_pipe, p2, lower=2, upper=6)
+end
+
+# ╔═╡ f46af08e-ddb9-4aec-93a0-9d99274137e8
+md"""
+Nominal ranges are defined by specifying `values` instead of `lower`
+and `upper`.
+"""
+
+# ╔═╡ af3023e6-920f-478d-af76-60dddeecbe6c
+md"""
+Next, we choose an optimization strategy from [this
+list](https://alan-turing-institute.github.io/MLJ.jl/dev/tuning_models/#Tuning-Models):
+"""
+
+# ╔═╡ 93c17a9b-b49c-4780-9074-c069a0e97d7e
+tuning = RandomSearch(rng=123)
+
+# ╔═╡ 21f14f18-cc5a-4ed9-ac4a-024c5894013e
+md"""
+Then we wrap the model, specifying a `resampling` strategy and a
+`measure`, as we did for `IteratedModel`. In fact, we can include a
+battery of `measures`; by default, optimization is with respect to
+performance estimates based on the first measure, but estimates for
+all measures can be accessed from the model's `report`.
+"""
+
+# ╔═╡ 147cc2c5-f3c2-4aec-82ca-0687059409ae
+md"""
+The keyword `n` specifies the total number of models (sets of
+hyper-parameters) to evaluate.
+"""
+
+# ╔═╡ 2996c4f0-567a-4c5a-9e9e-2379bd3c438e
+tuned_iterated_pipe = TunedModel(model=iterated_pipe,
+ range=[r1, r2],
+ tuning=tuning,
+ measures=[brier_loss, auc, accuracy],
+ resampling=StratifiedCV(nfolds=6, rng=123),
+ acceleration=CPUThreads(),
+ n=40)
+
+# ╔═╡ fbf15fc2-347a-432a-bf9e-f9077f3deea4
+md"To save time, we skip the `repeats` here."
+
+# ╔═╡ 3922d490-8018-4b8c-9b74-5a67b6e8b15f
+md"Binding our final model to data and training:"
+
+# ╔═╡ d59dc0af-8eb7-4ac4-b1f4-bf9663e97bb7
+begin
+ mach_tuned_iterated_pipe = machine(tuned_iterated_pipe, X, y)
+ fit!(mach_tuned_iterated_pipe)
+end
+
+# ╔═╡ ed1d8575-1c50-4bc7-8dca-7ecb1b94fa1b
+md"""
+As explained above, the training we have just performed was split
+internally into two separate steps:
+"""
+
+# ╔═╡ c392b949-7e64-4f3e-aeca-6e3d5d94d8fa
+md"""
+- A step to determine the parameter values that optimize the aggregated cross-validation scores
+- A final step that trains the optimal model on *all* available data. Future predictions `predict(mach_tuned_iterated_pipe, ...)` are based on this final training step.
+"""
+
+# ╔═╡ 954a354f-0a1d-4a7f-8aa0-20af95403dd9
+md"""
+From `report(mach_tuned_iterated_pipe)` we can extract details about
+the optimization procedure. For example:
+"""
+
+# ╔═╡ 0f9e72c1-fb6e-4ea7-a120-518242409f79
+begin
+ rpt2 = report(mach_tuned_iterated_pipe);
+ best_booster = rpt2.best_model.model.evo_tree_classifier
+end
+
+# ╔═╡ 242b8047-0508-4ce2-bcf7-0be279ee1434
+@info "Optimal hyper-parameters:" best_booster.max_depth best_booster.η;
+
+# ╔═╡ af6d8f76-f96e-4136-9df6-c78b3bed8b80
+md"Using the `confidence_intervals` function we defined earlier:"
+
+# ╔═╡ ec5e230c-397a-4e07-b9cc-1427739824b3
+begin
+ e_best = rpt2.best_history_entry
+ confidence_intervals(e_best)
+end
+
+# ╔═╡ a0a591d4-ae53-4db4-9051-24fa20a24653
+md"""
+Digging a little deeper, we can learn what stopping criterion was
+applied in the case of the optimal model, and how many iterations
+were required:
+"""
+
+# ╔═╡ 8f18ff72-39db-43f2-ac11-6a06d822d067
+rpt2.best_report.controls |> collect
+
+# ╔═╡ d28bdf7e-2ba5-4a97-8d27-497b9a4e8f4b
+md"Finally, we can visualize the optimization results:"
+
+# ╔═╡ e0dba28c-5f07-4305-a8e6-955351cd26f5
+plot(mach_tuned_iterated_pipe, size=(600,450))
+
+# ╔═╡ d92c7d6c-6eda-4083-bf7b-99c47ef72fd2
+md"## Saving our model"
+
+# ╔═╡ 7626383d-5212-4ee5-9b3c-c87e36798d40
+md"> Introduces: `MLJ.save`"
+
+# ╔═╡ f38bca90-a19e-4faa-bc52-7a03f8a4f046
+md"""
+Here's how to serialize our final, trained self-iterating,
+self-tuning pipeline machine:
+"""
+
+# ╔═╡ 36e8600e-f5ee-4e5f-9814-5dd23028b7dd
+MLJ.save("tuned_iterated_pipe.jlso", mach_tuned_iterated_pipe)
+
+# ╔═╡ 4de39bbb-e421-4e1b-aea9-6b095d52d246
+md"We'll deserialize this in \"Testing the final model\" below."
+
+# ╔═╡ cf4e89cd-998c-44d1-8a6a-3fad94d47b89
+md"## Final performance estimate"
+
+# ╔═╡ cd043584-6881-45df-ab7f-23dfd6fe43e8
+md"""
+Finally, to get an even more accurate estimate of performance, we
+can evaluate our model using stratified cross-validation and all the
+data attached to our machine. Because this evaluation implies
+[nested
+resampling](https://mlr.mlr-org.com/articles/tutorial/nested_resampling.html),
+this computation takes quite a bit longer than the previous one
+(which is being repeated six times, using 5/6th of the data each
+time):
+"""
+
+# ╔═╡ 9c90d7a7-1966-4d21-873e-e6fb8e7dca1a
+e_tuned_iterated_pipe = evaluate(tuned_iterated_pipe, X, y,
+ resampling=StratifiedCV(nfolds=6, rng=123),
+ measures=[brier_loss, auc, accuracy])
+
+# ╔═╡ f5058fe5-53de-43ad-9dd4-420f3ba8803c
+confidence_intervals(e_tuned_iterated_pipe)
+
+# ╔═╡ 669b492b-69a7-4d88-b994-ae59732958cb
+md"""
+For comparison, here are the confidence intervals for the basic
+pipeline model (no feature selection and default hyperparameters):
+"""
+
+# ╔═╡ 5c1754b1-21c2-4173-9aa5-a206354b401f
+confidence_intervals_basic_model
+
+# ╔═╡ 675f06bc-6488-45e2-b666-1307eccc221d
+md"""
+As each pair of intervals overlap, it's doubtful the small changes
+here can be assigned statistical significance. Default `booster`
+hyper-parameters do a pretty good job.
+"""
+
+# ╔═╡ 19bb6a91-8c9d-4ed0-8f9a-f7439f35ea8f
+md"## Testing the final model"
+
+# ╔═╡ fe32c8ab-2f9e-4bb4-a8bb-11a6d1761f50
+md"""
+We now determine the performance of our model on our
+lock-and-throw-away-the-key holdout set. To demonstrate
+deserialization, we'll pretend we're in a new Julia session (but
+have called `import`/`using` on the same packages). Then the
+following should suffice to recover our model trained under
+"Hyper-parameter optimization" above:
+"""
+
+# ╔═╡ 695926dd-3faf-48a0-8c71-7d4e18e2f699
+mach_restored = machine("tuned_iterated_pipe.jlso")
+
+# ╔═╡ 24ca2761-e264-4cf9-a590-db6dcb21b2d3
+md"We compute predictions on the holdout set:"
+
+# ╔═╡ 9883bd87-d70f-4fea-bec6-3fa17d8c7b35
+begin
+ ŷ_tuned = predict(mach_restored, Xtest);
+ ŷ_tuned[1]
+end
+
+# ╔═╡ 9b7a12e7-2b3c-445a-97eb-2de8afd657be
+md"And can compute the final performance measures:"
+
+# ╔═╡ 7fa067b5-7e77-4dda-bba0-54e6f740a5b5
+@info("Tuned model measurements on test:",
+ brier_loss(ŷ_tuned, ytest) |> mean,
+ auc(ŷ_tuned, ytest),
+ accuracy(mode.(ŷ_tuned), ytest)
+ )
+
+# ╔═╡ f4dc71cf-64c8-4857-94c0-45caa9808777
+md"For comparison, here's the performance for the basic pipeline model"
+
+# ╔═╡ 00257755-7145-45e6-adf5-76545beae885
+begin
+ mach_basic = machine(pipe, X, y)
+ fit!(mach_basic, verbosity=0)
+
+ ŷ_basic = predict(mach_basic, Xtest);
+
+ @info("Basic model measurements on test set:",
+ brier_loss(ŷ_basic, ytest) |> mean,
+ auc(ŷ_basic, ytest),
+ accuracy(mode.(ŷ_basic), ytest)
+ )
+end
+
+# ╔═╡ 135dac9b-0bd9-4e1d-8715-73a20e2ae31b
+md"""
+---
+
+*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*
+"""
+
+# ╔═╡ Cell order:
+# ╟─f0cc864c-8b26-441f-9bca-7c69b794f8ce
+# ╟─8a6670b8-96a8-4a5d-b795-033f6f2a0674
+# ╟─aa49e638-95dc-4249-935f-ddf6a6bfbbdd
+# ╟─b04c4790-59e0-42a3-af2a-25235e544a31
+# ╟─4eb8dff4-c23a-4b41-8af5-148d95ea2900
+# ╟─a583d175-d623-4888-a6bf-47194d7e8e12
+# ╟─4830fc64-3d70-4869-828a-4cc485149963
+# ╟─28197138-d6b7-433c-9e54-3f7b3ca87ecb
+# ╟─39c599b2-3c1f-4292-ba1f-6363f43ec697
+# ╟─7c0464a0-4114-46bf-95ea-2955abd45275
+# ╟─256869d0-5e0c-42af-b1b3-dd47e367ba54
+# ╠═60fe49c1-3434-4a77-8d7e-8e449afd1c48
+# ╟─e8c13e9d-7910-4a0e-a949-7f3bd292fe31
+# ╟─6bf6ef98-302f-478c-8514-99938a2932db
+# ╟─33ca287e-8cba-47d1-a0de-1721c1bc2df2
+# ╟─d5b8bf1d-6c9a-46c1-bca8-7eec7950fd82
+# ╠═9b7b9ade-9318-4b95-9873-1b1430e635cc
+# ╠═4f3f061d-259d-4479-b43e-d0ebe87da176
+# ╠═a546f372-4ae0-48c2-9009-4cfb2012998f
+# ╠═f8f79dd2-b6de-48e8-abd3-cb77d7a79683
+# ╟─617d63f6-8a62-40db-879e-dc128f3db7b3
+# ╠═ceda7ed0-3b98-44b3-a367-9aa3c6cf34d0
+# ╟─e69cd764-e4b8-4ff8-bf32-94657e65284e
+# ╟─4a175e0f-4b87-4b53-9afd-65a5b5facac9
+# ╠═3c23e9f8-49bd-4cde-b6c8-54f5ed90a964
+# ╟─6ad971c7-9fe5-45a9-9292-fe822525fc77
+# ╠═63151d31-8428-4ebd-ae5e-4b83dcbc9675
+# ╟─0f978839-cc95-4c3a-8a29-32f11452654a
+# ╠═5edc98cd-df7d-428a-a5f1-151fcbe229a2
+# ╟─1848109f-c94c-4cdd-81bc-2d5603785a09
+# ╠═eee948b7-2b42-439e-9d87-1d1fbb0e3997
+# ╟─7c98e10b-1fa3-4ab6-b950-fddef2a1fb10
+# ╠═413d9c3c-a68e-4bf0-951b-64cb2a36c8e3
+# ╟─eb79663c-c671-4c4c-b0e6-441461cc8770
+# ╠═ae9e9377-552c-4186-8cb0-de601961bc02
+# ╟─5c70ee06-edb9-4789-a87d-950c50fb2955
+# ╟─bde7bc43-a4bb-4a42-8448-4bcb089096cd
+# ╠═d5a9e9b8-c67e-4bdd-a012-a84240254fb6
+# ╠═8de2c7b4-1950-4f05-bbdd-9799f7bb2e60
+# ╟─768f369a-5f3d-4dcc-97a7-88e3af4f10ee
+# ╟─bc4c0eb9-3b5f-49fc-b372-8c9866e51852
+# ╟─f04becbb-42a6-409f-8f3d-b82f1e7b6f7a
+# ╟─1d08ed75-5377-4971-ab04-579e5608ae53
+# ╟─eea585d4-d55b-4ccc-86cf-35420d9e6995
+# ╟─ca482134-299b-459a-a29a-8d0445351914
+# ╟─5c7c97e4-fd8b-4ae5-be65-28d0fcca50a8
+# ╟─fd103be7-6fc9-43bc-9a2e-c468345d87d1
+# ╠═9d2f0b19-2942-47ac-b5fa-cb79ebf595ef
+# ╟─1d18aca8-11f4-4104-91c5-524da38aa391
+# ╠═accd3a09-5371-45ab-ad90-b4405b216771
+# ╟─bde0406f-3964-42b8-895b-997a92b731e0
+# ╠═25e60566-fc64-4f35-a525-e9b04a4bd246
+# ╟─441317a8-3b72-4bf1-80f0-ba7781e173cf
+# ╠═d2f2b67e-d06e-4c0b-9cbc-9a2c39793333
+# ╟─7c93b820-cbce-48be-b887-2b16710e5502
+# ╠═5ac4a13e-9983-41dd-9452-5298a8a4a401
+# ╟─9f28e6c3-5f18-4fb2-b017-96f1602f2cad
+# ╠═c9506e5b-c111-442c-8be2-2a4017c45345
+# ╟─8c865df5-ac95-438f-a7ad-80d84f5b0070
+# ╠═22690828-0992-4ed0-8378-5aa686f0b407
+# ╟─3bd753fc-163a-4b50-9f43-49763e8691a1
+# ╟─7b12673f-cc63-4cd9-bb0e-3a45761c733a
+# ╟─3c98ab4e-47cd-4247-9979-7044b2f2df33
+# ╠═ea010966-bb7d-4c79-b2a1-fbbde543f620
+# ╟─401c6f14-1128-4df1-910b-f8a9a217eff2
+# ╟─81aa1aa9-27d9-423f-aa36-d2e8546da46a
+# ╟─fff47e76-d7ac-4377-88a1-a2bf91434413
+# ╟─f1dbbc22-9844-4e1c-a1cb-54e34396a855
+# ╠═20713fb3-3a3b-42e7-8037-0de5806e1a54
+# ╠═369cb359-8512-4f80-9961-a09632c33fb0
+# ╟─af484868-b29c-4808-b7cc-46f4ef988c68
+# ╠═8f3eda66-6183-4fe4-90fb-7f2721f6fcc7
+# ╟─598142b5-0828-49ac-af65-ccd25ecb9818
+# ╟─c5309d64-7409-4148-8890-979691212d9b
+# ╟─f97969e2-c15c-42cf-a6fa-eaf14df5d44b
+# ╠═edf1d726-542a-4fc1-8025-5084804a9f6c
+# ╟─ee41a121-df0d-40ad-9e90-09023d201062
+# ╠═be6cef47-d585-45fe-b7bb-2f33ef765cc0
+# ╟─452edd51-b878-46d0-9625-01172c4a0081
+# ╠═fdd20843-c981-41ad-af4f-11c7de9e21dd
+# ╟─2b25f9cb-d12d-4a6f-8dba-241d9b744683
+# ╠═b139c43b-382b-415a-a6e5-2f0b4dca5c59
+# ╟─90a7fe9a-934f-445a-8550-20498aa03ed0
+# ╟─fc97e937-66ca-4434-9e7b-5b943cf6b560
+# ╟─b1d95557-51fc-4f1e-bce6-0a1379cc1259
+# ╟─38961daa-dd5f-4f97-9608-b5032c116833
+# ╟─ccd3ffd0-ed41-497d-b473-4aa968e6937a
+# ╠═35cf2011-4fb6-4e7a-8d9e-644a1b3cc6b6
+# ╟─4b6e3239-fd47-495a-ac0a-23ded81445da
+# ╟─ce9a3479-601a-472a-8535-1a088a6a3228
+# ╠═45c2d8f9-cbf9-4cda-a39f-7893c73eef39
+# ╟─232b7775-1f79-43f7-bcca-51a27994a151
+# ╟─ac84c13f-391f-405a-9b33-09b4b6661170
+# ╟─442078e4-a695-471c-b45d-88d068bb0fa2
+# ╟─23fb37b0-08c1-4688-92c8-6db3a590d963
+# ╟─1def55f5-71fc-4257-abf5-96f457eb2bdf
+# ╟─2c4016c6-dc7e-44ec-8a60-2e0214c059f5
+# ╠═236c098c-0189-4a96-a38a-beb5c7157b57
+# ╟─9ffe1654-7e32-4c4b-81f5-507803ea9ed6
+# ╟─48413a7b-d01b-4005-9b21-6d4eb642d87e
+# ╟─84d7bcbc-df3c-4602-b98c-3df1f31879bf
+# ╠═f8552404-f95f-4484-92b7-4ceba56e97ad
+# ╟─e0b1bffb-dbd5-4b0a-b122-214de244406c
+# ╠═d8ad2c21-fd27-44fb-8a4b-c83694978e38
+# ╟─ceff13ba-a526-4ecb-a8b6-f7ff516dedc4
+# ╟─dcf0eb93-2368-4158-81e1-74ef03c2e79e
+# ╠═442be3e3-b2e9-499d-a04e-0b76409c14a7
+# ╟─f03fdd56-6c30-4d73-b979-3458f2f26667
+# ╠═3afbe65c-18bf-4576-97e3-6677afc6ca9f
+# ╟─527b193b-737d-4701-b10e-562ce21caa04
+# ╠═b332842d-6397-45c6-8f78-ab549ef1544e
+# ╠═3035e6d0-3801-424d-a8a4-a53f5149481e
+# ╟─8b67dcac-6a86-4502-8704-dbab8e09b4f1
+# ╟─9cd42f39-942e-4011-a02f-27b6c05e4d02
+# ╠═0fecce89-102f-416f-bea4-3d467d48781c
+# ╟─0b11a81e-42a7-4519-97c5-2979af8a507d
+# ╟─e6f6fc92-f9e7-4472-b639-63c76c72c513
+# ╠═f3933688-1ea4-4239-8f5a-e1ee9eb5c15c
+# ╟─ce90835f-6709-4c34-adc5-e671db8bca5d
+# ╠═15996832-b736-45f4-86f1-0ea38de21abb
+# ╠═3c4984d2-9e1a-450b-a55c-0cb44ab816d7
+# ╟─27221b90-8506-4857-be86-92ecfd0d2343
+# ╟─445146e5-e45e-450d-9cf1-facf39e3f302
+# ╟─562887bb-b7fb-430f-b61c-748aec38e674
+# ╠═f9be989e-2604-44c2-9727-ed822e4fd85d
+# ╟─ff7cfc36-b9fc-4570-b2f2-e08965e5be66
+# ╟─2468db48-0ffd-459e-897c-f79712f9ec7c
+# ╠═0f76a79f-8675-4ec1-a543-f9324a87efad
+# ╠═f8d9dea6-d5a8-4b27-8657-9fba0caf3cb7
+# ╟─c3f71e42-8bbe-47fe-a217-e58f442fc85c
+# ╟─db354064-c2dd-4e6a-b8ad-0340f15a03ba
+# ╟─3bbb26ed-7d1e-46ac-946d-b124a8db5f7c
+# ╠═cdfe840d-4e87-467f-b582-dfcbeb05bcc5
+# ╟─de589473-4c37-4f70-9143-546b2286a5fe
+# ╟─0b80d69c-b60c-4cf4-a7d8-c3b94fb18495
+# ╟─19e7e4c9-95c0-49d6-8396-93fa872d2512
+# ╟─6ff08b40-906f-4154-a4ac-2ddb495858ce
+# ╟─8fc99d35-d8cc-455f-806e-1bc580dc349d
+# ╠═29f33708-4a82-4acc-9703-288eae064e2a
+# ╟─9f80b5b5-95b9-4f01-b2c3-413ae5867f7d
+# ╠═fd2e2ee4-c256-4fde-93d8-53c527b0a48c
+# ╟─bb7a34eb-4bf1-41d9-af98-8dffdf3118fc
+# ╟─71bead68-2d38-468c-8620-127a8c4021ec
+# ╠═f245005b-ca1e-4c38-a1f6-fec4c3edfa7b
+# ╟─16d8cd4e-ca20-475a-82f7-021485ee0115
+# ╟─ab59d744-51f8-4364-9ecb-94593d972599
+# ╟─ed9f284e-1e45-46e7-b54b-e44bea97c579
+# ╟─b8b0e5ee-8468-4e02-9121-4e392242994e
+# ╟─50372dc8-bb10-4f9e-b221-79886442efe7
+# ╟─3e385eb4-0a44-40f6-8df5-b7371beb6b3f
+# ╟─caa5153f-6633-41f7-a475-d52dc8eba727
+# ╟─0b74bfe8-bff6-469d-804b-89f5009710b0
+# ╠═b17f6f89-2fd4-46de-a14a-b0adc2955e1c
+# ╠═db7a8b4d-611b-4e5e-bd20-120a7a4020d0
+# ╟─f46af08e-ddb9-4aec-93a0-9d99274137e8
+# ╟─af3023e6-920f-478d-af76-60dddeecbe6c
+# ╠═93c17a9b-b49c-4780-9074-c069a0e97d7e
+# ╟─21f14f18-cc5a-4ed9-ac4a-024c5894013e
+# ╟─147cc2c5-f3c2-4aec-82ca-0687059409ae
+# ╠═2996c4f0-567a-4c5a-9e9e-2379bd3c438e
+# ╟─fbf15fc2-347a-432a-bf9e-f9077f3deea4
+# ╟─3922d490-8018-4b8c-9b74-5a67b6e8b15f
+# ╠═d59dc0af-8eb7-4ac4-b1f4-bf9663e97bb7
+# ╟─ed1d8575-1c50-4bc7-8dca-7ecb1b94fa1b
+# ╟─c392b949-7e64-4f3e-aeca-6e3d5d94d8fa
+# ╟─954a354f-0a1d-4a7f-8aa0-20af95403dd9
+# ╠═0f9e72c1-fb6e-4ea7-a120-518242409f79
+# ╠═242b8047-0508-4ce2-bcf7-0be279ee1434
+# ╟─af6d8f76-f96e-4136-9df6-c78b3bed8b80
+# ╠═ec5e230c-397a-4e07-b9cc-1427739824b3
+# ╟─a0a591d4-ae53-4db4-9051-24fa20a24653
+# ╠═8f18ff72-39db-43f2-ac11-6a06d822d067
+# ╟─d28bdf7e-2ba5-4a97-8d27-497b9a4e8f4b
+# ╠═e0dba28c-5f07-4305-a8e6-955351cd26f5
+# ╟─d92c7d6c-6eda-4083-bf7b-99c47ef72fd2
+# ╟─7626383d-5212-4ee5-9b3c-c87e36798d40
+# ╟─f38bca90-a19e-4faa-bc52-7a03f8a4f046
+# ╠═36e8600e-f5ee-4e5f-9814-5dd23028b7dd
+# ╟─4de39bbb-e421-4e1b-aea9-6b095d52d246
+# ╟─cf4e89cd-998c-44d1-8a6a-3fad94d47b89
+# ╟─cd043584-6881-45df-ab7f-23dfd6fe43e8
+# ╠═9c90d7a7-1966-4d21-873e-e6fb8e7dca1a
+# ╠═f5058fe5-53de-43ad-9dd4-420f3ba8803c
+# ╟─669b492b-69a7-4d88-b994-ae59732958cb
+# ╠═5c1754b1-21c2-4173-9aa5-a206354b401f
+# ╟─675f06bc-6488-45e2-b666-1307eccc221d
+# ╟─19bb6a91-8c9d-4ed0-8f9a-f7439f35ea8f
+# ╟─fe32c8ab-2f9e-4bb4-a8bb-11a6d1761f50
+# ╠═695926dd-3faf-48a0-8c71-7d4e18e2f699
+# ╟─24ca2761-e264-4cf9-a590-db6dcb21b2d3
+# ╠═9883bd87-d70f-4fea-bec6-3fa17d8c7b35
+# ╟─9b7a12e7-2b3c-445a-97eb-2de8afd657be
+# ╠═7fa067b5-7e77-4dda-bba0-54e6f740a5b5
+# ╟─f4dc71cf-64c8-4857-94c0-45caa9808777
+# ╠═00257755-7145-45e6-adf5-76545beae885
+# ╟─135dac9b-0bd9-4e1d-8715-73a20e2ae31b
diff --git a/examples/telco/notebook.pluto.jl.html b/examples/telco/notebook.pluto.jl.html
new file mode 100644
index 000000000..2d39e7ed9
--- /dev/null
+++ b/examples/telco/notebook.pluto.jl.html
@@ -0,0 +1,65 @@
+
+
+
+
+ ⚡ Pluto.jl ⚡
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/examples/telco/notebook.unexecuted.ipynb b/examples/telco/notebook.unexecuted.ipynb
new file mode 100644
index 000000000..5e5fe26c0
--- /dev/null
+++ b/examples/telco/notebook.unexecuted.ipynb
@@ -0,0 +1,1853 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# MLJ for Data Scientists in Two Hours"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "An application of the [MLJ\n",
+ "toolbox](https://alan-turing-institute.github.io/MLJ.jl/dev/) to the\n",
+ "Telco Customer Churn dataset, aimed at practicing data scientists\n",
+ "new to MLJ (Machine Learning in Julia). This tutorial does not\n",
+ "cover exploratory data analysis."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "MLJ is a *multi-paradigm* machine learning toolbox (i.e., not just\n",
+ "deep-learning)."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "For other MLJ learning resources see the [Learning\n",
+ "MLJ](https://alan-turing-institute.github.io/MLJ.jl/dev/learning_mlj/)\n",
+ "section of the\n",
+ "[manual](https://alan-turing-institute.github.io/MLJ.jl/dev/)."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "**Topics covered**: Grabbing and preparing a dataset, basic\n",
+ "fit/predict workflow, constructing a pipeline to include data\n",
+ "pre-processing, estimating performance metrics, ROC curves, confusion\n",
+ "matrices, feature importance, basic feature selection, controlling iterative\n",
+ "models, hyper-parameter optimization (tuning)."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "**Prerequisites for this tutorial.** Previous experience building,\n",
+ "evaluating, and optimizing machine learning models using\n",
+ "scikit-learn, caret, MLR, weka, or similar tool. No previous\n",
+ "experience with MLJ. Only fairly basic familiarity with Julia is\n",
+ "required. Uses\n",
+ "[DataFrames.jl](https://dataframes.juliadata.org/stable/) but in a\n",
+ "minimal way ([this\n",
+ "cheatsheet](https://ahsmart.com/pub/data-wrangling-with-data-frames-jl-cheat-sheet/index.html)\n",
+ "may help)."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "**Time.** Between two and three hours, first time through."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Summary of methods and types introduced"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "|code | purpose|\n",
+ "|:-------|:-------------------------------------------------------|\n",
+ "|`OpenML.load(id)` | grab a dataset from [OpenML.org](https://www.openml.org)|\n",
+ "|`scitype(X)` | inspect the scientific type (scitype) of object `X`|\n",
+ "|`schema(X)` | inspect the column scitypes (scientific types) of a table `X`|\n",
+ "|`coerce(X, ...)` | fix column encodings to get appropriate scitypes|\n",
+ "|`partition(data, frac1, frac2, ...; rng=...)` | vertically split `data`, which can be a table, vector or matrix|\n",
+ "|`unpack(table, f1, f2, ...)` | horizontally split `table` based on conditions `f1`, `f2`, ..., applied to column names|\n",
+ "|`@load ModelType pkg=...` | load code defining a model type|\n",
+ "|`input_scitype(model)` | inspect the scitype that a model requires for features (inputs)|\n",
+ "|`target_scitype(model)`| inspect the scitype that a model requires for the target (labels)|\n",
+ "|`ContinuousEncoder` | built-in model type for re-encoding all features as `Continuous`|# |`model1 |> model2` |> ...` | combine multiple models into a pipeline\n",
+ "| `measures(\"under curve\")` | list all measures (metrics) with string \"under curve\" in documentation\n",
+ "| `accuracy(yhat, y)` | compute accuracy of predictions `yhat` against ground truth observations `y`\n",
+ "| `auc(yhat, y)`, `brier_loss(yhat, y)` | evaluate two probabilistic measures (`yhat` a vector of probability distributions)\n",
+ "| `machine(model, X, y)` | bind `model` to training data `X` (features) and `y` (target)\n",
+ "| `fit!(mach, rows=...)` | train machine using specified rows (observation indices)\n",
+ "| `predict(mach, rows=...)`, | make in-sample model predictions given specified rows\n",
+ "| `predict(mach, Xnew)` | make predictions given new features `Xnew`\n",
+ "| `fitted_params(mach)` | inspect learned parameters\n",
+ "| `report(mach)` | inspect other outcomes of training\n",
+ "| `confmat(yhat, y)` | confusion matrix for predictions `yhat` and ground truth `y`\n",
+ "| `roc(yhat, y)` | compute points on the receiver-operator Characteristic\n",
+ "| `StratifiedCV(nfolds=6)` | 6-fold stratified cross-validation resampling strategy\n",
+ "| `Holdout(fraction_train=0.7)` | holdout resampling strategy\n",
+ "| `evaluate(model, X, y; resampling=..., options...)` | estimate performance metrics `model` using the data `X`, `y`\n",
+ "| `FeatureSelector()` | transformer for selecting features\n",
+ "| `Step(3)` | iteration control for stepping 3 iterations\n",
+ "| `NumberSinceBest(6)`, `TimeLimit(60/5), InvalidValue()` | iteration control stopping criteria\n",
+ "| `IteratedModel(model=..., controls=..., options...)` | wrap an iterative `model` in control strategies\n",
+ "| `range(model, :some_hyperparam, lower=..., upper=...)` | define a numeric range\n",
+ "| `RandomSearch()` | random search tuning strategy\n",
+ "| `TunedModel(model=..., tuning=..., options...)` | wrap the supervised `model` in specified `tuning` strategy"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Instantiate a Julia environment"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "The following code replicates precisely the set of Julia packages\n",
+ "used to develop this tutorial. If this is your first time running\n",
+ "the notebook, package instantiation and pre-compilation may take a\n",
+ "minute or so to complete. **This step will fail** if the [correct\n",
+ "Manifest.toml and Project.toml\n",
+ "files](https://github.com/alan-turing-institute/MLJ.jl/tree/dev/examples/telco)\n",
+ "are not in the same directory as this notebook."
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "using Pkg\n",
+ "Pkg.activate(@__DIR__) # get env from TOML files in same directory as this notebook\n",
+ "Pkg.instantiate()"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Warm up: Building a model for the iris dataset"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Before turning to the Telco Customer Churn dataset, we very quickly\n",
+ "build a predictive model for Fisher's well-known iris data set, as way of\n",
+ "introducing the main actors in any MLJ workflow. Details that you\n",
+ "don't fully grasp should become clearer in the Telco study."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "This section is a condensed adaption of the [Getting Started\n",
+ "example](https://alan-turing-institute.github.io/MLJ.jl/dev/getting_started/#Fit-and-predict)\n",
+ "in the MLJ documentation."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "First, using the built-in iris dataset, we load and inspect the features\n",
+ "`X_iris` (a table) and target variable `y_iris` (a vector):"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "using MLJ"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "const X_iris, y_iris = @load_iris;\n",
+ "schema(X_iris)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "y_iris[1:4]"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "levels(y_iris)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We load a decision tree model, from the package DecisionTree.jl:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "DecisionTree = @load DecisionTreeClassifier pkg=DecisionTree # model type\n",
+ "model = DecisionTree(min_samples_split=5) # model instance"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "In MLJ, a *model* is just a container for hyper-parameters of\n",
+ "some learning algorithm. It does not store learned parameters."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Next, we bind the model together with the available data in what's\n",
+ "called a *machine*:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "mach = machine(model, X_iris, y_iris)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "A machine is essentially just a model (ie, hyper-parameters) plus data, but\n",
+ "it additionally stores *learned parameters* (the tree) once it is\n",
+ "trained on some view of the data:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "train_rows = vcat(1:60, 91:150); # some row indices (observations are rows not columns)\n",
+ "fit!(mach, rows=train_rows)\n",
+ "fitted_params(mach)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "A machine stores some other information enabling [warm\n",
+ "restart](https://alan-turing-institute.github.io/MLJ.jl/dev/machines/#Warm-restarts)\n",
+ "for some models, but we won't go into that here. You are allowed to\n",
+ "access and mutate the `model` parameter:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "mach.model.min_samples_split = 10\n",
+ "fit!(mach, rows=train_rows) # re-train with new hyper-parameter"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Now we can make predictions on some other view of the data, as in"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "predict(mach, rows=71:73)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "or on completely new data, as in"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "Xnew = (sepal_length = [5.1, 6.3],\n",
+ " sepal_width = [3.0, 2.5],\n",
+ " petal_length = [1.4, 4.9],\n",
+ " petal_width = [0.3, 1.5])\n",
+ "yhat = predict(mach, Xnew)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "These are probabilistic predictions which can be manipulated using a\n",
+ "widely adopted interface defined in the Distributions.jl\n",
+ "package. For example, we can get raw probabilities like this:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "pdf.(yhat, \"virginica\")"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We now turn to the Telco dataset."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Getting the Telco data"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "import DataFrames"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "data = OpenML.load(42178) # data set from OpenML.org\n",
+ "df0 = DataFrames.DataFrame(data)\n",
+ "first(df0, 4)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "The object of this tutorial is to build and evaluate supervised\n",
+ "learning models to predict the `:Churn` variable, a binary variable\n",
+ "measuring customer retention, based on other variables that are\n",
+ "relevant."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "In the table, observations correspond to rows, and features to\n",
+ "columns, which is the convention for representing all\n",
+ "two-dimensional data in MLJ."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Type coercion"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "> Introduces: `scitype`, `schema`, `coerce`"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "A [\"scientific\n",
+ "type\"](https://juliaai.github.io/ScientificTypes.jl/dev/) or\n",
+ "*scitype* indicates how MLJ will *interpret* data. For example,\n",
+ "`typeof(3.14) == Float64`, while `scitype(3.14) == Continuous` and\n",
+ "also `scitype(3.14f0) == Continuous`. In MLJ, model data\n",
+ "requirements are articulated using scitypes."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Here are common \"scalar\" scitypes:"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "![](assets/scitypes.png)"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "There are also container scitypes. For example, the scitype of any\n",
+ "`N`-dimensional array is `AbstractArray{S, N}`, where `S` is the scitype of the\n",
+ "elements:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "scitype([\"cat\", \"mouse\", \"dog\"])"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "The `schema` operator summarizes the column scitypes of a table:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "schema(df0) |> DataFrames.DataFrame # converted to DataFrame for better display"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "All of the fields being interpreted as `Textual` are really\n",
+ "something else, either `Multiclass` or, in the case of\n",
+ "`:TotalCharges`, `Continuous`. In fact, `:TotalCharges` is\n",
+ "mostly floats wrapped as strings. However, it needs special\n",
+ "treatment because some elements consist of a single space, \" \",\n",
+ "which we'll treat as \"0.0\"."
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "fix_blanks(v) = map(v) do x\n",
+ " if x == \" \"\n",
+ " return \"0.0\"\n",
+ " else\n",
+ " return x\n",
+ " end\n",
+ "end\n",
+ "\n",
+ "df0.TotalCharges = fix_blanks(df0.TotalCharges);"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Coercing the `:TotalCharges` type to ensure a `Continuous` scitype:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "coerce!(df0, :TotalCharges => Continuous);"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Coercing all remaining `Textual` data to `Multiclass`:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "coerce!(df0, Textual => Multiclass);"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Finally, we'll coerce our target variable `:Churn` to be\n",
+ "`OrderedFactor`, rather than `Multiclass`, to enable a reliable\n",
+ "interpretation of metrics like \"true positive rate\". By convention,\n",
+ "the first class is the negative one:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "coerce!(df0, :Churn => OrderedFactor)\n",
+ "levels(df0.Churn) # to check order"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Re-inspecting the scitypes:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "schema(df0) |> DataFrames.DataFrame"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Preparing a holdout set for final testing"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "> Introduces: `partition`"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "To reduce training times for the purposes of this tutorial, we're\n",
+ "going to dump 90% of observations (after shuffling) and split off\n",
+ "30% of the remainder for use as a lock-and-throw-away-the-key\n",
+ "holdout set:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "df, df_test, df_dumped = partition(df0, 0.07, 0.03, # in ratios 7:3:90\n",
+ " stratify=df0.Churn,\n",
+ " rng=123);"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "The reader interested in including all data can instead do\n",
+ "`df, df_test = partition(df0, 0.7, rng=123)`."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Splitting data into target and features"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "> Introduces: `unpack`"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "In the following call, the column with name `:Churn` is copied over\n",
+ "to a vector `y`, and every remaining column, except `:customerID`\n",
+ "(which contains no useful information) goes into a table `X`. Here\n",
+ "`:Churn` is the target variable for which we seek predictions, given\n",
+ "new versions of the features `X`."
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "const y, X = unpack(df, ==(:Churn), !=(:customerID));\n",
+ "schema(X).names"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "intersect([:Churn, :customerID], schema(X).names)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We'll do the same for the holdout data:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "const ytest, Xtest = unpack(df_test, ==(:Churn), !=(:customerID));"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Loading a model and checking type requirements"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "> Introduces: `@load`, `input_scitype`, `target_scitype`"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "For tools helping us to identify suitable models, see the [Model\n",
+ "Search](https://alan-turing-institute.github.io/MLJ.jl/dev/model_search/#model_search)\n",
+ "section of the manual. We will build a gradient tree-boosting model,\n",
+ "a popular first choice for structured data like we have here. Model\n",
+ "code is contained in a third-party package called\n",
+ "[EvoTrees.jl](https://github.com/Evovest/EvoTrees.jl) which is\n",
+ "loaded as follows:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "Booster = @load EvoTreeClassifier pkg=EvoTrees"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Recall that a *model* is just a container for some algorithm's\n",
+ "hyper-parameters. Let's create a `Booster` with default values for\n",
+ "the hyper-parameters:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "booster = Booster()"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "This model is appropriate for the kind of target variable we have because of\n",
+ "the following passing test:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "scitype(y) <: target_scitype(booster)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "However, our features `X` cannot be directly used with `booster`:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "scitype(X) <: input_scitype(booster)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "As it turns out, this is because `booster`, like the majority of MLJ\n",
+ "supervised models, expects the features to be `Continuous`. (With\n",
+ "some experience, this can be gleaned from `input_scitype(booster)`.)\n",
+ "So we need categorical feature encoding, discussed next."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Building a model pipeline to incorporate feature encoding"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "> Introduces: `ContinuousEncoder`, pipeline operator `|>`"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "The built-in `ContinuousEncoder` model transforms an arbitrary table\n",
+ "to a table whose features are all `Continuous` (dropping any fields\n",
+ "it does not know how to encode). In particular, all `Multiclass`\n",
+ "features are one-hot encoded."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "A *pipeline* is a stand-alone model that internally combines one or\n",
+ "more models in a linear (non-branching) pipeline. Here's a pipeline\n",
+ "that adds the `ContinuousEncoder` as a pre-processor to the\n",
+ "gradient tree-boosting model above:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "pipe = ContinuousEncoder() |> booster"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Note that the component models appear as hyper-parameters of\n",
+ "`pipe`. Pipelines are an implementation of a more general [model\n",
+ "composition](https://alan-turing-institute.github.io/MLJ.jl/dev/composing_models/#Composing-Models)\n",
+ "interface provided by MLJ that advanced users may want to learn about."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "From the above display, we see that component model hyper-parameters\n",
+ "are now *nested*, but they are still accessible (important in hyper-parameter\n",
+ "optimization):"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "pipe.evo_tree_classifier.max_depth"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Evaluating the pipeline model's performance"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "> Introduces: `measures` (function), **measures:** `brier_loss`, `auc`, `accuracy`;\n",
+ "> `machine`, `fit!`, `predict`, `fitted_params`, `report`, `roc`, **resampling strategy** `StratifiedCV`, `evaluate`, `FeatureSelector`"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Without touching our test set `Xtest`, `ytest`, we will estimate the\n",
+ "performance of our pipeline model, with default hyper-parameters, in\n",
+ "two different ways:"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "**Evaluating by hand.** First, we'll do this \"by hand\" using the `fit!` and `predict`\n",
+ "workflow illustrated for the iris data set above, using a\n",
+ "holdout resampling strategy. At the same time we'll see how to\n",
+ "generate a **confusion matrix**, **ROC curve**, and inspect\n",
+ "**feature importances**."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "**Automated performance evaluation.** Next we'll apply the more\n",
+ "typical and convenient `evaluate` workflow, but using `StratifiedCV`\n",
+ "(stratified cross-validation) which is more informative."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "In any case, we need to choose some measures (metrics) to quantify\n",
+ "the performance of our model. For a complete list of measures, one\n",
+ "does `measures()`. Or we also can do:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "measures(\"Brier\")"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We will be primarily using `brier_loss`, but also `auc` (area under\n",
+ "the ROC curve) and `accuracy`."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Evaluating by hand (with a holdout set)"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Our pipeline model can be trained just like the decision tree model\n",
+ "we built for the iris data set. Binding all non-test data to the\n",
+ "pipeline model:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "mach_pipe = machine(pipe, X, y)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We already encountered the `partition` method above. Here we apply\n",
+ "it to row indices, instead of data containers, as `fit!` and\n",
+ "`predict` only need a *view* of the data to work."
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "train, validation = partition(1:length(y), 0.7)\n",
+ "fit!(mach_pipe, rows=train)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We note in passing that we can access two kinds of information from a trained machine:"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "- The **learned parameters** (eg, coefficients of a linear model): We use `fitted_params(mach_pipe)`\n",
+ "- Other **by-products of training** (eg, feature importances): We use `report(mach_pipe)`"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "fp = fitted_params(mach_pipe);\n",
+ "keys(fp)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "For example, we can check that the encoder did not actually drop any features:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "Set(fp.continuous_encoder.features_to_keep) == Set(schema(X).names)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "And, from the report, extract feature importances:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "rpt = report(mach_pipe)\n",
+ "keys(rpt.evo_tree_classifier)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "fi = rpt.evo_tree_classifier.feature_importances\n",
+ "feature_importance_table =\n",
+ " (feature=Symbol.(first.(fi)), importance=last.(fi)) |> DataFrames.DataFrame"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "For models not reporting feature importances, we recommend the\n",
+ "[Shapley.jl](https://expandingman.gitlab.io/Shapley.jl/) package."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Returning to predictions and evaluations of our measures:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "ŷ = predict(mach_pipe, rows=validation);\n",
+ "@info(\"Measurements\",\n",
+ " brier_loss(ŷ, y[validation]) |> mean,\n",
+ " auc(ŷ, y[validation]),\n",
+ " accuracy(mode.(ŷ), y[validation])\n",
+ " )"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Note that we need `mode` in the last case because `accuracy` expects\n",
+ "point predictions, not probabilistic ones. (One can alternatively\n",
+ "use `predict_mode` to generate the predictions.)"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "While we're here, lets also generate a **confusion matrix** and\n",
+ "[receiver-operator\n",
+ "characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)\n",
+ "(ROC):"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "confmat(mode.(ŷ), y[validation])"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Note: Importing the plotting package and calling the plotting\n",
+ "functions for the first time can take a minute or so."
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "using Plots"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "roc_curve = roc(ŷ, y[validation])\n",
+ "plt = scatter(roc_curve, legend=false)\n",
+ "plot!(plt, xlab=\"false positive rate\", ylab=\"true positive rate\")\n",
+ "plot!([0, 1], [0, 1], linewidth=2, linestyle=:dash, color=:black)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Automated performance evaluation (more typical workflow)"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We can also get performance estimates with a single call to the\n",
+ "`evaluate` function, which also allows for more complicated\n",
+ "resampling - in this case stratified cross-validation. To make this\n",
+ "more comprehensive, we set `repeats=3` below to make our\n",
+ "cross-validation \"Monte Carlo\" (3 random size-6 partitions of the\n",
+ "observation space, for a total of 18 folds) and set\n",
+ "`acceleration=CPUThreads()` to parallelize the computation."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We choose a `StratifiedCV` resampling strategy; the complete list of options is\n",
+ "[here](https://alan-turing-institute.github.io/MLJ.jl/dev/evaluating_model_performance/#Built-in-resampling-strategies)."
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "e_pipe = evaluate(pipe, X, y,\n",
+ " resampling=StratifiedCV(nfolds=6, rng=123),\n",
+ " measures=[brier_loss, auc, accuracy],\n",
+ " repeats=3,\n",
+ " acceleration=CPUThreads())"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "(There is also a version of `evaluate` for machines. Query the\n",
+ "`evaluate` and `evaluate!` doc-strings to learn more about these\n",
+ "functions and what the `PerformanceEvaluation` object `e_pipe` records.)"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "While [less than ideal](https://arxiv.org/abs/2104.00673), let's\n",
+ "adopt the common practice of using the standard error of a\n",
+ "cross-validation score as an estimate of the uncertainty of a\n",
+ "performance measure's expected value. Here's a utility function to\n",
+ "calculate confidence intervals for our performance estimates based\n",
+ "on this practice, and it's application to the current evaluation:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "using Measurements"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "function confidence_intervals(e)\n",
+ " measure = e.measure\n",
+ " nfolds = length(e.per_fold[1])\n",
+ " measurement = [e.measurement[j] ± std(e.per_fold[j])/sqrt(nfolds - 1)\n",
+ " for j in eachindex(measure)]\n",
+ " table = (measure=measure, measurement=measurement)\n",
+ " return DataFrames.DataFrame(table)\n",
+ "end\n",
+ "\n",
+ "const confidence_intervals_basic_model = confidence_intervals(e_pipe)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Filtering out unimportant features"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "> Introduces: `FeatureSelector`"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Before continuing, we'll modify our pipeline to drop those features\n",
+ "with low feature importance, to speed up later optimization:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "unimportant_features = filter(:importance => <(0.005), feature_importance_table).feature\n",
+ "\n",
+ "pipe2 = ContinuousEncoder() |>\n",
+ " FeatureSelector(features=unimportant_features, ignore=true) |> booster"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Wrapping our iterative model in control strategies"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "> Introduces: **control strategies:** `Step`, `NumberSinceBest`, `TimeLimit`, `InvalidValue`, **model wrapper** `IteratedModel`, **resampling strategy:** `Holdout`"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We want to optimize the hyper-parameters of our model. Since our\n",
+ "model is iterative, these parameters include the (nested) iteration\n",
+ "parameter `pipe.evo_tree_classifier.nrounds`. Sometimes this\n",
+ "parameter is optimized first, fixed, and then maybe optimized again\n",
+ "after the other parameters. Here we take a more principled approach,\n",
+ "**wrapping our model in a control strategy** that makes it\n",
+ "\"self-iterating\". The strategy applies a stopping criterion to\n",
+ "*out-of-sample* estimates of the model performance, constructed\n",
+ "using an internally constructed holdout set. In this way, we avoid\n",
+ "some data hygiene issues, and, when we subsequently optimize other\n",
+ "parameters, we will always being using an optimal number of\n",
+ "iterations."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Note that this approach can be applied to any iterative MLJ model,\n",
+ "eg, the neural network models provided by\n",
+ "[MLJFlux.jl](https://github.com/FluxML/MLJFlux.jl)."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "First, we select appropriate controls from [this\n",
+ "list](https://alan-turing-institute.github.io/MLJ.jl/dev/controlling_iterative_models/#Controls-provided):"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "controls = [\n",
+ " Step(1), # to increment iteration parameter (`pipe.nrounds`)\n",
+ " NumberSinceBest(4), # main stopping criterion\n",
+ " TimeLimit(2/3600), # never train more than 2 sec\n",
+ " InvalidValue() # stop if NaN or ±Inf encountered\n",
+ "]"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Now we wrap our pipeline model using the `IteratedModel` wrapper,\n",
+ "being sure to specify the `measure` on which internal estimates of\n",
+ "the out-of-sample performance will be based:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "iterated_pipe = IteratedModel(model=pipe2,\n",
+ " controls=controls,\n",
+ " measure=brier_loss,\n",
+ " resampling=Holdout(fraction_train=0.7))"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We've set `resampling=Holdout(fraction_train=0.7)` to arrange that\n",
+ "data attached to our model should be internally split into a train\n",
+ "set (70%) and a holdout set (30%) for determining the out-of-sample\n",
+ "estimate of the Brier loss."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "For demonstration purposes, let's bind `iterated_model` to all data\n",
+ "not in our don't-touch holdout set, and train on all of that data:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "mach_iterated_pipe = machine(iterated_pipe, X, y)\n",
+ "fit!(mach_iterated_pipe);"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "To recap, internally this training is split into two separate steps:"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "- A controlled iteration step, training on the holdout set, with the total number of iterations determined by the specified stopping criteria (based on the out-of-sample performance estimates)\n",
+ "- A final step that trains the atomic model on *all* available\n",
+ " data using the number of iterations determined in the first step. Calling `predict` on `mach_iterated_pipe` means using the learned parameters of the second step."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Hyper-parameter optimization (model tuning)"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "> Introduces: `range`, **model wrapper** `TunedModel`, `RandomSearch`"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We now turn to hyper-parameter optimization. A tool not discussed\n",
+ "here is the `learning_curve` function, which can be useful when\n",
+ "wanting to visualize the effect of changes to a *single*\n",
+ "hyper-parameter (which could be an iteration parameter). See, for\n",
+ "example, [this section of the\n",
+ "manual](https://alan-turing-institute.github.io/MLJ.jl/dev/learning_curves/)\n",
+ "or [this\n",
+ "tutorial](https://github.com/ablaom/MLJTutorial.jl/blob/dev/notebooks/04_tuning/notebook.ipynb)."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Fine tuning the hyper-parameters of a gradient booster can be\n",
+ "somewhat involved. Here we settle for simultaneously optimizing two\n",
+ "key parameters: `max_depth` and `η` (learning_rate)."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Like iteration control, **model optimization in MLJ is implemented as\n",
+ "a model wrapper**, called `TunedModel`. After wrapping a model in a\n",
+ "tuning strategy and binding the wrapped model to data in a machine\n",
+ "called `mach`, calling `fit!(mach)` instigates a search for optimal\n",
+ "model hyperparameters, within a specified range, and then uses all\n",
+ "supplied data to train the best model. To predict using that model,\n",
+ "one then calls `predict(mach, Xnew)`. In this way the wrapped model\n",
+ "may be viewed as a \"self-tuning\" version of the unwrapped\n",
+ "model. That is, wrapping the model simply transforms certain\n",
+ "hyper-parameters into learned parameters (just as `IteratedModel`\n",
+ "does for an iteration parameter)."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "To start with, we define ranges for the parameters of\n",
+ "interest. Since these parameters are nested, let's force a\n",
+ "display of our model to a larger depth:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "show(iterated_pipe, 2)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "p1 = :(model.evo_tree_classifier.η)\n",
+ "p2 = :(model.evo_tree_classifier.max_depth)\n",
+ "\n",
+ "r1 = range(iterated_pipe, p1, lower=-2, upper=-0.5, scale=x->10^x)\n",
+ "r2 = range(iterated_pipe, p2, lower=2, upper=6)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Nominal ranges are defined by specifying `values` instead of `lower`\n",
+ "and `upper`."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Next, we choose an optimization strategy from [this\n",
+ "list](https://alan-turing-institute.github.io/MLJ.jl/dev/tuning_models/#Tuning-Models):"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "tuning = RandomSearch(rng=123)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Then we wrap the model, specifying a `resampling` strategy and a\n",
+ "`measure`, as we did for `IteratedModel`. In fact, we can include a\n",
+ "battery of `measures`; by default, optimization is with respect to\n",
+ "performance estimates based on the first measure, but estimates for\n",
+ "all measures can be accessed from the model's `report`."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "The keyword `n` specifies the total number of models (sets of\n",
+ "hyper-parameters) to evaluate."
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "tuned_iterated_pipe = TunedModel(model=iterated_pipe,\n",
+ " range=[r1, r2],\n",
+ " tuning=tuning,\n",
+ " measures=[brier_loss, auc, accuracy],\n",
+ " resampling=StratifiedCV(nfolds=6, rng=123),\n",
+ " acceleration=CPUThreads(),\n",
+ " n=40)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "To save time, we skip the `repeats` here."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Binding our final model to data and training:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "mach_tuned_iterated_pipe = machine(tuned_iterated_pipe, X, y)\n",
+ "fit!(mach_tuned_iterated_pipe)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "As explained above, the training we have just performed was split\n",
+ "internally into two separate steps:"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "- A step to determine the parameter values that optimize the aggregated cross-validation scores\n",
+ "- A final step that trains the optimal model on *all* available data. Future predictions `predict(mach_tuned_iterated_pipe, ...)` are based on this final training step."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "From `report(mach_tuned_iterated_pipe)` we can extract details about\n",
+ "the optimization procedure. For example:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "rpt2 = report(mach_tuned_iterated_pipe);\n",
+ "best_booster = rpt2.best_model.model.evo_tree_classifier"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "@info \"Optimal hyper-parameters:\" best_booster.max_depth best_booster.η;"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Using the `confidence_intervals` function we defined earlier:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "e_best = rpt2.best_history_entry\n",
+ "confidence_intervals(e_best)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Digging a little deeper, we can learn what stopping criterion was\n",
+ "applied in the case of the optimal model, and how many iterations\n",
+ "were required:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "rpt2.best_report.controls |> collect"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Finally, we can visualize the optimization results:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "plot(mach_tuned_iterated_pipe, size=(600,450))"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Saving our model"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "> Introduces: `MLJ.save`"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Here's how to serialize our final, trained self-iterating,\n",
+ "self-tuning pipeline machine:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "MLJ.save(\"tuned_iterated_pipe.jlso\", mach_tuned_iterated_pipe)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We'll deserialize this in \"Testing the final model\" below."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Final performance estimate"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Finally, to get an even more accurate estimate of performance, we\n",
+ "can evaluate our model using stratified cross-validation and all the\n",
+ "data attached to our machine. Because this evaluation implies\n",
+ "[nested\n",
+ "resampling](https://mlr.mlr-org.com/articles/tutorial/nested_resampling.html),\n",
+ "this computation takes quite a bit longer than the previous one\n",
+ "(which is being repeated six times, using 5/6th of the data each\n",
+ "time):"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "e_tuned_iterated_pipe = evaluate(tuned_iterated_pipe, X, y,\n",
+ " resampling=StratifiedCV(nfolds=6, rng=123),\n",
+ " measures=[brier_loss, auc, accuracy])"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "confidence_intervals(e_tuned_iterated_pipe)"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "For comparison, here are the confidence intervals for the basic\n",
+ "pipeline model (no feature selection and default hyperparameters):"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "confidence_intervals_basic_model"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "As each pair of intervals overlap, it's doubtful the small changes\n",
+ "here can be assigned statistical significance. Default `booster`\n",
+ "hyper-parameters do a pretty good job."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Testing the final model"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We now determine the performance of our model on our\n",
+ "lock-and-throw-away-the-key holdout set. To demonstrate\n",
+ "deserialization, we'll pretend we're in a new Julia session (but\n",
+ "have called `import`/`using` on the same packages). Then the\n",
+ "following should suffice to recover our model trained under\n",
+ "\"Hyper-parameter optimization\" above:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "mach_restored = machine(\"tuned_iterated_pipe.jlso\")"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We compute predictions on the holdout set:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "ŷ_tuned = predict(mach_restored, Xtest);\n",
+ "ŷ_tuned[1]"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "And can compute the final performance measures:"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "@info(\"Tuned model measurements on test:\",\n",
+ " brier_loss(ŷ_tuned, ytest) |> mean,\n",
+ " auc(ŷ_tuned, ytest),\n",
+ " accuracy(mode.(ŷ_tuned), ytest)\n",
+ " )"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "For comparison, here's the performance for the basic pipeline model"
+ ],
+ "metadata": {}
+ },
+ {
+ "outputs": [],
+ "cell_type": "code",
+ "source": [
+ "mach_basic = machine(pipe, X, y)\n",
+ "fit!(mach_basic, verbosity=0)\n",
+ "\n",
+ "ŷ_basic = predict(mach_basic, Xtest);\n",
+ "\n",
+ "@info(\"Basic model measurements on test set:\",\n",
+ " brier_loss(ŷ_basic, ytest) |> mean,\n",
+ " auc(ŷ_basic, ytest),\n",
+ " accuracy(mode.(ŷ_basic), ytest)\n",
+ " )"
+ ],
+ "metadata": {},
+ "execution_count": null
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "---\n",
+ "\n",
+ "*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*"
+ ],
+ "metadata": {}
+ }
+ ],
+ "nbformat_minor": 3,
+ "metadata": {
+ "language_info": {
+ "file_extension": ".jl",
+ "mimetype": "application/julia",
+ "name": "julia",
+ "version": "1.6.5"
+ },
+ "kernelspec": {
+ "name": "julia-1.6",
+ "display_name": "Julia 1.6.5",
+ "language": "julia"
+ }
+ },
+ "nbformat": 4
+}
diff --git a/examples/telco/scitypes.png b/examples/telco/scitypes.png
new file mode 100644
index 000000000..8e04fadf1
Binary files /dev/null and b/examples/telco/scitypes.png differ
diff --git a/examples/telco/tuned_iterated_pipe.jlso b/examples/telco/tuned_iterated_pipe.jlso
new file mode 100644
index 000000000..f81f90df1
Binary files /dev/null and b/examples/telco/tuned_iterated_pipe.jlso differ