-
Notifications
You must be signed in to change notification settings - Fork 74
Model Selection, Evaluation and Prediction
When a learner is a subclass of mltk.predictor.HoldoutValidatedLearner
, it allows to specify the validation set and the metric so that the best model will be found on the validation set according to the metric during the training.
Alternatively, MLTK also supports cross validation using mltk.core.processor.InstancesSplitter
.
Most machine learning algorithms are iterative, and usually best model is selected using a validation set based on some metric. By analyzing the series of metric values, one can often conclude the series has already been converged and therefore stop the learning algorithm before it hits maximum number of iterations. This not only saves training time, but also makes maximum number iterations
a less important parameter to tune.
MLTK uses mltk.predictor.evaluation.ConvergenceTester
class to keep track of the series of metric values and determines whether the series is converged or not. There are three main parameters to test convergence; minNumPoints
, n
and c
. The series has to be at least minNumPoints
long to be eligible for convergence test. Once a series is at least minNumPoints
long, we find the index idx
for the best metric value so far. We say a series is converged if idx + n < size * c
, where size
is the current number of points in the series. The intuition is that the best metric value should be peaked (or bottomed) with a wide margin.
With n
and c
we can implement complex convergence rules, but there are two common cases.
-
n = 0
andc in [0, 1]
For example, n = 0
and c = 0.8
. This means there should be at least 20% of points after the best metric value. Smaller value in c
leads to a more conservative convergence test. This setting is recommended in training boosted tree ensembles.
-
n > 0
andc = 1.0
Sometimes we need to make sure there are at least n
points after the peak (or bottom). For example, when training GAM
model, it is recommended to test at least k
passes after the peak (or bottom), where a pass means iterating over all p
features. This translates to setting n = k * p
.
Most subclasses of mltk.predictor.HoldoutValidatedLearner
should have an -S
option that can be used to specify convergence criteria. Currently it only works on validation set. The syntax is minNumPoints
[:n
][:c
]. For example, to require at least 200 points and c = 0.8
, we can use -S 200:0:0.8
. In addition, to require 200 points and n = 400
, we can use -S 200:400
. Default values for n
and c
are 0 and 1.0, respectively. A negative minNumPoints
will turn off the convergence test.
The following code specifies minNumPoints = 200
, n = 0
and c = 0.8
in LogitBoostLearner
:
ConvergenceTester ct = new ConvergenceTester(200, 0, 0.8);
LogitBoostLearner learner = new LogitBoostLearner();
learner.setConvergenceTester(ct);
MLTK uses mltk.predictor.evaluation.Evaluator
class. To evaluate models from command line, use the following command:
$ java mltk.predictor.evaluation.Evaluator
It should output a message like this:
Usage: mltk.predictor.evaluation.Evaluator
-d data set path
-m model path
[-r] attribute file path
[-e] AUC (a), Error (c), Logistic Loss (l), MAE(m), RMSE (r) (default: r)
Currently MLTK supports area-under-curve (AUC), classification error (Error), logistic loss (Logistic Loss), log loss (Log Loss), mean absolute error (MAE) and root-mean-squared error (RMSE).
Some learners support customized metric, such as LogitBoostLearner
. In command line, -e AUC
will use AUC as metric while -e LogLoss:true
will use log loss as metric. Note :
is used to separate parameters; the first part is the metric name and followed by :
is optional parameters. Here LogLoss:true
will use the raw score to compute log loss (default is to use probability to compute log loss).
The following code builds a L1-regularized linear model and evaluate the classification error on a held-out test set:
LassoLearner learner = new LassoLearner();
GLM glm = learner.buildClassifier(trainSet, 100, 0.01);
double error = Evaluator.evalError(glm, testSet);
MLTK uses mltk.predictor.evaluation.Predictor
class. To make predictions from command line, use the following command:
$ java mltk.predictor.evaluation.Predictor
It should output a message like this:
Usage: mltk.predictor.evaluation.Predictor
-d data set path
-m model path
[-r] attribute file path
[-p] prediction path
[-R] residual path
[-g] task between classification (c) and regression (r) (default: r)
[-P] output probability (default: false)
When using -P true
, it will generate probabilities instead of predicted labels. When -R
is used, it generates residuals (for classification problems, it will be pseudo residuals). Residuals will be the input to mltk.predictor.gam.interaction.FAST
if running with GA2M