Katib is a Kubernetes Native System for Hyperparameter Tuning and Neural Architecture Search. This short introduction illustrates how to use Katib to:
- Define a hyperparameter tuning experiment.
- Evaluate it using the resources in Kubernetes.
- Get the best hyperparameter combination in all these trials.
Before you run the hyperparameter tuning experiment, you need to have:
- A Kubernetes cluster with Kubeflow 0.7
Katib supports multiple Machine Learning Frameworks (e.g. TensorFlow, PyTorch, MXNet, and XGBoost).
In this quick start guide, we demonstrate how to use TensorFlow in Katib, which is one of the most popular framework among the world, to run a hyperparameter tuning job on MNIST.
The first thing we need to do is to package the training code to a docker image. We use the example code, which builds a simple neural network, to train on MNIST. The code trains the network and outputs the TFEvents to /tmp
by default.
You can use our prebuilt image gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
. Thus we can skip it.
If you want to use Katib to automatically tune hyperparameters, you need to define the Experiment
, which is a CRD that represents a single optimization run over a feasible space. Each Experiment
contains:
- Configuration about parallelism: The configuration about the parallelism.
- Objective: The metric that we want to optimize.
- Search space: The name and the distribution (discrete valued or continuous valued) of all the hyperparameters you need to search.
- Search algorithm: The algorithm (e.g. Random Search, Grid Search, TPE, Bayesian Optimization) used to find the best hyperparameters.
- Trial Template: The template used to define the trial.
- Metrics Collection: Definition about how to collect the metrics (e.g. accuracy, loss).
The Experiment
's definition is defined here:
Click here to get YAML configuration
apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
namespace: kubeflow
name: tfjob-example
spec:
parallelTrialCount: 3
maxTrialCount: 12
maxFailedTrialCount: 3
objective:
type: maximize
goal: 0.99
objectiveMetricName: accuracy_1
algorithm:
algorithmName: random
metricsCollectorSpec:
source:
fileSystemPath:
path: /train
kind: Directory
collector:
kind: TensorFlowEvent
parameters:
- name: --learning_rate
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.05"
- name: --batch_size
parameterType: int
feasibleSpace:
min: "100"
max: "200"
trialTemplate:
goTemplate:
rawTemplate: |-
apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
name: {{.Trial}}
namespace: {{.NameSpace}}
spec:
tfReplicaSpecs:
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: tensorflow
image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
imagePullPolicy: Always
command:
- "python"
- "/var/tf_mnist/mnist_with_summaries.py"
- "--log_dir=/train/metrics"
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
The experiment has two hyperparameters defined in parameters
: --learning_rate
and --batch_size
. We decide to use random search algorithm, and collect metrics from the TF Events.
Or you could just run:
kubectl apply -f ./examples/v1alpha3/tfjob-example.yaml
You can get the trial results using the command (Need to install jq
to parse JSON):
kubectl -n kubeflow get trials -o json | jq ".items[] | {assignments: .spec.parameterAssignments, observation: .status.observation}"
You should get the output:
{
"assignments": [
{
"name": "--learning_rate",
"value": "0.02722446089467028"
},
{
"name": "--batch_size",
"value": "115"
}
],
"observation": {
"metrics": [
{
"name": "accuracy_1",
"value": "0.987",
},
],
},
}
Or you could get the result in UI: <Katib-URL>/katib/#/katib/hp_monitor/kubeflow/quick-start-example
.
When you click the trial name, you should get the details about metrics: