Quick Start

Katib is a Kubernetes Native System for Hyperparameter Tuning and Neural Architecture Search. This short introduction illustrates how to use Katib to:

Define a hyperparameter tuning experiment.
Evaluate it using the resources in Kubernetes.
Get the best hyperparameter combination in all these trials.

Requirements

Before you run the hyperparameter tuning experiment, you need to have:

A Kubernetes cluster with Kubeflow 0.7

Hyperparameter Tuning on MNIST

Katib supports multiple Machine Learning Frameworks (e.g. TensorFlow, PyTorch, MXNet, and XGBoost).

In this quick start guide, we demonstrate how to use TensorFlow in Katib, which is one of the most popular framework among the world, to run a hyperparameter tuning job on MNIST.

Package Training Code

The first thing we need to do is to package the training code to a docker image. We use the example code, which builds a simple neural network, to train on MNIST. The code trains the network and outputs the TFEvents to /tmp by default.

You can use our prebuilt image gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0. Thus we can skip it.

Create the Experiment

If you want to use Katib to automatically tune hyperparameters, you need to define the Experiment, which is a CRD that represents a single optimization run over a feasible space. Each Experiment contains:

Configuration about parallelism: The configuration about the parallelism.
Objective: The metric that we want to optimize.
Search space: The name and the distribution (discrete valued or continuous valued) of all the hyperparameters you need to search.
Search algorithm: The algorithm (e.g. Random Search, Grid Search, TPE, Bayesian Optimization) used to find the best hyperparameters.
Trial Template: The template used to define the trial.
Metrics Collection: Definition about how to collect the metrics (e.g. accuracy, loss).

The Experiment's definition is defined here:

Click here to get YAML configuration

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: kubeflow
  name: tfjob-example
spec:
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: accuracy_1
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    source:
      fileSystemPath:
        path: /train
        kind: Directory
    collector:
      kind: TensorFlowEvent
  parameters:
    - name: --learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.05"
    - name: --batch_size
      parameterType: int
      feasibleSpace:
        min: "100"
        max: "200"
  trialTemplate:
    goTemplate:
        rawTemplate: |-
          apiVersion: "kubeflow.org/v1"
          kind: TFJob
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
           tfReplicaSpecs:
            Worker:
              replicas: 1 
              restartPolicy: OnFailure
              template:
                spec:
                  containers:
                    - name: tensorflow 
                      image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
                      imagePullPolicy: Always
                      command:
                        - "python"
                        - "/var/tf_mnist/mnist_with_summaries.py"
                        - "--log_dir=/train/metrics"
                        {{- with .HyperParameters}}
                        {{- range .}}
                        - "{{.Name}}={{.Value}}"
                        {{- end}}
                        {{- end}}

The experiment has two hyperparameters defined in parameters： --learning_rate and --batch_size. We decide to use random search algorithm, and collect metrics from the TF Events.

Or you could just run:

kubectl apply -f ./examples/v1alpha3/tfjob-example.yaml

Get trial results

You can get the trial results using the command (Need to install jq to parse JSON):

kubectl -n kubeflow get trials -o json | jq ".items[] | {assignments: .spec.parameterAssignments, observation: .status.observation}"

You should get the output:

{
  "assignments": [
    {
      "name": "--learning_rate",
      "value": "0.02722446089467028"
    },
    {
      "name": "--batch_size",
      "value": "115"
    }
  ],
  "observation": {
      "metrics": [
          {
            "name": "accuracy_1",
            "value": "0.987",
          },
      ],
  },
}

Or you could get the result in UI: <Katib-URL>/katib/#/katib/hp_monitor/kubeflow/quick-start-example.

When you click the trial name, you should get the details about metrics:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!