Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Latest commit

 

History

History
381 lines (321 loc) · 14.7 KB

cli-instructions.md

File metadata and controls

381 lines (321 loc) · 14.7 KB

Introduction

Azure CLI 2.0 allows you to create and manage Batch AI resources - create/delete Batch AI file servers and clusters, submit and monitor training jobs.

This recipe shows how to create a GPU cluster, run and monitor training job using Microsoft Cognitive Toolkit.

The training script pytorch_mnist.py is available at Official Horovod GitHub page. This script trains convolutional neural network on MNIST database of handwritten digits.

The Workflow

To train a model, you typically need to perform the following steps:

  • Create a GPU or CPU Batch AI cluster to run the job;
  • Make the training data and training scripts available on the cluster nodes;
  • Submit the training job and obtain its logs and/or generated models;
  • Delete the cluster or resize it to have zero node to not pay for compute resources when you are not using them.

In this recipe, we will:

  • Create a two node GPU cluster (with Standard_NC6 VM size) with name nc6;
  • Create a new storage account, Azure File Share with two folders logs and scripts to store jobs output and training scripts, and Azure Blob Contaier data to store training data;
  • Deploy the training script and the training data to the storage account before job submission;
  • During the job submission we will instruct Batch AI to mount the Azure File Share and Azure Blob Container on the cluster's node and make them available as regular file system at $AZ_BATCHAI_JOB_MOUNT_ROOT/logs, $AZ_BATCHAI_JOB_MOUNT_ROOT/scripts, where AZ_BATCHAI_JOB_MOUNT_ROOT is an environment variable set by Batch AI for the job.
  • We will monitor the job execution by streaming its standard output;
  • After the job completion, we will inspect its output and generated models;
  • At the end, we will cleanup all allocated resources.

Prerequisites

  • Azure subscription - If you don't have an Azure subscription, create a free account before you begin.
  • Access to Azure CLI 2.0. You can either use Azure CLI 2.0 available in Cloud Shell or install and configure it locally using the following instructions.

Cloud Shell Only

If you are using Cloud Shell, please change the working directory to /usr/$USER/clouddrive because your home directory has no empty space:

cd /usr/$USER/clouddrive

Create a Resource Group

An Azure resource group is a logical container for deploying and managing Azure resources. The following command will create a new resource group batchai.recipes in East US location:

az group create -n batchai.recipes -l eastus

Create a Batch AI Workspace

The following command will create a new workspace recipe_workspace in East US location:

az batchai workspace create -g batchai.recipes -n recipe_workspace -l eastus

Create GPU cluster

The following command will create a two node GPU cluster (VM size is Standard_NC6) using Ubuntu as the operation system image.

az batchai cluster create -n nc6 -g batchai.recipes -w recipe_workspace -s Standard_NC6 -t 2 --generate-ssh-keys 

--generate-ssh-keys option tells Azure CLI to generate private and public ssh keys if you have not them already, so you can ssh to cluster nodes using the ssh key and you current user name. Note. You need to backup ~/.ssh folder to some permanent storage if you are using Cloud Shell.

Example output:

{
  "allocationState": "steady",
  "allocationStateTransitionTime": "2018-06-12T21:25:07.039000+00:00",
  "creationTime": "2018-06-12T21:25:07.039000+00:00",
  "currentNodeCount": 2,
  "errors": null,
  "id": "/subscriptions/1cba1da6-5a83-45e1-a88e-8b397eb84356/resourceGroups/batchai.recipes/providers/Microsoft.BatchAI/workspaces/recipe_workspace/clusters/nc6",
  "name": "nc6",
  "nodeSetup": null,
  "nodeStateCounts": {
    "idleNodeCount": 2,
    "leavingNodeCount": 0,
    "preparingNodeCount": 0,
    "runningNodeCount": 0,
    "unusableNodeCount": 0
  },
  "provisioningState": "succeeded",
  "provisioningStateTransitionTime": "2018-06-12T21:25:23.591000+00:00",
  "resourceGroup": "batchai.recipes",
  "scaleSettings": {
    "autoScale": null,
    "manual": {
      "nodeDeallocationOption": "requeue",
      "targetNodeCount": 2
    }
  },
  "subnet": null,
  "type": "Microsoft.BatchAI/workspaces/clusters",
  "userAccountSettings": {
    "adminUserName": "recipeuser",
    "adminUserPassword": null,
    "adminUserSshPublicKey": "<YOUR SSH PUBLIC KEY HERE>"
  },
  "virtualMachineConfiguration": {
    "imageReference": {
      "offer": "UbuntuServer",
      "publisher": "Canonical",
      "sku": "16.04-LTS",
      "version": "latest",
      "virtualMachineImageId": null
    }
  },
  "vmPriority": "dedicated",
  "vmSize": "STANDARD_NC6"
}

Create a Storage Account

Create a new storage account with an unique name in the same region where you are going to create Batch AI cluster and run the job. Node, each storage account must have an unique name.

az storage account create -n <storage account name> --sku Standard_LRS -g batchai.recipes

If selected storage account name is not available, the above command will report corresponding error. In this case, choose other name and retry.

Data Deployment

Download the Training Script

For GNU/Linux or Cloud Shell:

wget https://raw.githubusercontent.com/uber/horovod/master/examples/pytorch_mnist.py

Create Azure File Share and Deploy the Training Script

The following commands will create Azure File Shares scripts and logs and will copy training script into horovod folder inside of scripts share:

az storage share create -n scripts --account-name <storage account name>
az storage share create -n logs --account-name <storage account name>
az storage directory create -n horovod -s scripts --account-name <storage account name>
az storage file upload -s scripts --source pytorch_mnist.py --path horovod --account-name <storage account name> 

Submit Training Job

Prepare Job Configuration File

Create a training job configuration file job.json with the following content:

{
    "$schema": "https://raw.githubusercontent.com/Azure/BatchAI/master/schemas/2018-05-01/job.json",
    "properties": {
        "nodeCount": 2,
        "horovodSettings": {
            "pythonScriptFilePath": "$AZ_BATCHAI_JOB_MOUNT_ROOT/scripts/horovod/pytorch_mnist.py"
        },
        "stdOutErrPathPrefix": "$AZ_BATCHAI_JOB_MOUNT_ROOT/logs",
        "mountVolumes": {
            "azureFileShares": [
                {
                    "azureFileUrl": "https://<AZURE_BATCHAI_STORAGE_ACCOUNT>.file.core.windows.net/logs",
                    "relativeMountPath": "logs"
                },
                {
                    "azureFileUrl": "https://<AZURE_BATCHAI_STORAGE_ACCOUNT>.file.core.windows.net/scripts",
                    "relativeMountPath": "scripts"
                }
            ]
        },
        "outputDirectories": [{
            "id": "MODEL",
            "pathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/external",
            "pathSuffix": "Models"
        }],
        "jobPreparation": {
          "commandLine": "apt update; apt install mpi-default-dev mpi-default-bin -y; pip install horovod"
        },
        "containerSettings": {
            "imageSourceRegistry": {
                "image": "batchaitraining/pytorch:0.4.0-cp36-cuda9-cudnn7"
            }
        }
    }
}

This configuration file specifies:

  • nodeCount - number of nodes required by the job;
  • horovodSettings - tells that the current job needs horovod and specifies path the training script.
  • Horovod framework will be installed by job preparation command line;
  • stdOutErrPathPrefix - path where Batch AI will create directories containing job's logs;
  • mountVolumes - list of filesystem to be mounted during the job execution. In this case, we are mounting two Azure File Shares logs and scripts, and Azure Blob Container data. The filesystems are mounted under AZ_BATCHAI_JOB_MOUNT_ROOT/<relativeMountPath>;
  • <AZURE_BATCHAI_STORAGE_ACCOUNT> tells that the storage account name will be specified during the job submission via --storage-account-name parameter or AZURE_BATCHAI_STORAGE_ACCOUNT environment variable on your computer.
  • The job will use pytorch docker container batchaitraining/pytorch:0.4.0-cp36-cuda9-cudnn7 which is build based on dockerfile.

Submit the Job in an Experiment

Use the following command to create a new experiment called horovod_experiment in the workspace:

az batchai experiment create -g batchai.recipes -w recipe_workspace -n horovod_experiment

Use the following command to submit the job on the cluster:

wget -O job.json https://raw.githubusercontent.com/Azure/BatchAI/master/recipes/Horovod/Horovod-PyTorch/job.json
az batchai job create -c nc6 -n horovod_pytorch -g batchai.recipes -w recipe_workspace -e horovod_experiment -f job.json --storage-account-name <storage account name>

Example output:

{
  "caffe2Settings": null,
  "caffeSettings": null,
  "chainerSettings": null,
  "cluster": {
    "id": "/subscriptions/1cba1da6-5a83-45e1-a88e-8b397eb84356/resourceGroups/batchai.recipes/providers/Microsoft.BatchAI/workspaces/recipe_workspace/clusters/nc6",
    "resourceGroup": "batchai.recipes"
  },
  "cntkSettings": null,
  "constraints": {
    "maxWallClockTime": "7 days, 0:00:00"
  },
  "containerSettings": {
    "imageSourceRegistry": {
      "credentials": null,
      "image": "batchaitraining/pytorch:0.4.0-cp36-cuda9-cudnn7",
      "serverUrl": null
    },
    "shmSize": null
  },
  "creationTime": "2018-06-15T22:02:05.587000+00:00",
  "customMpiSettings": null,
  "customToolkitSettings": null,
  "environmentVariables": null,
  "executionInfo": {
    "endTime": null,
    "errors": null,
    "exitCode": null,
    "startTime": "2018-06-15T22:02:06.725000+00:00"
  },
  "executionState": "running",
  "executionStateTransitionTime": "2018-06-15T22:02:06.724000+00:00",
  "horovodSettings": {
    "commandLineArgs": null,
    "processCount": 2,
    "pythonInterpreterPath": null,
    "pythonScriptFilePath": "$AZ_BATCHAI_JOB_MOUNT_ROOT/scripts/horovod/pytorch_mnist.py"
  },
  "id": "/subscriptions/1cba1da6-5a83-45e1-a88e-8b397eb84356/resourceGroups/batchai.recipes/providers/Microsoft.BatchAI/workspaces/recipe_workspace/experiments/horovod_experiment/jobs/horovod_pytorch",
  "inputDirectories": null,
  "jobOutputDirectoryPathSegment": "1cba1da6-5a83-45e1-a88e-8b397eb84356/batchai.recipes/workspaces/recipe_workspace/experiments/horovod_experiment/jobs/horovod_pytorch/145729b8-ca5a-4a7c-946e-eed1d3d3f16c",
  "jobPreparation": {
    "commandLine": "apt update; apt install mpi-default-dev mpi-default-bin -y; pip install horovod"
  },
  "mountVolumes": {
    "azureBlobFileSystems": null,
    "azureFileShares": [
      {
        "accountName": "batchairecipestorage",
        "azureFileUrl": "https://batchairecipestorage.file.core.windows.net/logs",
        "credentials": {
          "accountKey": null,
          "accountKeySecretReference": null
        },
        "directoryMode": "0777",
        "fileMode": "0777",
        "relativeMountPath": "logs"
      },
      {
        "accountName": "batchairecipestorage",
        "azureFileUrl": "https://batchairecipestorage.file.core.windows.net/scripts",
        "credentials": {
          "accountKey": null,
          "accountKeySecretReference": null
        },
        "directoryMode": "0777",
        "fileMode": "0777",
        "relativeMountPath": "scripts"
      }
    ],
    "fileServers": null,
    "unmanagedFileSystems": null
  },
  "name": "horovod_pytorch",
  "nodeCount": 2,
  "outputDirectories": [
    {
      "id": "MODEL",
      "pathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/external",
      "pathSuffix": "Models"
    }
  ],
  "provisioningState": "succeeded",
  "provisioningStateTransitionTime": "2018-06-15T22:02:06.275000+00:00",
  "pyTorchSettings": null,
  "resourceGroup": "batchai.recipes",
  "schedulingPriority": "normal",
  "secrets": null,
  "stdOutErrPathPrefix": "$AZ_BATCHAI_JOB_MOUNT_ROOT/logs",
  "tensorFlowSettings": null,
  "toolType": "horovod",
  "type": "Microsoft.BatchAI/workspaces/experiments/jobs"
}

Monitor Job Execution

The training script is reporting the training progress in stdout.txt file inside the standard output directory. You can monitor the progress using the following command:

az batchai job file stream -j horovod_pytorch -g batchai.recipes -w recipe_workspace -e horovod_experiment -f stdout.txt

Example output:

The file "stdout.txt" not found. Waiting for the job to generate it.
File found with URL "https://batchairecipestorage.file.core.windows.net/logs/1cba1da6-5a83-45e1-a88e-8b397eb84356/batchai.recipes/workspaces/recipe_workspace/experiments/horovod_experiment/jobs/horovod_pytorch/145729b8-ca5a-4a7c-946e-eed1d3d3f16c/stdouterr/stdout.txt?sv=2016-05-31&sr=f&sig=dipkON7iud6T17jvoZl%2FCeQyMLeiUoRKWVAbjNtmtOQ%3D&se=2018-06-15T23%3A12%3A33Z&sp=rl". Start streaming

...

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!
Processing...
Done!
Train Epoch: 1 [0/30000 (0%)]	Loss: 2.343261
Train Epoch: 1 [0/30000 (0%)]	Loss: 2.298381
Train Epoch: 1 [640/30000 (2%)]	Loss: 2.321822
Train Epoch: 1 [640/30000 (2%)]	Loss: 2.289122

...

Train Epoch: 5 [28160/30000 (94%)]	Loss: 0.283757
Train Epoch: 5 [28160/30000 (94%)]	Loss: 0.242315
Train Epoch: 5 [28800/30000 (96%)]	Loss: 0.257195
Train Epoch: 5 [28800/30000 (96%)]	Loss: 0.300486
Train Epoch: 5 [29440/30000 (98%)]	Loss: 0.187123
Train Epoch: 5 [29440/30000 (98%)]	Loss: 0.169184

The streaming is stopped when the job is completed.

Alternatively, you can use the Portal or Azure Storage Explorer to inspect the generated files. To distinguish output from the different jobs, Batch AI creates an unique folder structure for each of them. You can find the path to the folder containing the output using jobOutputDirectoryPathSegment attribute of the submitted job:

az batchai job show -n horovod_pytorch -g batchai.recipes -w recipe_workspace -e horovod_experiment --query jobOutputDirectoryPathSegment

Example output:

"00000000-0000-0000-0000-000000000000/batchai.recipes/workspaces/recipe_workspace/experiments/horovod_experiment/jobs/horovod_pytorch/145729b8-ca5a-4a7c-946e-eed1d3d3f16c"

Cleanup Resources

Delete the resource group and all allocated resources with the following command:

az group delete -n batchai.recipes -y