Added R Pipeline example

csiebler · Feb 3, 2021 · 5506912 · 5506912
1 parent 4a17ef9
commit 5506912
Show file tree

Hide file tree

Showing 8 changed files with 10,396 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -23,6 +23,7 @@ A workshop for doing MLOps on Azure Machine Learning.
   * :weight_lifting: Exercise - Multi-step pipeline with parameters - [`pipelines-multi-step-pipeline`](pipelines-multi-step-pipeline/)
   * :weight_lifting: Exercise - ParallelRunStep pipeline for batch scoring - [`pipelines-parallel-run-step`](pipelines-parallel-run-step/)
   * :weight_lifting: Exercise - Hyperparametertuning pipeline - [`pipelines-hyperdrive-step`](pipelines-hyperdrive-step/)
+  * :weight_lifting: Exercise - Multi-step pipeline with R code - [`pipelines-with-r`](pipelines-with-r/)
 * MLOps on Azure DevOps
   * :weight_lifting_woman: Exercise - Deploy AML pipeline as Published Endpoint - [`devops-deploy-simple-pipeline`](devops-deploy-simple-pipeline/)
   * :weight_lifting_woman: Exercise - Deploy AML pipeline as Published Endpoint, automatically test it and then add it to a Pipeline Endpoint - [`devops-deploy-pipeline-with-tests`](devops-deploy-pipeline-with-tests/)

diff --git a/pipelines-with-r/.amlignore b/pipelines-with-r/.amlignore
@@ -0,0 +1,9 @@
+data/
+.ipynb_checkpoints
+azureml-logs
+.azureml
+.git
+outputs
+azureml-setup
+docs
+Dockerfile
diff --git a/pipelines-with-r/Dockerfile b/pipelines-with-r/Dockerfile
@@ -0,0 +1,22 @@
+FROM mcr.microsoft.com/azureml/base:openmpi3.1.2-ubuntu18.04
+
+# Install R+R Essentials and AzureML defaults
+RUN conda install -c r -y pip=20.1.1 openssl=1.1.1c r-base r-essentials rpy2 && \
+  conda install -c conda-forge -y mscorefonts && \
+    conda clean -ay && \
+    pip install --no-cache-dir azureml-defaults
+
+# Set up miniconda environment for successful reticulate configuration
+RUN ln -s /opt/miniconda/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
+    echo ". /opt/miniconda/etc/profile.d/conda.sh" >> ~/.bashrc && \
+    echo "conda activate base" >> ~/.bashrc
+
+# Install and configure R AzureML SDK
+ENV TAR="/bin/tar"
+RUN R -e "install.packages(c('remotes', 'reticulate', 'optparse'), repos = 'https://cloud.r-project.org/')" && \
+    R -e "remotes::install_github('https://github.com/Azure/azureml-sdk-for-r')" && \
+    R -e "library(azuremlsdk); install_azureml()" && \
+    echo "Sys.setenv(RETICULATE_PYTHON='/opt/miniconda/envs/r-reticulate/bin/python3')" >> ~/.Rprofile
+
+# Install additional R packages 
+RUN R -e "install.packages(c('e1071'), repos = 'https://cloud.r-project.org/')"
diff --git a/pipelines-with-r/README.md b/pipelines-with-r/README.md
@@ -0,0 +1,12 @@
+# Exercise Instructions
+
+On your laptop or Compute Instance, build the Docker base image for R. Since R environments take a while to build, it makes sense to pre-built a Docker image with our required libraries. See the provided [`Dockerfile`](./Dockerfile) as an example and feel free to add more dependencies. Locate the name of the Azure Container Registry you want to use (preferably the one of the AML workspace) and run:
+
+```console
+cd pipelines-with-r
+az login
+az acr build --registry <your registry name> --image r-tutorial:v1 .
+```
+
+Once done, open [`r_pipeline.ipynb`](r_pipeline.ipynb) and follow the instructions in the notebook.
+
diff --git a/pipelines-with-r/data/diabetes.csv b/pipelines-with-r/data/diabetes.csv
diff --git a/pipelines-with-r/prepare.R b/pipelines-with-r/prepare.R
@@ -0,0 +1,26 @@
+library(azuremlsdk)
+library(optparse)
+library(caret)
+
+print("In prepare.R")
+
+# Get reference to this AML run to enable logging to the experiment (not needed here)
+run <- get_current_run()
+
+options <- list(
+  make_option(c("--data_path_input")),
+  make_option(c("--data_path_output"))
+)
+
+opt_parser <- OptionParser(option_list = options)
+opt <- parse_args(opt_parser)
+
+data <- read.csv(file = file.path(opt$data_path_input, "diabetes.csv"))
+
+# Do data preprocessing here
+
+if (!dir.exists(opt$data_path_output)){
+  dir.create(opt$data_path_output)
+}
+
+write.csv(data, file = file.path(opt$data_path_output, "diabetes.csv"))
diff --git a/pipelines-with-r/r_pipeline.ipynb b/pipelines-with-r/r_pipeline.ipynb
@@ -0,0 +1,264 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Multi-step pipeline example with R\n",
+    "\n",
+    "In this example, we'll be building a two step pipeline which passes data from the a first step (prepare) to the second step (train) in R.\n",
+    "\n",
+    "**Note:** This example requires that you've ran the notebook from the first tutorial, so that the compute cluster is set up."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import azureml.core\n",
+    "from azureml.core import Workspace, Experiment, Dataset, RunConfiguration\n",
+    "from azureml.pipeline.core import Pipeline, PipelineData, PipelineParameter\n",
+    "from azureml.pipeline.steps import RScriptStep\n",
+    "from azureml.core.environment import Environment, RSection\n",
+    "   \n",
+    "from azureml.data.dataset_consumption_config import DatasetConsumptionConfig\n",
+    "\n",
+    "print(\"Azure ML SDK version:\", azureml.core.VERSION)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "First, we will connect to the workspace. The command `Workspace.from_config()` will either:\n",
+    "* Read the local `config.json` with the workspace reference (given it is there) or\n",
+    "* Use the `az` CLI to connect to the workspace and use the workspace attached to via `az ml folder attach -g <resource group> -w <workspace name>`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "ws = Workspace.from_config()\n",
+    "print(f'WS name: {ws.name}\\nRegion: {ws.location}\\nSubscription id: {ws.subscription_id}\\nResource group: {ws.resource_group}')"
+   ]
+  },
+  {
+   "source": [
+    "Furthermore, we'll create a new dataset for the R example and register it to the workspace."
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from azureml.core import Dataset\n",
+    "\n",
+    "datastore = ws.get_default_datastore()\n",
+    "datastore.upload(src_dir='data/', target_path='r-pipeline', overwrite=True)\n",
+    "ds = Dataset.File.from_files(path=[(datastore, 'r-pipeline')])\n",
+    "ds.register(ws, name='r-pipeline-tutorial', description='Dataset for R pipeline tutorials', create_new_version=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, let's get our dataset ready for input to the training job:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "training_dataset = Dataset.get_by_name(ws, \"r-pipeline-tutorial\")\n",
+    "training_dataset_consumption = DatasetConsumptionConfig(\"training_dataset\", training_dataset).as_download(path_on_compute=\"/data\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's also define a `PipelineData` placeholder which will be used to persist and pipe data from the prepare step to the train step:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "default_datastore = ws.get_default_datastore()\n",
+    "prepared_data = PipelineData(\"prepared_data\", datastore=default_datastore)\n"
+   ]
+  },
+  {
+   "source": [
+    "Since R environments take quite a while to build, we'll use a Docker image that has all our dependencies built-in. This ensure quick execution of the pipeline and avoids unnecessay, long-running image build processes:"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "rc = RunConfiguration()\n",
+    "rc.framework='R'\n",
+    "rc.environment.r = RSection()\n",
+    "rc.environment.docker.enabled = True\n",
+    "\n",
+    "# Replace with the name of your container registry!!!\n",
+    "rc.environment.docker.base_image = 'xxxxxxx.azurecr.io/r-tutorial:v1'\n",
+    "\n",
+    "# Disable AML's automatic package installation, but rather rely on pre-built base image\n",
+    "rc.environment.r.user_managed = True\n",
+    "rc.environment.python.user_managed_dependencies = True "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, we can create our two-stepped pipeline that runs some preprocessing on the data and then pipes the output to the training code. The dependency graph is automatically resolved through the data input/outputs, but we could also define it ourselves (if desired):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "prepare_step = RScriptStep(\"prepare.R\",\n",
+    "                         name=\"prepare-step\",\n",
+    "                         arguments=['--data_path_input', '/data',\n",
+    "                                    '--data_path_output', prepared_data],\n",
+    "                         compute_target='cpu-cluster',\n",
+    "                         runconfig=rc,\n",
+    "                         inputs=[training_dataset_consumption],\n",
+    "                         outputs=[prepared_data],\n",
+    "                         source_directory=\"./\",\n",
+    "                         custom_docker_image=None,\n",
+    "                         allow_reuse=False)\n",
+    "\n",
+    "train_step = RScriptStep(\"train.R\",\n",
+    "                         name=\"train-step\",\n",
+    "                         arguments=['--data_path', prepared_data],\n",
+    "                         compute_target='cpu-cluster',\n",
+    "                         runconfig=rc,\n",
+    "                         inputs=[prepared_data],\n",
+    "                         source_directory=\"./\",\n",
+    "                         custom_docker_image=None,\n",
+    "                         allow_reuse=False)\n",
+    "\n",
+    "train_step.run_after(prepare_step) # not really needed here, just for illustration\n",
+    "steps = [prepare_step, train_step]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Finally, we can create our pipeline object and validate it. This will check the input and outputs are properly linked and that the pipeline graph is a non-cyclic graph:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "pipeline = Pipeline(workspace=ws, steps=steps)\n",
+    "pipeline.validate()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Lastly, we can submit the pipeline against an experiment:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "pipeline_run = Experiment(ws, 'prepare-training-pipeline-with-r').submit(pipeline)\n",
+    "pipeline_run.wait_for_completion()"
+   ]
+  },
+  {
+   "source": [
+    "Alternatively, we can also publish the pipeline as a RESTful API Endpoint. In this case, you can specify the dataset upon invocation of the pipeline. This is nicely possible in the `Studio UI`, goto `Endpoints`, then `Pipeline Endpoints` and then select the pipeline. Once you hit the submit button, you can select the Dataset at the bottom of the window."
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "published_pipeline = pipeline.publish('prepare-training-pipeline-with-r')\n",
+    "published_pipeline"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.9-final"
+  },
+  "orig_nbformat": 2,
+  "kernelspec": {
+   "name": "python3",
+   "display_name": "Python 3.7.9 64-bit",
+   "metadata": {
+    "interpreter": {
+     "hash": "54b76a1167e0a2b6a6b8c7f2df323eb2ecfae9d2bbefe58fb0609bf9141d6860"
+    }
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}