Skip to content

Commit

Permalink
Added R Pipeline example
Browse files Browse the repository at this point in the history
  • Loading branch information
csiebler committed Feb 3, 2021
1 parent 4a17ef9 commit 5506912
Show file tree
Hide file tree
Showing 8 changed files with 10,396 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ A workshop for doing MLOps on Azure Machine Learning.
* :weight_lifting: Exercise - Multi-step pipeline with parameters - [`pipelines-multi-step-pipeline`](pipelines-multi-step-pipeline/)
* :weight_lifting: Exercise - ParallelRunStep pipeline for batch scoring - [`pipelines-parallel-run-step`](pipelines-parallel-run-step/)
* :weight_lifting: Exercise - Hyperparametertuning pipeline - [`pipelines-hyperdrive-step`](pipelines-hyperdrive-step/)
* :weight_lifting: Exercise - Multi-step pipeline with R code - [`pipelines-with-r`](pipelines-with-r/)
* MLOps on Azure DevOps
* :weight_lifting_woman: Exercise - Deploy AML pipeline as Published Endpoint - [`devops-deploy-simple-pipeline`](devops-deploy-simple-pipeline/)
* :weight_lifting_woman: Exercise - Deploy AML pipeline as Published Endpoint, automatically test it and then add it to a Pipeline Endpoint - [`devops-deploy-pipeline-with-tests`](devops-deploy-pipeline-with-tests/)
Expand Down
9 changes: 9 additions & 0 deletions pipelines-with-r/.amlignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
data/
.ipynb_checkpoints
azureml-logs
.azureml
.git
outputs
azureml-setup
docs
Dockerfile
22 changes: 22 additions & 0 deletions pipelines-with-r/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
FROM mcr.microsoft.com/azureml/base:openmpi3.1.2-ubuntu18.04

# Install R+R Essentials and AzureML defaults
RUN conda install -c r -y pip=20.1.1 openssl=1.1.1c r-base r-essentials rpy2 && \
conda install -c conda-forge -y mscorefonts && \
conda clean -ay && \
pip install --no-cache-dir azureml-defaults

# Set up miniconda environment for successful reticulate configuration
RUN ln -s /opt/miniconda/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
echo ". /opt/miniconda/etc/profile.d/conda.sh" >> ~/.bashrc && \
echo "conda activate base" >> ~/.bashrc

# Install and configure R AzureML SDK
ENV TAR="/bin/tar"
RUN R -e "install.packages(c('remotes', 'reticulate', 'optparse'), repos = 'https://cloud.r-project.org/')" && \
R -e "remotes::install_github('https://github.com/Azure/azureml-sdk-for-r')" && \
R -e "library(azuremlsdk); install_azureml()" && \
echo "Sys.setenv(RETICULATE_PYTHON='/opt/miniconda/envs/r-reticulate/bin/python3')" >> ~/.Rprofile

# Install additional R packages
RUN R -e "install.packages(c('e1071'), repos = 'https://cloud.r-project.org/')"
12 changes: 12 additions & 0 deletions pipelines-with-r/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Exercise Instructions

On your laptop or Compute Instance, build the Docker base image for R. Since R environments take a while to build, it makes sense to pre-built a Docker image with our required libraries. See the provided [`Dockerfile`](./Dockerfile) as an example and feel free to add more dependencies. Locate the name of the Azure Container Registry you want to use (preferably the one of the AML workspace) and run:

```console
cd pipelines-with-r
az login
az acr build --registry <your registry name> --image r-tutorial:v1 .
```

Once done, open [`r_pipeline.ipynb`](r_pipeline.ipynb) and follow the instructions in the notebook.

10,001 changes: 10,001 additions & 0 deletions pipelines-with-r/data/diabetes.csv

Large diffs are not rendered by default.

26 changes: 26 additions & 0 deletions pipelines-with-r/prepare.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
library(azuremlsdk)
library(optparse)
library(caret)

print("In prepare.R")

# Get reference to this AML run to enable logging to the experiment (not needed here)
run <- get_current_run()

options <- list(
make_option(c("--data_path_input")),
make_option(c("--data_path_output"))
)

opt_parser <- OptionParser(option_list = options)
opt <- parse_args(opt_parser)

data <- read.csv(file = file.path(opt$data_path_input, "diabetes.csv"))

# Do data preprocessing here

if (!dir.exists(opt$data_path_output)){
dir.create(opt$data_path_output)
}

write.csv(data, file = file.path(opt$data_path_output, "diabetes.csv"))
264 changes: 264 additions & 0 deletions pipelines-with-r/r_pipeline.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,264 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Multi-step pipeline example with R\n",
"\n",
"In this example, we'll be building a two step pipeline which passes data from the a first step (prepare) to the second step (train) in R.\n",
"\n",
"**Note:** This example requires that you've ran the notebook from the first tutorial, so that the compute cluster is set up."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import os\n",
"import azureml.core\n",
"from azureml.core import Workspace, Experiment, Dataset, RunConfiguration\n",
"from azureml.pipeline.core import Pipeline, PipelineData, PipelineParameter\n",
"from azureml.pipeline.steps import RScriptStep\n",
"from azureml.core.environment import Environment, RSection\n",
" \n",
"from azureml.data.dataset_consumption_config import DatasetConsumptionConfig\n",
"\n",
"print(\"Azure ML SDK version:\", azureml.core.VERSION)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we will connect to the workspace. The command `Workspace.from_config()` will either:\n",
"* Read the local `config.json` with the workspace reference (given it is there) or\n",
"* Use the `az` CLI to connect to the workspace and use the workspace attached to via `az ml folder attach -g <resource group> -w <workspace name>`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"ws = Workspace.from_config()\n",
"print(f'WS name: {ws.name}\\nRegion: {ws.location}\\nSubscription id: {ws.subscription_id}\\nResource group: {ws.resource_group}')"
]
},
{
"source": [
"Furthermore, we'll create a new dataset for the R example and register it to the workspace."
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Dataset\n",
"\n",
"datastore = ws.get_default_datastore()\n",
"datastore.upload(src_dir='data/', target_path='r-pipeline', overwrite=True)\n",
"ds = Dataset.File.from_files(path=[(datastore, 'r-pipeline')])\n",
"ds.register(ws, name='r-pipeline-tutorial', description='Dataset for R pipeline tutorials', create_new_version=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let's get our dataset ready for input to the training job:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"training_dataset = Dataset.get_by_name(ws, \"r-pipeline-tutorial\")\n",
"training_dataset_consumption = DatasetConsumptionConfig(\"training_dataset\", training_dataset).as_download(path_on_compute=\"/data\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's also define a `PipelineData` placeholder which will be used to persist and pipe data from the prepare step to the train step:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"default_datastore = ws.get_default_datastore()\n",
"prepared_data = PipelineData(\"prepared_data\", datastore=default_datastore)\n"
]
},
{
"source": [
"Since R environments take quite a while to build, we'll use a Docker image that has all our dependencies built-in. This ensure quick execution of the pipeline and avoids unnecessay, long-running image build processes:"
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"rc = RunConfiguration()\n",
"rc.framework='R'\n",
"rc.environment.r = RSection()\n",
"rc.environment.docker.enabled = True\n",
"\n",
"# Replace with the name of your container registry!!!\n",
"rc.environment.docker.base_image = 'xxxxxxx.azurecr.io/r-tutorial:v1'\n",
"\n",
"# Disable AML's automatic package installation, but rather rely on pre-built base image\n",
"rc.environment.r.user_managed = True\n",
"rc.environment.python.user_managed_dependencies = True "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we can create our two-stepped pipeline that runs some preprocessing on the data and then pipes the output to the training code. The dependency graph is automatically resolved through the data input/outputs, but we could also define it ourselves (if desired):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"prepare_step = RScriptStep(\"prepare.R\",\n",
" name=\"prepare-step\",\n",
" arguments=['--data_path_input', '/data',\n",
" '--data_path_output', prepared_data],\n",
" compute_target='cpu-cluster',\n",
" runconfig=rc,\n",
" inputs=[training_dataset_consumption],\n",
" outputs=[prepared_data],\n",
" source_directory=\"./\",\n",
" custom_docker_image=None,\n",
" allow_reuse=False)\n",
"\n",
"train_step = RScriptStep(\"train.R\",\n",
" name=\"train-step\",\n",
" arguments=['--data_path', prepared_data],\n",
" compute_target='cpu-cluster',\n",
" runconfig=rc,\n",
" inputs=[prepared_data],\n",
" source_directory=\"./\",\n",
" custom_docker_image=None,\n",
" allow_reuse=False)\n",
"\n",
"train_step.run_after(prepare_step) # not really needed here, just for illustration\n",
"steps = [prepare_step, train_step]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we can create our pipeline object and validate it. This will check the input and outputs are properly linked and that the pipeline graph is a non-cyclic graph:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"pipeline = Pipeline(workspace=ws, steps=steps)\n",
"pipeline.validate()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lastly, we can submit the pipeline against an experiment:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"pipeline_run = Experiment(ws, 'prepare-training-pipeline-with-r').submit(pipeline)\n",
"pipeline_run.wait_for_completion()"
]
},
{
"source": [
"Alternatively, we can also publish the pipeline as a RESTful API Endpoint. In this case, you can specify the dataset upon invocation of the pipeline. This is nicely possible in the `Studio UI`, goto `Endpoints`, then `Pipeline Endpoints` and then select the pipeline. Once you hit the submit button, you can select the Dataset at the bottom of the window."
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"published_pipeline = pipeline.publish('prepare-training-pipeline-with-r')\n",
"published_pipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.9-final"
},
"orig_nbformat": 2,
"kernelspec": {
"name": "python3",
"display_name": "Python 3.7.9 64-bit",
"metadata": {
"interpreter": {
"hash": "54b76a1167e0a2b6a6b8c7f2df323eb2ecfae9d2bbefe58fb0609bf9141d6860"
}
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading

0 comments on commit 5506912

Please sign in to comment.