-
Notifications
You must be signed in to change notification settings - Fork 21
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
8 changed files
with
10,396 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
data/ | ||
.ipynb_checkpoints | ||
azureml-logs | ||
.azureml | ||
.git | ||
outputs | ||
azureml-setup | ||
docs | ||
Dockerfile |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
FROM mcr.microsoft.com/azureml/base:openmpi3.1.2-ubuntu18.04 | ||
|
||
# Install R+R Essentials and AzureML defaults | ||
RUN conda install -c r -y pip=20.1.1 openssl=1.1.1c r-base r-essentials rpy2 && \ | ||
conda install -c conda-forge -y mscorefonts && \ | ||
conda clean -ay && \ | ||
pip install --no-cache-dir azureml-defaults | ||
|
||
# Set up miniconda environment for successful reticulate configuration | ||
RUN ln -s /opt/miniconda/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \ | ||
echo ". /opt/miniconda/etc/profile.d/conda.sh" >> ~/.bashrc && \ | ||
echo "conda activate base" >> ~/.bashrc | ||
|
||
# Install and configure R AzureML SDK | ||
ENV TAR="/bin/tar" | ||
RUN R -e "install.packages(c('remotes', 'reticulate', 'optparse'), repos = 'https://cloud.r-project.org/')" && \ | ||
R -e "remotes::install_github('https://github.com/Azure/azureml-sdk-for-r')" && \ | ||
R -e "library(azuremlsdk); install_azureml()" && \ | ||
echo "Sys.setenv(RETICULATE_PYTHON='/opt/miniconda/envs/r-reticulate/bin/python3')" >> ~/.Rprofile | ||
|
||
# Install additional R packages | ||
RUN R -e "install.packages(c('e1071'), repos = 'https://cloud.r-project.org/')" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# Exercise Instructions | ||
|
||
On your laptop or Compute Instance, build the Docker base image for R. Since R environments take a while to build, it makes sense to pre-built a Docker image with our required libraries. See the provided [`Dockerfile`](./Dockerfile) as an example and feel free to add more dependencies. Locate the name of the Azure Container Registry you want to use (preferably the one of the AML workspace) and run: | ||
|
||
```console | ||
cd pipelines-with-r | ||
az login | ||
az acr build --registry <your registry name> --image r-tutorial:v1 . | ||
``` | ||
|
||
Once done, open [`r_pipeline.ipynb`](r_pipeline.ipynb) and follow the instructions in the notebook. | ||
|
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
library(azuremlsdk) | ||
library(optparse) | ||
library(caret) | ||
|
||
print("In prepare.R") | ||
|
||
# Get reference to this AML run to enable logging to the experiment (not needed here) | ||
run <- get_current_run() | ||
|
||
options <- list( | ||
make_option(c("--data_path_input")), | ||
make_option(c("--data_path_output")) | ||
) | ||
|
||
opt_parser <- OptionParser(option_list = options) | ||
opt <- parse_args(opt_parser) | ||
|
||
data <- read.csv(file = file.path(opt$data_path_input, "diabetes.csv")) | ||
|
||
# Do data preprocessing here | ||
|
||
if (!dir.exists(opt$data_path_output)){ | ||
dir.create(opt$data_path_output) | ||
} | ||
|
||
write.csv(data, file = file.path(opt$data_path_output, "diabetes.csv")) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,264 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Multi-step pipeline example with R\n", | ||
"\n", | ||
"In this example, we'll be building a two step pipeline which passes data from the a first step (prepare) to the second step (train) in R.\n", | ||
"\n", | ||
"**Note:** This example requires that you've ran the notebook from the first tutorial, so that the compute cluster is set up." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import os\n", | ||
"import azureml.core\n", | ||
"from azureml.core import Workspace, Experiment, Dataset, RunConfiguration\n", | ||
"from azureml.pipeline.core import Pipeline, PipelineData, PipelineParameter\n", | ||
"from azureml.pipeline.steps import RScriptStep\n", | ||
"from azureml.core.environment import Environment, RSection\n", | ||
" \n", | ||
"from azureml.data.dataset_consumption_config import DatasetConsumptionConfig\n", | ||
"\n", | ||
"print(\"Azure ML SDK version:\", azureml.core.VERSION)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"First, we will connect to the workspace. The command `Workspace.from_config()` will either:\n", | ||
"* Read the local `config.json` with the workspace reference (given it is there) or\n", | ||
"* Use the `az` CLI to connect to the workspace and use the workspace attached to via `az ml folder attach -g <resource group> -w <workspace name>`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"ws = Workspace.from_config()\n", | ||
"print(f'WS name: {ws.name}\\nRegion: {ws.location}\\nSubscription id: {ws.subscription_id}\\nResource group: {ws.resource_group}')" | ||
] | ||
}, | ||
{ | ||
"source": [ | ||
"Furthermore, we'll create a new dataset for the R example and register it to the workspace." | ||
], | ||
"cell_type": "markdown", | ||
"metadata": {} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from azureml.core import Dataset\n", | ||
"\n", | ||
"datastore = ws.get_default_datastore()\n", | ||
"datastore.upload(src_dir='data/', target_path='r-pipeline', overwrite=True)\n", | ||
"ds = Dataset.File.from_files(path=[(datastore, 'r-pipeline')])\n", | ||
"ds.register(ws, name='r-pipeline-tutorial', description='Dataset for R pipeline tutorials', create_new_version=True)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Next, let's get our dataset ready for input to the training job:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"training_dataset = Dataset.get_by_name(ws, \"r-pipeline-tutorial\")\n", | ||
"training_dataset_consumption = DatasetConsumptionConfig(\"training_dataset\", training_dataset).as_download(path_on_compute=\"/data\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Let's also define a `PipelineData` placeholder which will be used to persist and pipe data from the prepare step to the train step:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"default_datastore = ws.get_default_datastore()\n", | ||
"prepared_data = PipelineData(\"prepared_data\", datastore=default_datastore)\n" | ||
] | ||
}, | ||
{ | ||
"source": [ | ||
"Since R environments take quite a while to build, we'll use a Docker image that has all our dependencies built-in. This ensure quick execution of the pipeline and avoids unnecessay, long-running image build processes:" | ||
], | ||
"cell_type": "markdown", | ||
"metadata": {} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"rc = RunConfiguration()\n", | ||
"rc.framework='R'\n", | ||
"rc.environment.r = RSection()\n", | ||
"rc.environment.docker.enabled = True\n", | ||
"\n", | ||
"# Replace with the name of your container registry!!!\n", | ||
"rc.environment.docker.base_image = 'xxxxxxx.azurecr.io/r-tutorial:v1'\n", | ||
"\n", | ||
"# Disable AML's automatic package installation, but rather rely on pre-built base image\n", | ||
"rc.environment.r.user_managed = True\n", | ||
"rc.environment.python.user_managed_dependencies = True " | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Next, we can create our two-stepped pipeline that runs some preprocessing on the data and then pipes the output to the training code. The dependency graph is automatically resolved through the data input/outputs, but we could also define it ourselves (if desired):" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"prepare_step = RScriptStep(\"prepare.R\",\n", | ||
" name=\"prepare-step\",\n", | ||
" arguments=['--data_path_input', '/data',\n", | ||
" '--data_path_output', prepared_data],\n", | ||
" compute_target='cpu-cluster',\n", | ||
" runconfig=rc,\n", | ||
" inputs=[training_dataset_consumption],\n", | ||
" outputs=[prepared_data],\n", | ||
" source_directory=\"./\",\n", | ||
" custom_docker_image=None,\n", | ||
" allow_reuse=False)\n", | ||
"\n", | ||
"train_step = RScriptStep(\"train.R\",\n", | ||
" name=\"train-step\",\n", | ||
" arguments=['--data_path', prepared_data],\n", | ||
" compute_target='cpu-cluster',\n", | ||
" runconfig=rc,\n", | ||
" inputs=[prepared_data],\n", | ||
" source_directory=\"./\",\n", | ||
" custom_docker_image=None,\n", | ||
" allow_reuse=False)\n", | ||
"\n", | ||
"train_step.run_after(prepare_step) # not really needed here, just for illustration\n", | ||
"steps = [prepare_step, train_step]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Finally, we can create our pipeline object and validate it. This will check the input and outputs are properly linked and that the pipeline graph is a non-cyclic graph:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"pipeline = Pipeline(workspace=ws, steps=steps)\n", | ||
"pipeline.validate()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Lastly, we can submit the pipeline against an experiment:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"pipeline_run = Experiment(ws, 'prepare-training-pipeline-with-r').submit(pipeline)\n", | ||
"pipeline_run.wait_for_completion()" | ||
] | ||
}, | ||
{ | ||
"source": [ | ||
"Alternatively, we can also publish the pipeline as a RESTful API Endpoint. In this case, you can specify the dataset upon invocation of the pipeline. This is nicely possible in the `Studio UI`, goto `Endpoints`, then `Pipeline Endpoints` and then select the pipeline. Once you hit the submit button, you can select the Dataset at the bottom of the window." | ||
], | ||
"cell_type": "markdown", | ||
"metadata": {} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"published_pipeline = pipeline.publish('prepare-training-pipeline-with-r')\n", | ||
"published_pipeline" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.7.9-final" | ||
}, | ||
"orig_nbformat": 2, | ||
"kernelspec": { | ||
"name": "python3", | ||
"display_name": "Python 3.7.9 64-bit", | ||
"metadata": { | ||
"interpreter": { | ||
"hash": "54b76a1167e0a2b6a6b8c7f2df323eb2ecfae9d2bbefe58fb0609bf9141d6860" | ||
} | ||
} | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.