Sample - ADF pipeline (microsoft#619)

gfog-floqast · Mar 22, 2021 · 2033d81 · 2033d81
1 parent 3d05acb
commit 2033d81
Show file tree

Hide file tree

Showing 11 changed files with 1,231 additions and 12 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -16,7 +16,7 @@ repos:
                  '--docstring-convention=numpy',
                   # 'PEP8 Rules' to ignore in tests. Ignore documentation rules for all tests
                   # and ignore long lines / whitespaces for e2e-tests where we define jsons in-code.
-                 '--per-file-ignores=**/tests/**.py:D docs/**.py:D e2e-tests/**.py:D,E501,W291,W293 docs/samples/deployments/spark/presidio_anonymize_blobs.py:F821,D103',
+                 '--per-file-ignores=**/tests/**.py:D docs/**.py:D e2e-tests/**.py:D,E501,W291,W293 docs/samples/deployments/spark/presidio_anonymize_blobs.py:E501,F821,D103',
                  '--extend-ignore=
                  		E203,
                  		D100,

diff --git a/docs/samples/deployments/data-factory/adf-app-service-screenshot.png b/docs/samples/deployments/data-factory/adf-app-service-screenshot.png
diff --git a/docs/samples/deployments/data-factory/adf-databricks-screenshot.png b/docs/samples/deployments/data-factory/adf-databricks-screenshot.png
diff --git a/docs/samples/deployments/data-factory/azure-deploy-adf-app-service.json b/docs/samples/deployments/data-factory/azure-deploy-adf-app-service.json
diff --git a/docs/samples/deployments/data-factory/azure-deploy-adf-databricks.json b/docs/samples/deployments/data-factory/azure-deploy-adf-databricks.json
diff --git a/docs/samples/deployments/data-factory/index.md b/docs/samples/deployments/data-factory/index.md
@@ -0,0 +1,72 @@
+# Anonymize PII entities in an Azure Data Factory ETL Pipeline
+
+You can build data anonymization ETL pipelines using Azure Data Factory (ADF) and Presidio.
+The following samples showcase two scenarios which use ADF to move a set of JSON objects from an online location to an Azure Storage while anonymizing their content.
+The first sample leverages the code for using [Presidio on Azure App Service](../app-service/index.md) to call Presidio as an HTTP REST endpoint in the ADF pipeline while parsing and storing each file as an Azure Blob Storage.
+The second sample leverage the code for using [Presidio on spark](../spark/index.md) to run over a set of files on an Azure Blob Storage to anonymnize their content, in the case of having a large data set that requires the scale of databricks.
+
+The samples use the following Azure Services:
+
+* Azure Data Factory - Host and orchestrate the transformation pipeline.
+* Azure KeyVault - Holds the access keys for Azure Storage to avoid having keys and secrets in the code.
+* Azure Storage - Persistence layer of this sample.
+* Azure Databricks/ Azure App Service - Host presidio to anonymize the data.
+
+The input file used by the samples is hosted on [presidio-research](https://github.com/microsoft/presidio-research/) repository. It is setup as a variable on the provided ARM template and used by Azure Data Factory as the input source.
+
+## Option 1: Presidio as an HTTP REST endpoint
+
+By using Presidio as an HTTP endpoint, the user can select which infrastructure best suits their requirements. in this sample, Presidio is deployed to an Azure App Service, but other deployment targets can be used, such as [kubernetes](../k8s/index.md).
+
+![ADF-App-Service](adf-app-service-screenshot.png)
+
+
+### Deploy the ARM template
+
+Create the Azure App Service and the ADF pipeline by clicking the Deploy-to-Azure button, or by running the following script to provision the [provided ARM template](./azure-deploy-adf-app-service.json).
+
+[![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fmicrosoft%2Fpresidio%2Fmain%2Fdocs%2Fsamples%2Fdeployments%2Fdata-factory%2Fazure-deploy-adf-app-service.json)
+
+
+```bash
+RESOURCE_GROUP=[Name of resource group]
+LOCATION=[location of resources]
+
+az group create --name $RESOURCE_GROUP --location $LOCATION
+az deployment group create -g $RESOURCE_GROUP --template-file ./azure-deploy-adf-app-service.json
+```
+
+Note that:
+
+* A SAS token keys is created and read from Azure Storage and then imported to Azure Key Vault. Using ARM template built in [functions](https://docs.microsoft.com/en-us/azure/azure-resource-manager/templates/template-functions): [listAccountSas](https://docs.microsoft.com/en-us/rest/api/storagerp/storageaccounts/listaccountsas).
+* An access policy grants the Azure Data Factory managed identity access to the Azure Key Vault by using ARM template [reference](https://docs.microsoft.com/en-us/azure/azure-resource-manager/templates/template-functions-resource?tabs=json#reference) function to the Data Factory object and acquire its identity.principalId property. This is enabled by setting the data factory ARM resource's identity attribute to managed identity (SystemAssigned).
+
+## Option 2: Presidio on Azure Databricks
+
+By using Presidio as a Notebook step in ADF, we allow Databricks to scale presidio according to the cluster capabilities and the input dataset. Using presidio as a native python package in pyspark can unlock more analysis and de-identifiaction scenarios.
+
+![ADF-Databricks](adf-databricks-screenshot.png)
+
+### Pre-requisite - Deploy Azure Databricks
+
+Provision and setup the datbricks cluster by following the steps in [presidio-spark sample](../spark/index.md#Azure-Databricks). 
+**Note** that you should only create and configure the databricks cluster and not the storage account, which will be created in the next step.
+
+### Deploy the ARM template
+
+Create the rest of the services by running the following script which uses the [provided ARM template](./azure-deploy-adf-databricks.json).
+
+```bash
+RESOURCE_GROUP=[Name of resource group]
+LOCATION=[location of resources]
+DATABRICKS_ACCESS_TOKEN=[Access token to databricks created in the presidio-spark sample]
+DATABRICKS_WORKSPACE_URL=[Databricks workspace URL without the https:// prefix]
+DATABRICKS_CLUSTER_ID=[Databricks presidio-ready cluster ID]
+DATABRICKS_NOTEBOOK_LOCATION=[Location of presidio notebook from the presidio-spark sample]
+
+az group create --name $RESOURCE_GROUP --location $LOCATION
+az deployment group create -g $RESOURCE_GROUP --template-file ./azure-deploy-adf-databricks.json --parameters Databricks_accessToken=$DATABRICKS_ACCESS_TOKEN Databricks_clusterId=$DATABRICKS_CLUSTER_ID Databricks_notebookLocation=$DATABRICKS_NOTEBOOK_LOCATION Databricks_workSpaceUrl=$DATABRICKS_WORKSPACE_URL
+```
+
+Note that:
+Two keys are read from Azure Storage and imported to Azure Key Vault, the account Access Token and a SAS token, using ARM template built in [functions](https://docs.microsoft.com/en-us/azure/azure-resource-manager/templates/template-functions): [listAccountSas](https://docs.microsoft.com/en-us/rest/api/storagerp/storageaccounts/listaccountsas) and [listKeys](https://docs.microsoft.com/en-us/rest/api/storagerp/storageaccounts/listkeys).
diff --git a/docs/samples/deployments/index.md b/docs/samples/deployments/index.md
@@ -3,3 +3,4 @@
 - [Azure App Service](app-service/index.md)
 - [Kubernetes](k8s/index.md)
 - [Spark/Azure Databricks](spark/index.md)
+- [Azure Data Factory](data-factory/index.md)
diff --git a/docs/samples/deployments/spark/index.md b/docs/samples/deployments/spark/index.md
@@ -21,8 +21,7 @@ LOCATION=[location]
 
 # Create the storage account
 az group create --name $RESOURCE_GROUP --location $LOCATION
-az storage account create --name $STORAGE_ACCOUNT_NAME --resource-group
-$RESOURCE_GROUP
+az storage account create --name $STORAGE_ACCOUNT_NAME --resource-group $RESOURCE_GROUP
 
 # Get the storage account access key
 STORAGE_ACCESS_KEY=$(az storage account keys list --account-name $STORAGE_ACCOUNT_NAME --resource-group $RESOURCE_GROUP --query '[0].value' -o tsv)

diff --git a/docs/samples/deployments/spark/presidio_anonymize_blobs.py b/docs/samples/deployments/spark/presidio_anonymize_blobs.py
@@ -6,7 +6,7 @@
 # MAGIC
 # MAGIC <br>The following code sample will:
 # MAGIC <ol>
-# MAGIC <li>Import the content of text files located in an Azure Storage blob folder</li> # noqa D501
+# MAGIC <li>Import the content of text files located in an Azure Storage blob folder</li>
 # MAGIC <li>Anonymize the content using Presidio</li>
 # MAGIC <li>Write the anonymized content back to the Azure Storage blob account</li>
 # MAGIC </ol>
@@ -37,6 +37,16 @@
 storage_account_name = dbutils.widgets.get("storage_account_name")
 storage_container_name = dbutils.widgets.get("storage_container_name")
 storage_account_access_key = dbutils.widgets.get("storage_account_access_key")
+storage_mount_name = "/mnt/files"
+
+
+# unmount container if previously mounted
+def sub_unmount(str_path):
+    if any(mount.mountPoint == str_path for mount in dbutils.fs.mounts()):
+        dbutils.fs.unmount(str_path)
+
+
+sub_unmount(storage_mount_name)
 
 # mount the container
 dbutils.fs.mount(
@@ -45,7 +55,7 @@
     + "@"
     + storage_account_name
     + ".blob.core.windows.net",
-    mount_point="/mnt/files",
+    mount_point=storage_mount_name,
     extra_configs={
         "fs.azure.account.key."
         + storage_account_name
@@ -55,7 +65,7 @@
 
 # load the files
 input_df = spark.read.text(
-    "/mnt/files/" + dbutils.widgets.get("storage_input_folder") + "/*"
+    storage_mount_name + "/" + dbutils.widgets.get("storage_input_folder") + "/*"
 ).withColumn("filename", input_file_name())
 display(input_df)
 
@@ -111,11 +121,10 @@ def anonymize_series(s: pd.Series) -> pd.Series:
 # remove hdfs prefix from file name
 anonymized_df = anonymized_df.withColumn(
     "filename",
-    regexp_replace("filename", "^.*(/mnt/files/)", ""),
+    regexp_replace("filename", "^.*(" + storage_mount_name + "/)", ""),
 )
 anonymized_df = anonymized_df.drop("value")
 display(anonymized_df)
-anonymized_df.write.csv("/mnt/files/" + dbutils.widgets.get("storage_output_file"))
-
-# unmount the blob container
-dbutils.fs.unmount("/mnt/files")
+anonymized_df.write.csv(
+    storage_mount_name + "/" + dbutils.widgets.get("storage_output_folder")
+)
diff --git a/docs/samples/index.md b/docs/samples/index.md
@@ -10,3 +10,6 @@
 1. [Azure App service](deployments/app-service/index.md)
 2. [Kubernetes](deployments/k8s/index.md)
 3. [Spark/Azure Databricks](deployments/spark/index.md)
+4. [Azure Data Factory with App Service](deployments/data-factory/index.md#option-1-presidio-as-an-http-rest-endpoint)
+5. [Azure Data Factory with Databricks](deployments/data-factory/index.md#option-2-presidio-on-azure-databricks)
+
diff --git a/presidio-anonymizer/deploytoazure.json b/presidio-anonymizer/deploytoazure.json
@@ -56,7 +56,7 @@
         {
             "type": "Microsoft.Resources/deployments",
             "apiVersion": "2019-10-01",
-            "name": "linkedTemplate",
+            "name": "presidio-anonymizer",
             "properties": {
                 "mode": "Incremental",
                 "templateLink": {