Skip to content

Commit

Permalink
Sample - ADF pipeline (microsoft#619)
Browse files Browse the repository at this point in the history
  • Loading branch information
balteravishay authored Mar 22, 2021
1 parent 3d05acb commit 2033d81
Show file tree
Hide file tree
Showing 11 changed files with 1,231 additions and 12 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ repos:
'--docstring-convention=numpy',
# 'PEP8 Rules' to ignore in tests. Ignore documentation rules for all tests
# and ignore long lines / whitespaces for e2e-tests where we define jsons in-code.
'--per-file-ignores=**/tests/**.py:D docs/**.py:D e2e-tests/**.py:D,E501,W291,W293 docs/samples/deployments/spark/presidio_anonymize_blobs.py:F821,D103',
'--per-file-ignores=**/tests/**.py:D docs/**.py:D e2e-tests/**.py:D,E501,W291,W293 docs/samples/deployments/spark/presidio_anonymize_blobs.py:E501,F821,D103',
'--extend-ignore=
E203,
D100,
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Large diffs are not rendered by default.

568 changes: 568 additions & 0 deletions docs/samples/deployments/data-factory/azure-deploy-adf-databricks.json

Large diffs are not rendered by default.

72 changes: 72 additions & 0 deletions docs/samples/deployments/data-factory/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Anonymize PII entities in an Azure Data Factory ETL Pipeline

You can build data anonymization ETL pipelines using Azure Data Factory (ADF) and Presidio.
The following samples showcase two scenarios which use ADF to move a set of JSON objects from an online location to an Azure Storage while anonymizing their content.
The first sample leverages the code for using [Presidio on Azure App Service](../app-service/index.md) to call Presidio as an HTTP REST endpoint in the ADF pipeline while parsing and storing each file as an Azure Blob Storage.
The second sample leverage the code for using [Presidio on spark](../spark/index.md) to run over a set of files on an Azure Blob Storage to anonymnize their content, in the case of having a large data set that requires the scale of databricks.

The samples use the following Azure Services:

* Azure Data Factory - Host and orchestrate the transformation pipeline.
* Azure KeyVault - Holds the access keys for Azure Storage to avoid having keys and secrets in the code.
* Azure Storage - Persistence layer of this sample.
* Azure Databricks/ Azure App Service - Host presidio to anonymize the data.

The input file used by the samples is hosted on [presidio-research](https://github.com/microsoft/presidio-research/) repository. It is setup as a variable on the provided ARM template and used by Azure Data Factory as the input source.

## Option 1: Presidio as an HTTP REST endpoint

By using Presidio as an HTTP endpoint, the user can select which infrastructure best suits their requirements. in this sample, Presidio is deployed to an Azure App Service, but other deployment targets can be used, such as [kubernetes](../k8s/index.md).

![ADF-App-Service](adf-app-service-screenshot.png)


### Deploy the ARM template

Create the Azure App Service and the ADF pipeline by clicking the Deploy-to-Azure button, or by running the following script to provision the [provided ARM template](./azure-deploy-adf-app-service.json).

[![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fmicrosoft%2Fpresidio%2Fmain%2Fdocs%2Fsamples%2Fdeployments%2Fdata-factory%2Fazure-deploy-adf-app-service.json)


```bash
RESOURCE_GROUP=[Name of resource group]
LOCATION=[location of resources]

az group create --name $RESOURCE_GROUP --location $LOCATION
az deployment group create -g $RESOURCE_GROUP --template-file ./azure-deploy-adf-app-service.json
```

Note that:

* A SAS token keys is created and read from Azure Storage and then imported to Azure Key Vault. Using ARM template built in [functions](https://docs.microsoft.com/en-us/azure/azure-resource-manager/templates/template-functions): [listAccountSas](https://docs.microsoft.com/en-us/rest/api/storagerp/storageaccounts/listaccountsas).
* An access policy grants the Azure Data Factory managed identity access to the Azure Key Vault by using ARM template [reference](https://docs.microsoft.com/en-us/azure/azure-resource-manager/templates/template-functions-resource?tabs=json#reference) function to the Data Factory object and acquire its identity.principalId property. This is enabled by setting the data factory ARM resource's identity attribute to managed identity (SystemAssigned).

## Option 2: Presidio on Azure Databricks

By using Presidio as a Notebook step in ADF, we allow Databricks to scale presidio according to the cluster capabilities and the input dataset. Using presidio as a native python package in pyspark can unlock more analysis and de-identifiaction scenarios.

![ADF-Databricks](adf-databricks-screenshot.png)

### Pre-requisite - Deploy Azure Databricks

Provision and setup the datbricks cluster by following the steps in [presidio-spark sample](../spark/index.md#Azure-Databricks).
**Note** that you should only create and configure the databricks cluster and not the storage account, which will be created in the next step.

### Deploy the ARM template

Create the rest of the services by running the following script which uses the [provided ARM template](./azure-deploy-adf-databricks.json).

```bash
RESOURCE_GROUP=[Name of resource group]
LOCATION=[location of resources]
DATABRICKS_ACCESS_TOKEN=[Access token to databricks created in the presidio-spark sample]
DATABRICKS_WORKSPACE_URL=[Databricks workspace URL without the https:// prefix]
DATABRICKS_CLUSTER_ID=[Databricks presidio-ready cluster ID]
DATABRICKS_NOTEBOOK_LOCATION=[Location of presidio notebook from the presidio-spark sample]

az group create --name $RESOURCE_GROUP --location $LOCATION
az deployment group create -g $RESOURCE_GROUP --template-file ./azure-deploy-adf-databricks.json --parameters Databricks_accessToken=$DATABRICKS_ACCESS_TOKEN Databricks_clusterId=$DATABRICKS_CLUSTER_ID Databricks_notebookLocation=$DATABRICKS_NOTEBOOK_LOCATION Databricks_workSpaceUrl=$DATABRICKS_WORKSPACE_URL
```

Note that:
Two keys are read from Azure Storage and imported to Azure Key Vault, the account Access Token and a SAS token, using ARM template built in [functions](https://docs.microsoft.com/en-us/azure/azure-resource-manager/templates/template-functions): [listAccountSas](https://docs.microsoft.com/en-us/rest/api/storagerp/storageaccounts/listaccountsas) and [listKeys](https://docs.microsoft.com/en-us/rest/api/storagerp/storageaccounts/listkeys).
1 change: 1 addition & 0 deletions docs/samples/deployments/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@
- [Azure App Service](app-service/index.md)
- [Kubernetes](k8s/index.md)
- [Spark/Azure Databricks](spark/index.md)
- [Azure Data Factory](data-factory/index.md)
3 changes: 1 addition & 2 deletions docs/samples/deployments/spark/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,7 @@ LOCATION=[location]

# Create the storage account
az group create --name $RESOURCE_GROUP --location $LOCATION
az storage account create --name $STORAGE_ACCOUNT_NAME --resource-group
$RESOURCE_GROUP
az storage account create --name $STORAGE_ACCOUNT_NAME --resource-group $RESOURCE_GROUP

# Get the storage account access key
STORAGE_ACCESS_KEY=$(az storage account keys list --account-name $STORAGE_ACCOUNT_NAME --resource-group $RESOURCE_GROUP --query '[0].value' -o tsv)
Expand Down
25 changes: 17 additions & 8 deletions docs/samples/deployments/spark/presidio_anonymize_blobs.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# MAGIC
# MAGIC <br>The following code sample will:
# MAGIC <ol>
# MAGIC <li>Import the content of text files located in an Azure Storage blob folder</li> # noqa D501
# MAGIC <li>Import the content of text files located in an Azure Storage blob folder</li>
# MAGIC <li>Anonymize the content using Presidio</li>
# MAGIC <li>Write the anonymized content back to the Azure Storage blob account</li>
# MAGIC </ol>
Expand Down Expand Up @@ -37,6 +37,16 @@
storage_account_name = dbutils.widgets.get("storage_account_name")
storage_container_name = dbutils.widgets.get("storage_container_name")
storage_account_access_key = dbutils.widgets.get("storage_account_access_key")
storage_mount_name = "/mnt/files"


# unmount container if previously mounted
def sub_unmount(str_path):
if any(mount.mountPoint == str_path for mount in dbutils.fs.mounts()):
dbutils.fs.unmount(str_path)


sub_unmount(storage_mount_name)

# mount the container
dbutils.fs.mount(
Expand All @@ -45,7 +55,7 @@
+ "@"
+ storage_account_name
+ ".blob.core.windows.net",
mount_point="/mnt/files",
mount_point=storage_mount_name,
extra_configs={
"fs.azure.account.key."
+ storage_account_name
Expand All @@ -55,7 +65,7 @@

# load the files
input_df = spark.read.text(
"/mnt/files/" + dbutils.widgets.get("storage_input_folder") + "/*"
storage_mount_name + "/" + dbutils.widgets.get("storage_input_folder") + "/*"
).withColumn("filename", input_file_name())
display(input_df)

Expand Down Expand Up @@ -111,11 +121,10 @@ def anonymize_series(s: pd.Series) -> pd.Series:
# remove hdfs prefix from file name
anonymized_df = anonymized_df.withColumn(
"filename",
regexp_replace("filename", "^.*(/mnt/files/)", ""),
regexp_replace("filename", "^.*(" + storage_mount_name + "/)", ""),
)
anonymized_df = anonymized_df.drop("value")
display(anonymized_df)
anonymized_df.write.csv("/mnt/files/" + dbutils.widgets.get("storage_output_file"))

# unmount the blob container
dbutils.fs.unmount("/mnt/files")
anonymized_df.write.csv(
storage_mount_name + "/" + dbutils.widgets.get("storage_output_folder")
)
3 changes: 3 additions & 0 deletions docs/samples/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,6 @@
1. [Azure App service](deployments/app-service/index.md)
2. [Kubernetes](deployments/k8s/index.md)
3. [Spark/Azure Databricks](deployments/spark/index.md)
4. [Azure Data Factory with App Service](deployments/data-factory/index.md#option-1-presidio-as-an-http-rest-endpoint)
5. [Azure Data Factory with Databricks](deployments/data-factory/index.md#option-2-presidio-on-azure-databricks)

2 changes: 1 addition & 1 deletion presidio-anonymizer/deploytoazure.json
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@
{
"type": "Microsoft.Resources/deployments",
"apiVersion": "2019-10-01",
"name": "linkedTemplate",
"name": "presidio-anonymizer",
"properties": {
"mode": "Incremental",
"templateLink": {
Expand Down

0 comments on commit 2033d81

Please sign in to comment.