From the Cloud Data Loss Prevention documentation, "Cloud DLP helps you better understand and manage sensitive data. It provides fast, scalable classification and redaction for sensitive data elements like credit card numbers, names, social security numbers, US and selected international identifier numbers, phone numbers, and GCP credentials. Cloud DLP classifies this data using more than 90 predefined detectors to identify patterns, formats, and checksums, and even understands contextual clues. You can optionally redact data as well, using techniques like masking, secure hashing, tokenization, bucketing, and format-preserving encryption."
In this project, the DLP API is configured as a logging filter for the paymentservice
microservice of the microservices-demo application. Since all application logs are
sent to Stackdriver Logging, this filter is added to remove sensitive data from log
events before reaching the Stackdriver target.
This is accomplished by customizing the fluentd
configuration so that the
paymentservice
application logs are not initially sent directly to Stackdriver,
but are first submitted to the DLP API for redaction. What is returned is then
logged via submission to Stackdriver Logs.
Specifically, after a purchase is completed in the microservices demo web application, a
log event such as this is generated by paymentservice
. Note that the unredacted (demo)
credit card number is included in the log event:
{"severity":"info","time":1555345379891,"message":"PaymentService#Charge invoked with request {\"amount\":{\"currency_code\":\"USD\",\"units\":\"41\",\"nanos\":180000000},\"credit_card\":{\"credit_card_number\":\"4432-8015-6152-0454\",\"credit_card_cvv\":672,\"credit_card_expiration_year\":2020,\"credit_card_expiration_month\":1}}","pid":1,"hostname":"paymentservice-799fb9bdd-9sqdt","name":"paymentservice-server","v":1}
Once sent to the DLP API, this is what is returned and logged:
{
...
severity: "INFO"
textPayload: "{"severity":"info","time":1555345379891,"message":"PaymentService#Charge invoked with request {\"amount\":{\"currency_code\":\"USD\",\"units\":\"41\",\"nanos\":180000000},\"credit_card\":{\"credit_card_number\":\"[CREDIT_CARD_NUMBER]\",\"credit_card_cvv\":672,\"credit_card_expiration_year\":2020,\"credit_card_expiration_month\":1}}","pid":1,"hostname":"paymentservice-799fb9bdd-9sqdt","name":"paymentservice-server","v":1}
"
timestamp: "2019-04-15T16:22:59.891425283Z"
}
This is the DeidentifyTemplate template created as described in the readme:
"deidentifyConfig": {
"infoTypeTransformations": {
"transformations": [
{
"infoTypes": [
{
"name": "CREDIT_CARD_NUMBER"
}
],
"primitiveTransformation": {
"replaceWithInfoTypeConfig": {}
}
}
]
}
}
}
The API
takes in a DeidentifyTemplate,
including a list of infoTypes.
From the InfoType documentation: "...name
is either a name of your choosing when creating a CustomInfoType, or one of the names listed at https://cloud.google.com/dlp/docs/infotypes-reference when specifying a built-in type". The InfoTypes reference includes a predefined template CREDIT_CARD_NUMBER
to match credit card numbers globally.
It's important to note that, as configured, the in-scope-cluster's fluentd
will
be making a DLP API call for every log event (and not necessarily every one
containing credit card data) sent to the paymentservice. That can be seen in
the fluentd
's configuration file
here.' With
this configuration, there are two limiting considerations to take in to account.
First, there are DLP API limits, published at https://cloud.google.com/dlp/limits
which default to 600 requests/minute. Second, the cost scales with the number of
requests (DLP Pricing). This should be
taken into account before deploying this style architecture at scale.