Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimize telemetry surface area #2844

Merged
merged 33 commits into from
Jul 9, 2024

Conversation

davidmrdavid
Copy link
Contributor

@davidmrdavid davidmrdavid commented Jun 4, 2024

Under certain conditions (user opting in to tracing raw inputs and outputs), our telemetry may capture sensitive information. For example, through logs of exceptions, inputs, output, etc. T

This PR aims to remove the sensitive information by implementing these guidelines:

(1) User-provided inputs and outputs are never part of telemetry as-is. This includes the "reason" label for operations like terminate or rewind, as well as the simpler cases of inputs and outputs to orchestrators and entities.

(2) When application-level exceptions are logged, we only log a sanitized version of them that includes the exception type, and the exception stack trace. The exception message is absent, as it may contain sensitive information.

And that's it. To this end, I've refactored many methods in the EndToEndTracer class, as well as their callers. In most cases, the refactoring is simply to avoid logging sensitive parameters via ETW, but still logging them to the user's Application Insights.

The more complicated refactorings are all around logging exceptions. In the easy case - I simply refactored the string parameter containing the exception string into an actual Exception type, that I can use to create a sanitized exception string containing only the exception type and stack trace. Other cases where the exception object wasn't available in the immediate caller are handled on a case-by-case basis.

Finally - some diffs in this PR are minor improvements like adding nullable analysis and removing build warnings.

resolves N/A

Pull request checklist

  • My changes do not require documentation changes
    • Otherwise: Documentation PR is ready to merge and referenced in pending_docs.md
  • My changes should not be added to the release notes for the next release
    • Otherwise: I've added my notes to release_notes.md
  • My changes do not need to be backported to a previous version
    • Otherwise: Backport tracked by issue/PR #issue_or_pr
  • I have added all required tests (Unit tests, E2E tests)
  • My changes do not require any extra work to be leveraged by OutOfProc SDKs
    • Otherwise: That work is being tracked here: #issue_or_pr_in_each_sdk
  • My changes do not change the version of the WebJobs.Extensions.DurableTask package
    • Otherwise: major or minor version updates are reflected in /src/Worker.Extensions.DurableTask/AssemblyInfo.cs
  • My changes do not add EventIds to our EventSource logs
    • Otherwise: Ensure the EventIds are within the supported range in our existing Windows infrastructure. You may validate this with a deployed app's telemetry. You may also extend the range by completing a PR such as this one.
  • My changes should be added to v3.x branch.
    • Otherwise: This change only applies to Durable Functions v2.x and will not be merged to branch v3.x.

@davidmrdavid davidmrdavid changed the title [WIP] Removing potentially sensitive logs from telemetry Remove potentially sensitive logs from telemetry Jun 4, 2024
@davidmrdavid davidmrdavid changed the title Remove potentially sensitive logs from telemetry Minimize telemetry surface area Jun 4, 2024
@davidmrdavid davidmrdavid marked this pull request as ready for review June 4, 2024 01:11
@@ -239,26 +238,6 @@ public void ExtensionWarningEvent(string hubName, string functionName, string in
}
}

public void ProcessingOutOfProcPayload(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this wasn't used

@davidmrdavid
Copy link
Contributor Author

Just realized a bunch of logging unit tests broke, which is good, but it means I need to update them. Still, I'd appreciate a quick pass on the proposed refactoring in the meantime. Thanks

@davidmrdavid
Copy link
Contributor Author

Tests are passing. I'd appreciate a review here, @cgillum. Thanks!

@davidmrdavid davidmrdavid requested a review from cgillum June 13, 2024 01:41
return sanitizedPayload;
}

private string SanitizeException(Exception? exception, out string iloggerExceptionString, bool isReplay = false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design of this API is a bit confusing. It's not clear what the difference is supposed to be between the output (return value) and the out variable. Rather than having a named out parameter and an unnamed return value, it would be better from a code understandability perspective to have two named out parameters and just return void. Having this kind of usage clarity is especially important in this case because making a mistake in interpreting the outputs of this method could cause us to accidentally leak sensitive information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - I have refactored the implementation to use two out variables.

public void FunctionFailed(
string hubName,
string functionName,
string instanceId,
string reason,
string sanitizedReason,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that sanitizedReason is going to ETW while reason is going to ILogger, but don't we send both of these payloads to Kusto, ETW to DurableFunctionsEvents and ILogger to FunctionsLogs? I know that's not the case for the DTFx tracing code, but I thought the WebJobs ILogger logs went to FunctionsLogs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Throughout this PR, we only sanitize the logs to DurableFunctionsEvents and not the ILogger-powered logs to FunctionsLogs. This is because the ILogger logs are also sent to the user's Application Insights instance, and I was worried about changing the logging behavior to a user-facing component.

So I'm choosing to delay our decision to sanitize the FunctionsLogs for now so that we can unblock sanitizing the DF Kusto table. Does that seem reasonable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks for clarifying. It would be good to confirm what the plan is for sanitizing FunctionsLogs just to make sure that the work done in this PR isn't redundant or need to be reversed. Happy to chat about this offline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss this offline. But in general I think the clean up done here is unlikely to need to be reversed. If anything, I think it's more likely that it will need to be expanded to include FunctionsLogs as well. I'm just opting to merge a minimal improvement for now.

@@ -95,7 +95,7 @@ public async Task CallOrchestratorAsync(DispatchMiddlewareContext dispatchContex
this.Options.HubName,
functionName.Name,
instance.InstanceId,
isReplaying ? "(replay)" : this.extension.GetIntputOutputTrace(startEvent.Input),
startEvent.Input,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell by looking at the diffs, but are we using GetInputOutputTrace at all anymore in the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time this of writing this comment, there was a single leftover use, but that was an accident. Starting from this commit (43f63fc) you can see the method has been removed.

So yes - this PR intends to remove this method altogether. Instead of sanitizing the logs at each calling site of our EndToEndTraceHelper, I opted to centralizing the sanitization in the EndtoEndTraceHelper itself. I think that should help minimize the chance of sanitization errors.

@davidmrdavid davidmrdavid requested a review from cgillum June 19, 2024 23:24
Copy link
Collaborator

@bachuv bachuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@davidmrdavid davidmrdavid merged commit c88b62c into dev Jul 9, 2024
12 checks passed
@davidmrdavid davidmrdavid deleted the dajusto/remove-potentially-sensitive-logs branch July 9, 2024 18:42
bachuv pushed a commit that referenced this pull request Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants