Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add easy support for Datadog and possibly other observability solutions #1433

Open
2 tasks done
petarlishov opened this issue Aug 10, 2022 · 14 comments · Fixed by #2194 or #2183
Open
2 tasks done

Add easy support for Datadog and possibly other observability solutions #1433

petarlishov opened this issue Aug 10, 2022 · 14 comments · Fixed by #2194 or #2183
Assignees
Labels
feature-request feature request logger metrics tracer Tracer utility

Comments

@petarlishov
Copy link

petarlishov commented Aug 10, 2022

Use case

I have played around with the Datadog and the AWS Powertools Lambda layers and as one that needs to integrate with Datadog, the Datadog Lambda layer is a good choice for getting that integration set up flawlessly.

But I also really enjoy some of the features that AWS Powertools have incorporated into their Lambda layer and as a developer I find it to be a very useful tool. Last time I checked (a while ago), the AWS Powertools layer was more lightweight as well.

Sadly, because both layers use similar packages (boto3 for example among other things), I believe they are not exactly compatible with each other. And either way, adding both would make our Lambdas' start times much worse as both layers are not exactly light either.

I have one collection of Lambdas using the Datadog layer and another collection using Poewrtools. I have thus noticed some of the differences that make Powertools tricky to easily integrate with Datadog despite being the more useful tool purely from a developer's perspective:

  • the log format is slightly different and it does not allow Datadog to easily query logs by things like request ID. It would be nice to be able to specify a Datadog log format that we can easily use within Powertools without having to create a custom one ourselves. I believe compatibility can be easily achieved here and all we need is a special logger format that mimics the one used by Datadog
  • Datadog does not ingest AWS CloudWatch metrics. I believe the easiest way to do a custom metric that gets interpreted by Datadog is to send logs in this format {"m": "Metric name", "v": "Metric value", "e": "Unix timestamp (seconds)", "t": "Array of tags"}. I know this is different from the embedded metrics format which AWS already provides - https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-cloudwatch-launches-embedded-metric-format but maybe we can have a way to support both based on some configuration setting? Sadly the metrics that Powertools provides at the moment are not easily ingestible into Datadog
  • I have not tested it sufficiently, but if there is some incompatibility with the way Datadog handles traces, maybe someone else that has that knowledge can keep in mind that some adaptations would be useful there too? From what I can understand, there should be no problems there as traces are ingested straight from XRay but I have sadly under-utilised that functionality

Solution/User Experience

Provide a configuration or an option to define the format when setting up loggers, metrics and traces, which would allow for better integration with other ovservability solutions other than AWS' CloudWatch and XRay

Alternative solutions

It may be possible to do all these already. Maybe a separate package that adds this compatibility can be created that works well with Powertools, as long as Powertools already has easy methods to manipulate its behaviour in the required ways to allow for the requested integration.

Acknowledgment

@petarlishov petarlishov added feature-request feature request triage Pending triage from maintainers labels Aug 10, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Aug 10, 2022

Thanks for opening your first issue here! We'll come back to you as soon as we can.

@heitorlessa
Copy link
Contributor

It is so rare to receive such a high quality feature request like this that I want us to take time to reply to you accordingly - please bear with us for an answer next week.

Until then, we're laying the ground work in E2E and Integ test framework to give us confidence to offer what you're asking -- either native support or expose the mechanisms we already have for customers to build them.

For anyone else reading this, please please add your +1 to the author to help us prioritise it.

Thank you for taking the time to share such rich detail.

@heitorlessa
Copy link
Contributor

Hey @petarlishov, our new (internal) E2E framework took a lot longer than I expected to refactor, so I'm replying tomorrow morning to address your questions and a few asks as we started v2 in parallel.

@heitorlessa
Copy link
Contributor

I'll break down my answer in categories to make it easier to parse later.

Making Lambda Powertools more lightweight

Sadly, because both layers use similar packages (boto3 for example among other things)
And either way, adding both would make our Lambdas' start times much worse as both layers are not exactly light either.

I bet you'll be excited to hear that we've started working on v2 (minor breaking changes) to cut down the final package size to ~464K (compressed) 🎉

In v2, we are making all dependencies optional, e.g. boto3, AWS X-Ray SDK and fastjsonschema bringing a >90% reduction. Powertools feature set today work with the older version Lambda supports at runtime (~7-8 months old), and if a feature requires a newer boto3 our docs will suggest how to upgrade it accordingly.

@rubenfonseca is leading V2. We'll create a RFC to discuss trade-offs of relying on Lambda runtime's packages, and how we're thinking of our Lambda Layer v2.

NOTE: Our stream of consciousness for Lambda Layer is to likely have boto3 removed while adding all optional dependencies to ease distribution for consumers - botocore alone is ~67M today (uncompressed).

Modularization is our medium-term game

This is an intermediate stage towards modularization in V3. This would need a major structural change but allow customers to pick and choose what they need, going as far as ~16K package size if one wants to. That however needs a ton of research and testing to make sure it's stable and maintainable - we plan to draft a RFC next year once we're comfortable with v2 outcomes.

Despite being a major version, we want to prevent disruptions as much as possible to our customers. We're working on our first upgrade guide, and for v3 we even have the ambition to create a linter plugin to help you upgrade faster.

Long-term, this will give us the structure we need to add support to non-Lambda runtimes like Fargate, Glue jobs, etc. We have some customers using it that way, but we haven't put a "it's supported" stamp on it yet. We could even expose our private integ/e2e testing utilities as a package for customers ;)

Datadog Log format

It would be nice to be able to specify a Datadog log format that we can easily use within Powertools without having to create a custom one ourselves. I believe compatibility can be easily achieved here and all we need is a special logger format that mimics the one used by Datadog

I'm not sure if you've tried, but Logger supports Bringing Your Own Logging Formatter without forgoing Logger features and UX. We recommend that option for customers looking to only change the final format without having to maintain a different Logger implementation altogether.

I'm not fully aware of what Datadog expects in a structured logging and why they have difficulties to query a JSON field. That said, we're more than happy to investigate any non-breaking change we could do on our side if this spans more than Datadog - feature request please!

In V3, we'll be able to create a providers package where we could use community help to get these quirks addressed while not imposing everyone to receive a copy of a provider (e.g., Datadog) themselves.

Datadog Metrics format

I believe the easiest way to do a custom metric that gets interpreted by Datadog is to send logs in this format {"m": "Metric name", "v": "Metric value", "e": "Unix timestamp (seconds)", "t": "Array of tags"}

We could make this more extensible quite easily - wanna create a RFC?

RFC will help us agree on a contract for exposing a Metric Provider (a simple sink pattern), so that customers can use the same UX but have different outputs and validation mechanisms. As of now, we didn't invest much and this part of the code could be easily rearranged until we have a proper Provider - https://github.com/awslabs/aws-lambda-powertools-python/blob/develop/aws_lambda_powertools/metrics/base.py#L139

Datadog Trace compatibility

I have not tested it sufficiently, but if there is some incompatibility with the way Datadog handles traces, maybe someone else that has that knowledge can keep in mind that some adaptations would be useful there too?

Tracing has an undocumented BaseProvider for that intent but we haven't been able to put more thoughts into it. Now that we're making X-Ray SDK optional in v2, this becomes a more interesting conversation to have otherwise we'd be forcing customers to have X-Ray SDK lib when they were using a DataDog Provider.

The hardest part in Tracing is patching modules and nomenclatures (e.g., segment/span, patching only X but not Y lib) --- wanna create a RFC for what minimum feature set the BaseProvider should support?

I initially wanted to have a drop-in replacement Tracing Provider, but then digging into 3-4 tracing providers' lib I saw how much custom logic each provider does and I became less sure of it - a RFC can help us get there ;)

We also looked into Open Telemetry but the cold start was too significant, and it was a moving target in terms of changes too. I think exposing our BaseProvider is a good first step.

Overall

We're going towards that direction but we'd love help from the community in helping us define a good contract for Providers (Tracing has already).

Right now, our main focus is on operational excellence (E2E test) to ensure V2 can be smooth sailing, and pave the road for our future modularization story. We'll continue to respond to feature requests and greatly appreciate any help we can get - we can't wait to create new utilities and new extensibility mechanisms, but first we need confidence large changes can be made ;)

Once again, thank you for creating such a comprehensive issue. These make me personally happy that we have a lot to do but also emphasize that got a community who cares ;)

Hopefully that answers your questions and remarks, please let us know otherwise!

PS: Join us on Discord, we'd love to have you if you aren't there already.

cc @rubenfonseca @leandrodamascena @mploski @am29d @saragerion @sliedig

@heitorlessa heitorlessa added need-customer-feedback Requires more customers feedback before making or revisiting a decision help wanted Could use a second pair of eyes/hands and removed triage Pending triage from maintainers labels Sep 1, 2022
@heitorlessa heitorlessa pinned this issue Sep 9, 2022
@sthulb sthulb unpinned this issue Jan 5, 2023
@sthulb sthulb pinned this issue Jan 5, 2023
@heitorlessa
Copy link
Contributor

Update: We estimate one RFC per feature (Tracer, Metrics, Logger) starting in mid-April/early May. This will help focus the discussion on a standard interface to help customers bring their own provider.

Since we launched V2, the only difference for Tracer is that we'd stick with AWS X-Ray SDK provider as the default, while providing a built-in provider for OpenTelemetry - other 3rd party providers (e.g., Datadog, Lumigo, NewRelic, etc.) would be owned by them where we'd be happy to collaborate/coordinate.

Thank you all!!

@leandrodamascena leandrodamascena added logger metrics tracer Tracer utility and removed help wanted Could use a second pair of eyes/hands need-customer-feedback Requires more customers feedback before making or revisiting a decision labels Apr 11, 2023
@leandrodamascena
Copy link
Contributor

I'm removing the help wanted and need-customer-feedback labels because already have work in progress, so anyone can comment on these RFCs.

@heitorlessa
Copy link
Contributor

UPDATE: We're adding support in Logger for Datadog as a start. We're working on a POC for Metrics, and adding last refinements for Tracer Providers.

For Logger, Datadog was the only provider that required a custom timestamp so we've added a Formatter, and documented our recommendation to use Lambda Extensions to not impact in performance.

The only reason we're not adding OTel Log output now is because it's not Final yet - please feel free to open a feature request when that happens (whoever is reading and need that)

image

@heitorlessa
Copy link
Contributor

Quick update -- @roger-zhangg is working on the last feature: Observability Provider for Tracer. Once that's done, we'll close this issue, and start investigating an alternative solution for OTel as cold starts haven't significantly improved.

@leandrodamascena leandrodamascena unpinned this issue Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment