Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[azuremonitorreceiver] Add the ability to pull Resource Metrics using AZ Query Batch API #29593

Conversation

cparkins
Copy link
Contributor

Add the ability to pull Resource Metrics using the AZ Query Batch API. Currently a Proof-of-Concept that does not allow Resources to exist in different regions and does not filter based on dimensions similar to how it's being handled in the previous implementation (by Resource).

Description:

This is an enhancement to try and avoid limitations when querying resource directly by using the AZ Query Batch API that was recently released for the GO SDK.

Link to tracking Issue:

Testing:
Tested locally against a Resource Group in an single region. Some issues exist that still need to be investigated.

Documentation:
This will be updated once the Draft is discussed.

… the Batch API. Currently a Proof-of-Concept that does not allow Resources to exist in different regions and does not filter based on dimensions similar to how it's being handled in the previous implementation (by Resource).
@github-actions github-actions bot added cmd/configschema configschema command cmd/otelcontribcol otelcontribcol command receiver/azuremonitor labels Nov 30, 2023
@@ -26,6 +26,7 @@ var (
errMissingClientSecret = errors.New(`ClientSecret" is not specified in config`)
errMissingFedTokenFile = errors.New(`FederatedTokenFile is not specified in config`)
errInvalidCloud = errors.New(`Cloud" is invalid`)
errInvalidRegion = errors.New("`Region` is not specifiec in config`")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! typo.

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Dec 15, 2023
@ahurtaud
Copy link

ahurtaud commented Dec 21, 2023

I have pulled the code and compiled a version, tested it on one subscription and it's a really good start!

As in first comment, we are seeing few errors but I am more concerned about this one:
"error": "POST https://northeurope.metrics.monitor.azure.com/subscriptions/56ffb7d8-dd02-4217-9f63-11093b6176ba/metrics:getBatch\n--------------------------------------------------------------------------------\nRESPONSE 400: 400 Bad Request\nERROR CODE: BadRequest\n--------------------------------------------------------------------------------\n{\n \"error\": {\n \"additionalInfo\": [\n {\n \"type\": \"string\",\n \"info\": \"TraceId=ca886d1c7e36172863695ed22c30acf2\"\n },\n {\n \"type\": \"string\",\n \"info\": \"ExceptionType=Microsoft.Online.Metrics.MetricsMP.Utilities.RPRequestFormatException\"\n }\n ],\n \"code\": \"BadRequest\",\n \"message\": \"Number of requests in the batch 301 exceeded max allowed: 50\"\n }\n}

I think this correspond to microsoft to:
Up to 50 unique resource IDs may be specified in each call. Each resource must belong to the same subscription, region, and be of the same resource type. source

And also, I would like to start a discussion (maybe for a second PR / feature request) (ping @altuner @codeboten):
At Amadeus, we have tons of subscriptions (like thousands), would it be possible to have auto discovery of them as well?

  • there is a GET subscription available at https://management.azure.com/subscriptions?api-version=2020-01-01
  • and then the list resourcegroups also give the location (for the new getBatch API endpoint):
    e.g: https://management.azure.com/subscriptions/14b86a40-8d8f-4e69-abaf-42cbb0b8a331/resourcegroups?api-version=2019-10-01

WDYT? (please tell me if you want me to raise a new issue for this feature request, this getBatch will anyway be a requirement first)
Thank you,
Alban

@cparkins
Copy link
Contributor Author

@ahurtaud - I ran into a few errors. But not that one specifically. I didn't realize that there was a limit to be honest. The documentation was kind of hard to understand. I had some concerns about the number of Dimensions that I was pulling so I did some testing there and it would be really nice to be able to determine the number of dimensions that a Metric can have instead of having to hard code the values. I think it would be pretty easy to break the requests up into batches of 50, I guess I didn't read all of the documentation :-D

@github-actions github-actions bot removed the Stale label Dec 22, 2023
@nslaughter
Copy link
Contributor

nslaughter commented Dec 28, 2023

Want to ping those who worked on this previously as I also think the component would benefit from switching to use the azquery interfaces in the Azure SDK. I think that provides a lot of leverage to the purposes of this code. However, when experimenting with the changes and finding it was difficult to test all paths and similarly difficult to make the result of changes clear to reviewers... I thought we could benefit from all of the following.

  1. Separate fetching responsibility, so the mechanism of each type of fetch (i.e. resources, metric definitions, metric values) could have it's fetch specified in a simplified strategy.
  2. Separate caching/providing responsibility, so reviewers so it's simple to verify when state (i.e. azureScraper.resources is updated).
  3. Potentially make changes to the update trigger for each state, so a pull through flow model wouldn't be our only option. I think this would provide some optionality for controlling consistency of reads if this is a concern later.
  4. With testing and verifying more paths in a straightforward way, start introducing changes to use higher-level APIs like azquery.
  5. It looks like this would simplify scraping logs with azquery as well, though I haven't investigate it extensively.

In the long run these change could help with scalability, though it's unknown when/if that would be helpful. And my purpose for proposing the changes now is to simplify testing on more paths.

Before I opened an issue for changes around this I wanted to get input from those who have looked at this receiver and might have informed opinions to share.

Please share any thoughts you have on this: @cparkins @ahurtaud

@nslaughter
Copy link
Contributor

I didn't realize that there was a limit to be honest. The documentation was kind of hard to understand.

I agree. Maybe it's just easy to confuse me when there are multiple numbers involved. Most things in the SDK are straightforward as you get familiar with their patterns and the Azure object model, but I believe some calls are limited to 20, I recall reading that one was limited to 10, and @hurtaud has errors that say limit of 50 on something else. Currently the default config in factory.go says maximum number of metrics in a call is 20 though it's configurable. The scraper uses that field in this ad hoc batching mechanism. That's what the start and end variables are for, but I find it difficult to understand much less verify.

I had some concerns about the number of Dimensions that I was pulling so I did some testing there and it would be really nice to be able to determine the number of dimensions that a Metric can have instead of having to hard code the values.

The current mechanism that deals with dimensions is the metricsCompositeKey where the dimensions field is a string built in this stanza from the metric definitions we get from Azure. Dimensions are sorted before joining to a comma delimited string, so that calls can be batched by requesting metric values in batches. (Tell me if the rationale doesn't make sense.)

I think it would be pretty easy to break the requests up into batches of 50.

Configurable batch sizes would definitely help. We could depend on a batcher package like this for example, make/copy our own like it, or use methods that will work almost as well if controlled in the context. The nice thing about the fancy one is that we don't have to manage keying on the metrics composite key, so you specify how to aggregate batches by options and you can fire calls as soon as you have a full batch or a configured batch timeframe expires.

Maybe this provides some useful information on the code's current state. Hopefully, it isn't a complete waist of time when the PoC version is considered. I plan to have a closer look at that code in a couple of hours.

@nslaughter
Copy link
Contributor

I like where the PR is headed. I'm planning to open an issue as something like migrate to azquery package in SDK and then can play with using the QueryResource path or the Batch path. I plan to cite this PR as some of the PoC work in the issue. Happy to see the transition to azquery completed in this PR or open something different. I think both of these paths could ease the burden as far as a layers of calls and cache.

@cparkins
Copy link
Contributor Author

cparkins commented Jan 2, 2024

a batcher package like this for example

@nslaughter I like the thoughts you have on how to improve the receiver in general. While writing this code I thought some time should be taken to consider different ways to pull the various bits of data that are being used when scraping. Unfortunately I didn't take much time to look at other ways that we could pull the data that might be more efficient. I really like the idea of using the batcher though. And I think that eventually this code will run into issues with getting consistent metrics because of the way it's written (multiple timers triggering at multiple intervals). I'm open to ideas based on this Draft PR and look forward to seeing what you think once you get more in depth.

@nslaughter
Copy link
Contributor

I had some concerns about the number of Dimensions that I was pulling so I did some testing there and it would be really nice to be able to determine the number of dimensions that a Metric can have instead of having to hard code the values.

The docs say that custom metrics can have up to 10 dimensions. The metric definitions list the preset dimensions for predefined ones. Does this help or you getting at something else?

@cparkins
Copy link
Contributor Author

I had some concerns about the number of Dimensions that I was pulling so I did some testing there and it would be really nice to be able to determine the number of dimensions that a Metric can have instead of having to hard code the values.

The docs say that custom metrics can have up to 10 dimensions. The metric definitions list the preset dimensions for predefined ones. Does this help or you getting at something else?

That does help actually. I was trying to determine the number of standard dimensions for a Storage Account and got to 30.... I probably need to look at the REST responses a little more closely to get this one solved better.

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jan 27, 2024
…calculating the nearest minute and then add a check to ensure the last update was not within the time grain. In addition add more logging around erorrs received from the Batch API by casting to the Azure Response Error type and logging the full error.
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Feb 27, 2024
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Mar 12, 2024
@tesharp
Copy link

tesharp commented Nov 20, 2024

@cparkins Any reason this was not continued? With the current throttling limits its can be a hassle to get all the wanted metrics from Azure

@celian-garcia
Copy link

celian-garcia commented Nov 20, 2024

Hello @tesharp. I took back the topic from Amadeus side. The plan is to separate in several PRs for readability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants