Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor]: aggregate_on_attributes function in transform processor not working as expected in conjunction with keep_matching_keys #36517

Open
Shindek77 opened this issue Nov 25, 2024 · 7 comments
Labels
processor/transform Transform processor question Further information is requested

Comments

@Shindek77
Copy link

Shindek77 commented Nov 25, 2024

Component(s)

processor/transformprocessor

Describe the issue you're reporting

Hello,

Our Goal: we want to reduce the number metrics labels which are not required in our metric time series, as a output its should the number of time series but the new reduces metrics time series should values aggregated from the time series which was coming before drop of labels

Example:
Input Metrics Data

http_requests{region="us-east", service_name="order-service", method="GET", status="200"} 10
http_requests{region="us-east", service_name="order-service", method="GET", status="500"} 5
http_requests{region="us-east", service_name="billing-service", method="GET", status="200"} 8
http_requests{region="us-west", service_name="order-service", method="POST", status="200"} 7
http_requests{region="us-west", service_name="order-service", method="POST", status="500"} 3

**Goal:
Resource Attributes: region, service_name
Datapoint Attributes: method, status

Keep:
region (resource attribute)
method (datapoint attribute)

Drop:
service_name (resource attribute)
status (datapoint attribute)**

Final Results:

http_requests{region="us-east", method="GET"} 23
http_requests{region="us-west", method="POST"} 10

Solution: After analysis we got to know that we can use transform processor in conjuction with aggregate_on_attributes function…
as per docs aggregate_on_attributes function aggregates all datapoints in the metric based on the supplied attributes as well as removes all attributes that are present in datapoints except the ones that are specified in the attributes parameter.

But after testing we got know that it is working only on datapoint attributes but not on the resource attributes which are present in our metrics.

So then as per docs The aggregate_on_attributes function can also be used in conjunction with keep_matching_keys or delete_matching_keys.
Then we have tried with the same so that with  keep_matching_keys we will keep only required resource attributes and drop others and with aggregate_on_attributes will give the list of datapoint attributes with we want to keep and it will aggregation also.

Configuration For Same:

data:
  relay: |
    exporters:
     	debug: 
          verbosity: detailed
        otlphttp/test-vm:
          compression: gzip
          encoding: proto
          endpoint: http:/victoria-metrics-cluster-vminsert.metrics-ns.svc.cluster.local:8480/insert/2/opentelemetry
          timeout: 30s
          tls:
            insecure: true
  processors:
    batch:
          timeout: 10s
     groupbyattrs:
     transform/TruncateTime:
          metric_statements:
            - context: datapoint
              statements:
                - set(time, TruncateTime(time, Duration("10s")))
     transform:
          metric_statements:
            - context: metric
              statements:
              - keep_matching_keys(resource.attributes, "^(region).*")
              - aggregate_on_attributes("sum", ["method"])
  pipelines:
        metrics/-test-label-reduction-transform:
          exporters:
          - otlphttp/test-vm
          processors:
          - batch
          - groupbyattrs
          - transform/TruncateTime
          - transform
          receivers:
          - otlp

But when keep_matching_keys(resource.attributes, "^(region).*")

This keeps only region from the resource attributes and removes service_name.
The intermediate result coming as

http_requests{region="us-east", method="GET", status="200"} (sometimes 10/8)
http_requests{region="us-east", method="GET", status="500"} 5
http_requests{region="us-west", method="POST", status="200"} (sometime 7/3)

Then aggregate_on_attributes("sum", ["method"]) works
which give final results as:

http_requests{region="us-east", method="GET"} (sometimes 10/8) + 5
http_requests{region="us-west", method="POST"} (sometime 7/3) 

But as per docs it should give results as as both works at once:

http_requests{region="us-east", method="GET"} 23
http_requests{region="us-west", method="POST"} 10

So please help on this how we can get the results are we wanted....
We are getting it as expected if we use only aggregate_on_attributes but it works only on datapoint attributes as well as we have to us only one replica of Opentelemetry

So how use only aggregate_on_attributes also with many replicas of OTEL. And why its working like this??

@Shindek77 Shindek77 added the needs triage New item requiring triage label Nov 25, 2024
@Shindek77 Shindek77 changed the title aggregate_on_attributes function in transform processor not working as expected in conjunction with keep_matching_keys [processor]: aggregate_on_attributes function in transform processor not working as expected in conjunction with keep_matching_keys Nov 25, 2024
@bacherfl
Copy link
Contributor

Hi @Shindek77 !

I just looked into this - If I understood this correctly, then the desired behavior might work with the following config, that first removes the service_name attribute and moves the region resource attribute to the datapoint attributes. After that, the region attribute can be used to regroup data points into one resource per region. Then, finally the aggregate_on_attributes function can be used to group the data points by the method attribute:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
processors:
  batch:
    timeout: 10s
  groupbyattrs:
    keys:
      - region
  transform/RemoveServiceName:
    metric_statements:
      - context: resource
        statements:
          - keep_matching_keys(attributes, "^(region).*")
  transform/MoveRegionToDataPoint:
    metric_statements:
      - context: datapoint
        statements:
          - set(attributes["region"], resource.attributes["region"])
  transform/TruncateTime:
    metric_statements:
      - context: datapoint
        statements:
          - set(time, TruncateTime(time, Duration("10s")))
  transform:
    metric_statements:
      - context: metric
        statements:
          - aggregate_on_attributes("sum", ["method"])


exporters:
  debug:
    verbosity: detailed

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors:
        - filter
        - batch
        - transform/RemoveServiceName
        - transform/MoveRegionToDataPoint
        - groupbyattrs
        - transform/TruncateTime
        - transform
      exporters: [debug]

Hope this helps - if not, please let me know and I will continue to look into this

@bacherfl bacherfl added processor/transform Transform processor question Further information is requested labels Nov 26, 2024
Copy link
Contributor

Pinging code owners for processor/transform: @TylerHelmuth @kentquirk @bogdandrutu @evan-bradley. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label.

@bacherfl bacherfl removed the needs triage New item requiring triage label Nov 26, 2024
@Shindek77
Copy link
Author

Shindek77 commented Nov 26, 2024

Hi @Shindek77 !

I just looked into this - If I understood this correctly, then the desired behavior might work with the following config, that first removes the service_name attribute and moves the region resource attribute to the datapoint attributes. After that, the region attribute can be used to regroup data points into one resource per region. Then, finally the aggregate_on_attributes function can be used to group the data points by the method attribute:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
processors:
  batch:
    timeout: 10s
  groupbyattrs:
    keys:
      - region
  transform/RemoveServiceName:
    metric_statements:
      - context: resource
        statements:
          - keep_matching_keys(attributes, "^(region).*")
  transform/MoveRegionToDataPoint:
    metric_statements:
      - context: datapoint
        statements:
          - set(attributes["region"], resource.attributes["region"])
  transform/TruncateTime:
    metric_statements:
      - context: datapoint
        statements:
          - set(time, TruncateTime(time, Duration("10s")))
  transform:
    metric_statements:
      - context: metric
        statements:
          - aggregate_on_attributes("sum", ["method"])


exporters:
  debug:
    verbosity: detailed

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors:
        - filter
        - batch
        - transform/RemoveServiceName
        - transform/MoveRegionToDataPoint
        - groupbyattrs
        - transform/TruncateTime
        - transform
      exporters: [debug]

Hope this helps - if not, please let me know and I will continue to look into this

Hello @bacherfl ,
Thanks for quick reply...I have tried with the same but the issue is that when it runs...the first processor i.e. transform/RemoveServiceName works that it removes the service_name resource attribute and that why the series 1 and 3 as well as 4 & 5 becomes same so that we get intermediate time series as follows:

http_requests{region="us-east", method="GET", status="200"} (sometimes 10/8)
http_requests{region="us-east", method="GET", status="500"} 5
http_requests{region="us-west", method="POST", status="200"} (sometime 7/3)

After that all remaining processor works .it means at the end aggregate_on_attributes on these above three time series so final ans comes as

http_requests{region="us-east", method="GET"} (sometimes 10/8) + 5
http_requests{region="us-west", method="POST"} (sometime 7/3) 

Just Idea:
As you show we can moves the region resource attribute to the datapoint attributes using set function..So can we do some regrex way so that firstly we will move all resource attributes to datapoint attributes regardless of there names...and the use aggregate_on_attributes on datapoint attributes as we needed so that it will perform aggregation as per given list and remove other attributes.

as per above Example: moves all resource attributes to datapoint attributes first. i.e. both region and service_name and Then, finally the aggregate_on_attributes function can be used to group the data points by the method and region datapoint attributes.
so final result will be as expected:

http_requests{region="us-east", method="GET"} 23
http_requests{region="us-west", method="POST"} 10

@bacherfl
Copy link
Contributor

Thanks for the feedback @Shindek77 - one possibility to move all resource attributes to the datapoints to make them available for the aggregate_on_attributes would be with the config below:

  transform/ResourceAttributesToDataPoint:
    metric_statements:
      - context: datapoint
        statements:
          - set(attributes["resource"], resource.attributes)
          - flatten(attributes)
  transform/TruncateTime:
    metric_statements:
      - context: datapoint
        statements:
          - set(time, TruncateTime(time, Duration("10s")))
  transform:
    metric_statements:
      - context: metric
        statements:
          - aggregate_on_attributes("sum", ["method", "resource.region"])

Note: the flatten function is used because the aggregate_on_attributes function does not seem to support access to nested properties.

@Shindek77
Copy link
Author

Hello @bacherfl , I tried with below approach

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        
processors:
  transform/ResourceAttributesToDataPoint:
      metric_statements:
        - context: datapoint
          statements:
            - set(attributes[""], resource.attributes)
            - flatten(attributes)
   transform/ResourceAttributesDeletion:
      metric_statements:
      - context: resource
        statements:
        - delete_matching_keys(attributes, "(?i).*")
  transform:
      metric_statements:
      - context: metric
        statements:
        - aggregate_on_attributes("sum", ["region", "method"])
        
exporters:
  debug:
    verbosity: detailed
    
services:
  pipelines:
      metrics:
        exporters:
        - debug
        processors:
        - filter
        - batch
        - transform/ResourceAttributesToDataPoint
        - transform/ResourceAttributesDeletion
        - groupbyattrs
        - transform/TruncateTime
        - transform
        receivers:
        - otlp

With this firstly it making all resource attributes as datapoints with same name region and service_name as I can see in otel logs...after that another processor needed to delete actual all resource attributes...but then while aggregation we are getting only two time series as expected but value is not aggregated properly..
Final Results:

http_requests{region="us-east", method="GET"} value: 10/5/8
http_requests{region="us-west", method="POST"} value: 7/3

@Shindek77
Copy link
Author

Shindek77 commented Dec 11, 2024

Hello @bacherfl ,

As you know we are getting lot of metrics data with many labels/attributes (resources/datapoint attributess) on our otel collector which is deployed as deployment on our k8s cluster.

We have tested it to reduce the number metrics labels which are not required in our metric time series, and the output time series should have only required labels and the aggregated value.

receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317

processors:
batch:
timeout: 10s
groupbyattrs: null
transform/TruncateTime:
metric_statements:
- context: datapoint
statements:
- set(time, TruncateTime(time, Duration("10s")))
transform/RemoveOtherResourceAttributes:
metric_statements:
- context: resource
statements:
- delete_matching_keys(attributes, "^().*")
transform/ResourceAttributesToDataPoint:
metric_statements:
- context: datapoint
statements:
- set(attributes["resource"], resource.attributes)
- flatten(attributes)
transform:
metric_statements:
- context: metric
statements:
- aggregate_on_attributes("sum", ["l7_serviceName", "resource.service.name", "service.name", "resource.service_name", "service_name"])

exporters:
debug:
verbosity: detailed
otlphttp/test-vm-test-label-reduction:
compression: gzip
encoding: proto
endpoint: http://victoria-metrics-cluster-vminsert.metrics-ns.svc.cluster.local:8480/insert/8/opentelemetry
timeout: 30s
tls:
insecure: true
otlphttp/test-vm-test-without-label-reduction:
compression: gzip
encoding: proto
endpoint: http://victoria-metrics-cluster-vminsert.metrics-ns.svc.cluster.local:8480/insert/9/opentelemetry
timeout: 30s
tls:
insecure: true

services:
pipelines:
metrics/with-metrics-label-reduction:
exporters:
- otlphttp/test-vm-test-label-reduction
processors:
- batch
- transform/ResourceAttributesToDataPoint
- transform/ResourceAttributesDeletion
- groupbyattrs
- transform/TruncateTime
- transform
receivers:
- otlp
metrics/without-metrics-label-reduction:
exporters:
- otlphttp/test-vm-test-without-label-reduction
processors:
- batch
- transform/ResourceAttributesToDataPoint
- transform/ResourceAttributesDeletion
- groupbyattrs
- transform/TruncateTime
- transform
receivers:
- otlp

With the above configuration we have tested two cases:

By keeping otel-collector replicas=1
When we are keeping replica as 1 we are getting aggregated values but sometimes we are getting spikes as shown in below screenshots
Without Processors:
image
With Processors:
image

By keeping otel-collector replicas=6
When we have multiple replicas we are getting quite disturbed output...means for few metrics its doing aggregation sometimes and for few metrics its not doing aggregation properly....
Without Processors:
image
With Processors:
image

could you please help on this what is going wrong and how can we set??

@bacherfl
Copy link
Contributor

bacherfl commented Dec 12, 2024

Hi @Shindek77 and thanks for the update - I will look into this today and try to see what is happening here. Do you also have some example payload that can be sent to the otlp receiver to reproduce this behavior?

Also, if you are using multiple replicas, keep in mind that metrics that should be grouped together need to be sent to the same instance of the collector - is there currently any mechanism in place that ensures that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
processor/transform Transform processor question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants