Use of sample with key_field appears to be causing events to be dropped silently. #19680

NeilJed · 2024-01-22T10:02:53Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

I'm currently using a fairly complex sampling set-up due to the need to have different sample rates for different service logs. You can find an outline of this in issue #19332.

My pipeline route is basically source -> remap -> route -> sampler-> remap-> sink

After my sources I pass the incoming events to a remap transform that uses some VRL logic to add a sampler field to the message with the desired sample rate and the value to use as the key_field - in this case the CloudFront distribution ID that the logs came from. For example:

"sampler": {
    "rate": 3,
    "key_field": "ED4332JKNG"
}

After adding the field, I use a route transform to route it to a specific sampler for the given rate (as we can't currently set this dynamically). There is a final remap after that which does some final clean up, like removing the sampler field before sending it to the sink.

What I noticed is that when logs are passing through the samplers they were being dropped with no warnings or errors.

If I used vector tap on each stage I can see that as the events leave the router the sampler field exists in my messages. However, when I tap one of the samplers, there is no output. In contrast when I tap the final remap which collects all logs, I only see the logs that are unsampled and thus bypass the samplers.

After a lot of head scratching, what I discovered is that if i disable the use of key_field in the sampler, then the events pass through correctly and I can see them with tap and they appear in my final output.

I'm not sure what the actual issue is here but after several rounds of testing the events are mysteriously dropped when any of the sample transforms have key_field configured.

Configuration

transforms:
  # Set sample rate for events
  sampler_setrate:
    type: remap
    inputs:
      - "*_source"
    file: ${VECTOR_CONFIG_DIR}/vrl/sampler_logic.vrl

  # Routes events to the correct sampler
  sample_router:
    type: route
    inputs:
      - sampler_setrate
    reroute_unmatched: true

    # Define routes to use based on the desired sample rate
    route:
      none: .sampler.rate == 1       # no sampling
      s2: .sampler.rate == 2         # 50%
      s3: .sampler.rate == 3         # 33%
      s4: .sampler.rate == 4         # 25%
      s5: .sampler.rate == 5         # 20%
      s10: .sampler.rate == 10       # 10%
      s100: .sampler.rate == 100     # 1%
      s1000: .sampler.rate == 1000   # 0.1%
  
  sample_set_2:
    type: sample
    inputs:
      - sample_router.s2
    key_field: sampler.key_field
    rate: 2
  
  sample_set_3:
    type: sample
    inputs:
      - sample_router.s3
    key_field: sampler.key_field
    rate: 3
  
  sample_set_4:
    type: sample
    inputs:
      - sample_router.s4
    key_field: sampler.key_field
    rate: 4
  
  sample_set_5:
    type: sample
    inputs:
      - sample_router.s5
    key_field: sampler.key_field
    rate: 5
  
  sample_set_10:
    type: sample
    inputs:
      - sample_router.s10
    key_field: sampler.key_field
    rate: 10
  
  sample_set_100:
    type: sample
    inputs:
      - sample_router.s100
    key_field: sampler.key_field
    rate: 100
  
  sample_set_1000:
    type: sample
    inputs:
      - sample_router.s1000
    key_field: sampler.key_field
    rate: 1000
  
  # Gather all sampled events and perform any final tasks before sending onwards
  sampler_final:
    type: remap
    inputs:
      - sample_set_*
      - sample_router.none
      - sample_router._unmatched
    source: |
      del(.sampler)

Version

vector 0.35.0 (x86_64-apple-darwin e57c0c0 2024-01-08 14:42:10.103908779)

Debug Output

No response

Example Data

No response

Additional Context

This occurs with the same configuration whether running locally or as part of our ECS Fargate cluster.

References

#19332

The text was updated successfully, but these errors were encountered:

jszwedko · 2024-01-22T15:26:24Z

Hi @NeilJed !

I think there might be a misunderstanding of how key_field works. The way it works is that it samples "groups" at the configured rate so it could be the case that none of the "groups" are being selected. In your case, for example, it means that it will sample 1 in 3 "CloudFront distribution ID"s (rather than, say, 1 in 3 messages within each "CloudFront distribution ID").

https://vector.dev/docs/reference/configuration/transforms/sample/#key_field attempts to explain this a bit more. It is, admittedly, a bit of a confusing feature to communicate. The intention is for key_field to be used with, for example, transaction IDs to make sure that if you get all logs for a given transaction ID, if it is sampled.

Let me know if that helps to clear things up.

NeilJed · 2024-01-22T15:59:19Z

OK so I've read the documentation a few times but to me it still reads like it's describing strafified sampling which is what I thought it was getting.

"Each unique value for the key creates a bucket of related events to be sampled together and the rate is applied to the buckets themselves to sample 1/N buckets."

As I am including the value of the CloudFront Distribution IDs my assumption was that sampling is being applied per ID? i.e for 1/3 sampling, if one ID has 3 million events, i get 1 million and for 300, I get 100. My goal was to get even sampling so that high volume distributions don't crowd out low volume ones.

So I'm assuming theres no way to achieve that?

I think I understand what you're saying that it's sample 1/N ID's rather than 1/N events per ID. The description is very unclear, maybe a better explanation is needed?

jszwedko · 2024-01-22T16:30:27Z

As I am including the value of the CloudFront Distribution IDs my assumption was that sampling is being applied per ID? i.e for 1/3 sampling, if one ID has 3 million events, i get 1 million and for 300, I get 100. My goal was to get even sampling so that high volume distributions don't crowd out low volume ones.

So I'm assuming theres no way to achieve that?

I think I understand what you're saying that it's sample 1/N ID's rather than 1/N events per ID. The description is very unclear, maybe a better explanation is needed?

Yeah, that's correct, it samples 1/N ID's rather than 1/N per ID. I agree it is a bit confusing. Currently there isn't a way to configure the sample transform to sample 1/N per key_field bucket.

It's not sampling, but you could consider using the throttle transform (https://vector.dev/docs/reference/configuration/transforms/throttle) to enforce "quotas" for each key_field. This would result in the transform dropping events past the quota though, rather than sampling.

NeilJed added the type: bug A code related bug. label Jan 22, 2024

NeilJed closed this as completed Jan 26, 2024

hillmandj mentioned this issue Jul 24, 2024

Sampler Configuration Setting to Enable Stratified Sampling #20921

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of sample with key_field appears to be causing events to be dropped silently. #19680

Use of sample with key_field appears to be causing events to be dropped silently. #19680

NeilJed commented Jan 22, 2024 •

edited

Loading

jszwedko commented Jan 22, 2024

NeilJed commented Jan 22, 2024

jszwedko commented Jan 22, 2024

Use of sample with key_field appears to be causing events to be dropped silently. #19680

Use of sample with key_field appears to be causing events to be dropped silently. #19680

Comments

NeilJed commented Jan 22, 2024 • edited Loading

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

jszwedko commented Jan 22, 2024

NeilJed commented Jan 22, 2024

jszwedko commented Jan 22, 2024

NeilJed commented Jan 22, 2024 •

edited

Loading