Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of sample with key_field appears to be causing events to be dropped silently. #19680

Closed
NeilJed opened this issue Jan 22, 2024 · 3 comments
Closed
Labels
type: bug A code related bug.

Comments

@NeilJed
Copy link

NeilJed commented Jan 22, 2024

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

I'm currently using a fairly complex sampling set-up due to the need to have different sample rates for different service logs. You can find an outline of this in issue #19332.

My pipeline route is basically source -> remap -> route -> sampler-> remap-> sink

After my sources I pass the incoming events to a remap transform that uses some VRL logic to add a sampler field to the message with the desired sample rate and the value to use as the key_field - in this case the CloudFront distribution ID that the logs came from. For example:

"sampler": {
    "rate": 3,
    "key_field": "ED4332JKNG"
}

After adding the field, I use a route transform to route it to a specific sampler for the given rate (as we can't currently set this dynamically). There is a final remap after that which does some final clean up, like removing the sampler field before sending it to the sink.

What I noticed is that when logs are passing through the samplers they were being dropped with no warnings or errors.

If I used vector tap on each stage I can see that as the events leave the router the sampler field exists in my messages. However, when I tap one of the samplers, there is no output. In contrast when I tap the final remap which collects all logs, I only see the logs that are unsampled and thus bypass the samplers.

After a lot of head scratching, what I discovered is that if i disable the use of key_field in the sampler, then the events pass through correctly and I can see them with tap and they appear in my final output.

I'm not sure what the actual issue is here but after several rounds of testing the events are mysteriously dropped when any of the sample transforms have key_field configured.

Configuration

transforms:
  # Set sample rate for events
  sampler_setrate:
    type: remap
    inputs:
      - "*_source"
    file: ${VECTOR_CONFIG_DIR}/vrl/sampler_logic.vrl

  # Routes events to the correct sampler
  sample_router:
    type: route
    inputs:
      - sampler_setrate
    reroute_unmatched: true

    # Define routes to use based on the desired sample rate
    route:
      none: .sampler.rate == 1       # no sampling
      s2: .sampler.rate == 2         # 50%
      s3: .sampler.rate == 3         # 33%
      s4: .sampler.rate == 4         # 25%
      s5: .sampler.rate == 5         # 20%
      s10: .sampler.rate == 10       # 10%
      s100: .sampler.rate == 100     # 1%
      s1000: .sampler.rate == 1000   # 0.1%
  
  sample_set_2:
    type: sample
    inputs:
      - sample_router.s2
    key_field: sampler.key_field
    rate: 2
  
  sample_set_3:
    type: sample
    inputs:
      - sample_router.s3
    key_field: sampler.key_field
    rate: 3
  
  sample_set_4:
    type: sample
    inputs:
      - sample_router.s4
    key_field: sampler.key_field
    rate: 4
  
  sample_set_5:
    type: sample
    inputs:
      - sample_router.s5
    key_field: sampler.key_field
    rate: 5
  
  sample_set_10:
    type: sample
    inputs:
      - sample_router.s10
    key_field: sampler.key_field
    rate: 10
  
  sample_set_100:
    type: sample
    inputs:
      - sample_router.s100
    key_field: sampler.key_field
    rate: 100
  
  sample_set_1000:
    type: sample
    inputs:
      - sample_router.s1000
    key_field: sampler.key_field
    rate: 1000
  
  # Gather all sampled events and perform any final tasks before sending onwards
  sampler_final:
    type: remap
    inputs:
      - sample_set_*
      - sample_router.none
      - sample_router._unmatched
    source: |
      del(.sampler)

Version

vector 0.35.0 (x86_64-apple-darwin e57c0c0 2024-01-08 14:42:10.103908779)

Debug Output

No response

Example Data

No response

Additional Context

This occurs with the same configuration whether running locally or as part of our ECS Fargate cluster.

References

#19332

@NeilJed NeilJed added the type: bug A code related bug. label Jan 22, 2024
@jszwedko
Copy link
Member

Hi @NeilJed !

I think there might be a misunderstanding of how key_field works. The way it works is that it samples "groups" at the configured rate so it could be the case that none of the "groups" are being selected. In your case, for example, it means that it will sample 1 in 3 "CloudFront distribution ID"s (rather than, say, 1 in 3 messages within each "CloudFront distribution ID").

https://vector.dev/docs/reference/configuration/transforms/sample/#key_field attempts to explain this a bit more. It is, admittedly, a bit of a confusing feature to communicate. The intention is for key_field to be used with, for example, transaction IDs to make sure that if you get all logs for a given transaction ID, if it is sampled.

Let me know if that helps to clear things up.

@NeilJed
Copy link
Author

NeilJed commented Jan 22, 2024

OK so I've read the documentation a few times but to me it still reads like it's describing strafified sampling which is what I thought it was getting.

"Each unique value for the key creates a bucket of related events to be sampled together and the rate is applied to the buckets themselves to sample 1/N buckets."

As I am including the value of the CloudFront Distribution IDs my assumption was that sampling is being applied per ID? i.e for 1/3 sampling, if one ID has 3 million events, i get 1 million and for 300, I get 100. My goal was to get even sampling so that high volume distributions don't crowd out low volume ones.

So I'm assuming theres no way to achieve that?

I think I understand what you're saying that it's sample 1/N ID's rather than 1/N events per ID. The description is very unclear, maybe a better explanation is needed?

@jszwedko
Copy link
Member

As I am including the value of the CloudFront Distribution IDs my assumption was that sampling is being applied per ID? i.e for 1/3 sampling, if one ID has 3 million events, i get 1 million and for 300, I get 100. My goal was to get even sampling so that high volume distributions don't crowd out low volume ones.

So I'm assuming theres no way to achieve that?

I think I understand what you're saying that it's sample 1/N ID's rather than 1/N events per ID. The description is very unclear, maybe a better explanation is needed?

Yeah, that's correct, it samples 1/N ID's rather than 1/N per ID. I agree it is a bit confusing. Currently there isn't a way to configure the sample transform to sample 1/N per key_field bucket.

It's not sampling, but you could consider using the throttle transform (https://vector.dev/docs/reference/configuration/transforms/throttle) to enforce "quotas" for each key_field. This would result in the transform dropping events past the quota though, rather than sampling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

2 participants