-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of sample with key_field appears to be causing events to be dropped silently. #19680
Comments
Hi @NeilJed ! I think there might be a misunderstanding of how https://vector.dev/docs/reference/configuration/transforms/sample/#key_field attempts to explain this a bit more. It is, admittedly, a bit of a confusing feature to communicate. The intention is for Let me know if that helps to clear things up. |
OK so I've read the documentation a few times but to me it still reads like it's describing strafified sampling which is what I thought it was getting. "Each unique value for the key creates a bucket of related events to be sampled together and the rate is applied to the buckets themselves to sample 1/N buckets." As I am including the value of the CloudFront Distribution IDs my assumption was that sampling is being applied per ID? i.e for 1/3 sampling, if one ID has 3 million events, i get 1 million and for 300, I get 100. My goal was to get even sampling so that high volume distributions don't crowd out low volume ones. So I'm assuming theres no way to achieve that? I think I understand what you're saying that it's sample 1/N ID's rather than 1/N events per ID. The description is very unclear, maybe a better explanation is needed? |
Yeah, that's correct, it samples 1/N ID's rather than 1/N per ID. I agree it is a bit confusing. Currently there isn't a way to configure the It's not sampling, but you could consider using the |
A note for the community
Problem
I'm currently using a fairly complex sampling set-up due to the need to have different sample rates for different service logs. You can find an outline of this in issue #19332.
My pipeline route is basically
source
->remap
->route
->sampler
->remap
->sink
After my sources I pass the incoming events to a remap transform that uses some VRL logic to add a
sampler
field to the message with the desired sample rate and the value to use as the key_field - in this case the CloudFront distribution ID that the logs came from. For example:After adding the field, I use a
route
transform to route it to a specific sampler for the given rate (as we can't currently set this dynamically). There is a finalremap
after that which does some final clean up, like removing thesampler
field before sending it to the sink.What I noticed is that when logs are passing through the samplers they were being dropped with no warnings or errors.
If I used
vector tap
on each stage I can see that as the events leave therouter
thesampler
field exists in my messages. However, when I tap one of the samplers, there is no output. In contrast when I tap the finalremap
which collects all logs, I only see the logs that are unsampled and thus bypass the samplers.After a lot of head scratching, what I discovered is that if i disable the use of
key_field
in the sampler, then the events pass through correctly and I can see them withtap
and they appear in my final output.I'm not sure what the actual issue is here but after several rounds of testing the events are mysteriously dropped when any of the
sample
transforms havekey_field
configured.Configuration
Version
vector 0.35.0 (x86_64-apple-darwin e57c0c0 2024-01-08 14:42:10.103908779)
Debug Output
No response
Example Data
No response
Additional Context
This occurs with the same configuration whether running locally or as part of our ECS Fargate cluster.
References
#19332
The text was updated successfully, but these errors were encountered: