Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sample to optionally sample logs randomly #21393

Closed
nzxwang opened this issue Oct 1, 2024 · 2 comments
Closed

sample to optionally sample logs randomly #21393

nzxwang opened this issue Oct 1, 2024 · 2 comments
Labels
transform: sample Anything `sample` transform related type: feature A value-adding code addition that introduce new functionality.

Comments

@nzxwang
Copy link

nzxwang commented Oct 1, 2024

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

With reference to this code block:

self.count = (self.count + 1) % self.rate;
if num % self.rate == 0 {
match event {
Event::Log(ref mut event) => {
event.namespace().insert_source_metadata(
self.name.as_str(),
event,
Some(LegacyKey::Overwrite(vrl::path!("sample_rate"))),
vrl::path!("sample_rate"),
self.rate.to_string(),
);
}
Event::Trace(ref mut event) => {
event.insert(event_path!("sample_rate"), self.rate.to_string());
}
Event::Metric(_) => panic!("component can never receive metric events"),
};
output.push(event);
} else {
emit!(SampleEventDiscarded);
}

sample currently uses a deterministic incremental method over the entire volume of inputs events to determine whether to discard an individual event. This means that a single sample component cannot handle several streams of events, especially if they have vastly differing volumes since the largest input stream will overwhelm the others.

We would like to use a single sample component for every service's logs to keep startup times low which means sample would have to sample the logs randomly independently of each other.

Attempted Solutions

We've essentially implemented the aforementioned random sampling using a remap component that assigns each log a to_be_dropped attribute based on if random_float(0.0, 1.0) > (1.0/sample_rate) followed by a filter with condition to_bool!(to_be_dropped) == false.

Proposal

Add some mode option that is an enumeration defaulting to incremental (for current behavior) or random for the previously described behavior.

References

No response

Version

vector 0.40.0 (x86_64-apple-darwin 1167aa9 2024-07-29 15:08:44.028365803)

@nzxwang nzxwang added the type: feature A value-adding code addition that introduce new functionality. label Oct 1, 2024
@nzxwang nzxwang changed the title sample to optionally sample logs *randomly* sample to optionally sample logs randomly Oct 1, 2024
@jszwedko
Copy link
Member

jszwedko commented Oct 1, 2024

Hi @nzxwang ! Thanks for filing this. I think it is the same as #20921. Would you agree? That issue actually already has a PR that is attempting to implement it.

@jszwedko jszwedko added the transform: sample Anything `sample` transform related label Oct 1, 2024
@nzxwang
Copy link
Author

nzxwang commented Oct 2, 2024

Hi @jszwedko thanks for the quick response. The problem statement is the same as #20921, and they proposed an even better "stratified sampling" solution that deviates less from the current behavior. I can't think of any relevant downsides of it.

@nzxwang nzxwang closed this as not planned Won't fix, can't repro, duplicate, stale Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
transform: sample Anything `sample` transform related type: feature A value-adding code addition that introduce new functionality.
Projects
None yet
Development

No branches or pull requests

2 participants