Skip to content
This repository was archived by the owner on Apr 1, 2025. It is now read-only.

Fix reservoir sampling #229

Conversation

markwongsk
Copy link

-- when the sample size has reached the reservoir size,
the original implementation popped off the min
element of the priority queue before inserting
the updated element. Priority Sampling should retain the
"k items with highest priority". The existence of the
uniformly selected u_i from [0..1] makes it possible
that the updated value actually has a lower priority
than the existing k values, as demonstrated by the
seeded test

-- when the sample size has reached the reservoir size,
   the original implementation popped off the min
   element of the priority queue before inserting
   the updated element. Priority Sampling should retain the
   "k items with highest priority". The existence of the
   uniformly selected u_i from [0..1] makes it possible
   that the updated value actually has a lower priority
   than the existing k values, as demonstrated by the
   seeded test
@markwongsk
Copy link
Author

Fixes #230

@mihasya
Copy link
Collaborator

mihasya commented Mar 26, 2018

I'm a little unclear on the motivation for these changes. I see the statement "Priority Sampling should retain the "k items with highest priority"." - is this coming from an academic definition of Priority Sampling?

The objective of the implementation is to weigh more recent data significantly more heavily than later data. Barring a very compelling correctness argument ("these numbers are completely wrong"), I'd be reticent to change the formula. Anyone upgrading would likely see at least a subtle change in their reported results, which would make time period comparisons more difficult.

@markwongsk
Copy link
Author

@mihasya Hey, thanks for the comments and I apologize for missing this notification.

For the definition of Priority Sampling, I'm using the definition from the paper that's linked in the comments in the code. For clarity, this is the link: http://dimacs.rutgers.edu/~graham/pubs/papers/fwddecay.pdf

On page 7 bottom left, "... and the algorithm retains the k items
with highest priorities".

Note in particular, that the current implementation specifically does not adhere to this definition, since it always includes the latest element in the stream. The last element isn't always guaranteed to be in the sample stream because it may not be the one with the highest priority, as the u_is are randomly picked uniformly between [0..1] and there's a probability that w_i/u_i >= w_n/u_n, for all i < n and n is the nth element in the stream. Intuitively, a weighted sample stream should bias towards recent values, but should not always include the latest element.

Practically, the test I wrote "TestExpDecayDoesNotAlwaysIncludeLastItem" serves to demonstrate that the new implementation doesn't always include the latest element in the sample whenever a new element appears in the stream. Note that without my changes, this test actually fails. At the end of the day I'm not introducing anything new, just trying to fix an implementation flaw of the algorithm. Happy to go into more math if I need to.

@seanjoe
Copy link

seanjoe commented Jul 4, 2018

Does it influence the memory in server when use the plugin for monitor the service? There is memory leak in my service, the go tool pprof output is like as follow:

(pprof) top
Showing nodes accounting for 292.13MB, 91.35% of 319.79MB total
Dropped 112 nodes (cum <= 1.60MB)
Showing top 10 nodes out of 97
flat flat% sum% cum cum%
238.64MB 74.62% 74.62% 238.64MB 74.62% go-metrics.newExpDecaySampleHeap

the memory of it's occupy will grow little by little, up to 1G and more

@markwongsk markwongsk closed this Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants