-
-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider Fast Loaded Dice Roller for WeightedIndex
#1014
Comments
I played a bit with the algorithm, but I could not make it perform as well as our alias-method implementation. The algorithm consumes the random numbers bit-by-bit. On the one hand, this avoids throwing away random data, on the other hand, the approach does not buy much, because generating random bits is cheap with modern generators. I suspect the results in the paper are with a slow RNG like the Mersenne Twister. There is also a similar Fast Dice Roller algorithm for uniform sampling, which would also save a lot of random bits for small ranges. However, in practice it seems to be much slower with a fast RNG. |
Vks wrote:
@fsaad, can you please comment? |
Thanks for trying out FLDR.
@vks Does your alias implementation use exact sampling or floating-point approximations of real numbers? The runtime comparison, both theoretical and practical, between FLDR and alias sampling is with respect to error-free sampling. If the alias implementation uses floating-point approximations of real numbers then there is not much that can analyzed in terms of performance with FLDR, since an inexact alias implementation typically trade-offs error with runtime, whereas FLDR is designed to always be error free. If you have an error-free alias sampler, I would be really interested to explore its performance as compared to FLDR, and I would be quite surprised if FLDR is slower (especially on low to medium entropy distributions), both for preprocessing and sampling.
While the FLDR paper uses a simple (naive) serial implementation of the sampler (Alg 5), there are many other ways to implement that leverage alternate data structures, which include linear encodings of the DDG sampling tree (Algs 4--7 Saad et al. 2020), bit-slicing for constant time generation (Sec 3, Karmakar et al. 2018), and batch generating with bit recycling (Alg 3, Devroye & Gravel 2018), among many others. The choice of implementation depends on the constraints of the application. For example, lazy DDG sampling is typically far more efficient as compared to floating-point arithmetic when implementing the sampler directly on hardware (e.g., an FPGA). While FLDR is characterized by its theoretical near-optimal entropy use, many of these techniques can be used in practice to push down the constant factors of runtime (or memory).
The FDR of Lumbroso 2013 is an exact uniform sampler. Unlike FLDR, the FDR is theoretically optimal (for uniform sampling), so it should be superior to any other exact uniform sampler, though I agree that non-exact samplers can be faster in practice. Overall, I think it is very valuable to compare the performance of exact and approximate sampling, as your ticket may point toward. The status quo of most existing software libraries is inexact sampling with floating-point. FLDR is an initial step toward showing that exact sampling can be practical, both theoretically and for applications, for general discrete distributions. There is of course a lot more work to do to explore the trade-offs with inexact samplers. |
The code for our weighted alias implementation is here. IIRC it uses exact sampling, with the requirement that the sum of weights can be represented by the weight type. Weight type may be an integer or floating-point or a user-defined type; the latter implies that rounding can occur, depending on the user's choice of types. In many benchmarks however I found the alias method a worse choice than the binary search method due to the set-up time, hence the latter is our default (and also exact if used with integer types). |
I don't think we use floating-point when the weights are integers. I did not look into the details, but I doubt we use the alias method documented in the paper:
In particular, we are not using the FDR for uniform sampling, which would be twice as slow. Here are the benchmark results:
I also implemented the FDR uniform sampling. It's a bit awkward to implement with our API, so I'm not sure it's optimized as well as it could be.
The code is here. |
For your information there are numerous data structures and algorithms for weighted sampling with replacement, besides the Fast Loaded Dice Roller, binary search, and the alias method, and I list a great number of them. Perhaps it would be a good idea to compare their performance. |
@vks Thanks a lot for the code. I'm not an expert in Rust, but it seems that there are many ways that this FLDR implementation can be optimized. One immediate thought is that, in your current implementation, the PRNG is always called once per sample, i.e., rand/rand_distr/src/weighted_fldr.rs Line 85 in e2b4b12
In our reference implementation, we separately store a "global" buffer of 31 random bits (in C, rand() produces a positive 32 bit signed integer, so only 31 of the bits are random), which is the function "flip()" in the following file: https://github.com/probcomp/fast-loaded-dice-roller/blob/master/src/c/flip.c The main FLDR sampler then calls "flip()" to obtain a fresh bit: which may or not may not involve a call to the PRNG, depending on whether the buffer is empty. This approach ensures that we do not spuriously generate random bits. If FLDR is implemented to use a separate buffer that is shared across calls to "sample", instead of calling the PRNG once per sample, then I do expect the runtime should improve in many regimes. |
@fsaad Thanks for having a look!
Fair enough, that is a very good point! Unfortunately, implementing this is a bit tricky, because our API requires
Going with 2. seems to have an adverse effects on the performance for large sets, while improving it for small sets:
|
Personally I don't like option 2. Can we implement |
It's going to take quite a good reason to convince me that we should change that. Especially since the benchmarks so far have the alias-method winning everywhere (especially in set-up, where I already thought the alias method was slow). |
I agree. Does your API For
Do the benchmarks use test distributions with varying entropy, i.e., for a distribution with n outcomes, entropies from 0 to log_2(n)? The average runtime of sampling in FLDR is a direct function of the entropy. The runtime of preprocessing also varies with the values of probabilities in the test distribution. The point about set-up is surprising. Do you have a pointer to the benchmarks and the measurements of setup runtime? |
Good question. No. Theoretically it could, even using different approaches depending on the RNG, but we chose not to do that. For simpler PRNGs we use no buffer at all and for block-based PRNGs we use a block which we read off one word (32-bit) at a time. Another factor to consider (for both mutable distribution and buffered RNGs) is reproducible behaviour. If we allowed out-of-order usage of random bits/bytes it becomes much harder to reason about and especially document the behaviour, thus making reproducibility hard. If we don't allow this, then every read has to either be compatible with misaligned buffers (slow) or check and clear up to the next alignment point (also slow). At this point we're no longer optimising one RNG+distribution pair but expected usage of the RNG across a number of distributions. I suppose we could side-step that problem by having a special RNG optimised for bit- or byte-sized reads, though that's another step in complexity. @vks which RNG did you use for these benchmarks? |
Aren't we having this problem already when mixing
In principle, we could have a wrapper type implementing the buffering. It can then be used as needed. For example, we could just add a
It's just our default benchmarks, they use
See here. |
I think we can use a single counter for used bits, which would be incremented by 8 for read byte, 32 for As an advantage this approach will be less wasteful (and thus more performant) for users using primarily |
Re compromising performance on pure u64: I think you can reduce that cost to a single branch.
First, we discard this as a design constraint when the application is making RNG dependent range requests. I do not believe any application should expect seeded RNG stability under a reordering of range request sizes. Deterministic behavior for a seeded RNG under a deterministic request stream should be preserved. (Note: it is theoretically possible for this to weaken the stream somehow, when considered as a complete system with the application. Probably a fun research paper or game speedrun investigation for someone.) The "one branch" algorithm is then simple: for any "large enough" requests, bypass or discard the buffer. Any requests for over half the source RNG bit length (>32.0 bits) should bypass the buffer, as they are unlikely to be followed by small enough requests. Note that 63 and 53 bits are common request lengths; there is little to no profit to be had in attempting to merge leftover bits there. Any requests for more bits than are contained in the buffer (e.g. buffer has 2 bits remaining, request wants 6) should discard the remaining buffer instead of attempting to merge the leftovers. |
Hang on, why would we need to compromise performance at all for existing use cases? Requests for a full u64 would usually know at compile time that they want a full u64, so they could just make a method call that gets a random u64 from the backing RNG rather than asking for bits from the "spare bits buffer" at all. I'm currently fiddling with some code that keeps just such a "spare bits buffer", and is built entirely as a wrapper around a source Rng, without having to change the Rng interface at all. For generating small uniform ranges, my wrapper around ChaChaRng is able to outperform using ChaChaRng directly. (On the other hand, it's hard to beat Pcg64Mcg. On the third hand, I haven't put a lot of work into optimizing it.) |
@elidupree see notes about reproducibility in this comment. |
I saw that comment, but I don't understand it. Let me ask some questions to clarify my understanding: If an Rng implementor is deterministic and platform-independent, and my code that keeps a bit buffer is a deterministic, platform-independent wrapper around the Rng, then how could the combination possibly be less reproducible than the Rng we started with? From the wording "out-of-order usage of random bits/bytes", my only guess about what you mean is that you're talking about getting different results if you switch the order between a request for bits and a request for a full u64, but that can't be right, because an Rng always gives different results if you switch the order between any two requests for numbers... |
Besides non-determinicity there are a few possible sources of non-portable results:
But this is also about documenting behaviour...
Is this actually the case (for two different requests) given an RNG using a buffer for partially-consumed generated values? There is nothing in the |
Those are possible sources of non-portable results in general, but I still don't see how any of them would cause difficulties for implementing a gen_bits API. Endianness could be an issue if you implement it using casts, but my quick-and-dirty implementation isn't endianness-dependent (it just uses bit ops to extract the desired number of bits); I don't see how this is different from any other situation where we accept a small performance cost to guarantee platform-independence.
Sorry, my wording was imprecise. Of course it's not guaranteed that the values will be different (for one thing, they could be the same due to luck!) What I'm saying is that the Rng traits already do not provide any guarantee that results will be identical in any situation other than an identical sequence of requests (with identical order). I suppose implementors are permitted to provide additional guarantees (I've never heard of a trait that forbids implementors from providing additional guarantees), but is there any specific implementor that does? |
Sorry, what I wrote was besides the point. We have discussed in the past whether RNGs may discard unused bits/bytes (to which the conclusion was yes, though some attempts have been made not to do this, most recently #1031). I believe we also discussed whether an RNG may return bits out-of-order from when they are generated (using a buffer for some outputs but not others), but I don't recall and can't find the exact conclusion here. I thought we had decided against this (i.e. forcing clearing of buffered bits whenever generating fresh output), but do not see that documentation. (It's also not exactly "portability" and more "how to reproduce output given an RNG algorithm and sequence of requests".) |
Note that code in that PR also discards bits for performance reasons (we do not want to copy |
We still do not support single-bit random value generation (see #1261 for related proposals), and I don't think it's very likely that we will. We also do not really have an interest in supporting mutable "distribution" objects. Therefore I think it's unlikely we will pursue the suggested FLDR algorithm. |
It seems to outperform the alias method (and others), being close to the theoretical optimum of entropy use. According to the paper, it is faster than the alias method (if the latter is modified to produce exact samples), while using several orders of magnitudes less preprocessing time. Maybe it can replace our current algorithms in Rand?
There is a reference implementation in C and Python.
The text was updated successfully, but these errors were encountered: