-
-
Notifications
You must be signed in to change notification settings - Fork 436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Future of distributions in rand #290
Comments
Very good question. I don't know; potentially I'm open to big changes (post 0.5). Statrs focusses on The The other distributions we have (exponential, normal, gamma) do feel out-of-place in One thing not clear about the statrs distributions is how precise the sampling is — e.g. is the Bernoulli distribution accurate for I think we should also ask @boxtown what he thinks? |
I see the |
I'm open to the idea of depending on |
I think the main reason for this is simplicity. AFAIK, the sampling of
I agree, it makes sense to move them to
Yes, I think it makes sense to avoid duplicated work implementing the distributions, which is the main reason I brought up the issue. |
I labelled this low-priority mostly because it needs more planning and would probably be better to discuss after more distribution code has been implemented / optimised / reviewed, and also because it's beyond the scope of the next release, but this is an important topic to finalise before the 1.0 release. |
Just my personal opinion. There are a few distributions I really want to keep/have in Rand:
What I am not sure about is generating random numbers that fit some probability distribution. Because if have not used them myself, I can't care strongly whether they are part of Rand or not. Two not-so-strong reasons to have them in Rand (or a clear companion crate to point to):
On the other hand, I think it is best to keep generating values according to some distribution in Rand. For simple use cases people can just use Rand. Then it would be nice if |
It's sometimes useful if two projects do the same things from different perspectives, especially if those things involve tricky things like presenting mathematics well. |
It would be good to make some progress on this now. From @vks's suggestions, I like 1 and 3 the best. 1. Only implement sampling in rand. Port the missing distributions from statrs to rand.This means all the distribution objects (e.g. the Traits like 3. Remove the distributions from rand and suggest to use statrs instead.We should keep This could be significantly simpler to maintain, and perhaps makes the scope of Rand clearer. Note that several of the distribution implementations in Rand may be faster than those in 4. New sub-crateThis is just a variant of 1. We move the distribution objects (and their implementations of Functionally, this is not much different than option 3. There main difference is probably that users do not have to import the full There is some overlap with #494 in that if we use option 3 or 4 then we do not need to worry about most of the distributions should we implement high and low precision sampling variants for e.g. This is also related to #431 in design concepts; do we prefer few large crates or many small crates? E.g. the |
@vks don't you think we should resolve this issue before trying to duplicate all functionality in There are several issues currently:
I'm not sure where we should go from here...
|
@dhardy The status quo is that Rand only implements sampling. Most of the work in statrs is calculating the statistics (PDF, CDF, moments etc.). In principle statrs could use the sampling implementations from Rand without introducing a breaking change, so moving them to Rand could actually reduce duplication. However, as you noted this will result in duplication: A distribution struct has to be created for sampling and another for calculating statistics, unless the internal parameters are made accessible. I think the main decision is whether we want the code for calculating statistics in Rand or not:
Personally, I would prefer the first option, but the disadvantage is that this would add a lot of code that most users will not use. It would also make Rand a bit of a misnomer. Maybe we should get some community feedback on Reddit/the Rust user forum.
This could be done without breaking changes by introducing
That should be easy to fix. |
I think having a |
Sure. If @boxtown would prefer a community-maintained crate over I guess we may as well take the opportunity to re-design as we see fit (potentially combine the @vks do you want to make a Reddit post about this then? It sounds like there is quite a bit of scope for input, though probably only a small set of interested users. |
Maybe we should be careful to not take to much on our plate. @dhardy Don't you think this will add quite a lot of extra work? Is it something to start right now? |
I think it is important to know where we are headed before blindly adding every distribution we can. As for being too much for us to handle, sure, it is more than we can get to right now — except there are currently five people making significant contributions to Rand, and as @boxtown says, it probably makes sense to move to a community-maintained project for an important stats library at some point (I doubt any of us will continue working on Rand for the lifetime of the Rust project). In the mean-time though, it would probably make more sense to port @boxtown, as the developer of
|
Sure, I can do that. What is our current consensus? We could do something like that:
I think it can be done incrementally, as mentioned above. |
Rand 0.6 is basically done, and resolving this issue should be next on the agenda. We could remove all distributions other than
Is this a good goal? It doesn't exactly make Rand a small lib, but does move us in that direction (though we couldn't go much further without either removing a lot of features or modularisation pushing The other perspective is de-duplication with |
My plan was to move all distributions (including the ones you mentioned) to a In principle I would like to have everything about distributions in one place. This is the approach taken by R and Julia, which I consider best-in-class for statistics. However, for now it should be enough to focus on sampling.
I'm also not sure whether we can get away with one internal representation for everything. If a different internal representation is needed, it can probably be implemented on top of the representation used for sampling. |
But this is what You mean you would like to remove even |
Yes, basically. In any case, this is something to consider after moving the sampling to
|
So you are proposing modularisation... I don't really see the point though; it doesn't help us and won't help most users (who will still depend on all the same code, just in two crates instead of one). Part of the point of removing many distributions (from my POV) is that most users do not use those distributions, therefore will depend on less code. (There is probably also a significant subset who only want a randomness source like |
It helps us to iterate on the distributions without having to worry about
This is not possible, because of the convenience methods in |
No. You cannot iterate on the distributions upon which
If I understand, you want nonsensical things:
|
Yes, this is what I meant — without having |
You could release newer versions of
I'm not sure it's worth it to keep them in two different places.
I don't see how that would be a problem. |
@vks yes, this would partially decouple
Is this your real rationale? I wasn't intending on doing much redesign anyway, myself. I don't see any good rationale for your idea so far. |
Even if we keep the distributions used by (Ideally, I would like to see
No, my real rationale is that I think it makes sense to modularize that functionality into a different crate. The decoupling is one of the advantages. |
Well, potentially we could push:
then re-export
To fully realise this, we'd also need at least Further, we currently use Edit: until we've finished dealing with the fall-out from 0.6.2 we should hold off on big PRs. |
I think we should start on this now — the goal being to reduce Eventually this could still merge with To do items:
|
We've already had complaints about Rand using two many crates. Adding another crate doesn't really gain anything, since if So lets do the following:
Thus from a user's point of view, they can just use This change lets us do two things:
|
Sounds good to me! I'll presumably try to implement this in a week or so. |
Keeping the versions in lock-step does have disadvantages:
On the other hand, without this, re-exporting all of However, I think having all distributions available through This means that FYI I've started working on this |
|
Does this duplication need to be addressed? @boxtown Statrs claims to be a port from the C# Math.NET library. It implements a bunch of things (e.g. error function, CDFs) that are quite specialised (not widely used). Its distributions are implemented such that properties like
Because of this I think there is scope for the on-going existance of both libraries, and I have documented this here and here. (Note: another possibility would be including the extra functionality in |
We will need some of |
I don't really see a need to de-dupe. Like @burdges mentioned above I don't see a harm in having two approaches to sampling and like you mentioned, the bulk of the work Statrs is doing is in calculating attributes of various distributions |
🤣 Last time I tried histogram testing, I think even with 10 million samples I was struggling to get useful results. I think it's probably a dead-end, at least for standard integration testing (though it may be somewhat usable for offline analysis). OTOH "black box testing" (aka testing a few samples produced the same result as last time) is easy to do and somewhat useful. |
I was able to get errors smaller than 0.001 with a million samples, but this is for (kind of) uniform distributions: https://github.com/rust-random/rand/blob/master/rand_distr/tests/uniformity.rs For non-uniform distributions it will be more tricky. |
When working with distributions it is very useful to have access to their density, their distribution function and their quantile function in addition to being able to sample from them. Furthermore, it is nice to be able to calculate their (theoretical, exact) moments. I think it makes sense to have all this functionality in a common interface.
This is implemented in the
statrs
crate. There is some overlap withrand
, sampling is implemented there as well, but for a lot more distributions.I see the following options:
rand
. Port the missing distributions fromstatrs
torand
.rand
, essentially duplicatingstatrs
.rand
and suggest to usestatrs
instead.What do you think?
The text was updated successfully, but these errors were encountered: