-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for statistical distributions #234
Comments
Prior art.
|
I agree with @Jim-215-Fisher 's comment. Thank you for the suggestion. To add to the prior art mentioned by @epagone:
|
Should stdlib be based on GSL through fgsl? |
@Jim-215-Fisher Unfortunately |
Don't forget the Software from Alan J. Miller: https://wp.csiro.au/alanmiller/random.html |
i agree with that: it is an extensive collection provided by a well-known
professional in the field. It may require some reorganisation, as it is not
organised in modules and the like, but it is certainly worth our while.
Op di 22 sep. 2020 om 12:59 schreef Ivan <[email protected]>:
… Don't forget the Software from Alan J. Miller:
https://wp.csiro.au/alanmiller/random.html
(a mirror exists at https://jblevins.org/mirror/amiller/)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#234 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YRZSJTTC675SELMLNWLSHB7Q3ANCNFSM4RUXIQFQ>
.
|
Indeed, I agree. However, it is unclear to me what the license is for these files. |
It says that the code written by Alan Miller is in the public domain, I
will ask Jason what this means exactly.
Op di 22 sep. 2020 om 13:07 schreef Jeremie Vandenplas <
[email protected]>:
… i agree with that: it is an extensive collection provided by a well-known
professional in the field. It may require some reorganisation, as it is not
organised in modules and the like, but it is certainly worth our while. Op
di 22 sep. 2020 om 12:59 schreef Ivan ***@***.***:
… <#m_-2168024105408715737_>
Don't forget the Software from Alan J. Miller:
https://wp.csiro.au/alanmiller/random.html (a mirror exists at
https://jblevins.org/mirror/amiller/) — You are receiving this because
you are subscribed to this thread. Reply to this email directly, view it on
GitHub <#234 (comment)
<#234 (comment)>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAN6YRZSJTTC675SELMLNWLSHB7Q3ANCNFSM4RUXIQFQ
.
Indeed, I agree. However, it is unclear to me what the license is for
these files.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#234 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YR2XPP2KHV2B46IZE3LSHCANJANCNFSM4RUXIQFQ>
.
|
If it is public domain, then there is no copyrights. |
In terms of license, as long as there are mathematic formulae, I think license issue should not be a concern. Based on comments so far, it looks like majority agreed to have statistical distributions in stdlib. So the next question is what kind of API should be. We can either define a data type for each distribution and various type-bound procedures, or define various procedures for each type of distribution directly in the stdlib module. For example, normal distribution one could have: module stdlib contains contains or directly in stdlib module: module stdlib The first one is object oriented, data and procedure are encapsulated. The second one is more traditional like intrinsic function call familiar to users. Which one is better? |
In these matters it is not likely that there is a "best" solution. It is a
matter of taste, I would say. But I am leaning towards a procedural style
in this. Typical use of this functionality:
- I have a time series and I want to see if it can be fitted to a normal
or log-normal distribution. In that case, I would pass the time series to
some function that uses a relevant statistical test (say Lillifors) to
determine whether the fit is good enough. Rather than setting up an object
that takes the confidence level and perhaps a few other data, why not use a
simple function?
- I have determined the mean and standard deviation of my data set and
new data come in. Do they follow the same distribution? Again a function
seems more appropriate.
I do see the attractiveness of an object-oriented interface, especially if
you want to examine several different data sets against the same
distribution, but it feels indirect in many cases. So I would prefer a
functional/procedural interface, at least for the moment. An OO interface
can be added later.
Op wo 23 sep. 2020 om 00:56 schreef Jing <[email protected]>:
… In terms of license, as long as there are mathematic formulae, I think
license issue should not be a concern.
Based on comments so far, it looks like majority agreed to have
statistical distributions in stdlib. So the next question is what kind of
API should be. We can either define a data type for each distribution and
various type-bound procedures, or define various procedures for each type
of distribution directly in the stdlib module. For example, normal
distribution one could have:
module stdlib
type :: norm_dist_t
real :: x
real :: loc
real :: scale
contains
generic :: pdf => norm_dist_pdf
generic :: cdf => norm_dist_cdf
.
.
.
end type norm_dist_t
contains
{procedure definition}
.
.
.
end module stdlib
or directly in stdlib module:
module stdlib
function norm_dist_pdf(x, loc, scale)
end function norm_dist_pdf
.
.
.
end module stdlib
The first one is object oriented, data and procedure are encapsulated. The
second one is more traditional like intrinsic function call familiar to
users. Which one is better?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#234 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YR74DUUVPIAE2UBA4WDSHETTHANCNFSM4RUXIQFQ>
.
|
My preference is a procedural style. |
Yeh, I have checked the style guide and can't find any recommendation. Anyway, procedure style is the Fortran way. I will go ahead to implement a small module. |
@Jim-215-Fisher we should put it into the style guide, great idea. Here are some links where it was discussed in the past: |
@certik Thanks for the links. BTW, is there a table/list/link showing status of each stdlib module/proposal? |
The latest status is in the open issue and PR for a given proposal. We don't have a nice table summarizing it.
…On Wed, Sep 23, 2020, at 7:19 PM, Jing wrote:
@certik <https://github.com/certik> Thanks for the links. BTW, is there
a table/list/link showing status of each stdlib module/proposal?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#234 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAFAWGOUJQG5L5OALIM5ITSHKNDHANCNFSM4RUXIQFQ>.
|
Arjen, thanks for writing to ask about the licensing. When I took over hosting Alan Miller's files no license was stated. So I asked him about that and he told me he intended his code to be public domain. So, his work can be incorporated into libraries, such as this one, with other licenses. |
Hi Jason,
great, this work deserves widespread use. And having it available via the
Fortran Wiki and hopefully at some point via the standard library (or
something similar) will make it much easier.
Op vr 25 sep. 2020 om 16:09 schreef Jason Blevins <[email protected]
…:
It says that the code written by Alan Miller is in the public domain, I
will ask Jason what this means exactly.
Arjen, thanks for writing to ask about the licensing. When I took over
hosting Alan Miller's files no license was stated. So I asked him about
that and he told me he intended his code to be public domain. So, his work
can be incorporated into libraries, such as this one, with other licenses.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#234 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YR5H3ALI5OV5IHOSLOLSHSP73ANCNFSM4RUXIQFQ>
.
|
Thank oyu @jrblevin and @arjenmarkus for these explanations. These codes would be a good start for this proposal IMO. |
In terms of Alan Miller's code, should we use his code/module directly, or reorganize it according to current style? |
I would say we need to reorganize it - make sure things are consistent
within the standard library. That will help people to understand how to use
it and to avoid certain types of mistakes.
Op zo 27 sep. 2020 om 02:22 schreef Jing <[email protected]>:
… In terms of Alan Miller's code, should we use his code/module directly, or
reorganize it according to current style?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#234 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YR3AE3OZ7AA3OV5VG3LSH2ATZANCNFSM4RUXIQFQ>
.
|
I agree with @arjenmarkus . We will need to re-organize the code (and probably re-write some parts) to be consistent with stdlib style guide. It is anyway a great start. It seems that the people involved in this thread would agree to use Alan Miller 's code as a starting point, and that a procedural approach should be used (in agreement with several other discussions). |
Yes, I am working on it, hopefully send PR soon. |
Hi. |
The initial implementation of this proposal is ready for PR. Anyone interested is welcome to review it. I have implemented three distributions at the moment, uniform, normal and binomial distribution random number generators and pdf, cdf functions. Will implement more distributions and functions after collecting comments and suggestions. |
Thank you @Jim-215-Fisher . Can you open a PR? It will be easier for reviewing and discussing the API and code. |
I'd like to call out this RNG library, which I use regularly. There is a Fortran wrapper to the version producing 32bit random values online but I think it's worth implementing the whole thing. I use this library with C++. There the API is that one first instantiates an RNG object (say pcg), and a distribution object (say normal_dist) and calls like normal_dist(pcg) return a normal variate using this particular RNG. I find this API very smooth and flexible. Allows switching out RNGs. But I have no problem with a strictly procedural implementation either. Also, looking at some of the implementations in the links, I see that a few of them generate 2 uniform random numbers to return only 1 variate. See for example, the Box-Muller implementation in GSL, where they explain their choice. This is may be a bit wasteful as two covariates might be returned instead of 1, when the algorithm produces 2 independent covariates. Saving one of the covariates computed using a SAVE attribute or simply filling entire arrays at once may be a better implementation. |
I was intrigued by the simplicity of this generator and attempted a quick Fortran version. I haven't verified it's correctness yet, but the timings are promising (roughly the same order as the intrinsic I would suggest preserving the discussion of a random number generator object under issue #135. |
Just for reference here is libstdc++'s implementation of the gaussian variate generator, which is a very standard implementation of Marsaglia's algorithm with each call saving one of the two variates generated for the next call: https://gcc.gnu.org/onlinedocs/libstdc++/latest-doxygen/a15735_source.html#l01783 |
Just looking at In my code, I use the ratio of uniforms (2 uniforms per single gaussian) algo, but this review suggests a Ziggurat algorithm as |
There is a useful discussion of implementing the ziggurat gaussian RNG for numpy here: |
Maybe this one? |
@David-Duffy you're absolutely right. It appears Ziggurat is the fastest option for gaussians, and doesn't seem to be worst statistically from what I've been reading. There's a paper on a generalized version to unimodal distributions with unbounded support. According to the authors performance is better than, for example, what is used by GCC's implementation of random.h or in Boost's random.h. The paper gives thorough pseudo-code too, which I appreciate. EDIT: However, on GPU the reverse seems to hold, because "...in a GPU, the performance of the slow route will apply to all threads in a warp, even if only one thread uses the route. If the warp size is 32 and the probability of taking the slow route is 2 percent, then the probability of any warp taking the slow route is (1 - 0.02)^32, which is 47 percent! So, because of thread batching, the assumptions designed into the ziggurat are violated, and the performance advantage is destroyed." |
"a generalized version" - yes, I had been looking at their C++ code, but dreading how how long it would take to fully test a Fortran port. One reason Boost etc are using time-tested algorithms rather than bleeding edge, |
The R documents provide a listing of the available probability distributions. Considering another, independent request for better interop with R, it might save some effort to follow the R conventions as much as possible. Here is the list: https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Probability-distributions Is there another discussion on what should be in the statistical methods portion of the Fortran Standard Library? The GNU Scientific Library stats section is limited. I also think the R stats package provides a good model of what should be included, but that might be too ambitious. If you have R installed, open up the REPL and type:
|
I want to draw attention in this issue thread to algorithms for geometric and binomial variates published from 2013 through 2015, by Bringmann and colleagues as well as Farach-Colton/Tsai, which were designed especially for small p parameters (geometric) or large numbers of trials (binomial). See my notes on samplers for both distributions. REFERENCES:
|
Besides common descriptive statistics, we need standard modules for various continuous statistical distribution (e.g., gamma distribution) and discrete distribution (e.g., bernoulli distribution). These statistical distributions will be very useful to various computer simulation techniques.
Even though these functions are available in Scipy package for python, I think it is worthwhile to have in stdlib with pure Fortran. There are plenty of source codes on the net, we just need to convert them.
The text was updated successfully, but these errors were encountered: