Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for statistical distributions #234

Open
Jim-215-Fisher opened this issue Sep 21, 2020 · 37 comments
Open

Proposal for statistical distributions #234

Jim-215-Fisher opened this issue Sep 21, 2020 · 37 comments
Labels
idea Proposition of an idea and opening an issue to discuss it in progress This proposal is being worked on topic: mathematics linear algebra, sparse matrices, special functions, FFT, random numbers, statistics, ...

Comments

@Jim-215-Fisher
Copy link
Contributor

Besides common descriptive statistics, we need standard modules for various continuous statistical distribution (e.g., gamma distribution) and discrete distribution (e.g., bernoulli distribution). These statistical distributions will be very useful to various computer simulation techniques.
Even though these functions are available in Scipy package for python, I think it is worthwhile to have in stdlib with pure Fortran. There are plenty of source codes on the net, we just need to convert them.

@epagone
Copy link

epagone commented Sep 21, 2020

Prior art.

  • GSL has a comprehensive section of random number distributions that should be available in Fortran though the fgsl module. This should be stable and well maintained.
  • Richard Chandler and Paul Northrop from UCL have developed a nice library in ancient FORTRAN77. Latest update has been in 2003. License is very "liberal" (they only ask to be cited) but the interface is not simple. Documentation is minimal but seems sufficient. No examples or tutorials are provided.
  • A library for normal distributions only, one more generic with a large number of distributions and another one focussed on cumulative density function are provided on Jacob Burkardt site. This is Fortran 90 compliant code. The license is GNU LGPL and documentation is adequate with examples.
  • This post in a blog is the translation in Modern Fortran of a public domain, simple, Julia library about different distributions. It's simple but documentation is almost non-existent.
  • The site of Jean-Pierre Moreau has a number of files available (apparently in a somewhat modern Fortran). However, some files appear to be taken from the book "Numerical Recipes" (which has a peculiar license) and it is not clear if the distributions are included.

@jvdp1
Copy link
Member

jvdp1 commented Sep 21, 2020

I agree with @Jim-215-Fisher 's comment. Thank you for the suggestion.

To add to the prior art mentioned by @epagone:

  • RANLIB: Fortran90 (license: LGPL) (and its F77 version). I believe this library is also used in Octave.

@Jim-215-Fisher
Copy link
Contributor Author

Should stdlib be based on GSL through fgsl?

@certik
Copy link
Member

certik commented Sep 22, 2020

@Jim-215-Fisher Unfortunately GSL is GPL licensed, which prohibits its inclusion into stdlib which is MIT licensed.

@ivan-pi
Copy link
Member

ivan-pi commented Sep 22, 2020

Don't forget the Software from Alan J. Miller: https://wp.csiro.au/alanmiller/random.html
(a mirror exists at https://jblevins.org/mirror/amiller/)

@arjenmarkus
Copy link
Member

arjenmarkus commented Sep 22, 2020 via email

@jvdp1
Copy link
Member

jvdp1 commented Sep 22, 2020

i agree with that: it is an extensive collection provided by a well-known professional in the field. It may require some reorganisation, as it is not organised in modules and the like, but it is certainly worth our while. Op di 22 sep. 2020 om 12:59 schreef Ivan [email protected]:

Don't forget the Software from Alan J. Miller: https://wp.csiro.au/alanmiller/random.html (a mirror exists at https://jblevins.org/mirror/amiller/) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#234 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YRZSJTTC675SELMLNWLSHB7Q3ANCNFSM4RUXIQFQ .

Indeed, I agree. However, it is unclear to me what the license is for these files.

@arjenmarkus
Copy link
Member

arjenmarkus commented Sep 22, 2020 via email

@Jim-215-Fisher
Copy link
Contributor Author

If it is public domain, then there is no copyrights.

@Jim-215-Fisher
Copy link
Contributor Author

In terms of license, as long as there are mathematic formulae, I think license issue should not be a concern.

Based on comments so far, it looks like majority agreed to have statistical distributions in stdlib. So the next question is what kind of API should be. We can either define a data type for each distribution and various type-bound procedures, or define various procedures for each type of distribution directly in the stdlib module. For example, normal distribution one could have:

module stdlib
type :: norm_dist_t
real :: x
real :: loc
real :: scale

contains
generic :: pdf => norm_dist_pdf
generic :: cdf => norm_dist_cdf
.
.
.
end type norm_dist_t

contains
{procedure definition}
.
.
.
end module stdlib

or directly in stdlib module:

module stdlib
function norm_dist_pdf(x, loc, scale)
end function norm_dist_pdf
.
.
.
end module stdlib

The first one is object oriented, data and procedure are encapsulated. The second one is more traditional like intrinsic function call familiar to users. Which one is better?

@arjenmarkus
Copy link
Member

arjenmarkus commented Sep 23, 2020 via email

@jvdp1
Copy link
Member

jvdp1 commented Sep 23, 2020

My preference is a procedural style.
This choice "procedural" vs "OOP" style has been already discussed in previous issues (I can't find them back), and, if I remember wel, procedural style should be prefered over OOP. Of course similar functionality in a OOP style could be always proposed on top of the procedural ones.

@Jim-215-Fisher
Copy link
Contributor Author

Yeh, I have checked the style guide and can't find any recommendation. Anyway, procedure style is the Fortran way. I will go ahead to implement a small module.

@certik
Copy link
Member

certik commented Sep 23, 2020

@Jim-215-Fisher we should put it into the style guide, great idea. Here are some links where it was discussed in the past:

@Jim-215-Fisher
Copy link
Contributor Author

@certik Thanks for the links. BTW, is there a table/list/link showing status of each stdlib module/proposal?

@certik
Copy link
Member

certik commented Sep 24, 2020 via email

@jrblevin
Copy link

It says that the code written by Alan Miller is in the public domain, I will ask Jason what this means exactly.

Arjen, thanks for writing to ask about the licensing. When I took over hosting Alan Miller's files no license was stated. So I asked him about that and he told me he intended his code to be public domain. So, his work can be incorporated into libraries, such as this one, with other licenses.

@arjenmarkus
Copy link
Member

arjenmarkus commented Sep 25, 2020 via email

@jvdp1
Copy link
Member

jvdp1 commented Sep 25, 2020

Thank oyu @jrblevin and @arjenmarkus for these explanations. These codes would be a good start for this proposal IMO.

@Jim-215-Fisher
Copy link
Contributor Author

In terms of Alan Miller's code, should we use his code/module directly, or reorganize it according to current style?

@arjenmarkus
Copy link
Member

arjenmarkus commented Sep 27, 2020 via email

@jvdp1
Copy link
Member

jvdp1 commented Sep 28, 2020

I agree with @arjenmarkus . We will need to re-organize the code (and probably re-write some parts) to be consistent with stdlib style guide. It is anyway a great start.

It seems that the people involved in this thread would agree to use Alan Miller 's code as a starting point, and that a procedural approach should be used (in agreement with several other discussions).
If I may propose, @Jim-215-Fisher would you be interested to start a proposal for API?

@jvdp1 jvdp1 added topic: mathematics linear algebra, sparse matrices, special functions, FFT, random numbers, statistics, ... idea Proposition of an idea and opening an issue to discuss it labels Sep 28, 2020
@Jim-215-Fisher
Copy link
Contributor Author

Yes, I am working on it, hopefully send PR soon.

@David-Duffy
Copy link

Hi.
A certain number of the routines from Alan Miller's site are from other sources - for example, the Applied Statistics routines have a funny licence (no commercial use), which tripped me up when I used them for an R package. The Rmath library might be a good model for what working statisticians want (esp in terms of numerics).

@Jim-215-Fisher
Copy link
Contributor Author

Jim-215-Fisher commented Oct 10, 2020

The initial implementation of this proposal is ready for PR. Anyone interested is welcome to review it. I have implemented three distributions at the moment, uniform, normal and binomial distribution random number generators and pdf, cdf functions. Will implement more distributions and functions after collecting comments and suggestions.
The code is in my branch https://github.com/Jim-215-Fisher/stdlib/tree/stats_distribution

@jvdp1
Copy link
Member

jvdp1 commented Oct 10, 2020

The initial implementation of this proposal is ready for PR. Anyone interested is welcome to review it. I have implemented three distributions at the moment, uniform, normal and binomial distribution random number generators and pdf, cdf functions. Will implement more distributions and functions after collecting comments and suggestions.
The code is in my branch https://github.com/Jim-215-Fisher/stdlib/tree/stats_distribution

Thank you @Jim-215-Fisher . Can you open a PR? It will be easier for reviewing and discussing the API and code.

@esterjo
Copy link

esterjo commented Dec 17, 2020

I'd like to call out this RNG library, which I use regularly. There is a Fortran wrapper to the version producing 32bit random values online but I think it's worth implementing the whole thing. I use this library with C++. There the API is that one first instantiates an RNG object (say pcg), and a distribution object (say normal_dist) and calls like normal_dist(pcg) return a normal variate using this particular RNG. I find this API very smooth and flexible. Allows switching out RNGs. But I have no problem with a strictly procedural implementation either.

Also, looking at some of the implementations in the links, I see that a few of them generate 2 uniform random numbers to return only 1 variate. See for example, the Box-Muller implementation in GSL, where they explain their choice. This is may be a bit wasteful as two covariates might be returned instead of 1, when the algorithm produces 2 independent covariates. Saving one of the covariates computed using a SAVE attribute or simply filling entire arrays at once may be a better implementation.

@ivan-pi
Copy link
Member

ivan-pi commented Dec 17, 2020

I'd like to call out this RNG library, which I use regularly. There is a Fortran wrapper to the version producing 32bit random values online but I think it's worth implementing the whole thing. I use this library with C++. There the API is that one first instantiates an RNG object (say pcg), and a distribution object (say normal_dist) and calls like normal_dist(pcg) return a normal variate using this particular RNG. I find this API very smooth and flexible. Allows switching out RNGs. But I have no problem with a strictly procedural implementation either.

I was intrigued by the simplicity of this generator and attempted a quick Fortran version. I haven't verified it's correctness yet, but the timings are promising (roughly the same order as the intrinsic random_number subroutines in gfortran and Intel fortran). I have compared the bit sequences of the random integers from this version, and the Fortran wrapper of the C version, and they are equal. The conversion from signed integers to double is still problematic. Portability might also be an issue.

I would suggest preserving the discussion of a random number generator object under issue #135.

@esterjo
Copy link

esterjo commented Dec 20, 2020

Just for reference here is libstdc++'s implementation of the gaussian variate generator, which is a very standard implementation of Marsaglia's algorithm with each call saving one of the two variates generated for the next call: https://gcc.gnu.org/onlinedocs/libstdc++/latest-doxygen/a15735_source.html#l01783

@David-Duffy
Copy link

Just looking at
Thomas, D. B., Luk. W., Leong, P. H. W., and Villasenor, J. D. 2007. Gaussian random number generators. ACM Comput. Surv. 39, 4, Article 11 (October 2007)
http://doi.acm.org/10.1145/1287620.1287622

In my code, I use the ratio of uniforms (2 uniforms per single gaussian) algo, but this review suggests a Ziggurat algorithm as
accurate (ie especially producing the correct number of extreme deviates), uncorrelated, and 5 times faster than Box-Muller). I have a vague memory of Fortran code for this somewhere.

@David-Duffy
Copy link

There is a useful discussion of implementing the ziggurat gaussian RNG for numpy here:
numpy/numpy#2047

@epagone
Copy link

epagone commented Dec 21, 2020

In my code, I use the ratio of uniforms (2 uniforms per single gaussian) algo, but this review suggests a Ziggurat algorithm as
accurate (ie especially producing the correct number of extreme deviates), uncorrelated, and 5 times faster than Box-Muller). I have a vague memory of Fortran code for this somewhere.

Maybe this one?

@esterjo
Copy link

esterjo commented Dec 22, 2020

@David-Duffy you're absolutely right. It appears Ziggurat is the fastest option for gaussians, and doesn't seem to be worst statistically from what I've been reading.

There's a paper on a generalized version to unimodal distributions with unbounded support. According to the authors performance is better than, for example, what is used by GCC's implementation of random.h or in Boost's random.h. The paper gives thorough pseudo-code too, which I appreciate.

EDIT: However, on GPU the reverse seems to hold, because "...in a GPU, the performance of the slow route will apply to all threads in a warp, even if only one thread uses the route. If the warp size is 32 and the probability of taking the slow route is 2 percent, then the probability of any warp taking the slow route is (1 - 0.02)^32, which is 47 percent! So, because of thread batching, the assumptions designed into the ziggurat are violated, and the performance advantage is destroyed."

@David-Duffy
Copy link

"a generalized version" - yes, I had been looking at their C++ code, but dreading how how long it would take to fully test a Fortran port. One reason Boost etc are using time-tested algorithms rather than bleeding edge,

@rryley
Copy link

rryley commented Jan 15, 2021

The R documents provide a listing of the available probability distributions. Considering another, independent request for better interop with R, it might save some effort to follow the R conventions as much as possible.

Here is the list: https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Probability-distributions

Is there another discussion on what should be in the statistical methods portion of the Fortran Standard Library? The GNU Scientific Library stats section is limited. I also think the R stats package provides a good model of what should be included, but that might be too ambitious. If you have R installed, open up the REPL and type:

library(help="stats") for a list of what is available.

@jvdp1
Copy link
Member

jvdp1 commented Jan 15, 2021

Is there another discussion on what should be in the statistical methods portion of the Fortran Standard Library?

@rryley Different PRs were already submitted regarding probability distributions (#271, #272, #273, #26, #278, #286, ...). It seems some of them cover the R probability distributions.

@peteroupc
Copy link

peteroupc commented Jan 22, 2021

I want to draw attention in this issue thread to algorithms for geometric and binomial variates published from 2013 through 2015, by Bringmann and colleagues as well as Farach-Colton/Tsai, which were designed especially for small p parameters (geometric) or large numbers of trials (binomial).

See my notes on samplers for both distributions.

REFERENCES:

  • Binomial: K. Bringmann, F. Kuhn, et al., “Internal DLA: Efficient Simulation of a Physical Growth Model.” In: Proc. 41st International Colloquium on Automata, Languages, and Programming (ICALP'14), 2014.
  • Binomial: Farach-Colton, M. and Tsai, M.T., 2015. Exact sublinear binomial sampling. Algorithmica 73(4), pp. 637-651.
  • Geometric: Bringmann, K., and Friedrich, T., 2013, July. Exact and efficient generation of geometric random variates and random graphs, in International Colloquium on Automata, Languages, and Programming (pp. 267-278).

@14NGiestas 14NGiestas added the in progress This proposal is being worked on label May 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
idea Proposition of an idea and opening an issue to discuss it in progress This proposal is being worked on topic: mathematics linear algebra, sparse matrices, special functions, FFT, random numbers, statistics, ...
Projects
None yet
Development

No branches or pull requests