Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added optional support for using stats library instead of GSL #222

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

adamfowleruk
Copy link

PR for discussion and visibility in the group. Some formatting and style changes may be required. Please do let me know.

(Note: Although the feature name is 'armadillo', this change does not introduce that as a dependency at this time as it wasn't required for the functionality used by the simulation.)

I have abstracted the statistical library used into its own header and implementation files controlled by the Makefile options. This allows selectively choosing stats libraries as needed. This could be for licensing needs, or the performance characteristics of certain statistical functions, including support for multi-core processors in future.

On licensing

GSL is GNU GPL v2 licensed, restricting who can use OpenABM-Covid19. The changes in this PR allow it to be dual licensed as GPLv2 or Apache-2.0 depending on the compiler options chosen. This is accomplished by optionally choosing the stats library instead of the GSL library as compilation options. They also potentially allow other statistical libraries to be used easily in future too. (I have not changed the license wording to a dual license in this PR, as this is a secondary benefit).

On the stats library

The stats library can be found here: https://github.com/kthohr/stats
This has a dependency upon the GCEM library also found here: https://github.com/kthohr/gcem
Once these are extracted into the OpenABM-Covid19 folder, they can be used.

README.md has been updated to reflect this new compilation option.

A GSL_COMPAT option is also provided so that the marsenne twister used is a 32bit variety with the same seed as GSL, to allow for direct comparison of results. Note: Not every function in the stats library works exactly the same as their GSL equivalent even with the GSL_COMPAT flag. These differences are noted in comments in random_stats.cpp.

Results

Performance results on my test machine using the test_performance.sh script showed:-

  • Existing GSL build with previous options (-O0) took 263 seconds
  • Stats library inlined using 32bit RNG and GSL_COMPAT settings (-O2) took 202 seconds
  • Stats library inlined using 64 bit RNG, no GSL COMPAT, -O2 with -fopenmp with a single change to model.c (commented out in this commit) took 178 seconds (A 32% saving)

Future benefits this commit would also allow

With further changes a greater degree of parallelisation will be possible using this PR as a basis. The stats library has built in support for OpenMP allowing multi threaded execution over vectors and matrices with minimal work as its a compiler feature.

The stats library allows the use of a 64 bit marsenne twister rather than 32 bit. This is potentially an issue given its extensive invocation in a large network relationship simulation like OpenABM-Covid19. (The performance test example uses the random number generator 2.17 billion times for a population of 1 million. 32 bits provides a space of 4.23 billion.)

The stats library also allows the use of other vector and matrix libraries such as Armadillo, Blaze, or Eigen. This vector approach may also provide performance to R and Jupyter notebook integrations in future too.

Separate to the use of the stats library, I have provided implementations of ran_discrete and ran_shuffle that are at least as fair as those in GSL, but more performant given how OpenABM-Covid19 uses GSL. My shuffle implementation executes in 46% less time, and my discrete implementation executes in 90% less time. I have not replaced the GSL calls with these, but we may wish to do so in future.

Abstracted GSL out to allow choice of scientific math libraries
- GSL is GPL-V3, Armadillo is Apache-2.0 - this change gives choice to downstream OpenABM-Covid19 library users with regards to licensing their final applications
- Created general random.h header file with non-library specific structs and functions
- Created random_gsl.c/h to link to existing use of GSL library
- Altered random.h and Makefile to check for USE_ARMA being set, but defaults to GSL
- Added in dummy random_arma.c/h files that currently don't call Armadillo
- Checked that compilation works with and without USE_ARMA set
- Known issue: Need to flesh out actual Armadillo functions to replace GSL
Signed-off-by: Adam Fowler <[email protected]>
Stats library as replacement for GSL compiling
- Only some functions implemented so far
- Successfully building combined C++ and C binary
- Added stricter void* casts to satisfy g++ compiler
- Still some other functions to implement
Signed-off-by: Adam Fowler <[email protected]>
Stats library performs equivalent to GSL by default
- performance test shell script sample results are almost the same
- stats library is a little slower currently (not tuned yet)
- examples provided to generate test output for both gsl and stats libraries for the same input data for comparison
- R script provided to compare their test completion times
- gitignore updated for additional performance and compilation temporary files
- Current stats implementation (GSL compatibility mode) requires 1 line source change to the stats library options hpp file (to use mt19937 and not mt19937_64)
Signed-off-by: Adam Fowler <[email protected]>
- Typos fixed
- Optimisation level O2 specified for GSL examples too
- GSL_COMPAT option added (not used by default) for simpler compilation when using the stats library
- README.md updated for compilation options and additional examples
Signed-off-by: Adam Fowler <[email protected]>
Default to 64 bit RNG in stats library
Maintain GSL compatibility flag
Added ran_shuffle test which I had missed
Added valgrind profiling option to performance shell script
Signed-off-by: Adam Fowler <[email protected]>
Defaults to 64 bit marsenne twister for ease of integration when using stats library
Additional example showing 'classic' struct linked list versus value vectors performance with OpenMP multi thread processing
Code added to model.c, but commented out, showing direct use of stats library's OpenMP support to save 32% of the test simulations processing time.
Signed-off-by: Adam Fowler <[email protected]>
Fixed python tests compilation
Cannot reproduce tests on my machine due to missing data_tests folder
Signed-off-by: Adam Fowler <[email protected]>
@adamfowleruk
Copy link
Author

@roberthinch I've ensured the python swig wrapper compiles now, but the laptop I have here doesn't have Docker and so for some reason the data_tests files seem to be missing. Please can you test again and let me know the output. I also think my renaming of network.network to network.next (to remove C++ compiler warnings when used with the stats library) may have broken the R bindings, but I cannot find the R wrapper functions at fault. Any hints welcome.
I've limited the R and python bindings to GSL only for now so the tests at least pass.
Thanks.

Various small changes post merge
- Added explicit casts required by c++ compiler
   - Fixed two locations where wrong calloc type used (double* not double, and int* not long*)
- Added Makefile nice exit when trying to build GSL version on Mac (as it fails)
- Replaced subprocess in python tests with direct use of Model.run
- Modified statistical allclose calls to not divide by sqrt(N) for absolute difference value
- Some tests still failing: 
   - disease_dynamics: test_total_infectious_rate_zero
   - disease_dynamics: test_multi_strain_disease_outcome_proportions
   - disease_dynamics: test_multi_strain_infectious_factor
   - concordance: all fail except test_zero_output
   - segfault in infection_dynamics: test_transmission_pairs
- Dockerfile for GSL has stopped linking since merging strain code changes
Signed-off-by: Adam Fowler <[email protected]>
Dockerfile was missing required autoconf and ./configure calls for GSL
Still some python test failures:-
- disease_dynamics: test_total_infectious_rate_zero
- disease_dynamics: test_multi_strain_infectious_factor
- segfault with concordance: test_incidence_timeseries_individual
Signed-off-by: Adam Fowler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant