Skip to content

Commit

Permalink
[joss] Paper Edits (#80)
Browse files Browse the repository at this point in the history
* updated text as recommended by @sneakers-the-rat in #78

* fixed an awkward wording

* updated plot

* update markersize, some text

* add dask scipy ref

* whoops, some license leftovers

* updated fig 1

* strong scaling comment

* another figure iteration

* Update LICENSE (#81)

* Update LICENSE

* bump version

* date, and fix dask double cite

* n_i -> n_u

* another update
  • Loading branch information
timothyas authored Nov 1, 2024
1 parent 2b0e8f6 commit 71c280c
Show file tree
Hide file tree
Showing 7 changed files with 173 additions and 53 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,6 @@ Report bugs, suggest features, or view the source code

## License and Copyright

xesn is licensed under version 3 of the GNU Lesser General Public License.
xesn is licensed under the Apache-2.0 License.

Development occurs on GitHub at <https://github.com/timothyas/xesn>.
10 changes: 5 additions & 5 deletions codemeta.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@
],
"identifier": "",
"codeRepository": "https://github.com/timothyas/xesn",
"datePublished": "2023-12-15",
"dateModified": "2023-12-15",
"datePublished": "2024-11-01",
"dateModified": "2024-11-01",
"dateCreated": "2023-12-15",
"description": "Echo State Networks powered by xarray and dask.",
"keywords": "echo state networks, machine learning, weather, climate, forecasting",
"license": "LGPL v3",
"license": "Apache-2.0",
"title": "xesn",
"version": "v0.1.2"
}
"version": "v0.2.0"
}
27 changes: 27 additions & 0 deletions docs/references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -1795,3 +1795,30 @@ @software{jax2018github
version = {0.3.13},
year = {2018},
}

@inproceedings{amdahl_1967,
author = {Amdahl, Gene M.},
title = {Validity of the single processor approach to achieving large scale computing capabilities},
year = {1967},
isbn = {9781450378956},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1465482.1465560},
doi = {10.1145/1465482.1465560},
abstract = {For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution. Variously the proper direction has been pointed out as general purpose computers with a generalized interconnection of memories, or as specialized computers with geometrically related memory interconnections and controlled by one or more instruction streams.},
booktitle = {Proceedings of the April 18-20, 1967, Spring Joint Computer Conference},
pages = {483–485},
numpages = {3},
location = {Atlantic City, New Jersey},
series = {AFIPS '67 (Spring)}
}

@InProceedings{rocklin_scipy_2015,
author = { {M}atthew {R}ocklin },
title = { {D}ask: {P}arallel {C}omputation with {B}locked algorithms and {T}ask {S}cheduling },
booktitle = { {P}roceedings of the 14th {P}ython in {S}cience {C}onference },
pages = { 126 - 132 },
year = { 2015 },
editor = { {K}athryn {H}uff and {J}ames {B}ergstra },
doi = { 10.25080/Majora-7b98e3ed-013 }
}
71 changes: 46 additions & 25 deletions paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ affiliations:
index: 4
- name: Pacific Northwest National Laboratory, Richland, WA, USA
index: 5
date: 15 December 2023
date: 1 November 2024
bibliography: docs/references.bib

---
Expand Down Expand Up @@ -67,7 +67,7 @@ signal processing [@jaeger_harnessing_2004].
# Statement of Need

ESNs are a conceptually simple Recurrent Neural Network architecture,
leading many scientists who use ESNs implement them from scratch.
leading many scientists who use ESNs to implement them from scratch.
While this approach can work well for low dimensional problems, the situation
quickly becomes more complicated when:

Expand All @@ -78,7 +78,7 @@ quickly becomes more complicated when:

`Xesn` is designed to address all of these points.
Additionally, while there are some design flexibilities for the ESN
architectures, the overall interface is streamlined, based on the parameter and
architectures, the overall interface is streamlined based on the parameter and
design impact study shown by @platt_systematic_2022.

## GPU Deployment
Expand All @@ -101,11 +101,11 @@ adding invariant metrics to the loss function, like the leading Lyapunov
exponent or the Kinetic Energy spectrum, improved generalizability.
As a generic implementation of these metrics,
`xesn` offers the capability to constrain the
system's Power Spectral Density during parameter optimization, in addition to a
system's Power Spectral Density during parameter optimization in addition to a
more traditional mean squared error loss function.

`Xesn` enables parameter optimization by integrating with the Surrogate Modeling
Toolbox [@bouhlel_scalable_2020], which has a Bayesian Optimization
Toolbox [@bouhlel_scalable_2020], which has a Bayesian optimization
implementation.
`Xesn` provides a simple interface so that the user can specify all of the
settings for training, parameter optimization, and testing with a single YAML
Expand All @@ -128,10 +128,10 @@ as the basis for `xesn`.
`Xesn` enables prediction for multi-dimensional systems by integrating its high
level operations with `xarray` [@hoyer_xarray_2017].
As with `xarray`, users refer to dimensions based on their named axes.
`Xesn` parallelizes the core array based operations by using `dask` [@dask_2016]
to map them across available resources, which can include a multi-threaded
environment on a laptop or single node, or a distributed computing resource
such as traditional on-premises HPC or in the cloud.
`Xesn` parallelizes the core array based operations by using `dask`
[@dask_2016; @rocklin_scipy_2015]
to map them across available resources, from a laptop to a distributed HPC or
cloud cluster.


## Existing Reservoir Computing Software
Expand All @@ -145,14 +145,13 @@ underlying the reservoir, and allowing for delayed connections.
On the other hand, `xesn` is focused specifically on implementing ESN
architectures that can scale to multi-dimensional forecasting tasks.
Additionally, while `ReservoirPy` enables hyperparameter grid search capabilities
via Hyperopt [@hyperopt], `xesn` enables Bayesian Optimization as noted above.
via Hyperopt [@hyperopt], `xesn` enables Bayesian optimization as noted above.

Finally, we note the code base used by [@arcomano_machine_2020;@arcomano_hybrid_2022;@arcomano_hybrid_2023],
Another ESN implementation is that of [@arcomano_machine_2020;@arcomano_hybrid_2022;@arcomano_hybrid_2023],
available at [@arcomano_code].
The code implements ESNs in Fortran, and focuses on using ESNs for hybrid physics-ML modeling.



# Computational Performance

Here we show brief scaling results in order to show
Expand All @@ -162,29 +161,40 @@ scales with increasing hidden and input dimensions.
Additionally, we provide some baseline results to serve as guidance when
configuring `dask` to use the parallelized
[`xesn.LazyESN`](https://xesn.readthedocs.io/en/latest/generated/xesn.LazyESN.html) architecture.
The scripts used to setup, execute, and visualize these scaling tests can be
found
[here](https://github.com/timothyas/xesn/tree/1524713149efa38a0fd52ecdeb32ca5aacb62693/scaling).
For methodological details on these two architectures, please refer to
[this section](https://xesn.readthedocs.io/en/latest/methods.html) of the documentation.
[the methods section of the documentation](https://xesn.readthedocs.io/en/latest/methods.html).

## Standard (Eager) ESN Performance

![Walltime and memory usage for the standard ESN architecture for two different
![Wall time and peak memory usage for the standard ESN architecture for two different
system sizes ($N_u$) and a variety of reservoir sizes ($N_r$).
Walltime is captured with Python's `time` module, and memory is captured with
Wall time is captured with Python's `time` module, and peak memory usage is captured with
[memory-profiler](https://pypi.org/project/memory-profiler/)
for the CPU runs and with
[NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems)
for the GPU runs.
The dotted lines indicate theoretical scaling of memory, where
$a=250,000$ and $b=20,000$ are empirically derived constants, and
Note that the peak memory usage for the GPU runs indicates GPU memory usage
only, since this is a typical bottleneck.
The gray and black lines indicate the general trend in
memory usage during the CPU and GPU simulations, respectively.
The empirically derived gray and black curves are a function of the problem size, and
are provided so users can estimate how much memory might be
required for their applications.
The constants are as follows:
$a=250,000$ is ~3 times the total number of samples used,
$b=20,000$ is the batch size, and
$c=8\cdot10^9$ is a conversion to GB.
\label{fig:eager}
](scaling/eager-scaling.pdf){ width=100% }

For reference, in \autoref{fig:eager} we show the walltime and memory usage involved for
training the
For reference, in \autoref{fig:eager} we show the wall time and peak memory usage required to
train the
standard (eager) `ESN` architecture as a function of the input dimension $N_u$ and
reservoir size $N_r$.
We ran the scaling tests in the `us-central-1c` zone on Google Cloud Platform, using
We ran the scaling tests in the `us-central-1c` zone on Google Cloud Platform (GCP), using
a single `c2-standard-60` instance to test the CPU (NumPy) implementation
and a single `a2-highgpu-8g` (i.e., with 8 A100 cards) instance to test the GPU
(CuPy) implementation.
Expand All @@ -193,7 +203,7 @@ The training data was generated from the Lorenz96 model
$N_u=\{16,256\}$,
and we generated 80,000 total samples in the training dataset.

In the CPU tests, walltime scales quadratically with the reservoir size, while
In the CPU tests, wall time scales quadratically with the reservoir size, while
it is mostly constant on a GPU.
For this problem, it becomes advantageous to use GPUs once the reservoir size is
approximately $N_r=8,000$ or greater.
Expand All @@ -207,14 +217,22 @@ This result serves as a motivation for our parallelized architecture.
In order to evaluate the performance of the parallelized architecture, we take
the Lorenz96 system with dimension $N_u=256$ and subdivide the domain into
$N_g = \{2, 4, 8, 16, 32\}$ groups.
We then fix the reservoir size so that $N_r*N_g = 16,000$, so that the problem
size is more or less fixed and the timing results reflect strong scaling.
We then fix the problem size such that $N_r*N_g = 16,000$, so that
the timing results reflect strong scaling.
That is, the results show how the code performs with increasing resources on a fixed problem
size, which in theory correspond to Amdahl's Law [@amdahl_1967].
The training task and resources used are otherwise the same as for the standard
ESN results shown in \autoref{fig:eager}.
We then create 3 different `dask.distributed` Clusters, testing:

1. Purely threaded mode (CPU only).
2. The relevant default "LocalCluster" (i.e., single node) configuration for our resources.
On the CPU resource, a GCP `c2-standard-60` instance,
the default
`dask.distributed.LocalCluster` has 6 workers, each with 5 threads.
On the GPU resource, a GCP `a2-highgpu-8g` instance, the default
`dask_cuda.LocalCUDACluster` has 8 workers, each
with 1 thread.
3. A `LocalCluster` with 1 `dask` worker per group. On GPUs, this assumes 1 GPU per worker
and we are able to use a maximum of 8 workers due to our available resources.

Expand All @@ -230,7 +248,7 @@ See text for a description of the different schedulers used.

\autoref{fig:lazy} shows the strong scaling results of `xesn.LazyESN` for each of these
cluster configurations, where each point shows the ratio of the
walltime with the standard (serial) architecture to the lazy (parallel)
wall time with the standard (serial) architecture to the lazy (parallel)
architecture with $N_g$ groups.
On CPUs, using 1 `dask` worker process per ESN group generally scales well,
which makes sense because each group is trained entirely independently.
Expand All @@ -241,7 +259,7 @@ When the number of workers is less than the number of groups, performance is
detrimental.
However, when there is at least one worker per group, the timing is almost the
same as for the single worker case, only improving performance by 10-20%.
While the strong scaling is somewhat muted, the invariance of walltime to
While the strong scaling is somewhat muted, the invariance of wall time to
reservoir size in \autoref{fig:eager} and number of groups in
\autoref{fig:lazy} means that the distributed GPU
implementation is able to tackle larger problems at roughly the same
Expand All @@ -256,5 +274,8 @@ N00014-19-1-2522 and N00014-20-1- 2580.
T.A. Smith acknowledges support from
the Cooperative Institute for Research in Environmental Sciences (CIRES) at the
University of Colorado Boulder.
The authors thank the editor Jonny Saunders for comments that
significantly improved the manuscript, and the reviewers Troy Arcomano and
William Nicholas.

# References
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ description = "Echo State Networks powered by xarray"
authors = [
{name="Timothy Smith", email="[email protected]"},
]
license = {file = "LICENSE.txt"}
license = {file = "LICENSE"}
readme = "README.md"
requires-python = ">=3.9"
dependencies = [
Expand Down
Binary file modified scaling/eager-scaling.pdf
Binary file not shown.
114 changes: 93 additions & 21 deletions scaling/plot_scaling.ipynb

Large diffs are not rendered by default.

0 comments on commit 71c280c

Please sign in to comment.