[joss] Paper Edits (#80)

* updated text as recommended by @sneakers-the-rat in #78 * fixed an awkward wording * updated plot * update markersize, some text * add dask scipy ref * whoops, some license leftovers * updated fig 1 * strong scaling comment * another figure iteration * Update LICENSE (#81) * Update LICENSE * bump version * date, and fix dask double cite * n_i -> n_u * another update
timothyas · Nov 1, 2024 · 71c280c · 71c280c
1 parent 2b0e8f6
commit 71c280c
Show file tree

Hide file tree

Showing 7 changed files with 173 additions and 53 deletions.
diff --git a/README.md b/README.md
@@ -74,6 +74,6 @@ Report bugs, suggest features, or view the source code
 
 ## License and Copyright
 
-xesn is licensed under version 3 of the GNU Lesser General Public License.
+xesn is licensed under the Apache-2.0 License.
 
 Development occurs on GitHub at <https://github.com/timothyas/xesn>.
diff --git a/codemeta.json b/codemeta.json
@@ -6,12 +6,12 @@
   ],
   "identifier": "",
   "codeRepository": "https://github.com/timothyas/xesn",
-  "datePublished": "2023-12-15",
-  "dateModified": "2023-12-15",
+  "datePublished": "2024-11-01",
+  "dateModified": "2024-11-01",
   "dateCreated": "2023-12-15",
   "description": "Echo State Networks powered by xarray and dask.",
   "keywords": "echo state networks, machine learning, weather, climate, forecasting",
-  "license": "LGPL v3",
+  "license": "Apache-2.0",
   "title": "xesn",
-  "version": "v0.1.2"
-}
+  "version": "v0.2.0"
+}
diff --git a/docs/references.bib b/docs/references.bib
@@ -1795,3 +1795,30 @@ @software{jax2018github
   version = {0.3.13},
   year = {2018},
 }
+
+@inproceedings{amdahl_1967,
+author = {Amdahl, Gene M.},
+title = {Validity of the single processor approach to achieving large scale computing capabilities},
+year = {1967},
+isbn = {9781450378956},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+url = {https://doi.org/10.1145/1465482.1465560},
+doi = {10.1145/1465482.1465560},
+abstract = {For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution. Variously the proper direction has been pointed out as general purpose computers with a generalized interconnection of memories, or as specialized computers with geometrically related memory interconnections and controlled by one or more instruction streams.},
+booktitle = {Proceedings of the April 18-20, 1967, Spring Joint Computer Conference},
+pages = {483–485},
+numpages = {3},
+location = {Atlantic City, New Jersey},
+series = {AFIPS '67 (Spring)}
+}
+
+@InProceedings{rocklin_scipy_2015,
+  author    = { {M}atthew {R}ocklin },
+  title     = { {D}ask: {P}arallel {C}omputation with {B}locked algorithms and {T}ask {S}cheduling },
+  booktitle = { {P}roceedings of the 14th {P}ython in {S}cience {C}onference },
+  pages     = { 126 - 132 },
+  year      = { 2015 },
+  editor    = { {K}athryn {H}uff and {J}ames {B}ergstra },
+  doi       = { 10.25080/Majora-7b98e3ed-013 }
+}
diff --git a/paper.md b/paper.md
@@ -32,7 +32,7 @@ affiliations:
    index: 4
  - name: Pacific Northwest National Laboratory, Richland, WA, USA
    index: 5
-date: 15 December 2023
+date: 1 November 2024
 bibliography: docs/references.bib
 
 ---
@@ -67,7 +67,7 @@ signal processing [@jaeger_harnessing_2004].
 # Statement of Need
 
 ESNs are a conceptually simple Recurrent Neural Network architecture,
-leading many scientists who use ESNs implement them from scratch.
+leading many scientists who use ESNs to implement them from scratch.
 While this approach can work well for low dimensional problems, the situation
 quickly becomes more complicated when:
 
@@ -78,7 +78,7 @@ quickly becomes more complicated when:
 
 `Xesn` is designed to address all of these points.
 Additionally, while there are some design flexibilities for the ESN
-architectures, the overall interface is streamlined, based on the parameter and
+architectures, the overall interface is streamlined based on the parameter and
 design impact study shown by @platt_systematic_2022.
 
 ## GPU Deployment
@@ -101,11 +101,11 @@ adding invariant metrics to the loss function, like the leading Lyapunov
 exponent or the Kinetic Energy spectrum, improved generalizability.
 As a generic implementation of these metrics,
 `xesn` offers the capability to constrain the
-system's Power Spectral Density during parameter optimization, in addition to a
+system's Power Spectral Density during parameter optimization in addition to a
 more traditional mean squared error loss function.
 
 `Xesn` enables parameter optimization by integrating with the Surrogate Modeling
-Toolbox [@bouhlel_scalable_2020], which has a Bayesian Optimization
+Toolbox [@bouhlel_scalable_2020], which has a Bayesian optimization
 implementation.
 `Xesn` provides a simple interface so that the user can specify all of the
 settings for training, parameter optimization, and testing with a single YAML
@@ -128,10 +128,10 @@ as the basis for `xesn`.
 `Xesn` enables prediction for multi-dimensional systems by integrating its high
 level operations with `xarray` [@hoyer_xarray_2017].
 As with `xarray`, users refer to dimensions based on their named axes.
-`Xesn` parallelizes the core array based operations by using `dask` [@dask_2016]
-to map them across available resources, which can include a multi-threaded
-environment on a laptop or single node, or a distributed computing resource
-such as traditional on-premises HPC or in the cloud.
+`Xesn` parallelizes the core array based operations by using `dask`
+[@dask_2016; @rocklin_scipy_2015]
+to map them across available resources, from a laptop to a distributed HPC or
+cloud cluster.
 
 
 ## Existing Reservoir Computing Software
@@ -145,14 +145,13 @@ underlying the reservoir, and allowing for delayed connections.
 On the other hand, `xesn` is focused specifically on implementing ESN
 architectures that can scale to multi-dimensional forecasting tasks.
 Additionally, while `ReservoirPy` enables hyperparameter grid search capabilities
-via Hyperopt [@hyperopt], `xesn` enables Bayesian Optimization as noted above.
+via Hyperopt [@hyperopt], `xesn` enables Bayesian optimization as noted above.
 
-Finally, we note the code base used by [@arcomano_machine_2020;@arcomano_hybrid_2022;@arcomano_hybrid_2023],
+Another ESN implementation is that of [@arcomano_machine_2020;@arcomano_hybrid_2022;@arcomano_hybrid_2023],
 available at [@arcomano_code].
 The code implements ESNs in Fortran, and focuses on using ESNs for hybrid physics-ML modeling.
 
 
-
 # Computational Performance
 
 Here we show brief scaling results in order to show
@@ -162,29 +161,40 @@ scales with increasing hidden and input dimensions.
 Additionally, we provide some baseline results to serve as guidance when
 configuring `dask` to use the parallelized
 [`xesn.LazyESN`](https://xesn.readthedocs.io/en/latest/generated/xesn.LazyESN.html) architecture.
+The scripts used to setup, execute, and visualize these scaling tests can be
+found
+[here](https://github.com/timothyas/xesn/tree/1524713149efa38a0fd52ecdeb32ca5aacb62693/scaling).
 For methodological details on these two architectures, please refer to
-[this section](https://xesn.readthedocs.io/en/latest/methods.html) of the documentation.
+[the methods section of the documentation](https://xesn.readthedocs.io/en/latest/methods.html).
 
 ## Standard (Eager) ESN Performance
 
-![Walltime and memory usage for the standard ESN architecture for two different
+![Wall time and peak memory usage for the standard ESN architecture for two different
 system sizes ($N_u$) and a variety of reservoir sizes ($N_r$).
-Walltime is captured with Python's `time` module, and memory is captured with
+Wall time is captured with Python's `time` module, and peak memory usage is captured with
 [memory-profiler](https://pypi.org/project/memory-profiler/)
 for the CPU runs and with
 [NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems)
 for the GPU runs.
-The dotted lines indicate theoretical scaling of memory, where
-$a=250,000$ and $b=20,000$ are empirically derived constants, and
+Note that the peak memory usage for the GPU runs indicates GPU memory usage
+only, since this is a typical bottleneck.
+The gray and black lines indicate the general trend in
+memory usage during the CPU and GPU simulations, respectively.
+The empirically derived gray and black curves are a function of the problem size, and
+are provided so users can estimate how much memory might be
+required for their applications.
+The constants are as follows:
+$a=250,000$ is ~3 times the total number of samples used,
+$b=20,000$ is the batch size, and
 $c=8\cdot10^9$ is a conversion to GB.
 \label{fig:eager}
 ](scaling/eager-scaling.pdf){ width=100% }
 
-For reference, in \autoref{fig:eager} we show the walltime and memory usage involved for
-training the
+For reference, in \autoref{fig:eager} we show the wall time and peak memory usage required to
+train the
 standard (eager) `ESN` architecture as a function of the input dimension $N_u$ and
 reservoir size $N_r$.
-We ran the scaling tests in the `us-central-1c` zone on Google Cloud Platform, using
+We ran the scaling tests in the `us-central-1c` zone on Google Cloud Platform (GCP), using
 a single `c2-standard-60` instance to test the CPU (NumPy) implementation
 and a single `a2-highgpu-8g` (i.e., with 8 A100 cards) instance to test the GPU
 (CuPy) implementation.
@@ -193,7 +203,7 @@ The training data was generated from the Lorenz96 model
 $N_u=\{16,256\}$,
 and we generated 80,000 total samples in the training dataset.
 
-In the CPU tests, walltime scales quadratically with the reservoir size, while
+In the CPU tests, wall time scales quadratically with the reservoir size, while
 it is mostly constant on a GPU.
 For this problem, it becomes advantageous to use GPUs once the reservoir size is
 approximately $N_r=8,000$ or greater.
@@ -207,14 +217,22 @@ This result serves as a motivation for our parallelized architecture.
 In order to evaluate the performance of the parallelized architecture, we take
 the Lorenz96 system with dimension $N_u=256$ and subdivide the domain into
 $N_g = \{2, 4, 8, 16, 32\}$ groups.
-We then fix the reservoir size so that $N_r*N_g = 16,000$, so that the problem
-size is more or less fixed and the timing results reflect strong scaling.
+We then fix the problem size such that $N_r*N_g = 16,000$, so that
+the timing results reflect strong scaling.
+That is, the results show how the code performs with increasing resources on a fixed problem
+size, which in theory correspond to Amdahl's Law [@amdahl_1967].
 The training task and resources used are otherwise the same as for the standard
 ESN results shown in \autoref{fig:eager}.
 We then create 3 different `dask.distributed` Clusters, testing:
 
 1. Purely threaded mode (CPU only).
 2. The relevant default "LocalCluster" (i.e., single node) configuration for our resources.
+   On the CPU resource, a GCP `c2-standard-60` instance,
+   the default
+   `dask.distributed.LocalCluster` has 6 workers, each with 5 threads.
+   On the GPU resource, a GCP `a2-highgpu-8g` instance, the default
+   `dask_cuda.LocalCUDACluster` has 8 workers, each
+   with 1 thread.
 3. A `LocalCluster` with 1 `dask` worker per group. On GPUs, this assumes 1 GPU per worker
    and we are able to use a maximum of 8 workers due to our available resources.
 
@@ -230,7 +248,7 @@ See text for a description of the different schedulers used.
 
 \autoref{fig:lazy} shows the strong scaling results of `xesn.LazyESN` for each of these
 cluster configurations, where each point shows the ratio of the
-walltime with the standard (serial) architecture to the lazy (parallel)
+wall time with the standard (serial) architecture to the lazy (parallel)
 architecture with $N_g$ groups.
 On CPUs, using 1 `dask` worker process per ESN group generally scales well,
 which makes sense because each group is trained entirely independently.
@@ -241,7 +259,7 @@ When the number of workers is less than the number of groups, performance is
 detrimental.
 However, when there is at least one worker per group, the timing is almost the
 same as for the single worker case, only improving performance by 10-20%.
-While the strong scaling is somewhat muted, the invariance of walltime to
+While the strong scaling is somewhat muted, the invariance of wall time to
 reservoir size in \autoref{fig:eager} and number of groups in
 \autoref{fig:lazy} means that the distributed GPU
 implementation is able to tackle larger problems at roughly the same
@@ -256,5 +274,8 @@ N00014-19-1-2522 and N00014-20-1- 2580.
 T.A. Smith acknowledges support from
 the Cooperative Institute for Research in Environmental Sciences (CIRES) at the
 University of Colorado Boulder.
+The authors thank the editor Jonny Saunders for comments that
+significantly improved the manuscript, and the reviewers Troy Arcomano and
+William Nicholas.
 
 # References
diff --git a/pyproject.toml b/pyproject.toml
@@ -5,7 +5,7 @@ description = "Echo State Networks powered by xarray"
 authors = [
     {name="Timothy Smith", email="[email protected]"},
 ]
-license = {file = "LICENSE.txt"}
+license = {file = "LICENSE"}
 readme = "README.md"
 requires-python = ">=3.9"
 dependencies = [

diff --git a/scaling/eager-scaling.pdf b/scaling/eager-scaling.pdf
diff --git a/scaling/plot_scaling.ipynb b/scaling/plot_scaling.ipynb