Description
Thanks to @dcherian and others who have been obsessively trying to make very common tasks like calculating climatological aggregations on climate data faster and easier.
I have some use cases on HPC where we are running climatological aggregations (and climatological aggregations composited over ENSO phases) on very large 4D dask arrays (11TB) and I have been digging into how best to employ flox
.
BUT today our HPC centre is down for regular maintenance so I tried to run some dummy examples on my laptop ( Apple M2 silicon, 32GB RAM ) using a 4 worker LocalCluster
and this example from the documentation - How about other climatologies?
The one change I made was to replace ones
with random
- as this seemed a more realistic test. I have no evidence but wonder if ones
would be something "easier" for xarray
?
The dummy array ended up being:
oisst object is 120.5342208 GB
3.7666944 times bigger than total memory.
To my surprise not using flox
was always much faster? I forced flox
to try both map-reduce
and cohorts
.
RESULTS: (which were repeatable and run after clearing memory and restarting the cluster)
with flox map-reduce = CPU times: user 7.07 s, sys: 1.44 s, total: 8.51 s = Wall time: 2min 9s
with flox cohorts = CPU times: user 5.82 s, sys: 1.16 s, total: 6.98 s = Wall time: 1min 20s
without flox = CPU times: user 3.37 s, sys: 1.39 s, total: 4.77 s = Wall time: 29.5 s
My goal was to generate an easy to run notebook where I could demonstrate to my colleagues the power of flox
. Instead, I'm a bit less confident I understand how this works.
Questions:
- Is this expected?
- Am I doing something silly or just misunderstanding something fundamental?
- Or is this all down to something in the differences in system architecture between a modern laptop and HPC or Cloud?
Thanks!