@@ -7,15 +7,15 @@ Xarray-Beam is a Python library for building
7
7
The project aims to facilitate data transformations and analysis on large-scale
8
8
multi-dimensional labeled arrays, such as:
9
9
10
- - Ad-hoc computation on Xarray data, by dividing a ` xarray.Dataset ` into many
11
- smaller pieces ("chunks").
12
- - Adjusting array chunks, using the
13
- [ Rechunker algorithm] ( https://rechunker.readthedocs.io/en/latest/algorithm.html )
14
- - Ingesting large multi-dimensional array datasets into an analysis-ready,
15
- cloud-optimized format, namely [ Zarr] ( https://zarr.readthedocs.io/ ) (see also
16
- [ Pangeo Forge] ( https://github.com/pangeo-forge/pangeo-forge-recipes ) )
17
- - Calculating statistics (e.g., "climatology") across distributed datasets with
18
- arbitrary groups.
10
+ - Ad-hoc computation on Xarray data, by dividing a ` xarray.Dataset ` into many
11
+ smaller pieces ("chunks").
12
+ - Adjusting array chunks, using the
13
+ [ Rechunker algorithm] ( https://rechunker.readthedocs.io/en/latest/algorithm.html )
14
+ - Ingesting large multi-dimensional array datasets into an analysis-ready,
15
+ cloud-optimized format, namely [ Zarr] ( https://zarr.readthedocs.io/ ) (see
16
+ also [ Pangeo Forge] ( https://github.com/pangeo-forge/pangeo-forge-recipes ) )
17
+ - Calculating statistics (e.g., "climatology") across distributed datasets
18
+ with arbitrary groups.
19
19
20
20
Xarray-Beam is implemented as a _ thin layer_ on top of existing libraries for
21
21
working with large-scale Xarray datasets. For example, it leverages
@@ -35,29 +35,29 @@ from early adopters, and hope to have it ready for wider audience soon.
35
35
We love Dask! Xarray-Beam explores a different part of the design space for
36
36
distributed data pipelines than Xarray's built-in Dask integration:
37
37
38
- - Xarray-Beam is built around explicit manipulation of
39
- ` (ChunkKey, xarray.Dataset)` pairs to perform operations on distributed
40
- datasets, where ` ChunkKey ` is an immutable dict keeping track of the offsets
41
- from the origin for a small contiguous "chunk" of a larger distributed
42
- dataset. This requires more boilerplate but is also more robust than
43
- generating distributed computation graphs in Dask using Xarray's built-in API.
44
- The user is expected to have a mental model for how their data pipeline is
45
- distributed across many machines.
46
- - Xarray-Beam distributes datasets by splitting them into many ` xarray.Dataset `
47
- chunks, rather than the chunks of NumPy arrays typically used by Xarray with
48
- Dask (unless using
49
- [ xarray.map_blocks] ( http://xarray.pydata.org/en/stable/user-guide/dask.html#automatic-parallelization-with-apply-ufunc-and-map-blocks ) ).
50
- Chunks of datasets is a more convenient data-model for writing ad-hoc
51
- whole dataset transformations, but is potentially a bit less efficient.
52
- - Beam ([ like Spark] ( https://docs.dask.org/en/latest/spark.html ) ) was designed
53
- around a higher-level model for distributed computation than Dask (although
54
- Dask has been making
55
- [ progress in this direction] ( https://coiled.io/blog/dask-under-the-hood-scheduler-refactor/ ) ).
56
- Roughly speaking, this trade-off favors scalability over flexibility.
57
- - Beam allows for executing distributed computation using multiple runners,
58
- notably including Google Cloud Dataflow and Apache Spark. These runners are
59
- more mature than Dask, and in many cases are supported as a service by major
60
- commercial cloud providers.
38
+ - Xarray-Beam is built around explicit manipulation of `(xarray_beam.Key,
39
+ xarray.Dataset)` pairs to perform operations on distributed datasets, where
40
+ ` Key ` is an immutable dict keeping track of the offsets from the origin for
41
+ a small contiguous "chunk" of a larger distributed dataset. This requires
42
+ more boilerplate but is also more robust than generating distributed
43
+ computation graphs in Dask using Xarray's built-in API. The user is expected
44
+ to have a mental model for how their data pipeline is distributed across
45
+ many machines.
46
+ - Xarray-Beam distributes datasets by splitting them into many
47
+ ` xarray.Dataset ` chunks, rather than the chunks of NumPy arrays typically
48
+ used by Xarray with Dask (unless using
49
+ [ xarray.map_blocks] ( http://xarray.pydata.org/en/stable/user-guide/dask.html#automatic-parallelization-with-apply-ufunc-and-map-blocks ) ).
50
+ Chunks of datasets is a more convenient data-model for writing ad-hoc whole
51
+ dataset transformations, but is potentially a bit less efficient.
52
+ - Beam ([ like Spark] ( https://docs.dask.org/en/latest/spark.html ) ) was designed
53
+ around a higher-level model for distributed computation than Dask (although
54
+ Dask has been making
55
+ [ progress in this direction] ( https://coiled.io/blog/dask-under-the-hood-scheduler-refactor/ ) ).
56
+ Roughly speaking, this trade-off favors scalability over flexibility.
57
+ - Beam allows for executing distributed computation using multiple runners,
58
+ notably including Google Cloud Dataflow and Apache Spark. These runners are
59
+ more mature than Dask, and in many cases are supported as a service by major
60
+ commercial cloud providers.
61
61
62
62
![ Xarray-Beam datamodel vs Xarray-Dask] ( ./static/xarray-beam-vs-xarray-dask.png )
63
63
@@ -70,9 +70,9 @@ representation similar to that used by dask.array.
70
70
71
71
## Getting started
72
72
73
- Xarray-Beam requires recent versions of xarray, dask, rechunker and zarr. It
74
- needs the latest release of Apache Beam (2.31.0 or later). For good performance
75
- when writing Zarr files, we strongly recommend patching Xarray with
73
+ Xarray-Beam requires recent versions of immutabledict, xarray, dask, rechunker
74
+ and zarr. It needs the latest release of Apache Beam (2.31.0 or later). For good
75
+ performance when writing Zarr files, we strongly recommend patching Xarray with
76
76
[ this pull request] ( https://github.com/pydata/xarray/pull/5252 ) .
77
77
78
78
TODO(shoyer): write a tutorial here! For now, see the test suite for examples.
@@ -90,6 +90,6 @@ See the "Contribution guidelines" for more.
90
90
91
91
Contributors:
92
92
93
- - Stephan Hoyer
94
- - Jason Hickey
95
- - Cenk Gazen
93
+ - Stephan Hoyer
94
+ - Jason Hickey
95
+ - Cenk Gazen
0 commit comments