@@ -17,65 +17,23 @@ multi-dimensional labeled arrays, such as:
17
17
- Calculating statistics (e.g., "climatology") across distributed datasets
18
18
with arbitrary groups.
19
19
20
- Xarray-Beam is implemented as a _ thin layer_ on top of existing libraries for
21
- working with large-scale Xarray datasets. For example, it leverages
22
- [ Dask] ( https://dask.org/ ) for describing lazy arrays and for executing
23
- multi-threaded computation on a single machine.
20
+ For more about our approach and how to get started,
21
+ ** [ read the documentation] ( https://xarray-beam.readthedocs.io/ ) ** !
24
22
25
23
** 🚨 Warning: Xarray-Beam is new and unpolished 🚨**
26
24
27
25
Expect sharp edges 🔪 and performance cliffs 🧗, particularly related to the
28
26
management of lazy data with Dask and reading/writing data with Zarr. We have
29
- used it to efficiently process 5 TB datasets. We _ expect_ it to scale to PB size
30
- datasets but that's easier said than done. We welcome feedback and contributions
31
- from early adopters, and hope to have it ready for wider audience soon.
27
+ used it to efficiently process ~ 25 TB datasets. We _ expect_ it to scale to PB
28
+ size datasets but that's easier said than done. We welcome feedback and
29
+ contributions from early adopters, and hope to have it ready for wider audience
30
+ soon.
32
31
33
- ## How does Xarray-Beam compare to Dask?
34
-
35
- We love Dask! Xarray-Beam explores a different part of the design space for
36
- distributed data pipelines than Xarray's built-in Dask integration:
37
-
38
- - Xarray-Beam is built around explicit manipulation of `(xarray_beam.Key,
39
- xarray.Dataset)` pairs to perform operations on distributed datasets, where
40
- ` Key ` is an immutable dict keeping track of the offsets from the origin for
41
- a small contiguous "chunk" of a larger distributed dataset. This requires
42
- more boilerplate but is also more robust than generating distributed
43
- computation graphs in Dask using Xarray's built-in API. The user is expected
44
- to have a mental model for how their data pipeline is distributed across
45
- many machines.
46
- - Xarray-Beam distributes datasets by splitting them into many
47
- ` xarray.Dataset ` chunks, rather than the chunks of NumPy arrays typically
48
- used by Xarray with Dask (unless using
49
- [ xarray.map_blocks] ( http://xarray.pydata.org/en/stable/user-guide/dask.html#automatic-parallelization-with-apply-ufunc-and-map-blocks ) ).
50
- Chunks of datasets is a more convenient data-model for writing ad-hoc whole
51
- dataset transformations, but is potentially a bit less efficient.
52
- - Beam ([ like Spark] ( https://docs.dask.org/en/latest/spark.html ) ) was designed
53
- around a higher-level model for distributed computation than Dask (although
54
- Dask has been making
55
- [ progress in this direction] ( https://coiled.io/blog/dask-under-the-hood-scheduler-refactor/ ) ).
56
- Roughly speaking, this trade-off favors scalability over flexibility.
57
- - Beam allows for executing distributed computation using multiple runners,
58
- notably including Google Cloud Dataflow and Apache Spark. These runners are
59
- more mature than Dask, and in many cases are supported as a service by major
60
- commercial cloud providers.
61
-
62
- ![ Xarray-Beam datamodel vs Xarray-Dask] ( ./static/xarray-beam-vs-xarray-dask.png )
63
-
64
- These design choices are not set in stone. In particular, in the future we
65
- _ could_ imagine writing a high-level ` xarray_beam.Dataset ` that emulates the
66
- ` xarray.Dataset ` API, similar to the popular high-level DataFrame APIs in Beam,
67
- Spark and Dask. This could be built on top of the lower-level transformations
68
- currently in Xarray-Beam, or alternatively could use a "chunks of NumPy arrays"
69
- representation similar to that used by dask.array.
70
-
71
- ## Getting started
32
+ ## Installation
72
33
73
34
Xarray-Beam requires recent versions of immutabledict, xarray, dask, rechunker
74
- and zarr. It needs the latest release of Apache Beam (2.31.0 or later). For good
75
- performance when writing Zarr files, we strongly recommend patching Xarray with
76
- [ this pull request] ( https://github.com/pydata/xarray/pull/5252 ) .
77
-
78
- TODO(shoyer): write a tutorial here! For now, see the test suite for examples.
35
+ and zarr, and the * latest* release of Apache Beam (2.31.0 or later). For best
36
+ performance when writing Zarr files, use Xarray 0.19.0 or later.
79
37
80
38
## Disclaimer
81
39
@@ -93,3 +51,4 @@ Contributors:
93
51
- Stephan Hoyer
94
52
- Jason Hickey
95
53
- Cenk Gazen
54
+ - Alex Merose
0 commit comments