Skip to content

Commit f191fd3

Browse files
committed
Add doc page on 'Why xarray-beam'
1 parent e66a38e commit f191fd3

9 files changed

+279
-252
lines changed

Diff for: .gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
*.egg-info
2+
.DS_Store
23
build
34
dist
45
docs/.ipynb_checkpoints

Diff for: README.md

+10-51
Original file line numberDiff line numberDiff line change
@@ -17,65 +17,23 @@ multi-dimensional labeled arrays, such as:
1717
- Calculating statistics (e.g., "climatology") across distributed datasets
1818
with arbitrary groups.
1919

20-
Xarray-Beam is implemented as a _thin layer_ on top of existing libraries for
21-
working with large-scale Xarray datasets. For example, it leverages
22-
[Dask](https://dask.org/) for describing lazy arrays and for executing
23-
multi-threaded computation on a single machine.
20+
For more about our approach and how to get started,
21+
**[read the documentation](https://xarray-beam.readthedocs.io/)**!
2422

2523
**🚨 Warning: Xarray-Beam is new and unpolished 🚨**
2624

2725
Expect sharp edges 🔪 and performance cliffs 🧗, particularly related to the
2826
management of lazy data with Dask and reading/writing data with Zarr. We have
29-
used it to efficiently process 5 TB datasets. We _expect_ it to scale to PB size
30-
datasets but that's easier said than done. We welcome feedback and contributions
31-
from early adopters, and hope to have it ready for wider audience soon.
27+
used it to efficiently process ~25 TB datasets. We _expect_ it to scale to PB
28+
size datasets but that's easier said than done. We welcome feedback and
29+
contributions from early adopters, and hope to have it ready for wider audience
30+
soon.
3231

33-
## How does Xarray-Beam compare to Dask?
34-
35-
We love Dask! Xarray-Beam explores a different part of the design space for
36-
distributed data pipelines than Xarray's built-in Dask integration:
37-
38-
- Xarray-Beam is built around explicit manipulation of `(xarray_beam.Key,
39-
xarray.Dataset)` pairs to perform operations on distributed datasets, where
40-
`Key` is an immutable dict keeping track of the offsets from the origin for
41-
a small contiguous "chunk" of a larger distributed dataset. This requires
42-
more boilerplate but is also more robust than generating distributed
43-
computation graphs in Dask using Xarray's built-in API. The user is expected
44-
to have a mental model for how their data pipeline is distributed across
45-
many machines.
46-
- Xarray-Beam distributes datasets by splitting them into many
47-
`xarray.Dataset` chunks, rather than the chunks of NumPy arrays typically
48-
used by Xarray with Dask (unless using
49-
[xarray.map_blocks](http://xarray.pydata.org/en/stable/user-guide/dask.html#automatic-parallelization-with-apply-ufunc-and-map-blocks)).
50-
Chunks of datasets is a more convenient data-model for writing ad-hoc whole
51-
dataset transformations, but is potentially a bit less efficient.
52-
- Beam ([like Spark](https://docs.dask.org/en/latest/spark.html)) was designed
53-
around a higher-level model for distributed computation than Dask (although
54-
Dask has been making
55-
[progress in this direction](https://coiled.io/blog/dask-under-the-hood-scheduler-refactor/)).
56-
Roughly speaking, this trade-off favors scalability over flexibility.
57-
- Beam allows for executing distributed computation using multiple runners,
58-
notably including Google Cloud Dataflow and Apache Spark. These runners are
59-
more mature than Dask, and in many cases are supported as a service by major
60-
commercial cloud providers.
61-
62-
![Xarray-Beam datamodel vs Xarray-Dask](./static/xarray-beam-vs-xarray-dask.png)
63-
64-
These design choices are not set in stone. In particular, in the future we
65-
_could_ imagine writing a high-level `xarray_beam.Dataset` that emulates the
66-
`xarray.Dataset` API, similar to the popular high-level DataFrame APIs in Beam,
67-
Spark and Dask. This could be built on top of the lower-level transformations
68-
currently in Xarray-Beam, or alternatively could use a "chunks of NumPy arrays"
69-
representation similar to that used by dask.array.
70-
71-
## Getting started
32+
## Installation
7233

7334
Xarray-Beam requires recent versions of immutabledict, xarray, dask, rechunker
74-
and zarr. It needs the latest release of Apache Beam (2.31.0 or later). For good
75-
performance when writing Zarr files, we strongly recommend patching Xarray with
76-
[this pull request](https://github.com/pydata/xarray/pull/5252).
77-
78-
TODO(shoyer): write a tutorial here! For now, see the test suite for examples.
35+
and zarr, and the *latest* release of Apache Beam (2.31.0 or later). For best
36+
performance when writing Zarr files, use Xarray 0.19.0 or later.
7937

8038
## Disclaimer
8139

@@ -93,3 +51,4 @@ Contributors:
9351
- Stephan Hoyer
9452
- Jason Hickey
9553
- Cenk Gazen
54+
- Alex Merose
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)