Sketch support for writing, reading sliced AwkwardArrays. (#549)

* Sketch support for writing, reading sliced AwkwardArrays. * Use py38-compatibility type hints. * Add awkward to some dependency sets. * Vendor zipfile from Python 3.9.17. * Extract form keys with attributes from form. * Pack before upload. * Remove redundant default setting; param may soon be deprecated Co-authored-by: Angus Hollands <[email protected]> * Use public API added in awkward v2.4.0. Co-authored-by: Angus Hollands <[email protected]> * Handle UnionForm For now, it will only work if len(step2) < 2. Work on the awkward side, tracked in scikit-hep/awkward#2666, is needed to enable the rest. Co-authored-by: Jim Pivarski <[email protected]> * Raise clear error for unsupported UnionForms. * Add minimum version pin for awkward. * Refactor to use Form.expected_form_buffers. * Move buffers to dedicated route. * Support /awkward/full * Add support for JSON, Feather, Parquet. * Test use of returned client and client obtained via lookup. * Include 'buffers' link. * Remove slicing options from export, not yet supported. * Add file ext aliases for parquet, feather, arrow. * Remove outdated comment. Co-authored-by: Angus Hollands <[email protected]> * Fix URLs. * Add Awkward to structure lists in docs. * Refactor AwkwardBuffersAdapter into AwkwardAdapter. * Add awkward example. * Document all writing methods, including new awkward ones. * Rename AwkwardArrayClient -> AwkwardClient. * Remove outdated plans. * Fix bugs that failed tests * Use allow_noncanonical_form. --------- Co-authored-by: Angus Hollands <[email protected]> Co-authored-by: Jim Pivarski <[email protected]>
bluesky · Sep 28, 2023 · 952d66e · 952d66e
1 parent ad8bccd
commit 952d66e
Show file tree

Hide file tree

Showing 27 changed files with 3,517 additions and 19 deletions.
diff --git a/.flake8 b/.flake8
@@ -6,7 +6,8 @@ exclude =
     dist,
     versioneer.py,
     tiled/_version.py,
-    docs/source/conf.py
-    share
-    web-frontend
+    tiled/serialization/_zipfile_py39.py,
+    docs/source/conf.py,
+    share,
+    web-frontend,
 max-line-length = 115
diff --git a/README.md b/README.md
@@ -20,6 +20,7 @@ Tiled puts an emphasis on **structures** rather than formats, including:
 * N-dimensional strided arrays (i.e. numpy-like arrays)
 * Sparse arrays
 * Tabular data (e.g. pandas-like "dataframes")
+* Nested, variable-sized data (as implemented by [AwkwardArray](https://awkward-array.org/))
 * Hierarchical structures thereof (e.g. xarrays, HDF5-compatible structures like NeXus)
 
 Tiled implements extensible **access control enforcement** based on web security

diff --git a/docs/source/explanations/structures.md b/docs/source/explanations/structures.md
@@ -10,13 +10,11 @@ potentially any language.
 The structure families are:
 
 * array --- a strided array, like a [numpy](https://numpy.org) array
+* awkward --- nested, variable-sized data (as implemented by [AwkwardArray](https://awkward-array.org/))
+* container --- a of other structures, akin to a dictionary or a directory
 * sparse --- a sparse array (i.e. an array which is mostly zeros)
 * table --- tabular data, as in [Apache Arrow](https://arrow.apache.org) or
   [pandas](https://pandas.pydata.org/)
-* container --- a of other structures, akin to a dictionary or a directory
-
-Support for sparse arrays and [Awkward Array](https://awkward-array.org/) are
-planned.
 
 ## How structure is encoded
 
@@ -57,7 +55,7 @@ a string label for each dimension.
 This `(10, 10)`-shaped array fits in a single `(10, 10)`-shaped chunk.
 
 ```
-$ http :8000/metadata/small_image | jq .data.attributes.structure
+$ http :8000/api/v1/metadata/small_image | jq .data.attributes.structure
 ```
 
 ```json
@@ -91,7 +89,7 @@ This `(10000, 10000)`-shaped array is subdivided into 4 × 4 = 16 chunks,
 which is why the size of each chunk is given explicitly.
 
 ```
-$ http :8000/metadata/big_image | jq .data.attributes.structure
+$ http :8000/api/v1/metadata/big_image | jq .data.attributes.structure
 ```
 
 ```json
@@ -130,7 +128,7 @@ This is a 1D array where each item has internal structure,
 as in numpy's [strucuted data types](https://numpy.org/doc/stable/user/basics.rec.html)
 
 ```
-$ http :8000/metadata/structured_data/pets | jq .data.attributes.structure
+$ http :8000/api/v1/metadata/structured_data/pets | jq .data.attributes.structure
 ```
 
 ```json
@@ -180,6 +178,70 @@ $ http :8000/metadata/structured_data/pets | jq .data.attributes.structure
 }
 ```
 
+### Awkward
+
+[AwkwardArrays](https://awkward-array.org/) express nested, variable-sized
+data, including arbitrary-length lists, records, mixed types, and missing data.
+This often comes up in the context of event-based data, such as is used in
+high-energy physics, neutron experiments, quantum computing, and very high-rate
+detectors.
+
+AwkwardArrays are specified by:
+
+* An outer `length` (always an integer)
+* A JSON `form` (specified by AwkwardArray, giving the internal layout)
+* Named buffers of bytes, whose names match information in the `form`
+
+The first two are included in the structure.
+
+```
+$ http :8000/api/v1/metadata/awkward_array | jq .data.attributes.structure
+```
+
+```json
+{
+  "length": 3,
+  "form": {
+    "class": "ListOffsetArray",
+    "offsets": "i64",
+    "content": {
+      "class": "RecordArray",
+      "fields": [
+        "x",
+        "y"
+      ],
+      "contents": [
+        {
+          "class": "NumpyArray",
+          "primitive": "float64",
+          "inner_shape": [],
+          "parameters": {},
+          "form_key": "node2"
+        },
+        {
+          "class": "ListOffsetArray",
+          "offsets": "i64",
+          "content": {
+            "class": "NumpyArray",
+            "primitive": "int64",
+            "inner_shape": [],
+            "parameters": {},
+            "form_key": "node4"
+          },
+          "parameters": {},
+          "form_key": "node3"
+        }
+      ],
+      "parameters": {},
+      "form_key": "node1"
+    },
+    "parameters": {},
+    "form_key": "node0"
+  }
+}
+
+```
+
 ### Sparse Array
 
 There are a variety of ways to represent
@@ -227,7 +289,7 @@ order, but we cannot make requests like "rows 100-200". (Dask has the same
 limitation, for the same reason.)
 
 ```
-$ http :8000/metadata/long_table | jq .data.attributes.structure
+$ http :8000/api/v1/metadata/long_table | jq .data.attributes.structure
 ```
 
 ```json

diff --git a/docs/source/reference/python-client.md b/docs/source/reference/python-client.md
@@ -89,6 +89,30 @@ structure_families, and specs of its children along with their counts.
    tiled.client.container.Container.distinct
 ```
 
+And, finally, there are convenience methods for writing:
+
+
+```{eval-rst}
+.. autosummary::
+   :toctree: generated
+
+   tiled.client.container.Container.create_container
+   tiled.client.container.Container.write_array
+   tiled.client.container.Container.write_awkward
+   tiled.client.container.Container.write_dataframe
+   tiled.client.container.Container.write_sparse
+```
+
+and a low-level method for creating a new node to write into:
+
+
+```{eval-rst}
+.. autosummary::
+   :toctree: generated
+
+   tiled.client.container.Container.new
+```
+
 ## Structure Clients
 
 For each *structure family* ("array", "table", etc.) there is a client
@@ -133,6 +157,32 @@ Tiled currently includes two clients for each structure family:
    tiled.client.array.DaskArrayClient.read_block
    tiled.client.array.DaskArrayClient.read
    tiled.client.array.DaskArrayClient.export
+   tiled.client.array.DaskArrayClient.write
+   tiled.client.array.DaskArrayClient.write_block
+```
+
+```{eval-rst}
+.. autosummary::
+   :toctree: generated
+
+   tiled.client.array.ArrayClient
+   tiled.client.array.ArrayClient.read_block
+   tiled.client.array.ArrayClient.read
+   tiled.client.array.ArrayClient.export
+   tiled.client.array.ArrayClient.write
+   tiled.client.array.ArrayClient.write_block
+```
+
+### Awkward
+
+```{eval-rst}
+.. autosummary::
+   :toctree: generated
+
+   tiled.client.awkward.AwkwardClient
+   tiled.client.awkward.AwkwardClient.read
+   tiled.client.awkward.AwkwardClient.write
+   tiled.client.awkward.AwkwardClient.export
 ```
 
 ```{eval-rst}
@@ -154,6 +204,8 @@ Tiled currently includes two clients for each structure family:
    tiled.client.sparse.SparseClient
    tiled.client.sparse.SparseClient.read
    tiled.client.sparse.SparseClient.export
+   tiled.client.sparse.SparseClient.write
+   tiled.client.sparse.SparseClient.write_block
 ```
 
 ### DataFrame
@@ -166,6 +218,8 @@ Tiled currently includes two clients for each structure family:
    tiled.client.dataframe.DaskDataFrameClient.read_partition
    tiled.client.dataframe.DaskDataFrameClient.read
    tiled.client.dataframe.DaskDataFrameClient.export
+   tiled.client.dataframe.DaskDataFrameClient.write
+   tiled.client.dataframe.DaskDataFrameClient.write_partition
 ```
 
 ```{eval-rst}
@@ -175,7 +229,9 @@ Tiled currently includes two clients for each structure family:
    tiled.client.dataframe.DataFrameClient
    tiled.client.dataframe.DataFrameClient.read_partition
    tiled.client.dataframe.DataFrameClient.read
-   tiled.client.dataframe.DaskDataFrameClient.export
+   tiled.client.dataframe.DataFrameClient.export
+   tiled.client.dataframe.DataFrameClient.write
+   tiled.client.dataframe.DataFrameClient.write_partition
 ```
 
 ### Xarray Dataset

diff --git a/pyproject.toml b/pyproject.toml
@@ -49,6 +49,7 @@ all = [
     "anyio",
     "asyncpg",
     "appdirs",
+    "awkward >=2.4.3",
     "blosc",
     "cachetools",
     "cachey",
@@ -102,6 +103,7 @@ array = [
 # This is the "kichen sink" fully-featured client dependency set.
 client = [
     "appdirs",
+    "awkward >=2.4.3",
     "blosc",
     "click !=8.1.0",
     "dask[array]",
@@ -215,8 +217,9 @@ server = [
     "aiosqlite",
     "alembic",
     "anyio",
-    "asyncpg",
     "appdirs",
+    "asyncpg",
+    "awkward >=2.4.3",
     "blosc",
     "cachetools",
     "cachey",
@@ -310,6 +313,7 @@ exclude = '''
   )/
   | versioneer.py
   | tiled/_version.py
+  | tiled/serialization/_zipfile_py39.py
   | docs/source/conf.py
   | share
   | web-frontend

diff --git a/tiled/_tests/test_awkward.py b/tiled/_tests/test_awkward.py
@@ -0,0 +1,118 @@
+import io
+
+import awkward
+import pyarrow.feather
+import pyarrow.parquet
+
+from ..catalog import in_memory
+from ..client import Context, from_context, record_history
+from ..server.app import build_app
+from ..utils import APACHE_ARROW_FILE_MIME_TYPE
+
+
+def test_slicing(tmpdir):
+    catalog = in_memory(writable_storage=tmpdir)
+    app = build_app(catalog)
+    with Context.from_app(app) as context:
+        client = from_context(context)
+
+        # Write data into catalog. It will be stored as directory of buffers
+        # named like 'node0-offsets' and 'node2-data'.
+        array = awkward.Array(
+            [
+                [{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}],
+                [],
+                [{"x": 3.3, "y": [1, 2, 3]}],
+            ]
+        )
+        returned = client.write_awkward(array, key="test")
+        # Test with client returned, and with client from lookup.
+        for aac in [returned, client["test"]]:
+            # Read the data back out from the AwkwardArrrayClient, progressively sliced.
+            assert awkward.almost_equal(aac.read(), array)
+            assert awkward.almost_equal(aac[:], array)
+            assert awkward.almost_equal(aac[0], array[0])
+            assert awkward.almost_equal(aac[0, "y"], array[0, "y"])
+            assert awkward.almost_equal(aac[0, "y", :1], array[0, "y", :1])
+
+            # When sliced, the serer sends less data.
+            with record_history() as h:
+                aac[:]
+            assert len(h.responses) == 1  # sanity check
+            full_response_size = len(h.responses[0].content)
+            with record_history() as h:
+                aac[0, "y"]
+            assert len(h.responses) == 1  # sanity check
+            sliced_response_size = len(h.responses[0].content)
+            assert sliced_response_size < full_response_size
+
+
+def test_export_json(tmpdir):
+    catalog = in_memory(writable_storage=tmpdir)
+    app = build_app(catalog)
+    with Context.from_app(app) as context:
+        client = from_context(context)
+
+        # Write data into catalog. It will be stored as directory of buffers
+        # named like 'node0-offsets' and 'node2-data'.
+        array = awkward.Array(
+            [
+                [{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}],
+                [],
+                [{"x": 3.3, "y": [1, 2, 3]}],
+            ]
+        )
+        aac = client.write_awkward(array, key="test")
+
+        file = io.BytesIO()
+        aac.export(file, format="application/json")
+        actual = bytes(file.getbuffer()).decode()
+        assert actual == awkward.to_json(array)
+
+
+def test_export_arrow(tmpdir):
+    catalog = in_memory(writable_storage=tmpdir)
+    app = build_app(catalog)
+    with Context.from_app(app) as context:
+        client = from_context(context)
+
+        # Write data into catalog. It will be stored as directory of buffers
+        # named like 'node0-offsets' and 'node2-data'.
+        array = awkward.Array(
+            [
+                [{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}],
+                [],
+                [{"x": 3.3, "y": [1, 2, 3]}],
+            ]
+        )
+        aac = client.write_awkward(array, key="test")
+
+        filepath = tmpdir / "actual.arrow"
+        aac.export(str(filepath), format=APACHE_ARROW_FILE_MIME_TYPE)
+        actual = pyarrow.feather.read_table(filepath)
+        expected = awkward.to_arrow_table(array)
+        assert actual == expected
+
+
+def test_export_parquet(tmpdir):
+    catalog = in_memory(writable_storage=tmpdir)
+    app = build_app(catalog)
+    with Context.from_app(app) as context:
+        client = from_context(context)
+
+        # Write data into catalog. It will be stored as directory of buffers
+        # named like 'node0-offsets' and 'node2-data'.
+        array = awkward.Array(
+            [
+                [{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}],
+                [],
+                [{"x": 3.3, "y": [1, 2, 3]}],
+            ]
+        )
+        aac = client.write_awkward(array, key="test")
+
+        filepath = tmpdir / "actual.parquet"
+        aac.export(str(filepath), format="application/x-parquet")
+        actual = pyarrow.parquet.read_table(filepath)
+        expected = awkward.to_arrow_table(array)
+        assert actual == expected