Add CosemDataset object-oriented api #42

tlambert03 · 2023-04-16T01:55:22Z

Hey @d-v-b ... curious what you think about adding an object like this. I use this in a little pet project as slightly more interactive way to interact with cosem-datasets.

In [1]: from fibsem_tools._dataset import CosemDataset

In [2]: CosemDataset.all_names()
Out[2]:
['jrc_hela-h89-1',
 'jrc_hela-1',
 ...,
]

In [3]: ds = CosemDataset('jrc_hela-3')

In [4]: list(ds.sources)
Out[4]:
['er-mem_pred',
 'endo_pred',
  ...,
]

In [5]: ds.read_source('nucleolus_seg')
Out[5]:
<xarray.DataArray 'array-c0a700bda7e9803a4720b383f9e7adfa' (z: 12000, y: 1000,
                                                            x: 12400)>
dask.array<array, shape=(12000, 1000, 12400), dtype=uint16, chunksize=(256, 256, 256), chunktype=numpy.ndarray>
Coordinates:
  * z        (z) float64 0.0 3.24 6.48 9.72 ... 3.887e+04 3.887e+04 3.888e+04
  * y        (y) float64 0.0 4.0 8.0 12.0 ... 3.988e+03 3.992e+03 3.996e+03
  * x        (x) float64 0.0 4.0 8.0 12.0 ... 4.959e+04 4.959e+04 4.96e+04

lemme know what you think, if it's something you'd want/consider including here, I can complete docstrings and clean it up a bit

tlambert03 · 2023-04-16T01:57:33Z

btw ... looks like tests are only running against master which doesn't exist?

https://github.com/janelia-cosem/fibsem-tools/blob/d21f095d8ebb4bdac7fb77b38caf6d8e40729d04/.github/workflows/pytest.yml#L2-L7

d-v-b · 2023-04-16T11:25:40Z

@tlambert03 I love the idea! Thanks for putting that together. If you are willing to iterate on this a bit, it might be nice to consider using datatrees to represent collections. I will definitely look through the PR more thoroughly in the next few days

as for ci... thanks for reminding me. getting that cleaned up and working has been on my to-do list for a while, and docs too. PRs are welcome, but I can also just sort it out myself this week.

tlambert03 · 2023-04-16T12:55:52Z

If you are willing to iterate on this a bit, it might be nice to consider using datatrees to represent collections.

yep, more than happy to play around with it. After taking a quick look, I think my first question is probably what exactly should I be representing as DaraArray, Dataset, and Datatree?

root s3 bucket
dataset ('jrc_hela-3')
sources ('er-mem_pred')
scales (0, 1, etc...)

do you want to sketch out roughly what structures you imagine for each? Would there be a single root Datatree representing the full bucket? datasets for each cosem dataset, and arrays for each source/scale combo? (Or would each cosem itself be a Datatree)

d-v-b · 2023-04-18T15:00:57Z

src/fibsem_tools/_dataset.py

+    def images(self) -> dict[str, ImageDict]:
+        return self.metadata["images"]
+
+    def read_image(self, image_id: str, level: int = 0) -> xr.DataArray:


I think this could be a place to return a datatree, e.g. via something like this:

def multiscale_datatree(group_url: str) -> DataTree: group = read(group_url) # this sorts on the key name, but a better approach is to sort by descending array size array_keys = sorted( tuple(name for name, _ in group.arrays()), key=lambda v: int(v[1:]) ) arrays = { arr_name: read_xarray(os.path.join(group_url, arr_name)) for arr_name in array_keys } return DataTree.from_dict(arrays, name=group.basename)

Then you don't need to worry about validating the 'level' kwarg

and maybe rename the method that returns the arrays to .images, and use the name image_meta for the method that just returns image metadata?

d-v-b · 2023-04-18T15:03:05Z

src/fibsem_tools/_dataset.py

+        return dataset_names()
+
+    @staticmethod
+    def all_thumbnails(ncols: int = 4, **kwargs) -> plt.Figure:


possible alternative name: thumbnail_gallery, or just gallery

d-v-b · 2023-04-18T15:08:11Z

src/fibsem_tools/_dataset.py

+        )
+
+    @staticmethod
+    def all_names() -> list[str]:


I think if we use a top-level Mapping-like object where the keys are the names of datasets and the values are instances of CosemDataset, we can replace this with the .keys method on that object

d-v-b · 2023-04-18T16:44:45Z

yep, more than happy to play around with it. After taking a quick look, I think my first question is probably what exactly should I be representing as DaraArray, Dataset, and Datatree?

root s3 bucket
dataset ('jrc_hela-3')
sources ('er-mem_pred')
scales (0, 1, etc...)

do you want to sketch out roughly what structures you imagine for each? Would there be a single root Datatree representing the full bucket? datasets for each cosem dataset, and arrays for each source/scale combo? (Or would each cosem itself be a Datatree)

top level: some kind of key-value interface, where the keys are dataset names, and the values are ~CosemDataset objects. Besides an instance of a supabase.Client and references to API keys, I cant think of anything else I would want from this object (at the moment). If you're OK with me committing to your branch, I could put this in.

dataset: I think the overall structure of the CosemDataset as you've put it together is good.

sources: after playing around with a quick hack, I think a Datatree of DataArrays is definitely the right fit for expressing the multiscaleness of an image. So one option would be to represent the images that belong to a dataset as ~Mapping[str, Datatree[DataArray]] (I have no idea how to represent meshes in the xarray model, so that's something important to think about).

A similar interface could be achieved by using Datatree[Datatree[DataArray]]]. There might be some advantage to this, but I'm not really sure what that would be at the moment, especially considering that eagerly constructing a Datatree[DataArray] representation for each image associated with a dataset is super slow when a dataset has a large number of images (e.g., jrc_hela-2 has 70+). So unless we can implement a performant asynchronous API for datatree and zarr (big doubt), we should probably not use this approach.

One sticky issue that will bite us is that the current design of the supabase / postgres setup is bad -- I am exposing the structure of the raw database tables, but I need to introduce a layer of indirection so that I can make changes to the table structure without wrecking clients like this one here. I'm still mulling over how best to do that...

tlambert03 · 2023-04-18T19:54:46Z

Your data tree branch is fine too... would you prefer to just go with that and close this?

d-v-b · 2023-04-18T20:03:24Z

i think my branch sucks! much better if we build off yours :)

tlambert03 · 2023-04-18T20:09:06Z

well, most of what we've got here is "extra" API ... but you raise a good point that it builds on the supabase schema, and perhaps we should minimize the extent to which we depend on that for now? Could be incrementally expanded in the future, and perhaps a simple key->Datatree map (dataset name to Datatree object) might be a good place to start here? It would be very nice to be able to load views too in some sense... but not sure what the best entry point for that is.

my "dream" application here is to be using the neuroglancer cosem app to explore the data at various scales... maybe come across a field of view that you really like (that you want to do something with), copy the URL or some json state, pass it to a python function in this library and have an xarray object to work with, that includes all of the sources/channels that had loaded in the view, perhaps cropped to a specific region (etc...)

but much of the details there could be hidden, keeping a relatively minimal user API until the database schema stabilizes a bit more?

tlambert03 · 2023-04-19T12:54:07Z

If you're OK with me committing to your branch, I could put this in.

btw, more than happy for you to commit directly!

tlambert03 added 2 commits April 15, 2023 20:58

add dataset object

c8a9c20

add test

992db9b

tlambert03 added 2 commits April 17, 2023 13:27

using supabase

374b2c1

add dep

9868343

d-v-b reviewed Apr 18, 2023

View reviewed changes

add top-level Dataset mapping object

7f3cf45

d-v-b mentioned this pull request Apr 27, 2023

read multiscale groups as datatrees #45

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CosemDataset object-oriented api #42

Add CosemDataset object-oriented api #42

tlambert03 commented Apr 16, 2023

tlambert03 commented Apr 16, 2023

d-v-b commented Apr 16, 2023

tlambert03 commented Apr 16, 2023 •

edited

Loading

d-v-b Apr 18, 2023

d-v-b Apr 18, 2023

d-v-b Apr 18, 2023

d-v-b Apr 18, 2023

d-v-b commented Apr 18, 2023 •

edited

Loading

tlambert03 commented Apr 18, 2023

d-v-b commented Apr 18, 2023

tlambert03 commented Apr 18, 2023

tlambert03 commented Apr 19, 2023

Add CosemDataset object-oriented api #42

Are you sure you want to change the base?

Add CosemDataset object-oriented api #42

Conversation

tlambert03 commented Apr 16, 2023

tlambert03 commented Apr 16, 2023

d-v-b commented Apr 16, 2023

tlambert03 commented Apr 16, 2023 • edited Loading

d-v-b Apr 18, 2023

Choose a reason for hiding this comment

d-v-b Apr 18, 2023

Choose a reason for hiding this comment

d-v-b Apr 18, 2023

Choose a reason for hiding this comment

d-v-b Apr 18, 2023

Choose a reason for hiding this comment

d-v-b commented Apr 18, 2023 • edited Loading

tlambert03 commented Apr 18, 2023

d-v-b commented Apr 18, 2023

tlambert03 commented Apr 18, 2023

tlambert03 commented Apr 19, 2023

tlambert03 commented Apr 16, 2023 •

edited

Loading

d-v-b commented Apr 18, 2023 •

edited

Loading