Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CosemDataset object-oriented api #42

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

tlambert03
Copy link
Contributor

Hey @d-v-b ... curious what you think about adding an object like this. I use this in a little pet project as slightly more interactive way to interact with cosem-datasets.

In [1]: from fibsem_tools._dataset import CosemDataset

In [2]: CosemDataset.all_names()
Out[2]:
['jrc_hela-h89-1',
 'jrc_hela-1',
 ...,
]

In [3]: ds = CosemDataset('jrc_hela-3')

In [4]: list(ds.sources)
Out[4]:
['er-mem_pred',
 'endo_pred',
  ...,
]

In [5]: ds.read_source('nucleolus_seg')
Out[5]:
<xarray.DataArray 'array-c0a700bda7e9803a4720b383f9e7adfa' (z: 12000, y: 1000,
                                                            x: 12400)>
dask.array<array, shape=(12000, 1000, 12400), dtype=uint16, chunksize=(256, 256, 256), chunktype=numpy.ndarray>
Coordinates:
  * z        (z) float64 0.0 3.24 6.48 9.72 ... 3.887e+04 3.887e+04 3.888e+04
  * y        (y) float64 0.0 4.0 8.0 12.0 ... 3.988e+03 3.992e+03 3.996e+03
  * x        (x) float64 0.0 4.0 8.0 12.0 ... 4.959e+04 4.959e+04 4.96e+04

lemme know what you think, if it's something you'd want/consider including here, I can complete docstrings and clean it up a bit

@tlambert03
Copy link
Contributor Author

btw ... looks like tests are only running against master which doesn't exist?

https://github.com/janelia-cosem/fibsem-tools/blob/d21f095d8ebb4bdac7fb77b38caf6d8e40729d04/.github/workflows/pytest.yml#L2-L7

@d-v-b
Copy link
Contributor

d-v-b commented Apr 16, 2023

@tlambert03 I love the idea! Thanks for putting that together. If you are willing to iterate on this a bit, it might be nice to consider using datatrees to represent collections. I will definitely look through the PR more thoroughly in the next few days

as for ci... thanks for reminding me. getting that cleaned up and working has been on my to-do list for a while, and docs too. PRs are welcome, but I can also just sort it out myself this week.

@tlambert03
Copy link
Contributor Author

tlambert03 commented Apr 16, 2023

If you are willing to iterate on this a bit, it might be nice to consider using datatrees to represent collections.

yep, more than happy to play around with it. After taking a quick look, I think my first question is probably what exactly should I be representing as DaraArray, Dataset, and Datatree?

  • root s3 bucket
  • dataset ('jrc_hela-3')
  • sources ('er-mem_pred')
  • scales (0, 1, etc...)

do you want to sketch out roughly what structures you imagine for each? Would there be a single root Datatree representing the full bucket? datasets for each cosem dataset, and arrays for each source/scale combo? (Or would each cosem itself be a Datatree)

def images(self) -> dict[str, ImageDict]:
return self.metadata["images"]

def read_image(self, image_id: str, level: int = 0) -> xr.DataArray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be a place to return a datatree, e.g. via something like this:

def multiscale_datatree(group_url: str) -> DataTree:
    group = read(group_url)
    # this sorts on the key name, but a better approach is to sort by descending array size
    array_keys = sorted(
        tuple(name for name, _ in group.arrays()), key=lambda v: int(v[1:])
    )
    arrays = {
        arr_name: read_xarray(os.path.join(group_url, arr_name))
        for arr_name in array_keys
    }
    return DataTree.from_dict(arrays, name=group.basename)

Then you don't need to worry about validating the 'level' kwarg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and maybe rename the method that returns the arrays to .images, and use the name image_meta for the method that just returns image metadata?

return dataset_names()

@staticmethod
def all_thumbnails(ncols: int = 4, **kwargs) -> plt.Figure:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible alternative name: thumbnail_gallery, or just gallery

)

@staticmethod
def all_names() -> list[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we use a top-level Mapping-like object where the keys are the names of datasets and the values are instances of CosemDataset, we can replace this with the .keys method on that object

@d-v-b
Copy link
Contributor

d-v-b commented Apr 18, 2023

yep, more than happy to play around with it. After taking a quick look, I think my first question is probably what exactly should I be representing as DaraArray, Dataset, and Datatree?

root s3 bucket
dataset ('jrc_hela-3')
sources ('er-mem_pred')
scales (0, 1, etc...)

do you want to sketch out roughly what structures you imagine for each? Would there be a single root Datatree representing the full bucket? datasets for each cosem dataset, and arrays for each source/scale combo? (Or would each cosem itself be a Datatree)

top level: some kind of key-value interface, where the keys are dataset names, and the values are ~CosemDataset objects. Besides an instance of a supabase.Client and references to API keys, I cant think of anything else I would want from this object (at the moment). If you're OK with me committing to your branch, I could put this in.

dataset: I think the overall structure of the CosemDataset as you've put it together is good.

sources: after playing around with a quick hack, I think a Datatree of DataArrays is definitely the right fit for expressing the multiscaleness of an image. So one option would be to represent the images that belong to a dataset as ~Mapping[str, Datatree[DataArray]] (I have no idea how to represent meshes in the xarray model, so that's something important to think about).

A similar interface could be achieved by using Datatree[Datatree[DataArray]]]. There might be some advantage to this, but I'm not really sure what that would be at the moment, especially considering that eagerly constructing a Datatree[DataArray] representation for each image associated with a dataset is super slow when a dataset has a large number of images (e.g., jrc_hela-2 has 70+). So unless we can implement a performant asynchronous API for datatree and zarr (big doubt), we should probably not use this approach.

One sticky issue that will bite us is that the current design of the supabase / postgres setup is bad -- I am exposing the structure of the raw database tables, but I need to introduce a layer of indirection so that I can make changes to the table structure without wrecking clients like this one here. I'm still mulling over how best to do that...

@tlambert03
Copy link
Contributor Author

Your data tree branch is fine too... would you prefer to just go with that and close this?

@d-v-b
Copy link
Contributor

d-v-b commented Apr 18, 2023

i think my branch sucks! much better if we build off yours :)

@tlambert03
Copy link
Contributor Author

well, most of what we've got here is "extra" API ... but you raise a good point that it builds on the supabase schema, and perhaps we should minimize the extent to which we depend on that for now? Could be incrementally expanded in the future, and perhaps a simple key->Datatree map (dataset name to Datatree object) might be a good place to start here? It would be very nice to be able to load views too in some sense... but not sure what the best entry point for that is.

my "dream" application here is to be using the neuroglancer cosem app to explore the data at various scales... maybe come across a field of view that you really like (that you want to do something with), copy the URL or some json state, pass it to a python function in this library and have an xarray object to work with, that includes all of the sources/channels that had loaded in the view, perhaps cropped to a specific region (etc...)

but much of the details there could be hidden, keeping a relatively minimal user API until the database schema stabilizes a bit more?

@tlambert03
Copy link
Contributor Author

If you're OK with me committing to your branch, I could put this in.

btw, more than happy for you to commit directly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants