Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST] .CXG (TileDb compliant) format support from single-cell-explorer to cellxgene Annotate #2655

Open
Neah-Ko opened this issue Jan 15, 2024 · 5 comments

Comments

@Neah-Ko
Copy link

Neah-Ko commented Jan 15, 2024

Hello and happy new year cellxgene team.

I am working on cellxgene integration to our software stack for my structure @bag-cnag.

By browsing the repositories here and there, I noticed that single-cell-explorer which is the

Hosted version of cellxgene

Is working with your .cxg/ file structure format which is TileDb based. Annotate however, only takes as input .h5ad files backed by Anndata.

We are considering multiple options to host our files, and are interested by the possible performances offered by TileDb.

Question

Can we expect this feature to be available on Annotate ?

From my understanding the codebase on single-cell-explorer is essentially a fork of annotate, thus it should be in principle integrable.

I will most likely have to work on a PoC on this feature. If you are interested, I could extend a pull request.

@MaximilianLombardo
Copy link

Hi @Neah-Ko !

Thanks for your question! Responding inline:

Can we expect this feature (support ingestion of .cxg) to be available on Annotate ?

We currently do not have plans to incorporate this feature in Annotate in the near future, that being said our roadmap has not been fully established for the latter half of 2024, so its not totally out of the realm of possibilities (if it becomes enabling or required for some other higher priority features, this would incentivize us to commit some resources towards development).

I will most likely have to work on a PoC on this feature. If you are interested, I could extend a pull request.

We really appreciate this offer and if this is a high priority/urgent feature for you, I think it makes sense for you to work on it. If you do end up implementing the feature, we'd definitely like to hear about your experience and technical discovery work on it. We can't promise that the PR would be merged into the main branch since our engineering resources are already committed to higher priority work, and we would likely not have time to review. That being said, hearing about your experience and learnings while implementing it would be invaluable for us and we definitely encourage you to report back with any information that might inform a future direction/implementation of the feature in Annotate.

We sincerely value your understanding and enthusiasm for contributing to the CELLxGENE community. Please do continue to share your ideas and feedback with us.

@Neah-Ko
Copy link
Author

Neah-Ko commented Jan 19, 2024

Hello @MaximilianLombardo

So I went on with frontporting the feature, with way more success than I initially expected to be honest :)

You may find the code at #2656
It comes with a Dockerfile in order to test it

docker build . -f Dockerfile_editable -t cellxgene:<tag>
docker run -it  -p 5005:5005 -v /path/to/datasets:/data cellxgene:<tag> launch /data/dataset.[h5ad | cxg/] --host 0.0.0.0 --verbose

Note: The Dockerfile is made to be rebuilt to test changes on sources. For me, installing cellxgene in editable mode didn't worked because then react doesn't find the templates anymore. I had a look on stackoverflow and it may or may not be arranged by tweaking pyproject.toml

Behavior

So far, it seems that both file formats can be read in, with no concerning error in the logs ⭐
However, I cannot guarantee yet that it is full-proof. I mostly used example-datasets/pbmc3k.h5ad and single-cell-explorer->example-datasets/pbmc3k.cxg/ for my tests.

In principle, I merged the different parts into an up to date version of CXG, so if something breaks it is probably going to be on the CXG side.

I've made some modest attempts at factoring, but it probably deserves another pass.

I am going to detail some of my choices and findings:

FBS schema

As mentioned above, I merged into Annotate. I decided to try without regenerating the schema.

To support cxg dataset, I had to propagate some num_bins argument value around server/common/fbs/matrix.py#encode_matrix_fbs() to eventually not use it.
In the single-cell-explorer repo, it contributes deciding of a "lossy" array_type when serializing, so we might be loosing something in the process.

To dig more into it, I would need access to at least the compilation chains sent to flatc

Conversion

cellxgene reads the dataset when converted with the primitives in server/converters/h5ad_data_file.py coming from v0.16.3 (pr features the hash of the commit)

However, I have to load it setting the flag
Else, I get this error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.11/site-packages/server/converters/h5ad_data_file.py", line 48, in __init__
    self.extract_metadata_about_dataset()
  File "/usr/local/lib/python3.11/site-packages/server/converters/h5ad_data_file.py", line 204, in extract_metadata_about_dataset
    raise ValueError("Unknown source file schema version is unsupported.")
ValueError: Unknown source file schema version is unsupported.

It doesn't reads when conversion is done using cxgtool

Configuration

The configuration scheme evolved from single-cell-explorer to here. At the moment I am not fully certain that all parts are going to be consumed and not taken from the default.

I'm almost duplicating the dataset entry with a default_dataset copied from single-cell-explorer. We may rethink about naming and factoring.

CxgDataset parameters

To enable annotation support and user genesets creation I had to tweak the function _validate_and_initialize to recover them from the schema, whereas it is being set in _alias_annotation_names on anndata_adaptor side, and I am not 100% sure that the two are yielding consistent results. Especially since there's some renaming happening on the second one.

In general I could have overlooked something from the many configuration files.

Let me know what you think of it, I am going to scan and bring minor improvements where I can next week and try to merge and run both data_adaptors unit tests.

@Neah-Ko
Copy link
Author

Neah-Ko commented Jan 24, 2024

Hello @MaximilianLombardo ,

Update

As I mentioned above, the pull request now also features test/unit/test_cxg_dataset.py, which runs fine but fails on creating a lossy array (i.e. test_tdb_bug_lossy).

I can see that on the single-cell-explorer side, the primitives located at server/common/fbs/fbs_coders.py are doing some encoding optimization on serialization (E.g. the DenseNumericIntCoder that stores float{16,32,64} in an int16 array).

At the moment, I don't think I can properly support them in the pull request as they are calling some NetEncodings that don't belong to the FBS sources for Annotate, and in particular this Int16EncodedXFBArray to perform the mentioned optimization. (Also, no dict type, only a JSON one).

If I could have access to some documentation on how the schema is re-generated, that would be really helpful.

Suggestion

While browsing for TileDb <-> Anndata conversion tools, I stumbled onto TileDb-SOMA from a collaboration between you CZI and TileDb.

In my opinion that would be a great backend for cellxgene, benefiting from TileDb optimizations and able to output an Anndata object.

That could also contribute towards reducing cellxgene RAM footprint, which is the top challenge at the moment.

We are pretty committed on using cellxgene for visualization here, we would benefit from improvements. So let me know if I may contribute.

Best,

@jbownzino
Copy link

+1 to this request!

1 similar comment
@mohammed-hussain1259
Copy link

+1 to this request!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants