-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom dataloader for scvi-tools #1826
Comments
@jkobject Concerning scDataloader, it is currently setup for BERT like loading using a max_len argument. I assume it loads for every cell the 1000 most expressed genes. This structure is not allowed for scvi-tools (we expect the same genes being represented for all cells). Can you provide some help about the how argument? |
Thanks for opening the issue, here! Can you confirm the objective? I understand that by leveraging If so, can you help clarify what exactly is needed from
|
@canergen if we integrate the underlying |
Hello All, I will get back to you in a week about this and potential updates to scDataLoader to make it fully in line with scvi-tools. I have already made a PR on the scverse community packages scverse/ecosystem-packages#195 so maybe this discussion will find a better place in this PR? scDataloader is setup to work in multiple ways to adapt to the new geneformer, scGPT and scPRINT models out there which don't always get the same set of genes. U However it also works if choosing always the same set of genes. how is a parameter for the Collator and is defined in its documentation here: If you know which genes you want to use you can use "some" and pass a list of genes, e.g. from scdataloader import DataModule
import tqdm
datamodule = DataModule(
collection_name=NAME,
organisms=["NCBITaxon:9606"], #organism that we will work on
how="some", # for the collator
genelist=most_variable_genes_in_data, #not recommended (as it will highlight batch effects)
batch_size=64,
num_workers=1,
validation_split=0.1,
test_split=0)
for i in tqdm.tqdm(datamodule.train_dataloader()):
print(i)
break What I would like to know @canergen is if scdataloader is already acceptable in its current state or what specific parameters, default values and functions it is missing to make it worthy of scvi-tools. |
The current comparison is to a huge h5ad file, which doesn't scale to more than 100 million cells (and is already more expensive below these numbers). While we have support for CELLXGENE census we don't support all tiledbsoma databases and CxG census is restrictive in extending it to the whole scvi-tools library.. I assume both dataloaders are in line with our requirements (providing a pipeline of dictionary values). My take is that MappedCollection can provide a dictionary with categorical/integer/float values for every obs column in the underlying AnnData object. However, there are some details. |
I am not sure I understand everything, but you seem to say that scdataloader is already in line with scvi-tools, but we need to enforce that its other elements in the output dictionary, e.g., class / batch... are categoricals encoded with integers. The function is already there, and in my example and current use case, everything is transformed so that the encoding is available via the "decoders" property of the datamodule or the "encoder" property of the pytorch datasets. Given how the mapped collection is implemented, multimodal data is quite straightforward. It just amounts to accessing another field of the underlying H5ad format. For the user, it would receive an 'X_rna', 'X_atac', ... in the output dictionary of tensors instead of just the 'X'. |
Hi, as discussed in our recent meeting, we can work with custom dataloaders within scvi-tools. It requires being a torch DataModule that loads dictionaries of keyword arguments for each mini batch. We have a working example for CELLXGENE census: https://github.com/chanzuckerberg/cellxgene-census/blob/ebezzi/census-scvi-datamodule/api/python/notebooks/experimental/pytorch_loader_scvi.ipynb (see imports to find the actual datamodule code). You will find that the notebook requires custom code to load and save models. We have built on top of this solution and changed the scvi-tools code to work without an registered AnnData object. I'm setting up a Colab notebook currently with our code and will update here.
Next steps would be feedback to our setup_registry function that takes a datamodule and populates all fields within an scvi-tools model. We currently only support scvi and would like to support more models. I expect that this will be joint effort as things might break outside of scvi-tools (for sure things will not work for multimodal - totalVI or multiVI models).
The text was updated successfully, but these errors were encountered: