Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bat_int] hcla dataset cannot be processed #359

Open
KaiWaldrant opened this issue Feb 5, 2024 · 1 comment
Open

[bat_int] hcla dataset cannot be processed #359

KaiWaldrant opened this issue Feb 5, 2024 · 1 comment
Labels
batch_integration relates to task batch_integration bug Something isn't working cellxgene_census Relates to cellxgene_census dataset

Comments

@KaiWaldrant
Copy link
Contributor

Describe the bug
Processing the cellxgene dataset hcla with the batch integration dataset processor an error is raised:

Traceback (most recent call last):
  File "/tmp/nxf.DNtRDk7Ba7/.viash_script.sh", line 60, in
    adata_with_hvg = compute_batched_hvg(input, n_hvgs=par['hvgs'])
  File "/tmp/nxf.DNtRDk7Ba7/.viash_script.sh", line 49, in compute_batched_hvg
    hvg_list = scib.pp.hvg_batch(
  File "/usr/local/lib/python3.10/site-packages/scib/preprocessing.py", line 504, in hvg_batch
    sc.pp.highly_variable_genes(
  File "/usr/local/lib/python3.10/site-packages/scanpy/preprocessing/_highly_variable_genes.py", line 469, in highly_variable_genes
    hvg = _highly_variable_genes_single_batch(
  File "/usr/local/lib/python3.10/site-packages/scanpy/preprocessing/_highly_variable_genes.py", line 248, in _highly_variable_genes_single_batch
    df['mean_bin'] = pd.cut(
  File "/usr/local/lib/python3.10/site-packages/pandas/core/reshape/tile.py", line 293, in cut
    fac, bins = _bins_to_cuts(
  File "/usr/local/lib/python3.10/site-packages/pandas/core/reshape/tile.py", line 421, in _bins_to_cuts
    raise ValueError(
ValueError: Bin edges must be unique: array([      -inf, 0.00013226, 0.00014937, 0.00014937, 0.00016693,
       0.00016958, 0.00016958, 0.00018138, 0.0001983 , 0.0001983 ,
       0.00020949, 0.00020964, 0.00021124, 0.00030184, 0.00034767,
       0.00037922, 0.00048056, 0.00062685, 0.00096363, 0.007547  ,
              inf]).
You can drop duplicate edges by setting the 'duplicates' kwarg

https://tower.nf/orgs/openproblems-bio/workspaces/openproblems-bio/watch/4L4a4swu0PnnpT

To Reproduce
Steps to reproduce the behavior:

bash src/tasks/batch_integration/resources_scripts/process_datasets.sh
@KaiWaldrant KaiWaldrant added bug Something isn't working batch_integration relates to task batch_integration cellxgene_census Relates to cellxgene_census dataset labels Feb 5, 2024
@rcannood
Copy link
Member

rcannood commented Mar 15, 2024

Another way of reproducing the issue:

aws s3 sync "s3://openproblems-data/resources/datasets/cellxgene_census/hcla/log_cp10k/" "resources/datasets/cellxgene_census/hcla/log_cp10k/"

viash run src/tasks/batch_integration/process_dataset/config.vsh.yaml -- --input resources/datasets/cellxgene_census/hcla/log_cp10k/dataset.h5ad --output_dataset dataset.h5ad --output_solution solution.h5ad

Or debug with:

## VIASH START
par = {
    'input': 'resources/datasets/cellxgene_census/hcla/log_cp10k/dataset.h5ad',
    'hvgs': 2000,
    'obs_label': 'cell_type',
    'obs_batch': 'batch',
    'subset_hvg': False,
    'output_dataset': 'dataset.h5ad',
    'output_solution': 'solution.h5ad'
}
meta = {
    "config": "target/nextflow/batch_integration/process_dataset/.config.vsh.yaml",
    "resources_dir": "src/common/helper_functions"
}
## VIASH END
  Traceback (most recent call last):
    File "/viash_automount/tmp//viash-run-process_dataset-HETEzD.py", line 60, in <module>
      adata_with_hvg = compute_batched_hvg(input, n_hvgs=par['hvgs'])
    File "/viash_automount/tmp//viash-run-process_dataset-HETEzD.py", line 49, in compute_batched_hvg
      hvg_list = scib.pp.hvg_batch(
    File "/usr/local/lib/python3.10/site-packages/scib/preprocessing.py", line 504, in hvg_batch
      sc.pp.highly_variable_genes(
    File "/usr/local/lib/python3.10/site-packages/scanpy/preprocessing/_highly_variable_genes.py", line 469, in highly_variable_genes
      hvg = _highly_variable_genes_single_batch(
    File "/usr/local/lib/python3.10/site-packages/scanpy/preprocessing/_highly_variable_genes.py", line 248, in _highly_variable_genes_single_batch
      df['mean_bin'] = pd.cut(
    File "/usr/local/lib/python3.10/site-packages/pandas/core/reshape/tile.py", line 293, in cut
      fac, bins = _bins_to_cuts(
    File "/usr/local/lib/python3.10/site-packages/pandas/core/reshape/tile.py", line 421, in _bins_to_cuts
      raise ValueError(
  ValueError: Bin edges must be unique: array([      -inf, 0.27518785, 0.27518785, 0.27518785, 0.27662799,
         0.2801334 , 0.2801334 , 0.30019245, 0.31369418, 0.31369418,
         0.32693949, 0.33407092, 0.35471782, 0.55181584, 0.59382758,
         0.62713194, 0.74299753, 0.9963228 , 1.50465243, 5.59722102,
                inf]).
  You can drop duplicate edges by setting the 'duplicates' kwarg
  Files and logs are stored at '/tmp/viash_process_dataset4742220387422529871'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
batch_integration relates to task batch_integration bug Something isn't working cellxgene_census Relates to cellxgene_census dataset
Projects
None yet
Development

No branches or pull requests

2 participants