-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
export_pseudobulk never generates pseudobulk files #117
Comments
Hi @SITRR Looks allright what you are doing. Could you run the following piece of code and send me the output? sample = "75G"
variable = "75G"
_sample_cell_data = cell_data.loc[cell_data[sample_id_col] == sample]
_cell_type_to_cell_barcodes = _sample_cell_data \
.groupby(variable, group_keys=False)["barcode"] \
.apply(list) \
.to_dict()
print(_cell_type_to_cell_barcodes) Best, Seppe |
Hi @SeppeDeWinter, Disclaimer: I'm an absolute beginner with python. The line _sample_cell_data = cell_data.loc[cell_data[sample_id_col] == sample] gives the error: Traceback (most recent call last): I thought to change sample_id_col to 'sample_id' (please refer to the disclaimer for my reasoning) but then _cell_type_to_cell_barcodes gives the error: Traceback (most recent call last): Best, |
Hi @SITRR Can you show how you print(cell_data) All the best, Seppe |
Hi @SeppeDeWinter, Thanks (seriously, I will be forever indebted), |
Hi Susana My bad, the code should be like this sample = "75G"
variable = "donor"
sample_id_col = "sample_id"
_sample_cell_data = cell_data.loc[cell_data[sample_id_col] == sample]
_cell_type_to_cell_barcodes = _sample_cell_data \
.groupby(variable, group_keys=False)["barcode"] \
.apply(list) \
.to_dict()
print(_cell_type_to_cell_barcodes) |
Hi, @SeppeDeWinter! It printed a huge list of barcodes: Is that to be expected? Should the output of len(_cell_type_to_cell_barcodes) be 1? I'm presuming that it shouldn't. Best, |
Hi @SITRR That looks allright, I don't immediately something that is off. Are you running it in jupyter notebooks or via the command line environment? All the best, Seppe |
Hi, @SeppeDeWinter! I’m running it via the command line environment, as an lsf job. It runs successfully but only outputs bed_paths.pkl and bw_paths.pkl files, nothing else. The output also doesn't say "Done!" as shown in the tutorial. I was really hoping to get this to work since there is huge batch effect when I don't use a common peak set. I wouldn't want to go forward with this cistopic object. Best, |
I tried following this user's pipeline to verify each input 59, but get an error when trying to read the fragments file (unless I'm misunderstanding their code or their input files have a different format). fragments = pd.read_table("/sc/arion/projects/DTR-EFGR/data_delivery/ATACseq_Human/TD005909_Nadejda_Tsankova/Sample_75G_ATAC/outs/fragments.tsv.gz", header = None) Does this give you any hints? Cheers, |
Hi @SITRR Could you try running the command line version of this code. This will involve generating:
After this you can run scatac_fragment_tools split \
-f <PATH_TO_SAMPLE_TO_FRAGMENT_DEFINITION> \
-b <PATH_TO_CELL_TYPE_TO_CELL_BARCODE_DEFINITION> \
-c <CHROM_SIZES_FILENAME> \
-o <PATH_TO_OUTPUT_FOLDER>
And this will do the same splitting as the function is trying to do. All the best, Seppe |
Thank you, @SeppeDeWinter! How do I proceed after running this command? I'd like to extract the list of barcodes from the fragments file it generated to compare to the list I've been using to run the export_pseudobulk command as follows, but I get a large 5.2 GB file, which doesn't seem right to me. df = pd.read_table(work_dir + "Testing/75G.fragments.tsv.gz", header=None) I tried using the new fragments file (output from scatac_fragment_tools split) but I'm getting the same results. Best, |
Hi @SITRR I'm not sure wether I'm completely understanding your question... To get the bed_paths variable (the one that is used in the tutorial, run) from scatac_fragment_tools.library.split.split_fragments_by_cell_type import (
_santize_string_for_filename
)
bed_path = <PATH_TO_OUTPUT_FOLDER>
variable = "donor" #variable used to split the fragments
bed_paths = {}
for cell_type in cell_data[variable].unique():
_bed_fname = os.path.join(
bed_path,
f"{_santize_string_for_filename(cell_type)}.fragments.tsv.gz")
if os.path.exists(_bed_fname):
bed_paths[cell_type] = _bed_fname
else:
print(f"Missing fragments for {cell_type}!") Best, S |
Hi @SeppeDeWinter! I finally got the export_pseudobulk function to run by switching to Python v3.11 and the dev branch of SCENIC+. Oddly though, it only runs with two samples. The function immediately outputs the following message: cisTopic INFO Reading fragments from /sc/arion/projects/DTR-EFGR/data_delivery/ATACseq_Human/TD005909_Nadejda_Tsankova/Sample_75G_ATAC/outs/fragments.tsv.gz However, when I add two more samples to the fragment directory, the function outputs the following instead and gets stuck on this step for several hours: INFO Splitting fragments by cell type. Any idea what might be happening? Best, |
Crisis averted, @SeppeDeWinter! Patience was all that was necessary. It's running smoothly now, but I'd like to keep the issue open until I get to the end of the pipeline as I might encounter other issues and would appreciate your guidance. If you'd like me to open a separate issue instead, I can do so. Thank you for everything so far! |
Hi!
I'm having an issue with the export_pseudobulk function, which has been previously 41. The function runs but doesn't output any files and in the text output does not state "Done!" after "Creating pseudobulk for 75G" (75G being the name of one of my samples).
I've generated a cell_data data frame including three columns, 'barcode' which has the same barcode format as the fragment files (e.g. 'TTTGTGTTCTTGTCGC-1; I got these from the CellRanger output singlecell.csv file let me know if that's not advisable), a second column called 'sample_id' with the name of the sample and fragments_dict key (e.g. 75G; repeated for the length of the 'barcode' column), and a third column 'donor' identical to the 'sample_id' column. I use the 'donor' column as the input for 'variable' in the export_pseudobulk function (see below).
Should I be extracting the barcodes from the fragments file directly? I'm new to this and don't know how to do so on python.
This is the cell_data dataframe:
This is the fragment file unzipped:
This is the code I'm running:
The text was updated successfully, but these errors were encountered: