diff --git a/LICENSE b/LICENSE index ab3aa59..7c78c2d 100644 --- a/LICENSE +++ b/LICENSE @@ -1,6 +1,6 @@ MIT License -Copyright (c) 2022 Benedetto Polimeni +Copyright (c) 2022-2024 Benedetto Polimeni Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal diff --git a/README.md b/README.md index 892ad98..c10d0b0 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ # IRescue - Interspersed Repeats single-cell quantifier -IRescue is a software for quantifying the expression of transposable elements (TEs) subfamilies in single cell RNA sequencing (scRNA-seq) data. The core feature of IRescue is to consider all multiple alignments (i.e. non-primary alignments) of reads/UMIs mapping on multiple TEs in a BAM file, to accurately infer the TE subfamily of origin. IRescue implements a UMI error-correction, deduplication and quantification strategy that includes such alignment events. IRescue's output is compatible with most scRNA-seq analysis toolkits, such as Seurat or Scanpy. +IRescue quantifies the expression fo transposable elements (TEs) subfamilies in single cell RNA sequencing (scRNA-seq) data that performs UMI-deduplication with sequencing errors correction and probabilistic assignment of multi-mapping reads by expectation-maximization (EM) procedure. TE counts are written on a sparse matrix (similar to Cell Ranger's output) compatible with Seurat, Scanpy and other toolkits. ## Content @@ -34,7 +34,7 @@ conda create -n irescue -c conda-forge -c bioconda irescue ### Using pip -If for any reason it's not possible or desiderable to use conda, it can be installed with pip and the following requirements must be installed manually: `python>=3.7`, `samtools>=1.12`, `bedtools>=2.30.0`, and fairly recent versions of the GNU utilities are required, specifically `coreutils>=8.30` and `gzip>=1.10` (older versions are untested). +If for any reason it's not possible or desiderable to use conda, it can be installed with pip and the following requirements must be installed manually: `python>=3.7`, `samtools>=1.12`, `bedtools>=2.30.0`, and fairly recent versions of the GNU utilities are required, specifically `gawk>=5.0.1`, `coreutils>=8.30` and `gzip>=1.10` (older versions are untested). ```bash pip install irescue @@ -57,29 +57,36 @@ singularity exec https://depot.galaxyproject.org/singularity/irescue:$TAG irescu ## Usage -### Quick start +```sh +irescue --help +``` + +The only required input is a BAM file annotated with cell barcode and UMI sequences as tags (by default, `CB` tag for cell barcode and `UR` tag for UMI; override with `--cb-tag` and `--umi-tag`). -The only required input is a BAM file annotated with cell barcode and UMI sequences as tags (by default, `CB` tag for cell barcode and `UR` tag for UMI; override with `--CBtag` and `--UMItag`). You can obtain it by aligning your reads using [STARsolo](https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md). +You can obtain it by aligning your reads using [STARsolo](https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md). It is advised to keep secondary alignments in BAM file, that will be used in the EM procedure to assign multi-mapping reads (e.g. `--outFilterMultimapNmax 100 --winAnchorMultimapNmax 100` or more), and remember to output all the needed SAM attributes (e.g. `--outSAMattributes NH HI AS nM NM MD jM jI XS MC ch cN CR CY UR UY GX GN CB UB sM sS sQ`). RepeatMasker annotation will be automatically downloaded for the chosen genome assembly (e.g. `-g hg38`), or provide your own annotation in bed format (e.g. `-r TE.bed`). -```bash +```sh irescue -b genome_alignments.bam -g hg38 ``` -If you already obtained gene-level counts (using STARsolo, Cell Ranger, Alevin, Kallisto or other tools), it is advised to provide the whitelisted cell barcodes list as a text file, e.g.: `-w barcodes.tsv`. This will significantly improve performance. +If you already obtained gene-level counts (using STARsolo, Cell Ranger, Alevin, Kallisto or other tools), it is advised to provide the whitelisted cell barcodes list as a text file (`-w barcodes.tsv`). This will significantly improve performance by processing viable cells only. -IRescue performs best using at least 4 threads, e.g.: `-p 8`. +For optimal run time, use at least, e.g.: `-p 8`. ### Output files -IRescue generates TE counts in a sparse matrix format, readable by [Seurat](https://github.com/satijalab/seurat) or [Scanpy](https://github.com/scverse/scanpy): +IRescue generates TE counts in a sparse matrix readable by [Seurat](https://github.com/satijalab/seurat) or [Scanpy](https://github.com/scverse/scanpy) into a `counts/` subdirectory. Optional outputs include a description of equivalence classes with UMI deduplication stats `ec_dump.tsv.gz` and a subdirectory of temporary files `tmp/` for debugging purpose. A detailed logging is enabled by `--verbose` and written to standard error. ``` -IRescue_out/ -├── barcodes.tsv.gz -├── features.tsv.gz -└── matrix.mtx.gz +irescue_out/ +├── counts/ +│   ├── barcodes.tsv.gz +│   ├── features.tsv.gz +│   └── matrix.mtx.gz +├── ec_dump.tsv.gz +└── tmp/ ``` ### Load IRescue data with Seurat @@ -108,8 +115,10 @@ Active assay: RNA (31078 features, 0 variable features) 1 other assay present: TE ``` +From here, TE expression can be normalized. Reductions can be made using TE or gene expression. + ## Cite Polimeni B, Marasca F, Ranzani V, Bodega B. -IRescue: single cell uncertainty-aware quantification of transposable elements expression. +*IRescue: uncertainty-aware quantification of transposable elements expression at single cell level.* bioRxiv 2022.09.16.508229; doi: https://doi.org/10.1101/2022.09.16.508229 diff --git a/irescue/_version.py b/irescue/_version.py index 4980bee..236fd8c 100644 --- a/irescue/_version.py +++ b/irescue/_version.py @@ -1 +1 @@ -__version__ = '1.1.0-beta.1' +__version__ = '1.1.0-beta.2' diff --git a/irescue/count.py b/irescue/count.py index 3a10b0f..6d055e8 100644 --- a/irescue/count.py +++ b/irescue/count.py @@ -1,372 +1,308 @@ #!/usr/bin/env python +from collections import Counter +from itertools import combinations import numpy as np -from irescue.misc import getlen, writerr, flatten, run_shell_cmd, iupac_nt_code +import networkx as nx +from irescue.misc import get_ranges, getlen, writerr, run_shell_cmd +from irescue.network import build_substr_idx, gen_ec_pairs +from irescue.em import run_em import gzip import os -def find_mm(x, y): +class EquivalenceClass: + def __init__( + self, + index: int, + umi: bytes, + features: set, + count: int + ) -> None: + self.index = index + self.umi = umi + self.features = features + self.count = count + def to_tuple(self): + return (self.umi, self.features, self.count) + def hdist(self, umi): + return sum(1 for i, j in zip(self.umi, umi) if i != j) + def connect(self, eqc, threshold): + return (self.count >= (2 * eqc.count) - 1 + and self.features.intersection(eqc.features) + and self.hdist(eqc.umi) <= threshold) + +def pathfinder(graph, node, path=[], features=None): """ - Calculate number of mismatches between sequences of the same length + Finds first valid path of UMIs with compatible equivalence class given + a starting node. Can be used iteratively to find all possible paths. """ - if len(x) != len(y): - return -1 - mm = 0 - for i in range(len(x)): - if x[i] != y[i]: - mm += 1 - return mm + if not features: + features = graph.nodes[node]['ft'] + path += [node] + for next_node in graph.successors(node): + if (features.intersection(graph.nodes[next_node]['ft']) + and next_node not in path): + path = pathfinder(graph, next_node, path, features) + return path + +def index_features(features_file): + idx = {} + with gzip.open(features_file, 'rb') as f: + for i, line in enumerate(f, start=1): + ft = line.strip().split(b'\t')[0] + idx[ft] = i + return idx -def collapse_networks(graph): +def parse_maps(maps_file, feature_index): """ - Collapse a UMI graph to a graph of the smallest number of hubs. - - Parameters - ---------- - graph: dict - A dictionary with nodes as keys and the set of adjacent nodes - (including the node itself) as values. - e.g.: {0: {0,1,2}, 1: {0,1}, 2: {0,2,3}} + maps_file : str + Content: "CB UMI FEATs count" + out : bytes, list + CB, + [(UMI , {FT , ...} , count ) , ...] """ - out = dict() - for key, value in graph.items(): - out[key] = [] - if len(value) == 1: - # check if it's a single node, then add to the output and go to the next node - out[key].append(value) - continue - for val in graph.values(): - # check if there are other nodes that contains all the values of the current one - if all(i in value for i in val): - out[key].append(val) - if len(out[key]) <= 1: - out.popitem() - return out + with gzip.open(maps_file, 'rb') as f: + cb, umi, feat, count = f.readline().strip().split(b'\t') + i = 0 + it = cb + count = int(count) + feat = {feature_index[ft] for ft in feat.split(b',')} + eqcl = [EquivalenceClass(i, umi, feat, count)] + for line in f: + cb, umi, feat, count = line.strip().split(b'\t') + count = int(count) + feat = {feature_index[ft] for ft in feat.split(b',')} + if cb == it: + i += 1 + eqcl.append(EquivalenceClass(i, umi, feat, count)) + else: + yield it, eqcl + it = cb + i = 0 + eqcl = [EquivalenceClass(i, umi, feat, count)] + yield it, eqcl -# calculate counts of a cell from mappings dictionary -def cellCount(maps, intcount=False, dumpec=False): +def compute_cell_counts(equivalence_classes, features_index, dumpEC): """ - Deduplicate UMI counts of a cell. + Calculate TE counts of a single cell, given a list of equivalence classes. Parameters ---------- - maps: dict - Dictionary of all UMI-TE mappings of the cell. - e.g.: {UMI: {TE_1, TE_2}} - intcount: bool - Convert all counts to integer. - dumpec: - Make a list of rows for the Equivalence Classes dump (to use with - --dumpEC on) + equivalence_classes : list + (UMI_sequence, {TE_index}, read_count) : (str, set, int) + Tuples containing UMI-TEs equivalence class infos. + + Returns + ------- + out : dict + feature : count dictionary. """ - - # get and index equivalence classes from maps - eclist = list() - for v in maps.values(): - eclist.append(tuple(sorted(v.keys()))) - eclist = sorted(list(set(eclist))) - - # make a simple mapping dict (index number in place of families) and its reverse - smaps = dict([(i,eclist.index(tuple(sorted(j.keys())))) for i,j in maps.items()]) - rsmaps = dict() - for key, value in smaps.items(): - rsmaps.setdefault(value, list()).append(key) - - # compute the count of each equivalence class in the cell barcode - counts = dict() - ec_log = [] - for ec in rsmaps: - # list of UMIs associated to EC - umis = rsmaps[ec] - ### compute the total count of the equivalence class - if len(umis) > 1: - ### Find and collapse duplicated UMIs ### - # Make an NxN array of number of mismatches between N UMIs - mm_arr = np.array([[find_mm(ux,i) for i in umis] for ux in umis]) - # Find UMI pairs with up to 1 mismatch, where UMIs are representad - # by integers: [[i, j], [i, k], [k, m]] - mm_check = np.argwhere(mm_arr <= 1) - # Make a graph that connects UMIs with <=1 mismatches - # {NODE: EDGES} or {UMI: [CONNECTED_UMIS]} - graph = dict() - for key, value in mm_check: - graph.setdefault(key, set()).add(value) - # Check if all nodes are connected (i.e. complete graph) - if all([x == set(graph.keys()) for x in graph.values()]): - # Set EC final count to 1 - ec_count = 1 - if dumpec: - mm = [ - (i, j) for i, j in enumerate( - [set(x) for x in zip(*umis)] - ) - if len(j) > 1 - ] - if len(mm) == 1: - mm = mm[0] - if mm: - iupac = iupac_nt_code(mm[1]) - umis_dedup = list(umis[0]) - umis_dedup[mm[0]] = iupac - umis_dedup = [''.join(umis_dedup)] - else: - umis_dedup = [''.join(umis_dedup)] - else: - # Collapse networks based on UMI similarity: {HUB: [UMI_GRAPHS]} - coll_nets = collapse_networks(graph) - # Get EC final count after collapsing - ec_count = len(coll_nets) - # - if dumpec: - umis_dedup = [umis[x] for x in coll_nets] - + # initialize TE counts and dedup log + counts = Counter() + dump = None + number_of_features = len(features_index) + # build cell-wide UMI deduplication graph + graph = nx.DiGraph() + # add nodes with annotated features + graph.add_nodes_from( + [(x.index, {'ft': x.features, 'c': x.count}) + for x in equivalence_classes] + ) + # make an iterator of umi pairs + if len(equivalence_classes) > 25: + umi_length = len(equivalence_classes[0].umi) + substr_idx = build_substr_idx(equivalence_classes, umi_length, 1) + iter_ec_pairs = gen_ec_pairs(equivalence_classes, substr_idx) + else: + iter_ec_pairs = combinations(equivalence_classes, 2) + for x, y in iter_ec_pairs: + # add edges to graph + if x.connect(y, 1): + graph.add_edge(x.index, y.index) + if y.connect(x, 1): + graph.add_edge(y.index, x.index) + if dumpEC: + # collect graph metadata in a dictionary + dump = {i: equivalence_classes[i].to_tuple() for i in graph.nodes} + # split cell-wide graph into subgraphs of connected nodes + subgraphs = [graph.subgraph(x) for x in + nx.connected_components(graph.to_undirected())] + # put aside networks that will be solved with EM + em_array = [] + # solve UMI deduplication for each subgraph of connected nodes + for subg in subgraphs: + # find all parent nodes in graph + parents = [x for x in subg if not list(subg.predecessors(x))] + if not parents: + # if no parents are found due to bidirected edges, take all nodes + # and the union of all features (i.e. all nodes are parents). + parents = list(subg.nodes) + features = [list(set.union(*[subg.nodes[x]['ft'] for x in subg]))] else: - # If only one umi, skip collapsing and assign 1 to the final count - ec_count = 1 - if dumpec: - umis_dedup = umis - - ### find the predominant TE family in the equivalence class - # make count matrix from mappings (row = UMI, column = TE) - ec_counts = np.array([list(j.values()) for i,j in maps.items() if i in rsmaps[ec]]) - # sum counts by TE - ec_sum = ec_counts.sum(axis = 0) - # find the index of the highest count - ec_max = np.argwhere(ec_sum == ec_sum.max()).flatten() - - # retrieve the TEs with highest count - te_max = list() - for i in ec_max: - te_max.append(eclist[ec][i]) - - # add count - for te in te_max: - # initialize the feature in the cell barcode dictionary - if te not in counts: - counts[te] = 0 - # get the normalized count by dividing the raw count by the number of predominant TEs - norm_count = ec_count / len(te_max) - # if integers are needed, round the normalized count - if intcount: - norm_count = round(norm_count) - # add count to dictionary - counts[te] += norm_count - - # dump EC - if dumpec: - if umis == umis_dedup: - umis_dedup = ['-'] - ec_log.append('\t'.join([ - str(ec), # EC index - ','.join(eclist[ec]), # EC name - ','.join(umis), # Raw UMIs - str(len(umis)), # Raw count - ','.join(umis_dedup), # Deduplicated UMIs - str(ec_count), # Deduplicated count - ','.join(te_max) # Filtered TEs - ]) + '\n') - - return ec_log, counts - -def parse_features(features_file): - """ - Parses the features.tsv file, assigns an index (int) for each feature and - yields (index, feature) tuples. - """ - with gzip.open(features_file, 'rb') as f: - for i, line in enumerate(f): - l = line.decode('utf-8').strip().split('\t') - yield (l[0], i+1) - -def split_int(num, div): - """ - Splits an integer X into N integers whose sum is equal to X. - """ - split = int(num/div) - for i in range(0, num, split): - j = i + split - if j > num-split: - j = num - yield range(i, j) - break - yield range(i, j) - -def split_bc(barcode_file, n): - """ - Yields barcodes (index,sequence) tuples in n chunks. - """ - bclen = getlen(barcode_file) - #split = round(bclen/n) - with gzip.open(barcode_file, 'rb') as f: - c=0 - for chunk in split_int(bclen, n): - yield (c,[(next(f).decode('utf-8').strip(),x+1) for x in chunk]) - c+=1 - -def count( - mappings_file, outdir, tmpdir, features, intcount, dumpec, verbose, - bc_split -): + # if parents node are found, features will be determined below. + features = None + # initialize dict of possible paths configurations, starting from + # each parent node. + paths = {x: [] for x in parents} + # find paths starting from each parent node + for parent in parents: + # populate this list with nodes utilized in paths + blacklist = [] + # find paths in list of nodes starting from parent + path = [] + subg_copy = subg.copy() + nodes = [parent] + [x for x in subg_copy if x != parent] + for node in nodes: + # make a copy of subgraph and remove nodes already used + # in a path + if node not in blacklist: + path = pathfinder(subg_copy, node, path=[], features=None) + for x in path: + blacklist.append(x) + subg_copy.remove_node(x) + paths[parent].append(path) + # find the path configuration leading to the minimum number of + # deduplicated UMIs -> list of lists of nodes + path_config = [ + paths[k] for k, v in paths.items() + if len(v) == min([len(x) for x in paths.values()]) + ][0] + if not features: + # take features from parent node of selected path configuration + features = [list(subg.nodes[x[0]]['ft']) for x in path_config] + else: + # if features was already determined (i.e. no parent nodes), + # multiplicate the feature's list by the number of paths + # in path_config to avoid going out of list range + features *= len(path_config) + # assign UMI count to features + for feats in features: + if len(feats) == 1: + counts[feats[0]] += 1.0 + elif len(feats) > 1: + row = [1 if x in feats else 0 + for x in range(1, number_of_features+1)] + em_array.append(row) + else: + writerr(nx.to_dict_of_lists(subg)) + writerr([subg.nodes[x]['ft'] for x in subg.nodes]) + writerr([subg.nodes[x]['count'] for x in subg.nodes]) + writerr(path_config) + writerr(path) + writerr(features) + writerr(feats) + writerr("Error: no common features detected in subgraph's" + " path.", error=True) + # add EC log to dump + if dumpEC: + for i, path_ in enumerate(path_config): + # add empty fields to parent node + parent_ = path_[0] + path_.pop(0) + dump[parent_] += (b'', b'') + # if child nodes are present, add parent node informations + for x in path_: + # add parent's UMI sequence and dedup features + dump[x] += (dump[parent_][0], features[i]) + if em_array: + # optimize the assignment of UMI from multimapping reads + em_array = np.array(em_array) + # save an array with features > 0, as in em_array order + tokeep = np.argwhere(np.any(em_array[..., :] > 0, axis=0))[:,0] + 1 + # remove unmapped features from em_array + todel = np.argwhere(np.all(em_array[..., :] == 0, axis=0)) + em_array = np.delete(em_array, todel, axis=1) + # run EM + em_counts = run_em(em_array, cycles=100) + em_counts = [x*em_array.shape[0] for x in em_counts] + for i, c in zip(tokeep, em_counts): + if c > 0: + counts[i] += c + return dict(counts), dump + +def split_barcodes(barcodes_file, n): """ - Run cellCount() for a set of barcodes. - - Parameters + barcode_file : iterable + n : int ---------- - mappings_file: str - File containing UMI-TE mappings (3-columns text of CB-UMI-TE) - outdir: str - Output dir to write into. - tmpdir: str - Directory to write temporary files into. - features: list - List of (index, feature) tuples, generated with parse_features(). - intcount: bool - Convert all counts to integer. - dumpec: bool - Write a report of equivalence classes and UMI deduplication. - verbose: bool - Be verbose. - bc_split: list - List of barcodes to process, generated with split_bc(). + out : int, dict """ - '''Runs cellCount for a set of barcodes''' - os.makedirs(outdir, exist_ok=True) - os.makedirs(tmpdir, exist_ok=True) - - # set temporary matrix name prefix as chunk number - chunkn = bc_split[0] - matrix_file = os.path.join(tmpdir, f'{chunkn}_matrix.mtx.gz') - - # parse barcodes in a SEQUENCE:INDEX dictionary - barcodes = dict(bc_split[1]) - writerr( - f'Processing {len(barcodes)} barcodes from chunk {chunkn}', - send=verbose - ) - - # get number of lines in mappings_file - nlines = getlen(mappings_file) - - # initialize mappings dictionary {UMI: {FEATURE: COUNT}} - maps = dict() - - # cell barcode placeholder - cell = None - - with gzip.open(mappings_file, 'rb') as data, \ - gzip.open(matrix_file, 'wb') as mtxFile: - - if dumpec: - ec_dump_file = os.path.join(tmpdir, f'{chunkn}_ec_dump.tsv.gz') - ecdump = gzip.open(ec_dump_file, 'wb') - else: - ec_dump_file = None - - for line in enumerate(data, start=1): - # gather barcode, umi and feature from mappings file - cx, ux, te = line[1].decode('utf-8').strip().split('\t') - if '~' in te: - te = te[:te.index('~')] - - if len(barcodes)==0: - # interrupt loop when reaching the end of the barcodes chunk - break - - if not cell: - # skip to the first cell barcode contained in the current - # barcodes chunk - if cx not in barcodes: - continue - else: - cell = cx - - # if cell barcode changes, compute counts from previous cell's - # mappings - if cx != cell and cell in barcodes: - cellidx = barcodes.pop(cell) - writerr( - f'[{chunkn}] Computing counts for cell barcode {cellidx} ' - '({cell})', - send=verbose - ) - # compute final counts of the cell - ec_log, counts = cellCount( - maps, - intcount=intcount, - dumpec=dumpec - ) - # arrange counts in a data frame and write to text file - lines = [f'{features[k]} {str(cellidx)} {str(v)}\n'.encode() \ - for k, v in counts.items()] - mtxFile.writelines(lines) - if dumpec: - ec_log = [f'{str(cellidx)}\t{cell}\t{x}'.encode() \ - for x in ec_log] - ecdump.writelines(ec_log) - # re-initialize mappings dict - maps = dict() - - # reassign cell to current barcode - cell = cx - - # add features count to mappings dict - if cx in barcodes: - #teidx = features[te] - if ux not in maps: - # initialize UMI if not in mappings dict - maps[ux] = dict() - if te in maps[ux]: - # initialize feature count for UMI - maps[ux][te]+=1 - else: - # add count to existing feature in UMI - maps[ux][te]=1 - - # if end of file is reached, compute counts from current cell's - # mappings - if line[0] == nlines and cell in barcodes: - cellidx = barcodes.pop(cell) - writerr( - f'[{chunkn}] [file_end] Computing counts for cell ' - f'barcode {cellidx} ({cell})', - send=verbose - ) - # compute final counts of the cell - ec_log, counts = cellCount( - maps, - intcount=intcount, - dumpec=dumpec - ) - # arrange counts in a data frame and write to text file - lines = [f'{features[k]} {str(cellidx)} {str(v)}\n'.encode() \ - for k, v in counts.items()] - mtxFile.writelines(lines) - if dumpec: - ec_log = [f'{str(cellidx)}\t{cell}\t{x}'.encode() \ - for x in ec_log] - ecdump.writelines(ec_log) - if dumpec: - ecdump.close() + nBarcodes = getlen(barcodes_file) + with gzip.open(barcodes_file, 'rb') as f: + for i, chunk in enumerate(get_ranges(nBarcodes, n)): + yield i, {next(f).strip(): x+1 for x in chunk} + +def run_count(maps_file, features_index, tmpdir, dumpEC, verbose, + barcodes_set): + # NB: keep args order consistent with main.countFun + taskn, barcodes = barcodes_set + matrix_file = os.path.join(tmpdir, f'{taskn}_matrix.mtx.gz') + dump_file = os.path.join(tmpdir, f'{taskn}_EqCdump.tsv.gz') + with (gzip.open(matrix_file, 'wb') as f, + gzip.open(dump_file, 'wb') if dumpEC + else gzip.open(os.devnull) as df): + for cellbarcode, cellmaps in parse_maps(maps_file, features_index): + if cellbarcode not in barcodes: + continue + cellidx = barcodes[cellbarcode] writerr( - f'Equivalence Classes dump file written to {ec_dump_file}', + f'[{taskn}] Run count for cell ' + f'{cellidx} ({cellbarcode.decode()})', send=verbose ) - writerr(f'Barcodes chunk {chunkn} written to {matrix_file}', send=verbose) - return matrix_file, ec_dump_file - -# Concatenate matrices in a single MatrixMarket file with proper header -def formatMM(matrix_files, outdir, features, barcodes): + cellcounts, dump = compute_cell_counts( + equivalence_classes=cellmaps, + features_index=features_index, + dumpEC=dumpEC + ) + writerr( + f'[{taskn}] Write count for cell ' + f'{cellidx} ({cellbarcode.decode()})', + send=verbose + ) + # round counts to 3rd decimal point and write to matrix file + # only if count is at least 0.001 + lines = [f'{feature} {cellidx} {round(count, 3)}\n'.encode() + for feature, count in cellcounts.items() + if count >= 0.001] + f.writelines(lines) + if dumpEC: + writerr( + f'[{taskn}] Write ECdump for cell ' + f'{cellidx} ({cellbarcode.decode()})', + send=verbose + ) + # reverse features index to get names back + findex = dict(zip(features_index.values(), + features_index.keys())) + dumplines = [ + b'\t'.join( + [str(cellidx).encode(), + cellbarcode, + str(i).encode(), + umi, + b','.join([findex[f] for f in feats]), + str(count).encode(), + pumi, + b','.join([findex[f] for f in pfeats])] + ) + b'\n' + for i, (umi, feats, count, pumi, pfeats) in dump.items() + ] + df.writelines(dumplines) + return matrix_file, dump_file + +def formatMM(matrix_files, feature_index, barcodes_chunks, outdir): if type(matrix_files) is str: matrix_files = [matrix_files] matrix_out = os.path.join(outdir, 'matrix.mtx.gz') - features_count = len(features) - barcodes_count = len(flatten([j for i,j in barcodes])) + features_count = len(feature_index) + barcodes_count = sum(len(x) for _, x in barcodes_chunks) mmsize = sum(getlen(f) for f in matrix_files) - mmheader = '%%MatrixMarket matrix coordinate real general\n' - mmtotal = f'{features_count} {barcodes_count} {mmsize}\n' + mmheader = b'%%MatrixMarket matrix coordinate real general\n' + mmtotal = f'{features_count} {barcodes_count} {mmsize}\n'.encode() with gzip.GzipFile(matrix_out, 'wb', mtime=0) as mmout: - mmout.write(mmheader.encode()) - mmout.write(mmtotal.encode()) + mmout.write(mmheader) + mmout.write(mmtotal) mtxstr = ' '.join(matrix_files) cmd = f'zcat {mtxstr} | LC_ALL=C sort -k2,2n -k1,1n | gzip >> {matrix_out}' run_shell_cmd(cmd) @@ -378,18 +314,17 @@ def writeEC(ecdump_files, outdir): ecdump_out = os.path.join(outdir, 'ec_dump.tsv.gz') ecdumpstr = ' '.join(ecdump_files) header = '\t'.join([ - 'BC_index', + 'Barcode_id', 'Barcode', - 'EC_index', - 'EC_name', - 'Raw_UMIs', - 'Raw_count', - 'Dedup_UMIs', - 'Dedup_count', - 'Filtered_TE' + 'EqClass', + 'UMI', + 'Features', + 'Read_count', + 'Dedup_UMI', + 'Dedup_feature' ]) + '\n' with gzip.GzipFile(ecdump_out, 'wb', mtime=0) as f: f.write(header.encode()) - cmd = f'zcat {ecdumpstr} | LC_ALL=C sort -k1,1n -k2 | gzip >> {ecdump_out}' + cmd = f'zcat {ecdumpstr} | LC_ALL=C sort -k1,1n -k3,3n | gzip >> {ecdump_out}' run_shell_cmd(cmd) - return ecdump_out + return ecdump_out \ No newline at end of file diff --git a/irescue/em.py b/irescue/em.py new file mode 100644 index 0000000..39443df --- /dev/null +++ b/irescue/em.py @@ -0,0 +1,49 @@ +import numpy as np + +def e_step(matrix, counts): + """ + Performs E-step of EM algorithm: proportionally assigns reads to features + based on relative feature abundances. + """ + colsums = (matrix * counts).sum(axis=1)[:, np.newaxis] + out = matrix / colsums * counts + return(out) + +def m_step(matrix): + """ + Performs M-step of EM algorithm: calculates feature abundances from read + counts proportionally distributed to features. + """ + counts = matrix.sum(axis=0) / matrix.sum() + return(counts) + +def run_em(matrix, cycles=100): + """ + Run Expectation-Maximization (EM) algorithm to redistribute read counts + across a set of features. + + Parameters + ---------- + matrix : array + Reads-features compatibility matrix. + cycles : int, optional + Number of EM cycles. + + Returns + ------- + out : list + Optimized relative feature abundances. + """ + + # calculate initial estimation of relative abundance. + # (let the sum of counts of features be 1, + # will be multiplied by the real UMI count later) + nFeatures = matrix.shape[1] + counts = np.array([1 / nFeatures] * nFeatures) + + # run EM for n cycles + for _ in range(cycles): + e_matrix = e_step(matrix=matrix, counts=counts) + counts = m_step(matrix=e_matrix) + + return(counts) \ No newline at end of file diff --git a/irescue/main.py b/irescue/main.py index 179353d..13f1089 100644 --- a/irescue/main.py +++ b/irescue/main.py @@ -3,11 +3,11 @@ from irescue._version import __version__ from irescue._genomes import __genomes__ from irescue.misc import writerr, versiontuple, run_shell_cmd -from irescue.misc import check_requirement, check_arguments, check_tags +from irescue.misc import check_requirement, check_tags from irescue.map import makeRmsk, getRefs, prepare_whitelist, isec, chrcat from irescue.map import checkIndex -from irescue.count import split_bc, parse_features, count, formatMM, writeEC -import argparse, os +from irescue.count import split_barcodes, index_features, run_count, formatMM, writeEC +import argparse, os, sys from multiprocessing import Pool from functools import partial from shutil import rmtree @@ -22,101 +22,90 @@ def parseArguments(): " in scRNA-seq.", epilog="Home page: https://github.com/bodegalab/irescue" ) - parser.add_argument('-b', '--bam', - required=True, - metavar='FILE', + parser.add_argument('-b', '--bam', required=True, metavar='FILE', help="scRNA-seq reads aligned to a reference genome " "(required).") - parser.add_argument('-r', '--regions', - metavar='FILE', + parser.add_argument('-r', '--regions', metavar='FILE', help="Genomic TE coordinates in bed format. " "Takes priority over --genome (default: %(default)s).") - parser.add_argument('-g', '--genome', - metavar='STR', + parser.add_argument('-g', '--genome', metavar='STR', + choices=__genomes__.keys(), help="Genome assembly symbol. One of: {} (default: " "%(default)s).".format(', '.join(__genomes__))) - parser.add_argument('-w', '--whitelist', - metavar='FILE', + parser.add_argument('-w', '--whitelist', metavar='FILE', help="Text file of filtered cell barcodes by e.g. " "Cell Ranger, STARSolo or your gene expression " "quantifier of choice (Recommended. " "default: %(default)s).") - parser.add_argument('-cb', '--CBtag', - default='CB', - metavar='STR', + parser.add_argument('-c', '--cb-tag', default='CB', metavar='STR', help="BAM tag containing the cell barcode sequence " "(default: %(default)s).") - parser.add_argument('-umi', '--UMItag', - default='UR', - metavar='STR', + parser.add_argument('-u', '--umi-tag', default='UR', metavar='STR', help="BAM tag containing the UMI sequence " "(default: %(default)s).") - parser.add_argument('-p', '--threads', - type=int, - default=1, - metavar='CPUS ', + parser.add_argument('-p', '--threads', type=int, default=1, metavar='CPUS', help="Number of cpus to use (default: %(default)s).") - parser.add_argument('-o', '--outdir', - default='IRescue_out', - metavar='DIR', + parser.add_argument('-o', '--outdir', default='irescue_out', metavar='DIR', help="Output directory name (default: %(default)s).") - parser.add_argument('--min-bp-overlap', - type=int, - metavar='INT', + parser.add_argument('--min-bp-overlap', type=int, metavar='INT', help="Minimum overlap between read and TE as number " "of nucleotides (Default: disabled).") - parser.add_argument('--min-fraction-overlap', - type=float, - metavar='FLOAT', + parser.add_argument('--min-fraction-overlap', type=float, metavar='FLOAT', + choices=[x/100 for x in range(101)], help="Minimum overlap between read and TE" " as a fraction of read's alignment" " (i.e. 0.00 <= NUM <= 1.00) (Default: disabled).") - parser.add_argument('--dumpEC', - action='store_true', + parser.add_argument('--dump-ec', action='store_true', help="Write a description log file of Equivalence " "Classes.") - parser.add_argument('--integers', - action='store_true', + parser.add_argument('--integers', action='store_true', help="Use if integers count are needed for " "downstream analysis.") - parser.add_argument('--samtools', - default='samtools', - metavar='PATH', + parser.add_argument('--samtools', default='samtools', metavar='PATH', help="Path to samtools binary, in case it's not in " "PATH (Default: %(default)s).") - parser.add_argument('--bedtools', - default='bedtools', - metavar='PATH', + parser.add_argument('--bedtools', default='bedtools', metavar='PATH', help="Path to bedtools binary, in case it's not in " "PATH (Default: %(default)s).") - parser.add_argument('--no-tags-check', - action='store_true', + parser.add_argument('--no-tags-check', action='store_true', help="Suppress checking for CBtag and UMItag " "presence in bam file.") - parser.add_argument('--keeptmp', - action='store_true', - help="Keep temporary files.") - parser.add_argument('--tmpdir', - default='IRescue_tmp', - metavar='DIR', - help="Directory to store temporary files " - "(default: %(default)s).") - parser.add_argument('-v', '--verbose', - action='store_true', + parser.add_argument('--keeptmp', action='store_true', + help="Keep temporary files under /tmp.") + #parser.add_argument('--tmpdir', default='irescue_out/tmp', metavar='DIR', + # help="Directory to store temporary files " + # "(default: %(default)s).") + parser.add_argument('-v', '--verbose', action='store_true', help="Writes a lot of stuff to stderr, such as " "chromosomes as they are mapped and cell barcodes " "as they are processed.") - parser.add_argument('-V', '--version', - action='version', + parser.add_argument('-V', '--version', action='version', version='%(prog)s {}'.format(__version__), help="Print software's version and exit.") return parser def main(): + + # Parse and print arguments parser = parseArguments() - args = parser.parse_args() - args = check_arguments(args) + args = parser.parse_args(args=None if sys.argv[1:] else ['--help']) + argstr = '\n'.join(f' {k}: {v}' for k, v in args.__dict__.items()) + sys.stderr.write(f" IRescue version {__version__}\n{argstr}\n") + + #__tmpdir__ = os.path.join(args.outdir, 'tmp') + dirs = { + 'out': args.outdir, + 'tmp': os.path.join(args.outdir, 'tmp'), + 'mex': os.path.join(args.outdir, 'counts') + } + + + #################### + # Preliminar steps # + #################### + + writerr("Running preliminary checks.") # Check requirements check_requirement( @@ -134,28 +123,33 @@ def main(): # Check if the selected cell barcode and UMI tags are present in bam file. if not args.no_tags_check: - check_tags(bamFile=args.bam, CBtag=args.CBtag, UMItag=args.UMItag, + check_tags(bamFile=args.bam, CBtag=args.cb_tag, UMItag=args.umi_tag, nLines=999999, exit_with_error=True, verbose=args.verbose) # Check for bam index file. If not present, will build an index. checkIndex(args.bam, verbose=args.verbose) - - writerr('IRescue job starts') - + # create directories - os.makedirs(args.tmpdir, exist_ok=True) - os.makedirs(args.outdir, exist_ok=True) + for v in dirs.values(): + os.makedirs(v, exist_ok=True) + + + ########### + # Mapping # + ########### + + writerr("Running mapping step.") # set regions object (provided or downloaded bed file) regions = makeRmsk(regions=args.regions, genome=args.genome, - genomes=__genomes__, tmpdir=args.tmpdir, + genomes=__genomes__, tmpdir=dirs['tmp'], outname='rmsk.bed') # get list of reference names from bam chrNames = getRefs(args.bam, regions) # decompress whitelist if compressed - whitelist = prepare_whitelist(args.whitelist, args.tmpdir) + whitelist = prepare_whitelist(args.whitelist, dirs['tmp']) # Allocate threads if args.threads > 1: @@ -168,8 +162,8 @@ def main(): send=args.verbose ) isecFun = partial( - isec, args.bam, regions, whitelist, args.CBtag, args.UMItag, - args.min_bp_overlap, args.min_fraction_overlap, args.tmpdir, + isec, args.bam, regions, whitelist, args.cb_tag, args.umi_tag, + args.min_bp_overlap, args.min_fraction_overlap, dirs['tmp'], args.samtools, args.bedtools, args.verbose ) if args.threads > 1: @@ -179,20 +173,27 @@ def main(): # concatenate intersection results mappings_file, barcodes_file, features_file = chrcat( - isecFiles, threads=args.threads, outdir=args.outdir, - tmpdir=args.tmpdir, verbose=args.verbose + isecFiles, threads=args.threads, outdir=dirs['mex'], + tmpdir=dirs['tmp'], bedtools=args.bedtools, verbose=args.verbose ) + + ######### + # Count # + ######### + + writerr("Running count step.") + # calculate number of mappings per process - bc_per_thread = list(split_bc(barcodes_file, args.threads)) + bc_per_thread = list(split_barcodes(barcodes_file, args.threads)) # parse features - ftlist = dict(parse_features(features_file)) + feature_index = index_features(features_file) # calculate TE counts countFun = partial( - count, mappings_file, args.outdir, args.tmpdir, ftlist, args.integers, - args.dumpEC, args.verbose + run_count, mappings_file, feature_index, dirs['tmp'], + args.dump_ec, args.verbose ) if args.threads > 1: mtxFiles = pool.map(countFun, bc_per_thread) @@ -208,16 +209,15 @@ def main(): matrix_files = [ i for i, j in mtxFiles] ecdump_files = [ j for i, j in mtxFiles] matrix_file = formatMM( - matrix_files, outdir=args.outdir, features=ftlist, - barcodes=bc_per_thread + matrix_files, feature_index, bc_per_thread, dirs['mex'] ) writerr(f'Writing sparse matrix to {matrix_file}') - if args.dumpEC: - ecdump_file = writeEC(ecdump_files, outdir=args.outdir) + if args.dump_ec: + ecdump_file = writeEC(ecdump_files, outdir=dirs['out']) writerr(f'Writing Equivalence Classes to {ecdump_file}') if not args.keeptmp: writerr(f'Cleaning up temporary files.', send=args.verbose) - rmtree(args.tmpdir) + rmtree(dirs['tmp']) writerr('Done.') diff --git a/irescue/map.py b/irescue/map.py index f82f9ea..9777af9 100644 --- a/irescue/map.py +++ b/irescue/map.py @@ -57,12 +57,6 @@ def makeRmsk(regions, genome, genomes, tmpdir, outname): # if no repeatmasker file is provided, and a genome assembly name is # provided, download and prepare a rmsk.bed file elif genome: - if not genome in genomes: - writerr( - "ERROR: Genome assembly name shouldbe one of: " - f"{', '.join(genomes.keys())}", - error=True - ) url, header_lines = genomes[genome] writerr( "Downloading and parsing RepeatMasker annotation for " @@ -101,7 +95,7 @@ def makeRmsk(regions, genome, genomes, tmpdir, outname): if famclass.split('/')[0] in fams_to_skip: continue # concatenate family and class with subfamily - subfamily += '~' + famclass + subfamily += '#' + famclass score = lst[0] chr, start, end = lst[4:7] # make coordinates 0-based @@ -172,7 +166,7 @@ def isec(bamFile, bedFile, whitelist, CBtag, UMItag, bpOverlap, fracOverlap, os.makedirs(isecdir, exist_ok=True) refFile = os.path.join(refdir, chrom + '.bed.gz') - isecFile = os.path.join(isecdir, chrom + '.isec.bed.gz') + isecFile = os.path.join(isecdir, chrom + '.isec.txt.gz') # split bed file by chromosome sort = 'LC_ALL=C sort -k1,1 -k2,2n --buffer-size=1G' @@ -210,8 +204,8 @@ def isec(bamFile, bedFile, whitelist, CBtag, UMItag, bpOverlap, fracOverlap, # remove mate information from read name cmd += ' { sub(/\/[12]$/,"",$4); ' # concatenate CB and UMI with feature name - cmd += ' n=split($4,qname,/\//); $4=qname[n-1]"\\t"qname[n]"\\t"$16; ' - cmd += ' print $4 }\' ' + cmd += ' n=split($4,qname,/\//); ' + cmd += ' print qname[n-1]"\\t"qname[n]"\\t"qname[1]"\\t"$16 }\' ' cmd += f' | gzip > {isecFile}' writerr(f'Extracting {chrom} reference', send=verbose) @@ -223,24 +217,34 @@ def isec(bamFile, bedFile, whitelist, CBtag, UMItag, bpOverlap, fracOverlap, return isecFile # Concatenate and sort data obtained from isec() -def chrcat(filesList, threads, outdir, tmpdir, verbose): +def chrcat(filesList, threads, outdir, tmpdir, bedtools, verbose): os.makedirs(outdir, exist_ok=True) - mappings_file = os.path.join(tmpdir, 'cb_umi_te.bed.gz') + mappings_file = os.path.join(tmpdir, 'mappings.tsv.gz') barcodes_file = os.path.join(outdir, 'barcodes.tsv.gz') features_file = os.path.join(outdir, 'features.tsv.gz') bedFiles = ' '.join(filesList) - cmd0 = f'zcat {bedFiles} ' - cmd0 += f' | LC_ALL=C sort --parallel {threads} --buffer-size 2G ' - cmd0 += f' | gzip > {mappings_file} ' + sort_threads = int(threads / 2 - 1) + sort_threads = sort_threads if sort_threads>0 else 1 + + # sort and summarize UMI-READ-TE mappings + sort_res = f'--parallel {sort_threads} --buffer-size 2G' + cmd0 = f'zcat {bedFiles}' + # input: "CB UMI READ FEAT" + cmd0 += f' | LC_ALL=C sort -u {sort_res}' + cmd0 += f' | {bedtools} groupby -g 1,2,3 -c 4 -o distinct' + # result: "CB UMI READ FEATs" + cmd0 += f' | LC_ALL=C sort -k1,2 -k4,4 {sort_res}' + cmd0 += f' | {bedtools} groupby -g 1,2,4 -c 3 -o count_distinct' + # result: "CB UMI FEATs count" + cmd0 += f' | gzip > {mappings_file}' + + # write barcodes.tsv.gz file cmd1 = f'zcat {mappings_file} | cut -f1 | uniq | gzip > {barcodes_file} ' + + # write features.tsv.gz file cmd2 = f'zcat {mappings_file} ' - cmd2 += ' | gawk \'!x[$3]++ { ' - cmd2 += ' split($3,a,"~"); ' - # avoid subfamilies with the same name - cmd2 += ' if(a[1] in sf) { sf[a[1]]+=1 } else { sf[a[1]] }; ' - cmd2 += ' if(length(a)<2) { a[2]=a[1] }; ' - cmd2 += ' print a[1] sf[a[1]] "\\t" a[2] "\\tGene Expression" ' - cmd2 += ' }\' ' + cmd2 += ' | cut -f3 | sed \'s/,/\\n/g\' | gawk \'!x[$1]++ { ' + cmd2 += ' print $1"\\t"gensub(/#.+/,"",1,$1)"\\tGene Expression" }\' ' cmd2 += f' | LC_ALL=C sort -u | gzip > {features_file} ' writerr('Concatenating mappings', send=verbose) diff --git a/irescue/misc.py b/irescue/misc.py index 7a49285..aea6b03 100644 --- a/irescue/misc.py +++ b/irescue/misc.py @@ -37,18 +37,6 @@ def versiontuple(version): """ return tuple(map(int, version.split('.'))) -def check_arguments(args): - """ - Check validity of arguments. - """ - if isinstance(args.min_fraction_overlap, (int, float)): - if 0 <= args.min_fraction_overlap <= 1: - pass - else: - writerr("ERROR: --min-fraction-overlap must be a floating point " - "number between 0 and 1.", error=True) - return args - def check_requirement(cmd, required_version, parser, verbose): """ Check if the required version for a software has been installed. @@ -94,7 +82,7 @@ def writerr(msg, error=False, send=True): Decides if the message should be sent (useful for verbose messages). """ if send: - timelog = datetime.now().strftime("%m/%d/%Y - %H:%M:%S") + timelog = datetime.now().strftime("%Y/%m/%d - %H:%M:%S") message = f'[{timelog}] ' if not msg[-1]=='\n': msg += '\n' @@ -143,12 +131,6 @@ def getlen(file): f.close() return out -def flatten(x): - """ - Flatten a list of sublists. - """ - return [item for sublist in x for item in sublist] - def check_tags( bamFile, CBtag, UMItag, nLines=None, exit_with_error=True, verbose=False @@ -216,22 +198,15 @@ def check_tags( else: return(False) -def iupac_nt_code(nts): - """ - Return the IUPAC code correspondent to a set of input nucleotides. - """ - codes = { - 'R': {'A', 'G'}, - 'Y': {'C', 'T'}, - 'S': {'G', 'C'}, - 'W': {'A', 'T'}, - 'K': {'G', 'T'}, - 'M': {'A', 'C'}, - 'B': {'C', 'G', 'T'}, - 'D': {'A', 'G', 'T'}, - 'H': {'A', 'C', 'T'}, - 'V': {'A', 'C', 'G'}, - 'N': {'A', 'C', 'G', 'T'} - } - out = [k for k, v in codes.items() if v == set(nts)][0] - return out +def get_ranges(num, div): + """ + Splits an integer X into N integers whose sum is equal to X. + """ + split = int(num/div) + for i in range(0, num, split): + j = i + split + if j > num-split: + j = num + yield range(i, j) + break + yield range(i, j) diff --git a/irescue/network.py b/irescue/network.py new file mode 100644 index 0000000..f274e3a --- /dev/null +++ b/irescue/network.py @@ -0,0 +1,69 @@ +#!/usr/bin/env python + +# NB: This module include partly modified third-party code distributed under the +# license below. + +############################################################################## +# The MIT License (MIT) + +# Copyright (c) 2015 CGAT + +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: + +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. + +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +############################################################################## + +from collections import defaultdict + +def get_substr_slices(umi_length, idx_size): + ''' + Create slices to split a UMI into approximately equal size substrings + Returns a list of tuples that can be passed to slice function + ''' + cs, r = divmod(umi_length, idx_size) + sub_sizes = [cs + 1] * r + [cs] * (idx_size - r) + offset = 0 + slices = [] + for s in sub_sizes: + slices.append((offset, offset + s)) + offset += s + return slices + +def build_substr_idx(equivalence_classes, length, threshold): + ''' + Group equivalence classes into subgroups having a common substring + ''' + slices = get_substr_slices(length, threshold+1) + substr_idx = {k: defaultdict(set) for k in slices} + for idx in slices: + for ec in equivalence_classes: + sub = ec.umi[slice(*idx)] + substr_idx[idx][sub].add(ec) + return substr_idx + +def gen_ec_pairs(equivalence_classes, substr_idx): + ''' + Yields equivalence classes pairs from build_substr_idx() + ''' + for i, ec in enumerate(equivalence_classes, start=1): + neighbours = set() + for idx, substr_map in substr_idx.items(): + sub = ec.umi[slice(*idx)] + neighbours = neighbours.union(substr_map[sub]) + neighbours.difference_update(equivalence_classes[:i]) + for nbr in neighbours: + yield ec, nbr \ No newline at end of file diff --git a/pyproject.toml b/pyproject.toml index 123ca38..9f2a6c8 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -32,6 +32,7 @@ dependencies = [ "numpy >= 1.20.2", "pysam >= 0.16.0.1", "requests >= 2.27.1", + "networkx >= 3.1", ] dynamic = ["version"] diff --git a/tests/data/rmsk.bed.gz b/tests/data/rmsk.bed.gz index d032937..422df4a 100644 Binary files a/tests/data/rmsk.bed.gz and b/tests/data/rmsk.bed.gz differ diff --git a/tests/test.yml b/tests/test.yml index 291c46d..4e55120 100644 --- a/tests/test.yml +++ b/tests/test.yml @@ -1,48 +1,48 @@ - name: base command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz --keeptmp -v files: - - path: "IRescue_out/barcodes.tsv.gz" + - path: "irescue_out/counts/barcodes.tsv.gz" md5sum: 1a74fa12e65ac1703bbe61282854f151 - - path: "IRescue_out/features.tsv.gz" - md5sum: e8bf21611afd1f40d722ed985f4e3392 - - path: "IRescue_out/matrix.mtx.gz" - md5sum: 04ddbd538c796f019f37d3048b159a2f + - path: "irescue_out/counts/features.tsv.gz" + md5sum: ae84bc368a289e070b754030a65d69b4 + - path: "irescue_out/counts/matrix.mtx.gz" + md5sum: ca147b42af250be7c47c4a748693ca97 - name: genome tags: - genome command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -g test --keeptmp -v files: - - path: "IRescue_out/barcodes.tsv.gz" + - path: "irescue_out/counts/barcodes.tsv.gz" md5sum: 1a74fa12e65ac1703bbe61282854f151 - - path: "IRescue_out/features.tsv.gz" - md5sum: e8bf21611afd1f40d722ed985f4e3392 - - path: "IRescue_out/matrix.mtx.gz" - md5sum: 04ddbd538c796f019f37d3048b159a2f + - path: "irescue_out/counts/features.tsv.gz" + md5sum: ae84bc368a289e070b754030a65d69b4 + - path: "irescue_out/counts/matrix.mtx.gz" + md5sum: ca147b42af250be7c47c4a748693ca97 - name: multi tags: - multi command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz -p 2 --keeptmp -v files: - - path: "IRescue_out/barcodes.tsv.gz" + - path: "irescue_out/counts/barcodes.tsv.gz" md5sum: 1a74fa12e65ac1703bbe61282854f151 - - path: "IRescue_out/features.tsv.gz" - md5sum: e8bf21611afd1f40d722ed985f4e3392 - - path: "IRescue_out/matrix.mtx.gz" - md5sum: 04ddbd538c796f019f37d3048b159a2f + - path: "irescue_out/counts/features.tsv.gz" + md5sum: ae84bc368a289e070b754030a65d69b4 + - path: "irescue_out/counts/matrix.mtx.gz" + md5sum: ca147b42af250be7c47c4a748693ca97 - name: whitelist tags: - whitelist command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz -w ./tests/data/whitelist.txt --keeptmp -v files: - - path: "IRescue_out/barcodes.tsv.gz" + - path: "irescue_out/counts/barcodes.tsv.gz" md5sum: 95dccc15cbee4feeeae2fbce4d7b41ad - - path: "IRescue_out/features.tsv.gz" - md5sum: 2dcec6f4aead5faba9c1af44b0129b55 - - path: "IRescue_out/matrix.mtx.gz" - md5sum: 85c61d1df6ccadf83eafc6bc36a21c89 + - path: "irescue_out/counts/features.tsv.gz" + md5sum: 65fb8381a658a4eb4e5d0a575c67818d + - path: "irescue_out/counts/matrix.mtx.gz" + md5sum: d4f60bc056ea189c7473a3624f3c2970 - name: multi whitelist tags: @@ -50,65 +50,65 @@ - whitelist command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz -w ./tests/data/whitelist.txt --keeptmp -v -p 2 files: - - path: "IRescue_out/barcodes.tsv.gz" + - path: "irescue_out/counts/barcodes.tsv.gz" md5sum: 95dccc15cbee4feeeae2fbce4d7b41ad - - path: "IRescue_out/features.tsv.gz" - md5sum: 2dcec6f4aead5faba9c1af44b0129b55 - - path: "IRescue_out/matrix.mtx.gz" - md5sum: 85c61d1df6ccadf83eafc6bc36a21c89 + - path: "irescue_out/counts/features.tsv.gz" + md5sum: 65fb8381a658a4eb4e5d0a575c67818d + - path: "irescue_out/counts/matrix.mtx.gz" + md5sum: d4f60bc056ea189c7473a3624f3c2970 - name: ecdump tags: - ecdump - command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz --keeptmp -v --dumpEC + command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz --keeptmp -v --dump-ec files: - - path: "IRescue_out/barcodes.tsv.gz" + - path: "irescue_out/counts/barcodes.tsv.gz" md5sum: 1a74fa12e65ac1703bbe61282854f151 - - path: "IRescue_out/features.tsv.gz" - md5sum: e8bf21611afd1f40d722ed985f4e3392 - - path: "IRescue_out/matrix.mtx.gz" - md5sum: 04ddbd538c796f019f37d3048b159a2f - - path: "IRescue_out/ec_dump.tsv.gz" - md5sum: 2fbcb954fb48065c6b67a84001b6bc34 + - path: "irescue_out/counts/features.tsv.gz" + md5sum: ae84bc368a289e070b754030a65d69b4 + - path: "irescue_out/counts/matrix.mtx.gz" + md5sum: ca147b42af250be7c47c4a748693ca97 + - path: "irescue_out/ec_dump.tsv.gz" + md5sum: d71ee82b25107d4e104d313efb4be134 - name: multi ecdump tags: - multi - ecdump - command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz --keeptmp -v -p 2 --dumpEC + command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz --keeptmp -v -p 2 --dump-ec files: - - path: "IRescue_out/barcodes.tsv.gz" + - path: "irescue_out/counts/barcodes.tsv.gz" md5sum: 1a74fa12e65ac1703bbe61282854f151 - - path: "IRescue_out/features.tsv.gz" - md5sum: e8bf21611afd1f40d722ed985f4e3392 - - path: "IRescue_out/matrix.mtx.gz" - md5sum: 04ddbd538c796f019f37d3048b159a2f - - path: "IRescue_out/ec_dump.tsv.gz" - md5sum: 2fbcb954fb48065c6b67a84001b6bc34 + - path: "irescue_out/counts/features.tsv.gz" + md5sum: ae84bc368a289e070b754030a65d69b4 + - path: "irescue_out/counts/matrix.mtx.gz" + md5sum: ca147b42af250be7c47c4a748693ca97 + - path: "irescue_out/ec_dump.tsv.gz" + md5sum: d71ee82b25107d4e104d313efb4be134 - name: bp tags: - bp command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz --keeptmp -v --min-bp-overlap 10 files: - - path: "IRescue_out/barcodes.tsv.gz" + - path: "irescue_out/counts/barcodes.tsv.gz" md5sum: 7433e88e94aec2f16a20459275188f1f - - path: "IRescue_out/features.tsv.gz" - md5sum: 12ff16aee1a5e9847ed96534b3764d13 - - path: "IRescue_out/matrix.mtx.gz" - md5sum: 39b3ee6dbffd61a68569b3b30dcaf972 + - path: "irescue_out/counts/features.tsv.gz" + md5sum: 434ff68c92d1b8dd718269a1cd974f99 + - path: "irescue_out/counts/matrix.mtx.gz" + md5sum: 30fe31ed8976bd002d86bcd956d25855 - name: fraction tags: - fraction command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz --keeptmp -v --min-fraction-overlap 0.5 files: - - path: "IRescue_out/barcodes.tsv.gz" + - path: "irescue_out/counts/barcodes.tsv.gz" md5sum: 4de44d3e4a851392a48ccabfee5bb6fc - - path: "IRescue_out/features.tsv.gz" - md5sum: 6ee6ded0563e8e138fb1d5c958cedeee - - path: "IRescue_out/matrix.mtx.gz" - md5sum: 345b8aff9c00f607ea5a305bed569653 + - path: "irescue_out/counts/features.tsv.gz" + md5sum: 927d00f20e4e65b8d46e761d406b69ff + - path: "irescue_out/counts/matrix.mtx.gz" + md5sum: 85b8e8ae7696ff12ffa7e5ac86600fa1 - name: bp fraction tags: @@ -116,9 +116,9 @@ - fraction command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz --keeptmp -v --min-bp-overlap 10 --min-fraction-overlap 0.5 files: - - path: "IRescue_out/barcodes.tsv.gz" + - path: "irescue_out/counts/barcodes.tsv.gz" md5sum: 4de44d3e4a851392a48ccabfee5bb6fc - - path: "IRescue_out/features.tsv.gz" - md5sum: c95db95604d1731d2908f08eeaf8ded1 - - path: "IRescue_out/matrix.mtx.gz" - md5sum: 40fd2a331d7328d4a4a5428307f8adaf + - path: "irescue_out/counts/features.tsv.gz" + md5sum: f304e63657f73eeec0edffed68490b6c + - path: "irescue_out/counts/matrix.mtx.gz" + md5sum: 336e5a5edfad998bc7d64cf0e68cc897