diff --git a/LICENSE b/LICENSE
index ab3aa59..7c78c2d 100644
--- a/LICENSE
+++ b/LICENSE
@@ -1,6 +1,6 @@
MIT License
-Copyright (c) 2022 Benedetto Polimeni
+Copyright (c) 2022-2024 Benedetto Polimeni
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
diff --git a/README.md b/README.md
index 892ad98..c10d0b0 100644
--- a/README.md
+++ b/README.md
@@ -8,7 +8,7 @@
# IRescue - Interspersed Repeats single-cell quantifier
-IRescue is a software for quantifying the expression of transposable elements (TEs) subfamilies in single cell RNA sequencing (scRNA-seq) data. The core feature of IRescue is to consider all multiple alignments (i.e. non-primary alignments) of reads/UMIs mapping on multiple TEs in a BAM file, to accurately infer the TE subfamily of origin. IRescue implements a UMI error-correction, deduplication and quantification strategy that includes such alignment events. IRescue's output is compatible with most scRNA-seq analysis toolkits, such as Seurat or Scanpy.
+IRescue quantifies the expression fo transposable elements (TEs) subfamilies in single cell RNA sequencing (scRNA-seq) data that performs UMI-deduplication with sequencing errors correction and probabilistic assignment of multi-mapping reads by expectation-maximization (EM) procedure. TE counts are written on a sparse matrix (similar to Cell Ranger's output) compatible with Seurat, Scanpy and other toolkits.
## Content
@@ -34,7 +34,7 @@ conda create -n irescue -c conda-forge -c bioconda irescue
### Using pip
-If for any reason it's not possible or desiderable to use conda, it can be installed with pip and the following requirements must be installed manually: `python>=3.7`, `samtools>=1.12`, `bedtools>=2.30.0`, and fairly recent versions of the GNU utilities are required, specifically `coreutils>=8.30` and `gzip>=1.10` (older versions are untested).
+If for any reason it's not possible or desiderable to use conda, it can be installed with pip and the following requirements must be installed manually: `python>=3.7`, `samtools>=1.12`, `bedtools>=2.30.0`, and fairly recent versions of the GNU utilities are required, specifically `gawk>=5.0.1`, `coreutils>=8.30` and `gzip>=1.10` (older versions are untested).
```bash
pip install irescue
@@ -57,29 +57,36 @@ singularity exec https://depot.galaxyproject.org/singularity/irescue:$TAG irescu
## Usage
-### Quick start
+```sh
+irescue --help
+```
+
+The only required input is a BAM file annotated with cell barcode and UMI sequences as tags (by default, `CB` tag for cell barcode and `UR` tag for UMI; override with `--cb-tag` and `--umi-tag`).
-The only required input is a BAM file annotated with cell barcode and UMI sequences as tags (by default, `CB` tag for cell barcode and `UR` tag for UMI; override with `--CBtag` and `--UMItag`). You can obtain it by aligning your reads using [STARsolo](https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md).
+You can obtain it by aligning your reads using [STARsolo](https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md). It is advised to keep secondary alignments in BAM file, that will be used in the EM procedure to assign multi-mapping reads (e.g. `--outFilterMultimapNmax 100 --winAnchorMultimapNmax 100` or more), and remember to output all the needed SAM attributes (e.g. `--outSAMattributes NH HI AS nM NM MD jM jI XS MC ch cN CR CY UR UY GX GN CB UB sM sS sQ`).
RepeatMasker annotation will be automatically downloaded for the chosen genome assembly (e.g. `-g hg38`), or provide your own annotation in bed format (e.g. `-r TE.bed`).
-```bash
+```sh
irescue -b genome_alignments.bam -g hg38
```
-If you already obtained gene-level counts (using STARsolo, Cell Ranger, Alevin, Kallisto or other tools), it is advised to provide the whitelisted cell barcodes list as a text file, e.g.: `-w barcodes.tsv`. This will significantly improve performance.
+If you already obtained gene-level counts (using STARsolo, Cell Ranger, Alevin, Kallisto or other tools), it is advised to provide the whitelisted cell barcodes list as a text file (`-w barcodes.tsv`). This will significantly improve performance by processing viable cells only.
-IRescue performs best using at least 4 threads, e.g.: `-p 8`.
+For optimal run time, use at least, e.g.: `-p 8`.
### Output files
-IRescue generates TE counts in a sparse matrix format, readable by [Seurat](https://github.com/satijalab/seurat) or [Scanpy](https://github.com/scverse/scanpy):
+IRescue generates TE counts in a sparse matrix readable by [Seurat](https://github.com/satijalab/seurat) or [Scanpy](https://github.com/scverse/scanpy) into a `counts/` subdirectory. Optional outputs include a description of equivalence classes with UMI deduplication stats `ec_dump.tsv.gz` and a subdirectory of temporary files `tmp/` for debugging purpose. A detailed logging is enabled by `--verbose` and written to standard error.
```
-IRescue_out/
-├── barcodes.tsv.gz
-├── features.tsv.gz
-└── matrix.mtx.gz
+irescue_out/
+├── counts/
+│ ├── barcodes.tsv.gz
+│ ├── features.tsv.gz
+│ └── matrix.mtx.gz
+├── ec_dump.tsv.gz
+└── tmp/
```
### Load IRescue data with Seurat
@@ -108,8 +115,10 @@ Active assay: RNA (31078 features, 0 variable features)
1 other assay present: TE
```
+From here, TE expression can be normalized. Reductions can be made using TE or gene expression.
+
## Cite
Polimeni B, Marasca F, Ranzani V, Bodega B.
-IRescue: single cell uncertainty-aware quantification of transposable elements expression.
+*IRescue: uncertainty-aware quantification of transposable elements expression at single cell level.*
bioRxiv 2022.09.16.508229; doi: https://doi.org/10.1101/2022.09.16.508229
diff --git a/irescue/_version.py b/irescue/_version.py
index 4980bee..236fd8c 100644
--- a/irescue/_version.py
+++ b/irescue/_version.py
@@ -1 +1 @@
-__version__ = '1.1.0-beta.1'
+__version__ = '1.1.0-beta.2'
diff --git a/irescue/count.py b/irescue/count.py
index 3a10b0f..6d055e8 100644
--- a/irescue/count.py
+++ b/irescue/count.py
@@ -1,372 +1,308 @@
#!/usr/bin/env python
+from collections import Counter
+from itertools import combinations
import numpy as np
-from irescue.misc import getlen, writerr, flatten, run_shell_cmd, iupac_nt_code
+import networkx as nx
+from irescue.misc import get_ranges, getlen, writerr, run_shell_cmd
+from irescue.network import build_substr_idx, gen_ec_pairs
+from irescue.em import run_em
import gzip
import os
-def find_mm(x, y):
+class EquivalenceClass:
+ def __init__(
+ self,
+ index: int,
+ umi: bytes,
+ features: set,
+ count: int
+ ) -> None:
+ self.index = index
+ self.umi = umi
+ self.features = features
+ self.count = count
+ def to_tuple(self):
+ return (self.umi, self.features, self.count)
+ def hdist(self, umi):
+ return sum(1 for i, j in zip(self.umi, umi) if i != j)
+ def connect(self, eqc, threshold):
+ return (self.count >= (2 * eqc.count) - 1
+ and self.features.intersection(eqc.features)
+ and self.hdist(eqc.umi) <= threshold)
+
+def pathfinder(graph, node, path=[], features=None):
"""
- Calculate number of mismatches between sequences of the same length
+ Finds first valid path of UMIs with compatible equivalence class given
+ a starting node. Can be used iteratively to find all possible paths.
"""
- if len(x) != len(y):
- return -1
- mm = 0
- for i in range(len(x)):
- if x[i] != y[i]:
- mm += 1
- return mm
+ if not features:
+ features = graph.nodes[node]['ft']
+ path += [node]
+ for next_node in graph.successors(node):
+ if (features.intersection(graph.nodes[next_node]['ft'])
+ and next_node not in path):
+ path = pathfinder(graph, next_node, path, features)
+ return path
+
+def index_features(features_file):
+ idx = {}
+ with gzip.open(features_file, 'rb') as f:
+ for i, line in enumerate(f, start=1):
+ ft = line.strip().split(b'\t')[0]
+ idx[ft] = i
+ return idx
-def collapse_networks(graph):
+def parse_maps(maps_file, feature_index):
"""
- Collapse a UMI graph to a graph of the smallest number of hubs.
-
- Parameters
- ----------
- graph: dict
- A dictionary with nodes as keys and the set of adjacent nodes
- (including the node itself) as values.
- e.g.: {0: {0,1,2}, 1: {0,1}, 2: {0,2,3}}
+ maps_file : str
+ Content: "CB UMI FEATs count"
+ out : bytes, list
+ CB,
+ [(UMI , {FT , ...} , count ) , ...]
"""
- out = dict()
- for key, value in graph.items():
- out[key] = []
- if len(value) == 1:
- # check if it's a single node, then add to the output and go to the next node
- out[key].append(value)
- continue
- for val in graph.values():
- # check if there are other nodes that contains all the values of the current one
- if all(i in value for i in val):
- out[key].append(val)
- if len(out[key]) <= 1:
- out.popitem()
- return out
+ with gzip.open(maps_file, 'rb') as f:
+ cb, umi, feat, count = f.readline().strip().split(b'\t')
+ i = 0
+ it = cb
+ count = int(count)
+ feat = {feature_index[ft] for ft in feat.split(b',')}
+ eqcl = [EquivalenceClass(i, umi, feat, count)]
+ for line in f:
+ cb, umi, feat, count = line.strip().split(b'\t')
+ count = int(count)
+ feat = {feature_index[ft] for ft in feat.split(b',')}
+ if cb == it:
+ i += 1
+ eqcl.append(EquivalenceClass(i, umi, feat, count))
+ else:
+ yield it, eqcl
+ it = cb
+ i = 0
+ eqcl = [EquivalenceClass(i, umi, feat, count)]
+ yield it, eqcl
-# calculate counts of a cell from mappings dictionary
-def cellCount(maps, intcount=False, dumpec=False):
+def compute_cell_counts(equivalence_classes, features_index, dumpEC):
"""
- Deduplicate UMI counts of a cell.
+ Calculate TE counts of a single cell, given a list of equivalence classes.
Parameters
----------
- maps: dict
- Dictionary of all UMI-TE mappings of the cell.
- e.g.: {UMI: {TE_1, TE_2}}
- intcount: bool
- Convert all counts to integer.
- dumpec:
- Make a list of rows for the Equivalence Classes dump (to use with
- --dumpEC on)
+ equivalence_classes : list
+ (UMI_sequence, {TE_index}, read_count) : (str, set, int)
+ Tuples containing UMI-TEs equivalence class infos.
+
+ Returns
+ -------
+ out : dict
+ feature : count dictionary.
"""
-
- # get and index equivalence classes from maps
- eclist = list()
- for v in maps.values():
- eclist.append(tuple(sorted(v.keys())))
- eclist = sorted(list(set(eclist)))
-
- # make a simple mapping dict (index number in place of families) and its reverse
- smaps = dict([(i,eclist.index(tuple(sorted(j.keys())))) for i,j in maps.items()])
- rsmaps = dict()
- for key, value in smaps.items():
- rsmaps.setdefault(value, list()).append(key)
-
- # compute the count of each equivalence class in the cell barcode
- counts = dict()
- ec_log = []
- for ec in rsmaps:
- # list of UMIs associated to EC
- umis = rsmaps[ec]
- ### compute the total count of the equivalence class
- if len(umis) > 1:
- ### Find and collapse duplicated UMIs ###
- # Make an NxN array of number of mismatches between N UMIs
- mm_arr = np.array([[find_mm(ux,i) for i in umis] for ux in umis])
- # Find UMI pairs with up to 1 mismatch, where UMIs are representad
- # by integers: [[i, j], [i, k], [k, m]]
- mm_check = np.argwhere(mm_arr <= 1)
- # Make a graph that connects UMIs with <=1 mismatches
- # {NODE: EDGES} or {UMI: [CONNECTED_UMIS]}
- graph = dict()
- for key, value in mm_check:
- graph.setdefault(key, set()).add(value)
- # Check if all nodes are connected (i.e. complete graph)
- if all([x == set(graph.keys()) for x in graph.values()]):
- # Set EC final count to 1
- ec_count = 1
- if dumpec:
- mm = [
- (i, j) for i, j in enumerate(
- [set(x) for x in zip(*umis)]
- )
- if len(j) > 1
- ]
- if len(mm) == 1:
- mm = mm[0]
- if mm:
- iupac = iupac_nt_code(mm[1])
- umis_dedup = list(umis[0])
- umis_dedup[mm[0]] = iupac
- umis_dedup = [''.join(umis_dedup)]
- else:
- umis_dedup = [''.join(umis_dedup)]
- else:
- # Collapse networks based on UMI similarity: {HUB: [UMI_GRAPHS]}
- coll_nets = collapse_networks(graph)
- # Get EC final count after collapsing
- ec_count = len(coll_nets)
- #
- if dumpec:
- umis_dedup = [umis[x] for x in coll_nets]
-
+ # initialize TE counts and dedup log
+ counts = Counter()
+ dump = None
+ number_of_features = len(features_index)
+ # build cell-wide UMI deduplication graph
+ graph = nx.DiGraph()
+ # add nodes with annotated features
+ graph.add_nodes_from(
+ [(x.index, {'ft': x.features, 'c': x.count})
+ for x in equivalence_classes]
+ )
+ # make an iterator of umi pairs
+ if len(equivalence_classes) > 25:
+ umi_length = len(equivalence_classes[0].umi)
+ substr_idx = build_substr_idx(equivalence_classes, umi_length, 1)
+ iter_ec_pairs = gen_ec_pairs(equivalence_classes, substr_idx)
+ else:
+ iter_ec_pairs = combinations(equivalence_classes, 2)
+ for x, y in iter_ec_pairs:
+ # add edges to graph
+ if x.connect(y, 1):
+ graph.add_edge(x.index, y.index)
+ if y.connect(x, 1):
+ graph.add_edge(y.index, x.index)
+ if dumpEC:
+ # collect graph metadata in a dictionary
+ dump = {i: equivalence_classes[i].to_tuple() for i in graph.nodes}
+ # split cell-wide graph into subgraphs of connected nodes
+ subgraphs = [graph.subgraph(x) for x in
+ nx.connected_components(graph.to_undirected())]
+ # put aside networks that will be solved with EM
+ em_array = []
+ # solve UMI deduplication for each subgraph of connected nodes
+ for subg in subgraphs:
+ # find all parent nodes in graph
+ parents = [x for x in subg if not list(subg.predecessors(x))]
+ if not parents:
+ # if no parents are found due to bidirected edges, take all nodes
+ # and the union of all features (i.e. all nodes are parents).
+ parents = list(subg.nodes)
+ features = [list(set.union(*[subg.nodes[x]['ft'] for x in subg]))]
else:
- # If only one umi, skip collapsing and assign 1 to the final count
- ec_count = 1
- if dumpec:
- umis_dedup = umis
-
- ### find the predominant TE family in the equivalence class
- # make count matrix from mappings (row = UMI, column = TE)
- ec_counts = np.array([list(j.values()) for i,j in maps.items() if i in rsmaps[ec]])
- # sum counts by TE
- ec_sum = ec_counts.sum(axis = 0)
- # find the index of the highest count
- ec_max = np.argwhere(ec_sum == ec_sum.max()).flatten()
-
- # retrieve the TEs with highest count
- te_max = list()
- for i in ec_max:
- te_max.append(eclist[ec][i])
-
- # add count
- for te in te_max:
- # initialize the feature in the cell barcode dictionary
- if te not in counts:
- counts[te] = 0
- # get the normalized count by dividing the raw count by the number of predominant TEs
- norm_count = ec_count / len(te_max)
- # if integers are needed, round the normalized count
- if intcount:
- norm_count = round(norm_count)
- # add count to dictionary
- counts[te] += norm_count
-
- # dump EC
- if dumpec:
- if umis == umis_dedup:
- umis_dedup = ['-']
- ec_log.append('\t'.join([
- str(ec), # EC index
- ','.join(eclist[ec]), # EC name
- ','.join(umis), # Raw UMIs
- str(len(umis)), # Raw count
- ','.join(umis_dedup), # Deduplicated UMIs
- str(ec_count), # Deduplicated count
- ','.join(te_max) # Filtered TEs
- ]) + '\n')
-
- return ec_log, counts
-
-def parse_features(features_file):
- """
- Parses the features.tsv file, assigns an index (int) for each feature and
- yields (index, feature) tuples.
- """
- with gzip.open(features_file, 'rb') as f:
- for i, line in enumerate(f):
- l = line.decode('utf-8').strip().split('\t')
- yield (l[0], i+1)
-
-def split_int(num, div):
- """
- Splits an integer X into N integers whose sum is equal to X.
- """
- split = int(num/div)
- for i in range(0, num, split):
- j = i + split
- if j > num-split:
- j = num
- yield range(i, j)
- break
- yield range(i, j)
-
-def split_bc(barcode_file, n):
- """
- Yields barcodes (index,sequence) tuples in n chunks.
- """
- bclen = getlen(barcode_file)
- #split = round(bclen/n)
- with gzip.open(barcode_file, 'rb') as f:
- c=0
- for chunk in split_int(bclen, n):
- yield (c,[(next(f).decode('utf-8').strip(),x+1) for x in chunk])
- c+=1
-
-def count(
- mappings_file, outdir, tmpdir, features, intcount, dumpec, verbose,
- bc_split
-):
+ # if parents node are found, features will be determined below.
+ features = None
+ # initialize dict of possible paths configurations, starting from
+ # each parent node.
+ paths = {x: [] for x in parents}
+ # find paths starting from each parent node
+ for parent in parents:
+ # populate this list with nodes utilized in paths
+ blacklist = []
+ # find paths in list of nodes starting from parent
+ path = []
+ subg_copy = subg.copy()
+ nodes = [parent] + [x for x in subg_copy if x != parent]
+ for node in nodes:
+ # make a copy of subgraph and remove nodes already used
+ # in a path
+ if node not in blacklist:
+ path = pathfinder(subg_copy, node, path=[], features=None)
+ for x in path:
+ blacklist.append(x)
+ subg_copy.remove_node(x)
+ paths[parent].append(path)
+ # find the path configuration leading to the minimum number of
+ # deduplicated UMIs -> list of lists of nodes
+ path_config = [
+ paths[k] for k, v in paths.items()
+ if len(v) == min([len(x) for x in paths.values()])
+ ][0]
+ if not features:
+ # take features from parent node of selected path configuration
+ features = [list(subg.nodes[x[0]]['ft']) for x in path_config]
+ else:
+ # if features was already determined (i.e. no parent nodes),
+ # multiplicate the feature's list by the number of paths
+ # in path_config to avoid going out of list range
+ features *= len(path_config)
+ # assign UMI count to features
+ for feats in features:
+ if len(feats) == 1:
+ counts[feats[0]] += 1.0
+ elif len(feats) > 1:
+ row = [1 if x in feats else 0
+ for x in range(1, number_of_features+1)]
+ em_array.append(row)
+ else:
+ writerr(nx.to_dict_of_lists(subg))
+ writerr([subg.nodes[x]['ft'] for x in subg.nodes])
+ writerr([subg.nodes[x]['count'] for x in subg.nodes])
+ writerr(path_config)
+ writerr(path)
+ writerr(features)
+ writerr(feats)
+ writerr("Error: no common features detected in subgraph's"
+ " path.", error=True)
+ # add EC log to dump
+ if dumpEC:
+ for i, path_ in enumerate(path_config):
+ # add empty fields to parent node
+ parent_ = path_[0]
+ path_.pop(0)
+ dump[parent_] += (b'', b'')
+ # if child nodes are present, add parent node informations
+ for x in path_:
+ # add parent's UMI sequence and dedup features
+ dump[x] += (dump[parent_][0], features[i])
+ if em_array:
+ # optimize the assignment of UMI from multimapping reads
+ em_array = np.array(em_array)
+ # save an array with features > 0, as in em_array order
+ tokeep = np.argwhere(np.any(em_array[..., :] > 0, axis=0))[:,0] + 1
+ # remove unmapped features from em_array
+ todel = np.argwhere(np.all(em_array[..., :] == 0, axis=0))
+ em_array = np.delete(em_array, todel, axis=1)
+ # run EM
+ em_counts = run_em(em_array, cycles=100)
+ em_counts = [x*em_array.shape[0] for x in em_counts]
+ for i, c in zip(tokeep, em_counts):
+ if c > 0:
+ counts[i] += c
+ return dict(counts), dump
+
+def split_barcodes(barcodes_file, n):
"""
- Run cellCount() for a set of barcodes.
-
- Parameters
+ barcode_file : iterable
+ n : int
----------
- mappings_file: str
- File containing UMI-TE mappings (3-columns text of CB-UMI-TE)
- outdir: str
- Output dir to write into.
- tmpdir: str
- Directory to write temporary files into.
- features: list
- List of (index, feature) tuples, generated with parse_features().
- intcount: bool
- Convert all counts to integer.
- dumpec: bool
- Write a report of equivalence classes and UMI deduplication.
- verbose: bool
- Be verbose.
- bc_split: list
- List of barcodes to process, generated with split_bc().
+ out : int, dict
"""
- '''Runs cellCount for a set of barcodes'''
- os.makedirs(outdir, exist_ok=True)
- os.makedirs(tmpdir, exist_ok=True)
-
- # set temporary matrix name prefix as chunk number
- chunkn = bc_split[0]
- matrix_file = os.path.join(tmpdir, f'{chunkn}_matrix.mtx.gz')
-
- # parse barcodes in a SEQUENCE:INDEX dictionary
- barcodes = dict(bc_split[1])
- writerr(
- f'Processing {len(barcodes)} barcodes from chunk {chunkn}',
- send=verbose
- )
-
- # get number of lines in mappings_file
- nlines = getlen(mappings_file)
-
- # initialize mappings dictionary {UMI: {FEATURE: COUNT}}
- maps = dict()
-
- # cell barcode placeholder
- cell = None
-
- with gzip.open(mappings_file, 'rb') as data, \
- gzip.open(matrix_file, 'wb') as mtxFile:
-
- if dumpec:
- ec_dump_file = os.path.join(tmpdir, f'{chunkn}_ec_dump.tsv.gz')
- ecdump = gzip.open(ec_dump_file, 'wb')
- else:
- ec_dump_file = None
-
- for line in enumerate(data, start=1):
- # gather barcode, umi and feature from mappings file
- cx, ux, te = line[1].decode('utf-8').strip().split('\t')
- if '~' in te:
- te = te[:te.index('~')]
-
- if len(barcodes)==0:
- # interrupt loop when reaching the end of the barcodes chunk
- break
-
- if not cell:
- # skip to the first cell barcode contained in the current
- # barcodes chunk
- if cx not in barcodes:
- continue
- else:
- cell = cx
-
- # if cell barcode changes, compute counts from previous cell's
- # mappings
- if cx != cell and cell in barcodes:
- cellidx = barcodes.pop(cell)
- writerr(
- f'[{chunkn}] Computing counts for cell barcode {cellidx} '
- '({cell})',
- send=verbose
- )
- # compute final counts of the cell
- ec_log, counts = cellCount(
- maps,
- intcount=intcount,
- dumpec=dumpec
- )
- # arrange counts in a data frame and write to text file
- lines = [f'{features[k]} {str(cellidx)} {str(v)}\n'.encode() \
- for k, v in counts.items()]
- mtxFile.writelines(lines)
- if dumpec:
- ec_log = [f'{str(cellidx)}\t{cell}\t{x}'.encode() \
- for x in ec_log]
- ecdump.writelines(ec_log)
- # re-initialize mappings dict
- maps = dict()
-
- # reassign cell to current barcode
- cell = cx
-
- # add features count to mappings dict
- if cx in barcodes:
- #teidx = features[te]
- if ux not in maps:
- # initialize UMI if not in mappings dict
- maps[ux] = dict()
- if te in maps[ux]:
- # initialize feature count for UMI
- maps[ux][te]+=1
- else:
- # add count to existing feature in UMI
- maps[ux][te]=1
-
- # if end of file is reached, compute counts from current cell's
- # mappings
- if line[0] == nlines and cell in barcodes:
- cellidx = barcodes.pop(cell)
- writerr(
- f'[{chunkn}] [file_end] Computing counts for cell '
- f'barcode {cellidx} ({cell})',
- send=verbose
- )
- # compute final counts of the cell
- ec_log, counts = cellCount(
- maps,
- intcount=intcount,
- dumpec=dumpec
- )
- # arrange counts in a data frame and write to text file
- lines = [f'{features[k]} {str(cellidx)} {str(v)}\n'.encode() \
- for k, v in counts.items()]
- mtxFile.writelines(lines)
- if dumpec:
- ec_log = [f'{str(cellidx)}\t{cell}\t{x}'.encode() \
- for x in ec_log]
- ecdump.writelines(ec_log)
- if dumpec:
- ecdump.close()
+ nBarcodes = getlen(barcodes_file)
+ with gzip.open(barcodes_file, 'rb') as f:
+ for i, chunk in enumerate(get_ranges(nBarcodes, n)):
+ yield i, {next(f).strip(): x+1 for x in chunk}
+
+def run_count(maps_file, features_index, tmpdir, dumpEC, verbose,
+ barcodes_set):
+ # NB: keep args order consistent with main.countFun
+ taskn, barcodes = barcodes_set
+ matrix_file = os.path.join(tmpdir, f'{taskn}_matrix.mtx.gz')
+ dump_file = os.path.join(tmpdir, f'{taskn}_EqCdump.tsv.gz')
+ with (gzip.open(matrix_file, 'wb') as f,
+ gzip.open(dump_file, 'wb') if dumpEC
+ else gzip.open(os.devnull) as df):
+ for cellbarcode, cellmaps in parse_maps(maps_file, features_index):
+ if cellbarcode not in barcodes:
+ continue
+ cellidx = barcodes[cellbarcode]
writerr(
- f'Equivalence Classes dump file written to {ec_dump_file}',
+ f'[{taskn}] Run count for cell '
+ f'{cellidx} ({cellbarcode.decode()})',
send=verbose
)
- writerr(f'Barcodes chunk {chunkn} written to {matrix_file}', send=verbose)
- return matrix_file, ec_dump_file
-
-# Concatenate matrices in a single MatrixMarket file with proper header
-def formatMM(matrix_files, outdir, features, barcodes):
+ cellcounts, dump = compute_cell_counts(
+ equivalence_classes=cellmaps,
+ features_index=features_index,
+ dumpEC=dumpEC
+ )
+ writerr(
+ f'[{taskn}] Write count for cell '
+ f'{cellidx} ({cellbarcode.decode()})',
+ send=verbose
+ )
+ # round counts to 3rd decimal point and write to matrix file
+ # only if count is at least 0.001
+ lines = [f'{feature} {cellidx} {round(count, 3)}\n'.encode()
+ for feature, count in cellcounts.items()
+ if count >= 0.001]
+ f.writelines(lines)
+ if dumpEC:
+ writerr(
+ f'[{taskn}] Write ECdump for cell '
+ f'{cellidx} ({cellbarcode.decode()})',
+ send=verbose
+ )
+ # reverse features index to get names back
+ findex = dict(zip(features_index.values(),
+ features_index.keys()))
+ dumplines = [
+ b'\t'.join(
+ [str(cellidx).encode(),
+ cellbarcode,
+ str(i).encode(),
+ umi,
+ b','.join([findex[f] for f in feats]),
+ str(count).encode(),
+ pumi,
+ b','.join([findex[f] for f in pfeats])]
+ ) + b'\n'
+ for i, (umi, feats, count, pumi, pfeats) in dump.items()
+ ]
+ df.writelines(dumplines)
+ return matrix_file, dump_file
+
+def formatMM(matrix_files, feature_index, barcodes_chunks, outdir):
if type(matrix_files) is str:
matrix_files = [matrix_files]
matrix_out = os.path.join(outdir, 'matrix.mtx.gz')
- features_count = len(features)
- barcodes_count = len(flatten([j for i,j in barcodes]))
+ features_count = len(feature_index)
+ barcodes_count = sum(len(x) for _, x in barcodes_chunks)
mmsize = sum(getlen(f) for f in matrix_files)
- mmheader = '%%MatrixMarket matrix coordinate real general\n'
- mmtotal = f'{features_count} {barcodes_count} {mmsize}\n'
+ mmheader = b'%%MatrixMarket matrix coordinate real general\n'
+ mmtotal = f'{features_count} {barcodes_count} {mmsize}\n'.encode()
with gzip.GzipFile(matrix_out, 'wb', mtime=0) as mmout:
- mmout.write(mmheader.encode())
- mmout.write(mmtotal.encode())
+ mmout.write(mmheader)
+ mmout.write(mmtotal)
mtxstr = ' '.join(matrix_files)
cmd = f'zcat {mtxstr} | LC_ALL=C sort -k2,2n -k1,1n | gzip >> {matrix_out}'
run_shell_cmd(cmd)
@@ -378,18 +314,17 @@ def writeEC(ecdump_files, outdir):
ecdump_out = os.path.join(outdir, 'ec_dump.tsv.gz')
ecdumpstr = ' '.join(ecdump_files)
header = '\t'.join([
- 'BC_index',
+ 'Barcode_id',
'Barcode',
- 'EC_index',
- 'EC_name',
- 'Raw_UMIs',
- 'Raw_count',
- 'Dedup_UMIs',
- 'Dedup_count',
- 'Filtered_TE'
+ 'EqClass',
+ 'UMI',
+ 'Features',
+ 'Read_count',
+ 'Dedup_UMI',
+ 'Dedup_feature'
]) + '\n'
with gzip.GzipFile(ecdump_out, 'wb', mtime=0) as f:
f.write(header.encode())
- cmd = f'zcat {ecdumpstr} | LC_ALL=C sort -k1,1n -k2 | gzip >> {ecdump_out}'
+ cmd = f'zcat {ecdumpstr} | LC_ALL=C sort -k1,1n -k3,3n | gzip >> {ecdump_out}'
run_shell_cmd(cmd)
- return ecdump_out
+ return ecdump_out
\ No newline at end of file
diff --git a/irescue/em.py b/irescue/em.py
new file mode 100644
index 0000000..39443df
--- /dev/null
+++ b/irescue/em.py
@@ -0,0 +1,49 @@
+import numpy as np
+
+def e_step(matrix, counts):
+ """
+ Performs E-step of EM algorithm: proportionally assigns reads to features
+ based on relative feature abundances.
+ """
+ colsums = (matrix * counts).sum(axis=1)[:, np.newaxis]
+ out = matrix / colsums * counts
+ return(out)
+
+def m_step(matrix):
+ """
+ Performs M-step of EM algorithm: calculates feature abundances from read
+ counts proportionally distributed to features.
+ """
+ counts = matrix.sum(axis=0) / matrix.sum()
+ return(counts)
+
+def run_em(matrix, cycles=100):
+ """
+ Run Expectation-Maximization (EM) algorithm to redistribute read counts
+ across a set of features.
+
+ Parameters
+ ----------
+ matrix : array
+ Reads-features compatibility matrix.
+ cycles : int, optional
+ Number of EM cycles.
+
+ Returns
+ -------
+ out : list
+ Optimized relative feature abundances.
+ """
+
+ # calculate initial estimation of relative abundance.
+ # (let the sum of counts of features be 1,
+ # will be multiplied by the real UMI count later)
+ nFeatures = matrix.shape[1]
+ counts = np.array([1 / nFeatures] * nFeatures)
+
+ # run EM for n cycles
+ for _ in range(cycles):
+ e_matrix = e_step(matrix=matrix, counts=counts)
+ counts = m_step(matrix=e_matrix)
+
+ return(counts)
\ No newline at end of file
diff --git a/irescue/main.py b/irescue/main.py
index 179353d..13f1089 100644
--- a/irescue/main.py
+++ b/irescue/main.py
@@ -3,11 +3,11 @@
from irescue._version import __version__
from irescue._genomes import __genomes__
from irescue.misc import writerr, versiontuple, run_shell_cmd
-from irescue.misc import check_requirement, check_arguments, check_tags
+from irescue.misc import check_requirement, check_tags
from irescue.map import makeRmsk, getRefs, prepare_whitelist, isec, chrcat
from irescue.map import checkIndex
-from irescue.count import split_bc, parse_features, count, formatMM, writeEC
-import argparse, os
+from irescue.count import split_barcodes, index_features, run_count, formatMM, writeEC
+import argparse, os, sys
from multiprocessing import Pool
from functools import partial
from shutil import rmtree
@@ -22,101 +22,90 @@ def parseArguments():
" in scRNA-seq.",
epilog="Home page: https://github.com/bodegalab/irescue"
)
- parser.add_argument('-b', '--bam',
- required=True,
- metavar='FILE',
+ parser.add_argument('-b', '--bam', required=True, metavar='FILE',
help="scRNA-seq reads aligned to a reference genome "
"(required).")
- parser.add_argument('-r', '--regions',
- metavar='FILE',
+ parser.add_argument('-r', '--regions', metavar='FILE',
help="Genomic TE coordinates in bed format. "
"Takes priority over --genome (default: %(default)s).")
- parser.add_argument('-g', '--genome',
- metavar='STR',
+ parser.add_argument('-g', '--genome', metavar='STR',
+ choices=__genomes__.keys(),
help="Genome assembly symbol. One of: {} (default: "
"%(default)s).".format(', '.join(__genomes__)))
- parser.add_argument('-w', '--whitelist',
- metavar='FILE',
+ parser.add_argument('-w', '--whitelist', metavar='FILE',
help="Text file of filtered cell barcodes by e.g. "
"Cell Ranger, STARSolo or your gene expression "
"quantifier of choice (Recommended. "
"default: %(default)s).")
- parser.add_argument('-cb', '--CBtag',
- default='CB',
- metavar='STR',
+ parser.add_argument('-c', '--cb-tag', default='CB', metavar='STR',
help="BAM tag containing the cell barcode sequence "
"(default: %(default)s).")
- parser.add_argument('-umi', '--UMItag',
- default='UR',
- metavar='STR',
+ parser.add_argument('-u', '--umi-tag', default='UR', metavar='STR',
help="BAM tag containing the UMI sequence "
"(default: %(default)s).")
- parser.add_argument('-p', '--threads',
- type=int,
- default=1,
- metavar='CPUS ',
+ parser.add_argument('-p', '--threads', type=int, default=1, metavar='CPUS',
help="Number of cpus to use (default: %(default)s).")
- parser.add_argument('-o', '--outdir',
- default='IRescue_out',
- metavar='DIR',
+ parser.add_argument('-o', '--outdir', default='irescue_out', metavar='DIR',
help="Output directory name (default: %(default)s).")
- parser.add_argument('--min-bp-overlap',
- type=int,
- metavar='INT',
+ parser.add_argument('--min-bp-overlap', type=int, metavar='INT',
help="Minimum overlap between read and TE as number "
"of nucleotides (Default: disabled).")
- parser.add_argument('--min-fraction-overlap',
- type=float,
- metavar='FLOAT',
+ parser.add_argument('--min-fraction-overlap', type=float, metavar='FLOAT',
+ choices=[x/100 for x in range(101)],
help="Minimum overlap between read and TE"
" as a fraction of read's alignment"
" (i.e. 0.00 <= NUM <= 1.00) (Default: disabled).")
- parser.add_argument('--dumpEC',
- action='store_true',
+ parser.add_argument('--dump-ec', action='store_true',
help="Write a description log file of Equivalence "
"Classes.")
- parser.add_argument('--integers',
- action='store_true',
+ parser.add_argument('--integers', action='store_true',
help="Use if integers count are needed for "
"downstream analysis.")
- parser.add_argument('--samtools',
- default='samtools',
- metavar='PATH',
+ parser.add_argument('--samtools', default='samtools', metavar='PATH',
help="Path to samtools binary, in case it's not in "
"PATH (Default: %(default)s).")
- parser.add_argument('--bedtools',
- default='bedtools',
- metavar='PATH',
+ parser.add_argument('--bedtools', default='bedtools', metavar='PATH',
help="Path to bedtools binary, in case it's not in "
"PATH (Default: %(default)s).")
- parser.add_argument('--no-tags-check',
- action='store_true',
+ parser.add_argument('--no-tags-check', action='store_true',
help="Suppress checking for CBtag and UMItag "
"presence in bam file.")
- parser.add_argument('--keeptmp',
- action='store_true',
- help="Keep temporary files.")
- parser.add_argument('--tmpdir',
- default='IRescue_tmp',
- metavar='DIR',
- help="Directory to store temporary files "
- "(default: %(default)s).")
- parser.add_argument('-v', '--verbose',
- action='store_true',
+ parser.add_argument('--keeptmp', action='store_true',
+ help="Keep temporary files under /tmp.")
+ #parser.add_argument('--tmpdir', default='irescue_out/tmp', metavar='DIR',
+ # help="Directory to store temporary files "
+ # "(default: %(default)s).")
+ parser.add_argument('-v', '--verbose', action='store_true',
help="Writes a lot of stuff to stderr, such as "
"chromosomes as they are mapped and cell barcodes "
"as they are processed.")
- parser.add_argument('-V', '--version',
- action='version',
+ parser.add_argument('-V', '--version', action='version',
version='%(prog)s {}'.format(__version__),
help="Print software's version and exit.")
return parser
def main():
+
+ # Parse and print arguments
parser = parseArguments()
- args = parser.parse_args()
- args = check_arguments(args)
+ args = parser.parse_args(args=None if sys.argv[1:] else ['--help'])
+ argstr = '\n'.join(f' {k}: {v}' for k, v in args.__dict__.items())
+ sys.stderr.write(f" IRescue version {__version__}\n{argstr}\n")
+
+ #__tmpdir__ = os.path.join(args.outdir, 'tmp')
+ dirs = {
+ 'out': args.outdir,
+ 'tmp': os.path.join(args.outdir, 'tmp'),
+ 'mex': os.path.join(args.outdir, 'counts')
+ }
+
+
+ ####################
+ # Preliminar steps #
+ ####################
+
+ writerr("Running preliminary checks.")
# Check requirements
check_requirement(
@@ -134,28 +123,33 @@ def main():
# Check if the selected cell barcode and UMI tags are present in bam file.
if not args.no_tags_check:
- check_tags(bamFile=args.bam, CBtag=args.CBtag, UMItag=args.UMItag,
+ check_tags(bamFile=args.bam, CBtag=args.cb_tag, UMItag=args.umi_tag,
nLines=999999, exit_with_error=True, verbose=args.verbose)
# Check for bam index file. If not present, will build an index.
checkIndex(args.bam, verbose=args.verbose)
-
- writerr('IRescue job starts')
-
+
# create directories
- os.makedirs(args.tmpdir, exist_ok=True)
- os.makedirs(args.outdir, exist_ok=True)
+ for v in dirs.values():
+ os.makedirs(v, exist_ok=True)
+
+
+ ###########
+ # Mapping #
+ ###########
+
+ writerr("Running mapping step.")
# set regions object (provided or downloaded bed file)
regions = makeRmsk(regions=args.regions, genome=args.genome,
- genomes=__genomes__, tmpdir=args.tmpdir,
+ genomes=__genomes__, tmpdir=dirs['tmp'],
outname='rmsk.bed')
# get list of reference names from bam
chrNames = getRefs(args.bam, regions)
# decompress whitelist if compressed
- whitelist = prepare_whitelist(args.whitelist, args.tmpdir)
+ whitelist = prepare_whitelist(args.whitelist, dirs['tmp'])
# Allocate threads
if args.threads > 1:
@@ -168,8 +162,8 @@ def main():
send=args.verbose
)
isecFun = partial(
- isec, args.bam, regions, whitelist, args.CBtag, args.UMItag,
- args.min_bp_overlap, args.min_fraction_overlap, args.tmpdir,
+ isec, args.bam, regions, whitelist, args.cb_tag, args.umi_tag,
+ args.min_bp_overlap, args.min_fraction_overlap, dirs['tmp'],
args.samtools, args.bedtools, args.verbose
)
if args.threads > 1:
@@ -179,20 +173,27 @@ def main():
# concatenate intersection results
mappings_file, barcodes_file, features_file = chrcat(
- isecFiles, threads=args.threads, outdir=args.outdir,
- tmpdir=args.tmpdir, verbose=args.verbose
+ isecFiles, threads=args.threads, outdir=dirs['mex'],
+ tmpdir=dirs['tmp'], bedtools=args.bedtools, verbose=args.verbose
)
+
+ #########
+ # Count #
+ #########
+
+ writerr("Running count step.")
+
# calculate number of mappings per process
- bc_per_thread = list(split_bc(barcodes_file, args.threads))
+ bc_per_thread = list(split_barcodes(barcodes_file, args.threads))
# parse features
- ftlist = dict(parse_features(features_file))
+ feature_index = index_features(features_file)
# calculate TE counts
countFun = partial(
- count, mappings_file, args.outdir, args.tmpdir, ftlist, args.integers,
- args.dumpEC, args.verbose
+ run_count, mappings_file, feature_index, dirs['tmp'],
+ args.dump_ec, args.verbose
)
if args.threads > 1:
mtxFiles = pool.map(countFun, bc_per_thread)
@@ -208,16 +209,15 @@ def main():
matrix_files = [ i for i, j in mtxFiles]
ecdump_files = [ j for i, j in mtxFiles]
matrix_file = formatMM(
- matrix_files, outdir=args.outdir, features=ftlist,
- barcodes=bc_per_thread
+ matrix_files, feature_index, bc_per_thread, dirs['mex']
)
writerr(f'Writing sparse matrix to {matrix_file}')
- if args.dumpEC:
- ecdump_file = writeEC(ecdump_files, outdir=args.outdir)
+ if args.dump_ec:
+ ecdump_file = writeEC(ecdump_files, outdir=dirs['out'])
writerr(f'Writing Equivalence Classes to {ecdump_file}')
if not args.keeptmp:
writerr(f'Cleaning up temporary files.', send=args.verbose)
- rmtree(args.tmpdir)
+ rmtree(dirs['tmp'])
writerr('Done.')
diff --git a/irescue/map.py b/irescue/map.py
index f82f9ea..9777af9 100644
--- a/irescue/map.py
+++ b/irescue/map.py
@@ -57,12 +57,6 @@ def makeRmsk(regions, genome, genomes, tmpdir, outname):
# if no repeatmasker file is provided, and a genome assembly name is
# provided, download and prepare a rmsk.bed file
elif genome:
- if not genome in genomes:
- writerr(
- "ERROR: Genome assembly name shouldbe one of: "
- f"{', '.join(genomes.keys())}",
- error=True
- )
url, header_lines = genomes[genome]
writerr(
"Downloading and parsing RepeatMasker annotation for "
@@ -101,7 +95,7 @@ def makeRmsk(regions, genome, genomes, tmpdir, outname):
if famclass.split('/')[0] in fams_to_skip:
continue
# concatenate family and class with subfamily
- subfamily += '~' + famclass
+ subfamily += '#' + famclass
score = lst[0]
chr, start, end = lst[4:7]
# make coordinates 0-based
@@ -172,7 +166,7 @@ def isec(bamFile, bedFile, whitelist, CBtag, UMItag, bpOverlap, fracOverlap,
os.makedirs(isecdir, exist_ok=True)
refFile = os.path.join(refdir, chrom + '.bed.gz')
- isecFile = os.path.join(isecdir, chrom + '.isec.bed.gz')
+ isecFile = os.path.join(isecdir, chrom + '.isec.txt.gz')
# split bed file by chromosome
sort = 'LC_ALL=C sort -k1,1 -k2,2n --buffer-size=1G'
@@ -210,8 +204,8 @@ def isec(bamFile, bedFile, whitelist, CBtag, UMItag, bpOverlap, fracOverlap,
# remove mate information from read name
cmd += ' { sub(/\/[12]$/,"",$4); '
# concatenate CB and UMI with feature name
- cmd += ' n=split($4,qname,/\//); $4=qname[n-1]"\\t"qname[n]"\\t"$16; '
- cmd += ' print $4 }\' '
+ cmd += ' n=split($4,qname,/\//); '
+ cmd += ' print qname[n-1]"\\t"qname[n]"\\t"qname[1]"\\t"$16 }\' '
cmd += f' | gzip > {isecFile}'
writerr(f'Extracting {chrom} reference', send=verbose)
@@ -223,24 +217,34 @@ def isec(bamFile, bedFile, whitelist, CBtag, UMItag, bpOverlap, fracOverlap,
return isecFile
# Concatenate and sort data obtained from isec()
-def chrcat(filesList, threads, outdir, tmpdir, verbose):
+def chrcat(filesList, threads, outdir, tmpdir, bedtools, verbose):
os.makedirs(outdir, exist_ok=True)
- mappings_file = os.path.join(tmpdir, 'cb_umi_te.bed.gz')
+ mappings_file = os.path.join(tmpdir, 'mappings.tsv.gz')
barcodes_file = os.path.join(outdir, 'barcodes.tsv.gz')
features_file = os.path.join(outdir, 'features.tsv.gz')
bedFiles = ' '.join(filesList)
- cmd0 = f'zcat {bedFiles} '
- cmd0 += f' | LC_ALL=C sort --parallel {threads} --buffer-size 2G '
- cmd0 += f' | gzip > {mappings_file} '
+ sort_threads = int(threads / 2 - 1)
+ sort_threads = sort_threads if sort_threads>0 else 1
+
+ # sort and summarize UMI-READ-TE mappings
+ sort_res = f'--parallel {sort_threads} --buffer-size 2G'
+ cmd0 = f'zcat {bedFiles}'
+ # input: "CB UMI READ FEAT"
+ cmd0 += f' | LC_ALL=C sort -u {sort_res}'
+ cmd0 += f' | {bedtools} groupby -g 1,2,3 -c 4 -o distinct'
+ # result: "CB UMI READ FEATs"
+ cmd0 += f' | LC_ALL=C sort -k1,2 -k4,4 {sort_res}'
+ cmd0 += f' | {bedtools} groupby -g 1,2,4 -c 3 -o count_distinct'
+ # result: "CB UMI FEATs count"
+ cmd0 += f' | gzip > {mappings_file}'
+
+ # write barcodes.tsv.gz file
cmd1 = f'zcat {mappings_file} | cut -f1 | uniq | gzip > {barcodes_file} '
+
+ # write features.tsv.gz file
cmd2 = f'zcat {mappings_file} '
- cmd2 += ' | gawk \'!x[$3]++ { '
- cmd2 += ' split($3,a,"~"); '
- # avoid subfamilies with the same name
- cmd2 += ' if(a[1] in sf) { sf[a[1]]+=1 } else { sf[a[1]] }; '
- cmd2 += ' if(length(a)<2) { a[2]=a[1] }; '
- cmd2 += ' print a[1] sf[a[1]] "\\t" a[2] "\\tGene Expression" '
- cmd2 += ' }\' '
+ cmd2 += ' | cut -f3 | sed \'s/,/\\n/g\' | gawk \'!x[$1]++ { '
+ cmd2 += ' print $1"\\t"gensub(/#.+/,"",1,$1)"\\tGene Expression" }\' '
cmd2 += f' | LC_ALL=C sort -u | gzip > {features_file} '
writerr('Concatenating mappings', send=verbose)
diff --git a/irescue/misc.py b/irescue/misc.py
index 7a49285..aea6b03 100644
--- a/irescue/misc.py
+++ b/irescue/misc.py
@@ -37,18 +37,6 @@ def versiontuple(version):
"""
return tuple(map(int, version.split('.')))
-def check_arguments(args):
- """
- Check validity of arguments.
- """
- if isinstance(args.min_fraction_overlap, (int, float)):
- if 0 <= args.min_fraction_overlap <= 1:
- pass
- else:
- writerr("ERROR: --min-fraction-overlap must be a floating point "
- "number between 0 and 1.", error=True)
- return args
-
def check_requirement(cmd, required_version, parser, verbose):
"""
Check if the required version for a software has been installed.
@@ -94,7 +82,7 @@ def writerr(msg, error=False, send=True):
Decides if the message should be sent (useful for verbose messages).
"""
if send:
- timelog = datetime.now().strftime("%m/%d/%Y - %H:%M:%S")
+ timelog = datetime.now().strftime("%Y/%m/%d - %H:%M:%S")
message = f'[{timelog}] '
if not msg[-1]=='\n':
msg += '\n'
@@ -143,12 +131,6 @@ def getlen(file):
f.close()
return out
-def flatten(x):
- """
- Flatten a list of sublists.
- """
- return [item for sublist in x for item in sublist]
-
def check_tags(
bamFile, CBtag, UMItag,
nLines=None, exit_with_error=True, verbose=False
@@ -216,22 +198,15 @@ def check_tags(
else:
return(False)
-def iupac_nt_code(nts):
- """
- Return the IUPAC code correspondent to a set of input nucleotides.
- """
- codes = {
- 'R': {'A', 'G'},
- 'Y': {'C', 'T'},
- 'S': {'G', 'C'},
- 'W': {'A', 'T'},
- 'K': {'G', 'T'},
- 'M': {'A', 'C'},
- 'B': {'C', 'G', 'T'},
- 'D': {'A', 'G', 'T'},
- 'H': {'A', 'C', 'T'},
- 'V': {'A', 'C', 'G'},
- 'N': {'A', 'C', 'G', 'T'}
- }
- out = [k for k, v in codes.items() if v == set(nts)][0]
- return out
+def get_ranges(num, div):
+ """
+ Splits an integer X into N integers whose sum is equal to X.
+ """
+ split = int(num/div)
+ for i in range(0, num, split):
+ j = i + split
+ if j > num-split:
+ j = num
+ yield range(i, j)
+ break
+ yield range(i, j)
diff --git a/irescue/network.py b/irescue/network.py
new file mode 100644
index 0000000..f274e3a
--- /dev/null
+++ b/irescue/network.py
@@ -0,0 +1,69 @@
+#!/usr/bin/env python
+
+# NB: This module include partly modified third-party code distributed under the
+# license below.
+
+##############################################################################
+# The MIT License (MIT)
+
+# Copyright (c) 2015 CGAT
+
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+
+# The above copyright notice and this permission notice shall be included in all
+# copies or substantial portions of the Software.
+
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+# SOFTWARE.
+##############################################################################
+
+from collections import defaultdict
+
+def get_substr_slices(umi_length, idx_size):
+ '''
+ Create slices to split a UMI into approximately equal size substrings
+ Returns a list of tuples that can be passed to slice function
+ '''
+ cs, r = divmod(umi_length, idx_size)
+ sub_sizes = [cs + 1] * r + [cs] * (idx_size - r)
+ offset = 0
+ slices = []
+ for s in sub_sizes:
+ slices.append((offset, offset + s))
+ offset += s
+ return slices
+
+def build_substr_idx(equivalence_classes, length, threshold):
+ '''
+ Group equivalence classes into subgroups having a common substring
+ '''
+ slices = get_substr_slices(length, threshold+1)
+ substr_idx = {k: defaultdict(set) for k in slices}
+ for idx in slices:
+ for ec in equivalence_classes:
+ sub = ec.umi[slice(*idx)]
+ substr_idx[idx][sub].add(ec)
+ return substr_idx
+
+def gen_ec_pairs(equivalence_classes, substr_idx):
+ '''
+ Yields equivalence classes pairs from build_substr_idx()
+ '''
+ for i, ec in enumerate(equivalence_classes, start=1):
+ neighbours = set()
+ for idx, substr_map in substr_idx.items():
+ sub = ec.umi[slice(*idx)]
+ neighbours = neighbours.union(substr_map[sub])
+ neighbours.difference_update(equivalence_classes[:i])
+ for nbr in neighbours:
+ yield ec, nbr
\ No newline at end of file
diff --git a/pyproject.toml b/pyproject.toml
index 123ca38..9f2a6c8 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -32,6 +32,7 @@ dependencies = [
"numpy >= 1.20.2",
"pysam >= 0.16.0.1",
"requests >= 2.27.1",
+ "networkx >= 3.1",
]
dynamic = ["version"]
diff --git a/tests/data/rmsk.bed.gz b/tests/data/rmsk.bed.gz
index d032937..422df4a 100644
Binary files a/tests/data/rmsk.bed.gz and b/tests/data/rmsk.bed.gz differ
diff --git a/tests/test.yml b/tests/test.yml
index 291c46d..4e55120 100644
--- a/tests/test.yml
+++ b/tests/test.yml
@@ -1,48 +1,48 @@
- name: base
command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz --keeptmp -v
files:
- - path: "IRescue_out/barcodes.tsv.gz"
+ - path: "irescue_out/counts/barcodes.tsv.gz"
md5sum: 1a74fa12e65ac1703bbe61282854f151
- - path: "IRescue_out/features.tsv.gz"
- md5sum: e8bf21611afd1f40d722ed985f4e3392
- - path: "IRescue_out/matrix.mtx.gz"
- md5sum: 04ddbd538c796f019f37d3048b159a2f
+ - path: "irescue_out/counts/features.tsv.gz"
+ md5sum: ae84bc368a289e070b754030a65d69b4
+ - path: "irescue_out/counts/matrix.mtx.gz"
+ md5sum: ca147b42af250be7c47c4a748693ca97
- name: genome
tags:
- genome
command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -g test --keeptmp -v
files:
- - path: "IRescue_out/barcodes.tsv.gz"
+ - path: "irescue_out/counts/barcodes.tsv.gz"
md5sum: 1a74fa12e65ac1703bbe61282854f151
- - path: "IRescue_out/features.tsv.gz"
- md5sum: e8bf21611afd1f40d722ed985f4e3392
- - path: "IRescue_out/matrix.mtx.gz"
- md5sum: 04ddbd538c796f019f37d3048b159a2f
+ - path: "irescue_out/counts/features.tsv.gz"
+ md5sum: ae84bc368a289e070b754030a65d69b4
+ - path: "irescue_out/counts/matrix.mtx.gz"
+ md5sum: ca147b42af250be7c47c4a748693ca97
- name: multi
tags:
- multi
command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz -p 2 --keeptmp -v
files:
- - path: "IRescue_out/barcodes.tsv.gz"
+ - path: "irescue_out/counts/barcodes.tsv.gz"
md5sum: 1a74fa12e65ac1703bbe61282854f151
- - path: "IRescue_out/features.tsv.gz"
- md5sum: e8bf21611afd1f40d722ed985f4e3392
- - path: "IRescue_out/matrix.mtx.gz"
- md5sum: 04ddbd538c796f019f37d3048b159a2f
+ - path: "irescue_out/counts/features.tsv.gz"
+ md5sum: ae84bc368a289e070b754030a65d69b4
+ - path: "irescue_out/counts/matrix.mtx.gz"
+ md5sum: ca147b42af250be7c47c4a748693ca97
- name: whitelist
tags:
- whitelist
command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz -w ./tests/data/whitelist.txt --keeptmp -v
files:
- - path: "IRescue_out/barcodes.tsv.gz"
+ - path: "irescue_out/counts/barcodes.tsv.gz"
md5sum: 95dccc15cbee4feeeae2fbce4d7b41ad
- - path: "IRescue_out/features.tsv.gz"
- md5sum: 2dcec6f4aead5faba9c1af44b0129b55
- - path: "IRescue_out/matrix.mtx.gz"
- md5sum: 85c61d1df6ccadf83eafc6bc36a21c89
+ - path: "irescue_out/counts/features.tsv.gz"
+ md5sum: 65fb8381a658a4eb4e5d0a575c67818d
+ - path: "irescue_out/counts/matrix.mtx.gz"
+ md5sum: d4f60bc056ea189c7473a3624f3c2970
- name: multi whitelist
tags:
@@ -50,65 +50,65 @@
- whitelist
command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz -w ./tests/data/whitelist.txt --keeptmp -v -p 2
files:
- - path: "IRescue_out/barcodes.tsv.gz"
+ - path: "irescue_out/counts/barcodes.tsv.gz"
md5sum: 95dccc15cbee4feeeae2fbce4d7b41ad
- - path: "IRescue_out/features.tsv.gz"
- md5sum: 2dcec6f4aead5faba9c1af44b0129b55
- - path: "IRescue_out/matrix.mtx.gz"
- md5sum: 85c61d1df6ccadf83eafc6bc36a21c89
+ - path: "irescue_out/counts/features.tsv.gz"
+ md5sum: 65fb8381a658a4eb4e5d0a575c67818d
+ - path: "irescue_out/counts/matrix.mtx.gz"
+ md5sum: d4f60bc056ea189c7473a3624f3c2970
- name: ecdump
tags:
- ecdump
- command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz --keeptmp -v --dumpEC
+ command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz --keeptmp -v --dump-ec
files:
- - path: "IRescue_out/barcodes.tsv.gz"
+ - path: "irescue_out/counts/barcodes.tsv.gz"
md5sum: 1a74fa12e65ac1703bbe61282854f151
- - path: "IRescue_out/features.tsv.gz"
- md5sum: e8bf21611afd1f40d722ed985f4e3392
- - path: "IRescue_out/matrix.mtx.gz"
- md5sum: 04ddbd538c796f019f37d3048b159a2f
- - path: "IRescue_out/ec_dump.tsv.gz"
- md5sum: 2fbcb954fb48065c6b67a84001b6bc34
+ - path: "irescue_out/counts/features.tsv.gz"
+ md5sum: ae84bc368a289e070b754030a65d69b4
+ - path: "irescue_out/counts/matrix.mtx.gz"
+ md5sum: ca147b42af250be7c47c4a748693ca97
+ - path: "irescue_out/ec_dump.tsv.gz"
+ md5sum: d71ee82b25107d4e104d313efb4be134
- name: multi ecdump
tags:
- multi
- ecdump
- command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz --keeptmp -v -p 2 --dumpEC
+ command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz --keeptmp -v -p 2 --dump-ec
files:
- - path: "IRescue_out/barcodes.tsv.gz"
+ - path: "irescue_out/counts/barcodes.tsv.gz"
md5sum: 1a74fa12e65ac1703bbe61282854f151
- - path: "IRescue_out/features.tsv.gz"
- md5sum: e8bf21611afd1f40d722ed985f4e3392
- - path: "IRescue_out/matrix.mtx.gz"
- md5sum: 04ddbd538c796f019f37d3048b159a2f
- - path: "IRescue_out/ec_dump.tsv.gz"
- md5sum: 2fbcb954fb48065c6b67a84001b6bc34
+ - path: "irescue_out/counts/features.tsv.gz"
+ md5sum: ae84bc368a289e070b754030a65d69b4
+ - path: "irescue_out/counts/matrix.mtx.gz"
+ md5sum: ca147b42af250be7c47c4a748693ca97
+ - path: "irescue_out/ec_dump.tsv.gz"
+ md5sum: d71ee82b25107d4e104d313efb4be134
- name: bp
tags:
- bp
command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz --keeptmp -v --min-bp-overlap 10
files:
- - path: "IRescue_out/barcodes.tsv.gz"
+ - path: "irescue_out/counts/barcodes.tsv.gz"
md5sum: 7433e88e94aec2f16a20459275188f1f
- - path: "IRescue_out/features.tsv.gz"
- md5sum: 12ff16aee1a5e9847ed96534b3764d13
- - path: "IRescue_out/matrix.mtx.gz"
- md5sum: 39b3ee6dbffd61a68569b3b30dcaf972
+ - path: "irescue_out/counts/features.tsv.gz"
+ md5sum: 434ff68c92d1b8dd718269a1cd974f99
+ - path: "irescue_out/counts/matrix.mtx.gz"
+ md5sum: 30fe31ed8976bd002d86bcd956d25855
- name: fraction
tags:
- fraction
command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz --keeptmp -v --min-fraction-overlap 0.5
files:
- - path: "IRescue_out/barcodes.tsv.gz"
+ - path: "irescue_out/counts/barcodes.tsv.gz"
md5sum: 4de44d3e4a851392a48ccabfee5bb6fc
- - path: "IRescue_out/features.tsv.gz"
- md5sum: 6ee6ded0563e8e138fb1d5c958cedeee
- - path: "IRescue_out/matrix.mtx.gz"
- md5sum: 345b8aff9c00f607ea5a305bed569653
+ - path: "irescue_out/counts/features.tsv.gz"
+ md5sum: 927d00f20e4e65b8d46e761d406b69ff
+ - path: "irescue_out/counts/matrix.mtx.gz"
+ md5sum: 85b8e8ae7696ff12ffa7e5ac86600fa1
- name: bp fraction
tags:
@@ -116,9 +116,9 @@
- fraction
command: irescue -b ./tests/data/Aligned.sortedByCoord.out.bam -r ./tests/data/rmsk.bed.gz --keeptmp -v --min-bp-overlap 10 --min-fraction-overlap 0.5
files:
- - path: "IRescue_out/barcodes.tsv.gz"
+ - path: "irescue_out/counts/barcodes.tsv.gz"
md5sum: 4de44d3e4a851392a48ccabfee5bb6fc
- - path: "IRescue_out/features.tsv.gz"
- md5sum: c95db95604d1731d2908f08eeaf8ded1
- - path: "IRescue_out/matrix.mtx.gz"
- md5sum: 40fd2a331d7328d4a4a5428307f8adaf
+ - path: "irescue_out/counts/features.tsv.gz"
+ md5sum: f304e63657f73eeec0edffed68490b6c
+ - path: "irescue_out/counts/matrix.mtx.gz"
+ md5sum: 336e5a5edfad998bc7d64cf0e68cc897