BarcodEX

BarcodEX is tool for extracting Unique Molecular Identifiers (UMIs) from single or paired-end read sequences. It can handle UMIs inline with reads or located in separate fastqs. This is a reimplementation of the main package in Rust for improved performance when handling large datasets.

Installation

Obtain Rust for your system. Then invoke:

cargo build --release
sudo cp target/release/barcodex-rs /usr/local/bin

You can copy the binary to any installation directory you choose.

Running

BarcodEX requires input gzipped FASTQs. It can operate in two modes: the UMIs and sequence are mixed in the same file (inline) or the UMIs and sequence are in separate files (separate).

Scenario	Arguments
1 read with an embedded UMI	`inline --pattern1` pattern `--r1_in` fastq
2 reads each with an embedded UMI	`inline --pattern1` pattern `--r1_in` fastq `--pattern2` pattern `--r2_in` fastq
1 data read and 1 UMI read	`separate --r1_in` fastq `--ru_in` fastq
2 data reads and 1 UMI read	`separate --r1_in` fastq `--r2_in` fastq `--ru_in` fastq

In every case --prefix must be specified which specifies the start of the output files. (e.g. --prefix x will result in x_R1.fastq.gz, x_R1.discarded.fastq.gz, x_R1.extracted.fastq.gz, x_UMI_counts.json, x_extraction_metrics.json).

The UMIs will only be accepted if they match an allow list provided with --umilist. The list is a text file with one UMI per line. In the case of 2 reads with embedded UMIs, the two parts of the UMI must be on separate lines, optionally followed by the read number they apply to. So, AAA would be allowed for either read 1 or read 2, while CCC 2 will allow CCC only on read 2. It's also possible to write AAA 1 2 or AAA 1 and AAA 2 if desired.

The UMI will be placed in the header of the output file, separated by --separator or _ if unspecified.

UMI Patterns

There are two ways to specify a pattern for the UMI: a nucleotide sequence or a regular expression.

A nucleotide sequence is two or more Ns followed by a spacer sequence. For instance the pattern NNNNN extracts the first 5 nucleotides from the read whereas pattern NNNNNATCG extracts the first 9 nucleotides, using the first 5 nucleotides as the UMI and checks that the next 4 nucleotides match the spacer ATCG.

Regular expressions allow more flexibility for extracting UMIs, in particular UMIs with complex design and UMIs not starting at the beginning of the read. A good introduction to regular expression can be found in this Regular Expression HOWTO.

Sequences are extracted from the read using named groups within the regex. Groups that have names that start with umi will be used for the UMI and groups with names that start with discard will be matched but not included in the output. Group names must be unique, so suffixing with _ number is recommended.

For example, this expression extracts a 3bp UMI followed by TT spacer that is removed from read and discarded:

 (?<umi>.{3})(?<discard>T{2})

Any sequence not contained in umi and discard groups will remain in the read. Thus, it is important to construct the regular expression such that the beginning of the read is captured in groups.

Normally, the regular expression is matched at the beginning of the read and any unmatched bases at the end are assumed to be sequence. If the whole read must be matched, use --full_match.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
COPYING		COPYING
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BarcodEX

Installation

Running

UMI Patterns

About

Releases

Packages

Languages

License

oicr-gsi/barcodex-rs

Folders and files

Latest commit

History

Repository files navigation

BarcodEX

Installation

Running

UMI Patterns

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages