Skip to content

CRISPR, faster, better – The Crackling method for whole-genome target detection

License

Notifications You must be signed in to change notification settings

debugst1ck/Split-ISSL

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crackling

Rapid Whole-Genome Identification of High Quality CRISPR Guide RNAs with the Crackling Method

Jacob Bradford, Timothy Chappell, and Dimitri Perrin. The CRISPR Journal. Jun 2022.410-421. http://doi.org/10.1089/crispr.2021.0102

Preamble

The design of CRISPR-Cas9 guide RNAs is not trivial and is a computationally demanding task. Design tools need to identify target sequences that will maximize the likelihood of obtaining the desired cut, while minimizing off-target risk. There is a need for a tool that can meet both objectives while remaining practical to use on large genomes.

In this study, we present Crackling, a new method that is more suitable for meeting these objectives. We test its performance on 12 genomes and on data from validation studies. Crackling maximizes guide efficiency by combining multiple scoring approaches. On experimental data, the guides it selects are better than those selected by others. It also incorporates Inverted Signature Slice Lists (ISSL) for faster off-target scoring. ISSL provides a gain of an order of magnitude in speed compared with other popular tools, such as Cas-OFFinder, Crisflash, and FlashFry, while preserving the same level of accuracy. Overall, this makes Crackling a faster and better method to design guide RNAs at scale.

Crackling is available at https://github.com/bmds-lab/Crackling under the Berkeley Software Distribution (BSD) 3-Clause license.

Introduction

CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) is a technology that research scientists use to selectively modify the DNA of living organisms

Off-target effects in CRISPR/Cas9 gene editing refer to the deposition of unexpected, unwanted, or even adverse alterations to the genome. These effects occur when Cas9 acts on unintended genomic sites and creates cleavages which may cause genomic instability and disrupt the functionality of genes and may lead to adverse outcomes.

To prevent this, several software tools available that can emulate or score genome off-target risks in gene editing, however, these proved to be time-consuming and computationally expensive, especially for simple projects.

Therefore, to ensure portability and efficiency across different platforms and operating systems, and for implementing this in AWS server for cloud availability for researchers across the world, a new method for offtarget scoring was discovered, Inverted Signature Slice Lists (ISSL) for faster off-target scoring.

ISSL provides a gain of an order of magnitude in speed compared with other popular tools, such as Cas-OFFinder, Crisflash, and FlashFry, while preserving the same level of accuracy.

However, there was a limitation to ISSL, as it required relatively high memory requirement. This is where this project comes in. This project brings changes to improve upon the structure of ISSL while retaining both its behavioral and informational semantics.

Structure

Structure Old Difference

I would imagine ISSL as a dictionary, with each word being an offtarget. Previously we used to memorize the dictionary and directly load it to our memory, however, this proved to be demanding. With the new method of splitting the dictionary to multiple booklets, read the booklets sequentially, effectively this means that we only need to remember parts of the dictionary at a time.

Dependencies

  • The original cracking project to isolate offtargets, other than that, no dependencies, Yay!

Compilation

g++.exe -g ./src/CreateSplitIndex.cpp -o ./bin/CreateSplitIndex.exe -O3
g++.exe -g ./src/ReadSplitIndex.cpp -o ./bin/ReadSplitIndex.exe -I ./include -O3
  • Requires the POSIX threads library for Linux environments, add the -pthreads argument to g++ if not compiling.
  • Make sure you set the include path to ./include.

Usage

For converting the offtargets to binary encoded .issl ISSL slice list files.

[EXECUTABLE_PATH] [offtargetSites_PATH] [SEQUENCE_LENGTH] [SLICE_WIDTH(BITS)] [ISSL_TABLE]

Example usage:

.src/CreateSplitIndex.exe ./test/issl/guides/GCA_000008365.offtargets 20 8 ./test/issl/GCA_000008365.0.issl ./test/issl/GCA_000008365.1.issl ./test/issl/GCA_000008365.2.issl ./test/issl/GCA_000008365.3.issl ./test/issl/GCA_000008365.4.issl

For reading and scoring .issl files.

[EXECUTABLE_PATH] [CANDIDATE_GUIDE_PATH] [MAXIMUM_DISTANCE] [SCORE_THRESHOLD] [ISSL_TABLE]

Example usage:

.src/ReadSplitIndex.exe ./test/issl/guides/CG20-100.txt 4 0 and ./test/issl/GCA_000008365.0.issl ./test/issl/GCA_000008365.1.issl ./test/issl/GCA_000008365.2.issl ./test/issl/GCA_000008365.3.issl ./test/issl/GCA_000008365.4.issl         

Performance

We did 38 tests on various genomic sequences from NCBI, and tested across multiple operating systems (Windows, Ubuntu). The runtime and memory usages are summarized in a plot.

Runtime difference chart

Since this solution heavily depend on file IO, the runtime of the solution is heavily dependent on the number of queries and offtarget count, showing linear trend, the following graph shows the trend for offtargets for fixed query size. The solution is "blazingly fast", given O(1) lookup time for each query.

Memory difference chart

Memory usage is, as expected significantly reduced, making it "economically nimble".

High-performance Bioinformatic analysis of gene editing targets

New method to store offset to different slices in the slice-list, and load the slices to memory as needed for each query. In theory, this would reduce the maximum memory usage of loading slice-lists by 98%, and should increase performance, by reducing IO bound operations.

However, this must be tested, as this process would IO queries are done at different points of time (May be somewhat hard on HDD drives, but who uses HDD anyways? cloud providers do), which means compiler and operating system optimizations are scarce.

We did multiprocessing using multiple thread for each queries and each query running in parallel. But this proved to extremely inefficient memory usage and time-consuming for large candidate guides.

However for each query running in parallel with sub-queries in serial proved to be extremely efficient.

New structure

This uses a precalculated table to store the location each slice indices in storage and keeps it in memory for a constant O(1) lookup time. Although it serially then goes through the slice index afterwards, the skipping of previous indices greatly enhance speed for large ISSL files, while greatly reducing memory usage.

Speedy Runs

As you can see here, blazingly fast scaling for large queries, as a result of (minimal) parallel processing

Efficiency

And just extremely efficient scaling for memory, he plateau for the original score offtargets is due to memory limit reaching and inability to scale because of the locsl machine instance having maximum threshold RAM of 4GB

The memory overhead for multiprocessing is just non-existing for the new program since everything is just immediately loaded from storage, processed and discarded.

And no, no memory leaks for now

About

CRISPR, faster, better – The Crackling method for whole-genome target detection

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 98.0%
  • C 2.0%