Skip to content

Latest commit

 

History

History
35 lines (27 loc) · 7.15 KB

README.md

File metadata and controls

35 lines (27 loc) · 7.15 KB

Spacer2PAM

An R package for guiding experimental determination of functional PAM sequences from CRISPR array spacers

Overview

The recent discovery and in-depth characterization of CRISPR-Cas9 and other CRISPR-Cas systems has led to a variety of technologies, including genome editing, genome modification, nucleic acid sensing, and next generation antimicrobials. Although CRISPR-Cas systems are powerful tool to alter biology, they are often toxic when heterologously expressed in bacteria. Fortunately, about half of all bacteria that have been sequenced encode at least one CRISPR-Cas system in their own genome which provides an alternative to heterologous CRISPR-Cas systems for genome manipulation.

Endogenous CRISPR-Cas systems have been used to successfully edit the genomes of a few bacteria and archaea, but expansion of this method is hindered by the unique protospacer adjacent motif (PAM) sequence of each CRISPR-Cas system required to target a DNA sequence. That is to say that the PAM must be known in order to target an endogenous CRISPR-Cas system toward the genome that encodes it. However, the PAM often also recognized during the spacer acquisition process, which adds new spacers to the endogenous CRISPR array. During this process, foreign nucleic acid that is invading a organism is surveyed for the presence of a PAM and then the DNA adjacent is excised and inserted into the CRISR array. As such, reversing this process in silico would allow determination of the PAM sequence. Past efforts to do so have primarily consisted of indivdual researchers generating nucleotide alignments between CRISPR array spacers and sequences within a variable database, and then manually curating alignments to hypothesize a few potential PAMs. Other more sophisticated apporaches have built tools to find and present the nucleotide alignments to the user, but leaves the user to generate PAM predictions from the alignment data.

Here we present Spacer2PAM, a standardized in silico pipeline to predict PAM sequences for a given CRISPR-Cas system from annotated CRISPR array spacers. The tools in Spacer2PAM allow the user to manipulate and reformat CRISPR array spacer data and then predict PAM sequences from that data. Users may start with a FASTA file containg the CRISPR array spacers they wish to analyze (such as those from CRISPRCasdb) or from an annotated CSV file of CRISPR array spacers. Once the Spacer2PAM pipeline is run, the user is presented with a dataframe containing the statistics of their PAM prediction and a PDF file of a sequence logo annotated with the PAM prediction and score. Spacer2PAM is an easy to use pipeline for PAM prediction from CRISPR array spacers and is a key step toward enabling the use of endogenous CRISPR-Cas systems for genome engineering and other applications.

Workflow

Overview of Functions in Spacer2PAM. Blue arrows and boxes indicate the intiail data and functions needed to start the workflow, respectively.

The user starts by passing the CRISPR-Cas system’s host organism name and a user-defined identifier to setCRISPRInfo, which sets the name of the CRISPR-Cas system and defines file output names. The user then chooses one of two options to input the CRISPR array spacer sequence data. If starting with a FASTA file containing each spacer as an individual sequence, the user may call FASTA2DF to arrange the spacer sequences and other user input information about the CRISPR spacers into a dataframe which is suitable for downstream analysis with Spacer2PAM. We highly recommend that the user then call DF2FASTA to generate a FASTA containing all the spacers. Although the user already has a FASTA file, doing so ensures that the title of each sequence is compatible with downstream Spacer2PAM functions. Alternatively, a user may start with a formatted dataframe containing the headers “Strain”, “Spacers”, “Array.Orientation”, “Repeat”, “Array”, and “Spacer” and pass it to DF2FASTA to generate a FASTA file containing the spacer sequences with the appropriate labels. The user then uses the FASTA file and submits the sequences for alignment to BLAST. This can be done programatically through FASTA2Alignment, which will return a dataframe summarizing the results which can be passed to joinSpacerDFandAlignmentDF to continue. FASTA2Alignment does not work for all CRISPR arrays due to the data size restrictions on the Entrez API. For arrays that exceed this size limit and generate an error message from FASTA2Alignment, we recommend using the web server for BLAST. The user should use the BLASTn algorithm and to exclude both Eukaryotes (taxid:2759) to limit the alignment to relevant organisms and decreases both BLAST and Spacer2PAM computational time. Once the alignment is completed through the BLAST web server, the resulting hit table should be downloaded in .CSV format. The hit table file should then be passed to alignmentCSV2DF to convert it to a dataframe. The resulting dataframe can then be passed to joinSpacerDFandAlignmentDF. This function joins the two dataframes, assigning spacer information to each alignment in the hit table. This function also converts the accession number of the alignment to the genus and species name of the organism that encodes the alignment sequence using the taxonomizr package. The taxonomizr package requires the local download and set up of an SQL database, the user should be prepared to store the 65 GB (at time of writing this) database in a location stably accessible while using joinSpacerDFandAlignmentDF. The resulting dataframe is sufficient for PAM prediction by join2PAM, but we recommend calling Submit2Phaster if the user plans to select the prophage prediction option in join2PAM. Submit2Phaster interacts with the PHASTER prophage prediction web server to submit a nonredundant list of accession numbers from the joined dataframe for prophage detection. Depending on the volume of traffic on the PHASTER server, prediction can take minutes to months to complete. Lastly, the joined dataframe is passed to join2PAM. This function is the core of Spacer2PAM and predicts a PAM sequence from the alignments generated by BLAST. Multiple combinations of filter sets can be run sequentially with a single call of join2PAM. The output of join2PAM is a dataframe name collectionFrame that summarizes the filtering process and records the upstream and downstream predicted PAMs as well as their associated PAM score. Details on PAM identification and scoring can be found in the vignette associated with Spacer2PAM.

Getting Started

Using the devtools package, run the following command in R:

devtools::install_github("grybnicky/Spacer2PAM")

Once all dependencies are installed, follow the instructions to prepare the Taxonomizr SQL library at https://cran.r-project.org/web/packages/taxonomizr/vignettes/usage.html.

Dependencies

Spacer2PAM has the following dependencies:
dplyr
ggplot2
ggseqlogo
taxonomizr
HelpersMG
httr
jsonlite
spatstat.utils
seqinr
readr