This code is designed to take fastq files that are produced by the MiSeq and return integration sites, multihits, and chimeras. RData and fasta files are used for data persistance in place of a database, thus allowing massive parallelization and the LSF job submission system on the PMACS HPC
Analysis is started by having the user create the following directory structure:
primaryAnalysisDirectory
├── Data
│ ├── Undetermined_S0_L001_I1_001.fastq.gz
│ ├── Undetermined_S0_L001_R1_001.fastq.gz
│ └── Undetermined_S0_L001_R2_001.fastq.gz
├── processingParams.csv
└── sampleInfo.csv
-
Data/Undetermined_S0_L001_*_001.fastq.gz
are the fastq files returned by the MiSeq (R1, R2, and I1) -
processingParams.csv
contains 'dryside' sample metadata:alias
is the human-readable sample descriptionqualityThreshold, badQualityBases, qualitySlidingWindow
- After error-correcting and demultiplexing, intSiteCaller trims raw MiSeq reads based on Illumina-issued quality scores.
badQualityBases
is the number of bases belowqualityThreshold
, using standard Illumina ASCII Q-score encoding (p.41-2), that can be observed in a window ofqualitySlidingWindow
before the read is trimmed.
- After error-correcting and demultiplexing, intSiteCaller trims raw MiSeq reads based on Illumina-issued quality scores.
primer
is the primer sequence as seen in MiSeq read 2ltrBit
is the LTR sequence as seen in MiSeq read 2largeLTRFrag
is 43nt of the LTR sequence as seen from MiSeq read 1linkerCommon
is 15nt of the linker sequence as seen from MiSeq read 2mingDNA
is the minimum length of genomic DNA (in nt) that is allowed to be passed onto the alignment stepvectorSeq
is a filepath (either absolute or relative to the primary analysis directory) to the vector sequence in fasta format -- it is encouraged to place the vector sequence directly in the primary analysis directory, although that is not a requirementminPctIdent
is the minimum percent identity that a query sequence has to a putative alignment on the target sequence in order to be considered 'valid'maxAlignStart
is the maximum number of nucleotides of the query sequence that do not match the putative target sequence before a matched nucleotide is seenmaxFragLength
is the maximum length of a properly-paired alignment (in nt)refGenome
is the reference genome to be used for alignment - this is passed in as a standard text string (ex. 'hg18', 'hg19', 'mm8', 'mm9')
-
sampleInfo.csv
contains 'wetside' sample metadata:alias
is the human-readable sample descriptionlinkerSequence
is the linker sequence as seen in MiSeq read 1. N's indicate the presence of a primerIDbcSeq
is the barcode used during sample preparationgender
is either 'M' of 'F' for male/female, respectively
After creating the above directory structure, the following command is issued:
Rscript intSiteCaller.R
The rest of the processing is fully automated and shouldn't take more than 4 hours to process 1.5e7 raw reads.
intSiteCaller.R
can handle the following options
-j
,--jobID
- Unique name by which to identify this intance of intSiteCaller [default: intSiteCallerJob]-c
,--codeDir
- Directory where intSiteCaller code is stored, can be relative or absolute [default: .]-p
,--primaryAnalysisDir
- Location of primary analysis directory, can be relative or absolute [default: .]-C
,--cleanup
- Remove temporary files upon successful execution of intSiteCaller-h
,--help
- Show the help message and exit
This code returns integration sites in two formats. allSites.RData
is a GRanges
object that contains a single record for each Illumina read. sites.final.RData
is a GRanges
object of dereplicated integration sites along with a tally of how many reads were seen for each site (irregardless of sonic breakpoint). The revmap
column in sites.final.RData
links unique reads from allSites
to dereplicated sites in sites.final
.
Multihits are stored in multihitData.RData
which is a list
of three items:
unclusteredMultihits
is aGRanges
object of raw multihits, directly fromproperlyPairedAlignments
clusteredMultihitPositions
is aGRangesList
object of clustered solostart multihits with each includedGRanges
object representing one multihit clusterclusteredMultihitLengths
is alist
object containing dataframes of multihit 'fragment lengths' for use in sonicAbundance. Each list record, n, cooresponds to the nth cluster as defined inclusteredMultihitPositions
. TheVar1
column cooresponds to the 'length' and theFreq
column cooresponds to the number of times that fragment length was observed.
Chimeras are stored in chimeraData.RData
which is a list that contains some basic chimera frequency statistics and a GRangesList
object. Each GRanges
object contains two records, one for the read1 alignment and another for the read2 alignment
PrimerIDs (if present in the linker sequence) are stored in primerIDData.RData
. This file is a base R list
containing a DNAStringSet
and a BStringSet
containing the sequences and quality scores of the primerIDs.
Processing statistics are returned in the stats.RData
file. This file contains a single data.frame
. Detail of the specific columns provided in this dataframe will be added later and can inferred from intSiteLogic.R
.
This code is highly dependent on Bioconductor packages for processing DNA data and collapsing/expanding alignments.
The following R packages and their subsesequent dependencies are required for proper operation of intSiteCaller
:
ShortRead
hiReadsProcessor
GenomicRanges
rtracklayer
BSgenome
argparse
igraph
- Any
BSgenome.*.UCSC.*
package cooresponding to reference genomes specified inprocessingParams.csv
Specific versioning analysis has not yet been performed.
Additionally, BLAT code requires the availability of the blat
and python
command. blat
is available on PMACS at /opt/software/blatSrc/v35/bin/x86_64
, however is not included in the default PATH
. This directory will need to be added to the user's default PATH
in order to successfullly run BLAT alignments.
intSiteCaller
confirms the presence of all dependancies and will throw an error if a dependancy is not met.
- Primary read trimming and integration site calling logic is contained in
intSiteLogic.R
. - Branching and condensing of PMACS jobs is performed in
programFlow.R
- Barcode error correcting logic is performed in
errorCorrectIndices/golay.py
as wrapped byerrorCorrectIndices/processGolay.py
. - All code files are checked into the repository.
- Flowcharts will be added to graphically describe the flow of the overall program as well as the flow/logic of individual functions
A sample dataset is included for verification of integration site calling accuracy. The testCases
directory contains a subdirectory, intSiteValidation
, and a compressed folder, intSiteValidationOUTPUT.tar.gz
. To analyze the test data, move intSiteValidation/*
to an LSF envrionment and execute the code as described in the 'Usage' section. Upon successful execution, the directory should resemble intSiteValidationOUTPUT.tar.gz
. Note that this subset of data contains samples with some borderline cases. For example, clone7 samples should all fail, and many of the clone1-clone4 samples should return no multihits or chimeras. The current implementation of the code handles these gracefully.