forked from ginolhac/mapDamage
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
26 changed files
with
1,050 additions
and
1,574 deletions.
There are no files selected for viewing
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
# Permission is hereby granted, free of charge, to any person obtaining a copy | ||
# of this software and associated documentation files (the "Software"), to deal | ||
# in the Software without restriction, including without limitation the rights | ||
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
# copies of the Software, and to permit persons to whom the Software is | ||
# furnished to do so, subject to the following conditions: | ||
# | ||
# The above copyright notice and this permission notice shall be included in all | ||
# copies or substantial portions of the Software. | ||
# | ||
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
# SOFTWARE. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# file GENERATED by distutils, do NOT edit | ||
README.txt | ||
setup.py | ||
bin/mapDamage | ||
mapdamage/__init__.py | ||
mapdamage/align.py | ||
mapdamage/composition.py | ||
mapdamage/parseoptions.py | ||
mapdamage/rscript.py | ||
mapdamage/seq.py | ||
mapdamage/tables.py | ||
mapdamage/version.py | ||
mapdamage/Rscripts/mapDamage.R |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
include mapdamage/Rscripts/mapDamage.R |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
Introduction | ||
------------ | ||
|
||
mapDamage is a Perl script that tracks DNA damage patterns among ancient DNA sequencing reads generated by Next-Generation Sequencing platforms. | ||
In addition to the mapDamage Perl script, SAMtools, BEDtools and R are mandatory and must be present in your $PATH. | ||
|
||
http://sourceforge.net/projects/samtools/files/samtools/ | ||
|
||
http://code.google.com/p/bedtools/downloads/list | ||
|
||
http://www.r-project.org/ | ||
|
||
The original page with examples, datasets and result files is there: | ||
|
||
http://geogenetics.ku.dk/publications/mapdamage/ | ||
|
||
News in version 0.3.6 | ||
--------------------- | ||
|
||
* exchange colors for adenine/cytosine in fragmentation patterns, could be confusing with G>A in blue | ||
* bug fix with BAM files obtained from tmap and bowtie | ||
* hard clipping support | ||
|
||
Usage | ||
----- | ||
|
||
Three main commands can be used: map, merge and plot. The -c option in map allows to chain all the three steps. A detailed description is presented below: | ||
|
||
* 1. map | ||
|
||
./mapDamage.pl map -i input_SAM -d directory -r reference_fasta -c -t my_title | ||
|
||
REQUIRED | ||
-i : input SAM-file without header. SAM format, version 1.4 is described in this pdf file | ||
-r : input reference fasta file, with headers identical to the ones in the input_SAM file | ||
|
||
OPTIONAL | ||
-l : maximal read length, in nucleotides to consider [default: 70] | ||
-a : size, in nucleotides, of the genomic region to be retrieved before and after reads [default: 10] | ||
-d : folder for writing output-files, if not already present, this folder will be automatically created [default name: results_input_SAM.filtered] | ||
-j : second input SAM-file without header; reads present in both SAM files will not be processed | ||
-u : removes all non-unique reads based on the X1 and XT tags from the BWA mapper | ||
-k : keep bed file and fasta file containing all the genomic regions [default: delete these files] | ||
-f : outputs all read aligned against the reference genome, in a fasta file format | ||
-c : complete analysis, map will be automatically followed by both merge and plot steps [default: not active] | ||
|
||
when -c is active, the following are available to control the plot step: | ||
-b : the number of reference nucleotides to consider for ploting base composition in the region located upstream and downstream of every read [default: 10] | ||
-y : graphical y-axis limit for nucleotide misincorporation frequencies [default: 0.30] | ||
-m : read length, in nucleotides, considered for plotting nucleotide misincorporations [default: 25] | ||
|
||
* 2. merge | ||
|
||
./mapDamage.pl merge -d directory | ||
|
||
REQUIRED | ||
-d : folder with output-files generated by the map command | ||
|
||
OPTIONAL | ||
-c : complete analysis, merge will be automatically followed by the plot step [default: not active] | ||
|
||
* 3. plot | ||
|
||
./mapDamage.pl plot -d directory | ||
|
||
REQUIRED | ||
-d : folder with output-files generated by the merge command | ||
|
||
OPTIONAL | ||
-l : read length, in nucleotides, considered for plotting nucleotide misincorporations [default: 25] | ||
-m : graphical y-axis limit for nucleotide misincorporation frequencies [default: 0.30] | ||
-a : the number of reference nucleotides to consider for ploting base composition in the region located upstream and downstream of every read [default: 10] | ||
-t : title used for both graph and filename [default: folder name without extension] | ||
|
||
An example of FragMisincorporation_plot.pdf can be downloaded [here](https://github.com/ginolhac/mapDamage/blob/master/FragMisincorporation_plot.pdf?raw=true) | ||
|
||
|
||
Citation | ||
-------- | ||
If you use this script, please cite the following publication: Ginolhac A, Rasmussen M, Gilbert MT, Willerslev E, Orlando L. | ||
mapDamage: testing for damage patterns in ancient DNA sequences. Bioinformatics. 2011 27(15):2153-5 | ||
http://bioinformatics.oxfordjournals.org/content/27/15/2153 | ||
|
||
Limitations | ||
----------- | ||
The positions used in fragmentation files are not compatible with read length > 99999 nucleotides. This is compatible with all sequencing technologies currently available. | ||
Contact | ||
------- | ||
Please report bugs and suggest possible improvements to Aurelien Ginolhac by email. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,212 @@ | ||
#!/usr/bin/env python | ||
# -*- coding: ASCII -*- | ||
from __future__ import print_function | ||
|
||
import sys | ||
import os | ||
|
||
# check if pysam if available | ||
module = "pysam" | ||
url = "http://code.google.com/p/pysam/" | ||
try: | ||
__import__(module) | ||
|
||
except ImportError, e: | ||
sys.stderr.write("Error: Could not import required module '%s':\n\t- %s\n" % (module,e)) | ||
sys.stderr.write(" If module is not installed, please download from '%s'.\n" % (url,)) | ||
sys.stderr.write(" A local install may be performed using the following command:\n") | ||
sys.stderr.write(" $ python setup.py install --user\n\n") | ||
sys.exit(1) | ||
|
||
import pysam | ||
|
||
import mapdamage | ||
from mapdamage.version import __version__ | ||
|
||
|
||
""" | ||
plot damage patterns from a SAM/BAM file | ||
:Author: Aurelien Ginolhac | ||
:Contact: [email protected] | ||
:Date: November 2012 | ||
:Type: tool | ||
:Input: SAM/BAM | ||
:Output: tabulated tables, pdf | ||
""" | ||
|
||
|
||
def getCoordinates(read): | ||
|
||
if read.is_reverse: | ||
fivep = read.aend | ||
threep = read.pos | ||
else: | ||
fivep = read.pos | ||
threep = read.aend | ||
return(fivep, threep) | ||
|
||
|
||
def getAround(coord, chrom, reflengths, lg, ref): | ||
|
||
i = j = 0 | ||
before = after = "" | ||
if min(coord) - lg > 0: | ||
i = min(coord) - lg | ||
if max(coord) + lg > reflengths[chrom]: | ||
j = reflengths[chrom] | ||
else: | ||
j = max(coord) + lg | ||
|
||
before = ref.fetch(chrom, i, min(coord)) | ||
after = ref.fetch(chrom, max(coord), j) | ||
|
||
return(before, after) | ||
|
||
|
||
def writeFasta(read,ref, seq, refseq, start, end, before, after, fout): | ||
if read.is_reverse: | ||
std = '-' | ||
else: | ||
std = '+' | ||
# output coordinate in 1-based offset | ||
fout.write(">%s:%d-%d\n%s\n" % (ref, start-len(before), start, before)) | ||
fout.write(">%s:%d-%d\n%s\n>%s_%s\n%s\n" % (ref, start+1, end+1, refseq, read.qname, std, seq)) | ||
fout.write(">%s:%d-%d\n%s\n" % (ref, end+2, end+2+len(after), after)) | ||
|
||
|
||
|
||
""" main """ | ||
|
||
def main(argv): | ||
|
||
options = mapdamage.parseoptions.options(argv) | ||
if not options: | ||
return 1 | ||
|
||
# plot using R if results folder already done | ||
if options.plotonly: | ||
mapdamage.rscript.plot(options) | ||
return 0 | ||
|
||
# open mapped reads and corresponding reference | ||
in_bam = pysam.Samfile(options.filename) | ||
ref = pysam.Fastafile(options.ref) | ||
|
||
# for misincorporation patterns, to record mismaches, from 5'-ends | ||
misincorp = {} | ||
# for fragmentation patterns, record base compositions in the reference around | ||
dnacomp = {} | ||
|
||
# fetch all references and associated lengths in nucleotides | ||
reflengths = dict(zip(in_bam.references, in_bam.lengths)) | ||
# initialize table of misincorporations with zeros | ||
misincorp = mapdamage.tables.initializeMut(misincorp, reflengths, options.length) | ||
# initialise base composition tables with zeros | ||
dnacomp = mapdamage.tables.initializeComp(dnacomp, reflengths, options.around, options.length) | ||
|
||
|
||
if not options.quiet: | ||
print("\tReading from '%s'" % options.filename) | ||
if options.verbose: | ||
print("\t%d references are assumed in SAM/BAM file, for a total of %d nucleotides" % (len(reflengths), sum(reflengths.values()))) | ||
print("\tWriting results to '%s/'" % options.folder) | ||
|
||
# reference in BAM must be in the fasta reference file | ||
#if not referenceCheck(): | ||
#sys.stderr.write("Error: %s reference was not found in the reference file provided\n" % (chrom)) | ||
# return 1 | ||
|
||
|
||
# open file handler to write alignements in fasta format | ||
if options.fasta: | ||
# use name of the sam/bam filename without extension | ||
ffasta = os.path.splitext(os.path.basename(options.filename))[0]+'.fasta' | ||
if not options.quiet: | ||
print("\tWriting alignments in '%s'" % ffasta) | ||
ff = open(options.folder+"/"+ffasta,"w") | ||
|
||
counter = filtered = 0 | ||
|
||
# main loop | ||
for read in in_bam: | ||
counter += 1 | ||
|
||
if read.is_unmapped: | ||
filtered +=1 | ||
continue | ||
|
||
# external coordinates 5' and 3' , 0-based offset | ||
coordinate = getCoordinates(read) | ||
# fetch reference name, chromosome or contig names | ||
chrom = in_bam.getrname(read.tid) | ||
|
||
(before, after) = getAround(coordinate, chrom, reflengths, options.around, ref) | ||
refseq = ref.fetch(chrom, min(coordinate), max(coordinate)) | ||
# read.query contains aligned sequences while read.seq is the read itself | ||
seq = read.query | ||
|
||
# add gaps according to the cigar string | ||
(seq, refseq) = mapdamage.align.align(read.cigar, seq, refseq) | ||
|
||
# reverse complement read and reference when mapped reverse strand | ||
if read.is_reverse: | ||
refseq = refseq.translate(mapdamage.seq.table)[::-1] | ||
seq = seq.translate(mapdamage.seq.table)[::-1] | ||
beforerev = after.translate(mapdamage.seq.table)[::-1] | ||
after = before.translate(mapdamage.seq.table)[::-1] | ||
before = beforerev | ||
|
||
# record soft clippinp when present | ||
if len(mapdamage.align.parseCigar(read.cigar, 4)) > 0: | ||
mapdamage.align.recordSoftClipping(mapdamage.align.parseCigar(read.cigar, 4), read, chrom, misincorp, options.length) | ||
|
||
# count misincorparations by comparing read and reference base by base | ||
mapdamage.align.getMis(read, seq, refseq, chrom, options.length, misincorp, '5p') | ||
# do the same with sequences align to 3'-ends | ||
mapdamage.align.getMis(read, seq[::-1], refseq[::-1], chrom, options.length, misincorp, '3p') | ||
|
||
# compute base composition for genomic regions | ||
mapdamage.composition.countRefComp(read, chrom, before, after, dnacomp) | ||
# compute base composition for reads | ||
mapdamage.composition.countReadComp(read, chrom, options.length, dnacomp) | ||
|
||
|
||
if options.fasta: | ||
writeFasta(read, chrom, seq, refseq, min(coordinate), max(coordinate), before, after, ff) | ||
|
||
if options.verbose: | ||
if counter % 50000 == 0: | ||
print("\t%10d reads processed (%d unmapped)" % (counter, filtered)) | ||
|
||
if not options.quiet: | ||
print("\tDone. %d reads read (%d unmapped)" % (counter, filtered)) | ||
|
||
# close file handlers | ||
in_bam.close() | ||
ref.close() | ||
if options.fasta: | ||
ff.close() | ||
|
||
# open file handlers to output results | ||
fmut = open(options.folder+"/"+"misincorporation.txt", 'w') | ||
fcomp = open(options.folder+"/"+"dnacomp.txt", 'w') | ||
|
||
# write summary tables to disk | ||
mapdamage.tables.printMut(misincorp, options, fmut) | ||
mapdamage.tables.printComp(dnacomp, options, fcomp) | ||
|
||
# close file handlers | ||
fmut.close() | ||
fcomp.close() | ||
|
||
# plot using R | ||
if not options.nor: | ||
mapdamage.rscript.plot(options) | ||
|
||
return 0 | ||
|
||
if __name__ == '__main__': | ||
sys.exit(main(sys.argv[1:])) |
Binary file not shown.
Oops, something went wrong.