A python script to calculate the relative coverage of X and Y chromosomes, and their associated error bars, from the depth of coverage at specified SNPs.
Mathematical equations added to README using this tool.
The python script takes a modified output from samtools depth
as input, via stdin. The samtools depth file should be manually modified to include a header that begins with a #
and is including the sample names (generic or specific) as column headers, like below:
#Chr Pos Sample1 Sample2 Sample3 Sample4 Sample5
1 752566 1 0 1 0 1
1 776546 0 0 0 0 0
1 832918 0 1 0 0 0
1 842013 0 1 0 3 1
...
Alternatively, a Sample/bam list can be provided using the -f
option. This list should include 1 name per line, and can be the same list used for the samtools depth
command.
For instructions on the options available you can try running the script with the -h
flag:
$Sex.DetERRmine.py -h
usage: Sex.DetERRmine.py [-h] [-I <INPUT FILE>] [-f SAMPLELIST]
Calculate the relative X- and Y-chromosome coverage of data, as well as the
associated error bars for each.
optional arguments:
-h, --help show this help message and exit
-I <INPUT FILE>, --Input <INPUT FILE>
The input samtools depth file. Omit to read from
stdin.
-f SAMPLELIST, --SampleList SAMPLELIST
A list of samples/bams that were in the depth file.
One per line. Should be in the order of the samtools
depth output.
The script will print out the number of SNPs and the number of reads found on each of Autosomes/X/Y, as well as the relative X/Y coverage and their associated errors.
It is possible to pipe the samtools depth
output directly to this script:
samtools depth -a -q30 -Q30 -b <BED File> -f <BAM file list> | Sex.DetERRmine.py -f <BAM file list>
If you use Sex.DetERRmine
in your analysis, please cite:
Lamnidis, T.C. et al., 2018. Ancient Fennoscandian genomes reveal origin and spread of Siberian ancestry in Europe. Nature communications, 9(1), p.5018. Available at: http://dx.doi.org/10.1038/s41467-018-07483-5.
We assume that sequenced reads are distributed along the genome randomly and independently from each other. The "genome" here is made up only of positions in the input depth file.
Ni is the number of sequenced reads in a a chunk of the genome i, the sum of which is the total number of reads on target, N.
We can then calculate:
Where pi is the proportion of all sequenced reads that map to SNPs in i, estimated from the input depths. The error around Ni is the error of the binomial distribution. Then:
Where di is the average depth on SNPs within i, and Si is the number of SNPs in i.
The relative coverage on the X and Y chromosomes can then be calculated as:
We can then use error propagation to calculate the errors around the relative X and Y coverages: