You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Note:** some options could be combined but have priority orders, for example extract and sort could be run in a single step, while sort and extract will not work; break it into two or more steps under these situations.
85
+
86
+
87
+
38
88
### convert_fastq_quality.pl
39
89
> Convert fastq encodings
40
90
@@ -49,6 +99,92 @@ Simply type "perl certain_script.pl" or "perl certain_script.pl -h" for details
49
99
This script does quite a lot things, including filtering, combining, clustering and etc., seems I put too many functions here ...
50
100
However, since the VCF format generated from different caller varies, this script was manily tailored for vcf file generated from GATK (UnifiedGenotyper or HaplotypeCaller, http://www.broadinstitute.org/gatk/), some functions require the AD (allele depth) field, so it may not perform very well for VCF files generated from other caller.
51
101
102
+
#### Filtering variants
103
+
104
+
* Filtering by depth, this only mark samples with depth failed this criteria as missing, but will not filter the whole locus
* Specify some samples as natural homozygous sample (e.g. inbred lines), others would be treated as heterozygous, filtering heterozygous sites in homozygous samples (denoted as "pseudo-heterozygosity" here, mostly raised from mapping errors due to duplications)
**Note:** some filtering criteria have priority orders, do check the results after filtering!
129
+
130
+
131
+
#### Genotype manipulation
132
+
133
+
vcf_process.pl use the non-reference allele depth ratio (NRADR, reads support reference allele / all reads covered) to test whether the initial genotyping was really accurate, genotypes failed these criteria could be re-genotyped or set as missing, require AD fields (also add support for NR,NV tags generated from caller like Platypus, but less tested)
134
+
135
+
* For homozygous samples, no heterozygous genotypes should be expected, NRADR should be near zero (reference homozygous) or near 100% (alternative homozygous), considering the sequencing errors, a conserved range could be 5%~95% for high coverage data (above 20x)
136
+
137
+
vcf_process.pl --vcf hc.vcf.gz --default-sample-type hom --regenotype-hom 0.05 \
138
+
--gt-diff-as-missing > genotype.flt.vcf
139
+
140
+
* For heterozygous samples, we need two values, one for homozygous genotypes (same as used for homozygous samples), another for heterozygous genotypes (usually expect 50%, newly arised mutations could vary), 30%~70% maybe ok for a reliable heterozgyous call
0 commit comments