Skip to content

Commit 6bce2d3

Browse files
committed
Update README.md
1 parent 576a5f2 commit 6bce2d3

File tree

1 file changed

+138
-13
lines changed

1 file changed

+138
-13
lines changed

README.md

+138-13
Original file line numberDiff line numberDiff line change
@@ -33,8 +33,58 @@ Simply type "perl certain_script.pl" or "perl certain_script.pl -h" for details
3333
* Extract multiple sequences from the genome
3434

3535
fasta_process.pl --rows 0 1 2 --query regions.list --fasta genome.fasta --subset 1 2 > query_regions.fas
36-
37-
36+
37+
* Extract flanking sequence of each variant site and replace the nucleotide at the variant site with the alternative allele
38+
39+
perl -ne 'next if (/\#/); my @line = split /\t/; my $ex_start = $line[1]-75; my $ex_end = $line[1]+75;
40+
print "$line[0]\t$ex_start\t$ex_end\t76\t$line[3]\t$line[4]\n";' example.vcf | \
41+
fasta_process.pl --query - --fasta genome.fasta --rows 0 1 2 3 4 5 --subset 1 2 \
42+
--replace 3,4,5 > vars.ex75.replaced.fas
43+
44+
* Count triplet contents
45+
46+
fasta_process.pl --fasta genome.fasta --count-nucl triplet > triplets.csv
47+
48+
* Count triplet contents in query sequences
49+
50+
fasta_process.pl --fasta genome.fasta --query query.regions \
51+
--rows 1 3 4 0 2 --subset 3 4 --count-nucl triplet > query.triplets.csv
52+
53+
54+
* Extract di/tir-nucleotide contents in tabular format
55+
56+
awk 'BEGIN{OFS="\t"} !/\#/ {print $1,$2-1,$2;}' example.vcf | \
57+
fasta_process.pl --query - --fasta genome.fasta \
58+
--rows 0 1 2 --subset 1 2 --out-format tabular | sed 's/\_/\t/g' > nt2-1.csv
59+
60+
61+
* Translate nucleotides to proteins and remove final "*"
62+
63+
fasta_process.pl --fasta cds.fasta --translate --wordwrap 60 | sed 's/\*$//' > protein.fasta
64+
65+
66+
* Split multiple-sequences file into multiple single-sequence files
67+
68+
fasta_process.pl --fasta multiple.fasta --split
69+
70+
71+
* Sort fasta file by a user defined order, fasta file could also given from a pipe
72+
73+
cat *.fasta | fasta_process.pl --fasta - --sort-by-list orders.list > sorted.fasta
74+
75+
* Filtering fasta file by length
76+
77+
fasta_process.pl --fasta example.fasta --lower 100 --upper 2000 > len100_2000.fasta
78+
79+
* Filtering fasta file by id
80+
81+
fasta_process.pl --fasta example.fasta --match "scaffold|contig" > chromosome.fasta
82+
83+
84+
**Note:** some options could be combined but have priority orders, for example extract and sort could be run in a single step, while sort and extract will not work; break it into two or more steps under these situations.
85+
86+
87+
3888
### convert_fastq_quality.pl
3989
> Convert fastq encodings
4090
@@ -49,6 +99,92 @@ Simply type "perl certain_script.pl" or "perl certain_script.pl -h" for details
4999
This script does quite a lot things, including filtering, combining, clustering and etc., seems I put too many functions here ...
50100
However, since the VCF format generated from different caller varies, this script was manily tailored for vcf file generated from GATK (UnifiedGenotyper or HaplotypeCaller, http://www.broadinstitute.org/gatk/), some functions require the AD (allele depth) field, so it may not perform very well for VCF files generated from other caller.
51101

102+
#### Filtering variants
103+
104+
* Filtering by depth, this only mark samples with depth failed this criteria as missing, but will not filter the whole locus
105+
106+
vcf_process.pl --vcf example.vcf.gz -min-sample-depth 10 --max-sample-depth 80 > depth_flt.vcf
107+
108+
* Filtering by depth and number of missing allele calls, first check depth, then count all missing calls include those failed the depth criteria
109+
110+
vcf_process.pl --vcf example.vcf.gz -min-sample-depth 10 --max-sample-depth 80 --max-missing 8 > flt.vcf
111+
112+
* Specify the depth for each sample in a file, the overall criteria will still be effective if some sample were not specified
113+
114+
vcf_process.pl --vcf example.vcf.gz -min-sample-depth 10 --max-sample-depth 80 \
115+
--depth-file sample_depth.txt > depth_flt.vcf
116+
117+
118+
* Specify some samples as natural homozygous sample (e.g. inbred lines), others would be treated as heterozygous, filtering heterozygous sites in homozygous samples (denoted as "pseudo-heterozygosity" here, mostly raised from mapping errors due to duplications)
119+
120+
vcf_process.pl --vcf example.vcf.gz --homo-samples sample1 sample2 --max-pseudo-het 0 > flt.vcf
121+
122+
* Filtering by reference/non-reference sample counts, distinguish homozygous and heterozygous samples
123+
124+
vcf_process.pl --vcf example.vcf.gz --homo-samples sample1 sample2 \
125+
--min-hom-ref 5 --min-het-ref 4 --max-hom-missing 5 > flt.vcf
126+
127+
128+
**Note:** some filtering criteria have priority orders, do check the results after filtering!
129+
130+
131+
#### Genotype manipulation
132+
133+
vcf_process.pl use the non-reference allele depth ratio (NRADR, reads support reference allele / all reads covered) to test whether the initial genotyping was really accurate, genotypes failed these criteria could be re-genotyped or set as missing, require AD fields (also add support for NR,NV tags generated from caller like Platypus, but less tested)
134+
135+
* For homozygous samples, no heterozygous genotypes should be expected, NRADR should be near zero (reference homozygous) or near 100% (alternative homozygous), considering the sequencing errors, a conserved range could be 5%~95% for high coverage data (above 20x)
136+
137+
vcf_process.pl --vcf hc.vcf.gz --default-sample-type hom --regenotype-hom 0.05 \
138+
--gt-diff-as-missing > genotype.flt.vcf
139+
140+
* For heterozygous samples, we need two values, one for homozygous genotypes (same as used for homozygous samples), another for heterozygous genotypes (usually expect 50%, newly arised mutations could vary), 30%~70% maybe ok for a reliable heterozgyous call
141+
142+
vcf_process.pl --vcf hc.vcf.gz --regenotype-het 0.05,0.3 --gt-diff-as-missing > genotype.flt.vcf
143+
144+
* Contain both homozygous and heterozygous samples
145+
146+
vcf_process.pl --vcf hc.vcf.gz --homo-samples sample1 sample2 --regenotype-hom 0.05 \
147+
--regenotype-het 0.05,0.3 --gt-diff-as-missing > genotype.flt.vcf
148+
149+
150+
151+
#### Collect statistics and metrics
152+
153+
* Collect variants metrics, mainly designed for GATK callers
154+
155+
vcf_process.pl --vcf hc.vcf.gz --out-metrics \
156+
--metrics DP MQ MQ0 BaseQRankSum ClippingRankSum MQRankSum ReadPosRankSum InbreedingCoeff FS SOR \
157+
> hc.metrics.csv
158+
159+
* Metrics after filtering
160+
161+
vcf_process.pl --vcf hc.vcf.gz \
162+
--quality 50 --min-alleles 2 --max-alleles 2 --min-sample-depth 10 --max-missing 14 | \
163+
vcf_process.pl --vcf - --out-metrics \
164+
--metrics DP MQ MQ0 BaseQRankSum ClippingRankSum MQRankSum ReadPosRankSum InbreedingCoeff FS SOR \
165+
> hc.flt.metrics.csv
166+
167+
* Collect genotype infos, require the AD field
168+
169+
vcf_process.pl --vcf hc.vcf.gz \
170+
--quality 50 --stats-only --out-genotype-stats --ref-depth --var-type snp > hc.snp.gts.csv
171+
172+
* Generate statistics for each locus
173+
174+
vcf_process.pl --vcf snp.vcf.gz --stats-only --out-locus-stats > snp.locus_stats.csv
175+
176+
* Count base changes for all bi-allelic heterozygous sites
177+
178+
vcf_process.pl --vcf example.vcf.gz \
179+
--min-alleles 2 --max-alleles 2 --base-changes --GT-types "0/1" > het.changes.csv
180+
181+
* Generate statistics for distances between adjacent variants
182+
183+
vcf_process.pl --vcf snp.vcf.gz --stat-var-dist --source-tag GT > snp.dist.csv
184+
185+
186+
187+
52188

53189
#### Use vcf_process.pl to clustering markers (genetically linked regions)
54190

@@ -113,10 +249,6 @@ The "seeding-and-extension" algorithm was borrowed from "Wijnker, E. et al. The
113249
> Convert gff to tabular format
114250
115251

116-
### fasta2tabular.pl
117-
> Convert fasta to tabular format
118-
119-
120252
### intervals2bed
121253
> Convert intervals to bed format, e.g. chr01:1-1000 -> chr01 0 1000
122254
@@ -163,13 +295,6 @@ The "seeding-and-extension" algorithm was borrowed from "Wijnker, E. et al. The
163295
They can be actually integrated, but why 3 scripts? Because I forget the previous one when I started write a new one, and finally I got three ...
164296

165297

166-
### transNt2AA.pl
167-
> Simply translate nucleotide to proteins, require BioPerl
168-
169-
* Usage
170-
171-
transNt2AA.pl cds.fasta protein.fasta
172-
173298

174299
## Calculation
175300

0 commit comments

Comments
 (0)