Skip to content

Commit b5c070a

Browse files
committed
Update README.md
1 parent 6bce2d3 commit b5c070a

File tree

1 file changed

+53
-4
lines changed

1 file changed

+53
-4
lines changed

README.md

+53-4
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,12 @@ Simply type "perl certain_script.pl" or "perl certain_script.pl -h" for details
1717
### fasta_process.pl
1818
> Query, extract and processing fasta sequences.
1919
20+
**Note:** some options could be combined but have priority orders, for example extract and sort could be run in a single step, while sort and extract will not work; break it into two or more steps under these situations.
21+
22+
23+
24+
#### Extract sequences
25+
2026
* Query a single gene
2127

2228
echo "${gene_id}" | fasta_process.pl --fasta all.seq --query - --rows 0 > gene.fa
@@ -41,6 +47,8 @@ Simply type "perl certain_script.pl" or "perl certain_script.pl -h" for details
4147
fasta_process.pl --query - --fasta genome.fasta --rows 0 1 2 3 4 5 --subset 1 2 \
4248
--replace 3,4,5 > vars.ex75.replaced.fas
4349

50+
#### Do statistics of sequences
51+
4452
* Count triplet contents
4553

4654
fasta_process.pl --fasta genome.fasta --count-nucl triplet > triplets.csv
@@ -58,6 +66,8 @@ Simply type "perl certain_script.pl" or "perl certain_script.pl -h" for details
5866
--rows 0 1 2 --subset 1 2 --out-format tabular | sed 's/\_/\t/g' > nt2-1.csv
5967

6068

69+
#### Sequence manipulation
70+
6171
* Translate nucleotides to proteins and remove final "*"
6272

6373
fasta_process.pl --fasta cds.fasta --translate --wordwrap 60 | sed 's/\*$//' > protein.fasta
@@ -72,6 +82,12 @@ Simply type "perl certain_script.pl" or "perl certain_script.pl -h" for details
7282

7383
cat *.fasta | fasta_process.pl --fasta - --sort-by-list orders.list > sorted.fasta
7484

85+
* Reverse complement sequences
86+
87+
fasta_process.pl --fasta example.fasta --reverse --complement > rc.fasta
88+
89+
#### Filtering sequences
90+
7591
* Filtering fasta file by length
7692

7793
fasta_process.pl --fasta example.fasta --lower 100 --upper 2000 > len100_2000.fasta
@@ -81,8 +97,6 @@ Simply type "perl certain_script.pl" or "perl certain_script.pl -h" for details
8197
fasta_process.pl --fasta example.fasta --match "scaffold|contig" > chromosome.fasta
8298

8399

84-
**Note:** some options could be combined but have priority orders, for example extract and sort could be run in a single step, while sort and extract will not work; break it into two or more steps under these situations.
85-
86100

87101

88102
### convert_fastq_quality.pl
@@ -125,6 +139,11 @@ However, since the VCF format generated from different caller varies, this scrip
125139
--min-hom-ref 5 --min-het-ref 4 --max-hom-missing 5 > flt.vcf
126140

127141

142+
* Screen out rare alleles (allele with sample frequency less than the specified value)
143+
144+
vcf_process.pl --vcf example.vcf.gz --rare-only 3 > rare.vcf
145+
146+
128147
**Note:** some filtering criteria have priority orders, do check the results after filtering!
129148

130149

@@ -148,7 +167,7 @@ vcf_process.pl use the non-reference allele depth ratio (NRADR, reads support re
148167

149168

150169

151-
#### Collect statistics and metrics
170+
#### Collecting statistics and metrics of variants
152171

153172
* Collect variants metrics, mainly designed for GATK callers
154173

@@ -182,12 +201,25 @@ vcf_process.pl use the non-reference allele depth ratio (NRADR, reads support re
182201

183202
vcf_process.pl --vcf snp.vcf.gz --stat-var-dist --source-tag GT > snp.dist.csv
184203

204+
* Summary of results generated from GATK DiagnoseTargets (https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_diagnostics_diagnosetargets_DiagnoseTargets.php)
205+
206+
vcf_process.pl --vcf diagnose.vcf --sum-diagnose > diagnose.stats.csv
185207

208+
* Get variant sequence context (experimental)
186209

210+
vcf_process.pl --vcf snp.vcf.gz --check-context --fasta genome.fasta > snp.context.vcf
187211

212+
**Notes for context checking:**
188213

189-
#### Use vcf_process.pl to clustering markers (genetically linked regions)
214+
1) Only bi-allelic loci is supported while analysis sequence context, multi-alleles need to be
215+
breaked first;
216+
2) Extension here is different for SNPs and INDELs, e.g. upstream 5bp and downstream 5bp for SNPs,
217+
while only downstream 10bp for INDELs, thus the INDELs are assumed to be already left aligned
190218

219+
220+
#### Clustering variants
221+
222+
Use vcf_process.pl to clustering markers (genetically linked regions).
191223
The clustering function is used to identify genome blocks through certain type of markers. This was done by fisrt search for the reliable seeds (segments with consecutive markers of the same type and pass the criteria, the "seeding" stage), then merge adjacent seeds with same type to form blocks (the "extension" stage), the boundary between blocks of different type was determined according to the markers present between two blocks or use the middle point while no more markers present.
192224
The "seeding-and-extension" algorithm was borrowed from "Wijnker, E. et al. The genomic landscape of meiotic crossovers and gene conversions in Arabidopsis thaliana. eLife 2, e01426 (2013)", which used for identify recombinat blocks.
193225

@@ -219,6 +251,23 @@ The "seeding-and-extension" algorithm was borrowed from "Wijnker, E. et al. The
219251
--colors "type1:strong_red2;B:strong_blue2" --sort-blocks sample-original sample --format png
220252

221253

254+
#### Combining vcf files
255+
256+
257+
* Combine two vcf files according to the "CHROM" and "POS" fields
258+
259+
vcf_process.pl --vcf hc.vcf --secondary-vcf ug.vcf --combine-rows 0 1 \
260+
--primary-tag HC --secondary-tag UG --intersect-tag "UG+HC" > combined.vcf
261+
262+
* Combine two vcf files according to the "CHROM", "POS" and "ALT" fields, if the "ALT" field differ, there will be two records in combined vcf file
263+
264+
vcf_process.pl --vcf hc.vcf --secondary-vcf ug.vcf --combine-rows 0 1 4 \
265+
--primary-tag HC --secondary-tag UG --intersect-tag "UG+HC" > combined.vcf
266+
267+
* Combine two vcf files according to the "CHROM" and "POS" fields, but if the "ALT" field differ, write the "ALT" info of secondary file into "SDIFF" field
268+
269+
vcf_process.pl --vcf hc.vcf --secondary-vcf ug.vcf --combine-rows 0 1 --compare-row 4 \
270+
--primary-tag HC --secondary-tag UG --intersect-tag "UG+HC" > combined.vcf
222271

223272

224273
### fgenesh2gff.pl

0 commit comments

Comments
 (0)