featurExtract is a Python package for bioinformatics, containing two command programs. The first, featurExtract, includes ten subroutines: create, gene, promoter, UTR, uORF, CDS, dORF, exon, intron and intergenic. The create subroutine is used to create a database, while the promoter subroutine is used to extract promoter sequences. The uORF subroutine extracts upstream open reading frames sequences, and the UTR subroutine extracts untranslated region sequences. The CDS subroutine extracts coding sequences and the intergenic subroutine extracts intergenic sequences between two genes. The second command program, genBankExtract, includes four subroutines: gene, CDS, rRNA and tRNA.
Two way offer to install featurExtract module.
pip install featurExtract
# other
git clone https://github.com/SitaoZ/featurExtract.git
cd featurExtract
python setup.py install
python >= 3.7.6 python
pandas >= 1.2.4 pandas
gffutils >= 0.10.1 gffutils
setuptools >= 49.2.0 setuptools
biopython >= 1.78 biopython
featurExtract is designed for GFF and GTF file
and GenBankExtract is suited for GenBank file.
# gff or gtf database
featurExtract -h
Program: featurExtract (tools for genomic feature extract)
Version: 0.2.6.0
Contact: Sitao Zhu <[email protected]>
Usage : featurExtract <command> [parameters]
Command:
create create GFF/GTF database
stat database statistics
cds extract CDS sequence
dorf extract dORF sequence
exon extract exon sequence
gene extract gene sequence
intron extract intron sequence
igr extract intergenic region
mrna extract mRNA sequence
promoter extract promoter sequence
terminator extract terminator sequence
transcript extract transcript sequence
uorf extract uORF sequence
utr extract 5/3UTR sequence
- create
featurExtract create -h
usage: featurExtract create [-h] -g GENOMEFEATURE -o OUTPUT -p PREFIX
[-s {gff,gtf}]
optional arguments:
-h, --help show this help message and exit
-g GENOMEFEATURE, --genomefeature GENOMEFEATURE
genome annotation file, gff or gtf
-o OUTPUT, --output OUTPUT
database output dir path
-p PREFIX, --prefix PREFIX
database prefix
-s {gff,gtf}, --style {gff,gtf}
genome annotation file format
- stat
featurExtract stat -h
usage: featurExtract stat [-h] -d DATABASE -g GENOME -o OUTPUT [-s {gff,gtf}]
optional arguments:
-h, --help show this help message and exit
-d DATABASE, --database DATABASE
database created from creat command
-g GENOME, --genome GENOME
genome fasta path
-o OUTPUT, --output OUTPUT
stat output
-s {gff,gtf}, --style {gff,gtf}
genome annotation file format
- cds
featurExtract cds -h
usage: featurExtract cds [-h] -d DATABASE [-f {csv,fasta,gff}] -g GENOME
[-i TRANSCRIPT] [-o OUTPUT] [-p PROCESS]
[-r {mrna,all}] [-s {gff,gtf}] [-v]
optional arguments:
-h, --help show this help message and exit
-d DATABASE, --database DATABASE
database generated by subcommand create
-f {csv,fasta,gff}, --output_format {csv,fasta,gff}
output format
-g GENOME, --genome GENOME
genome fasta
-i TRANSCRIPT, --transcript TRANSCRIPT
specific transcript (optional); if not given, return
whole transcripts
-o OUTPUT, --output OUTPUT
output file path
-p PROCESS, --process PROCESS
number of cds extract process, (default: 4)
-r {mrna,all}, --rna_feature {mrna,all}
The type of RNA for extract cds, (default: mrna)
-s {gff,gtf}, --style {gff,gtf}
gtf database or gff database
-v, --print output to stdout. -v and -o option are mutually
exclusive
- dorf
featurExtract dorf -h
usage: featurExtract dorf [-h] -d DATABASE [-f {csv,fasta,gff}] -g GENOME
[-i TRANSCRIPT] [-l LENGTH] [-m] [-n] [-o OUTPUT]
[-p PROCESS] [-r {mrna,all}] [-s {gff,gtf}] [-v]
optional arguments:
-h, --help show this help message and exit
-d DATABASE, --database DATABASE
database generated by subcommand create
-f {csv,fasta,gff}, --output_format {csv,fasta,gff}
output format
-g GENOME, --genome GENOME
genome fasta
-i TRANSCRIPT, --transcript TRANSCRIPT
specific transcript (optional); if not given, return
whole transcripts
-l LENGTH, --length LENGTH
dorf length, (default: 6)
-m, --schematic_without_intron
schematic figure file for dorf, cds and transcript
without intron
-n, --schematic_with_intron
schematic figure file for dorf, cds and transcript
with intron
-o OUTPUT, --output OUTPUT
output file path
-p PROCESS, --process PROCESS
number of dorf extract process, (default: 4)
-r {mrna,all}, --rna_feature {mrna,all}
The type of RNA for dorf extraction (default: mrna)
-s {gff,gtf}, --style {gff,gtf}
gtf database or gff database
-v, --print output to stdout. -v and -o option are mutually
exclusive
- exon
featurExtract exon -h
usage: featurExtract exon [-h] -d DATABASE [-f {csv,fasta,gff}] -g GENOME
[-i TRANSCRIPT] [-o OUTPUT] [-p PROCESS]
[-r {mrna,all}] [-s {gff,gtf}] [-v]
optional arguments:
-h, --help show this help message and exit
-d DATABASE, --database DATABASE
database generated by subcommand create
-f {csv,fasta,gff}, --output_format {csv,fasta,gff}
output format
-g GENOME, --genome GENOME
genome fasta
-i TRANSCRIPT, --transcript TRANSCRIPT
specific transcript (optional); if not given, return
whole transcripts
-o OUTPUT, --output OUTPUT
output file path
-p PROCESS, --process PROCESS
number of exon extract process, (default: 4)
-r {mrna,all}, --rna_feature {mrna,all}
The type of RNA for exon extraction (default: mrna)
-s {gff,gtf}, --style {gff,gtf}
gtf database or gff database
-v, --print output to stdout
- gene
featurExtract gene -h
usage: featurExtract gene [-h] -d DATABASE [-f {csv,fasta,gff,gtf}] -g GENOME
[-i GENE] [-o OUTPUT] [-p] [-s {gff,gtf}]
optional arguments:
-h, --help show this help message and exit
-d DATABASE, --database DATABASE
database generated by subcommand create
-f {csv,fasta,gff,gtf}, --output_format {csv,fasta,gff,gtf}
output format
-g GENOME, --genome GENOME
genome fasta
-i GENE, --gene GENE specific gene (optional); if not given, return whole
genes
-o OUTPUT, --output OUTPUT
output file path
-p, --print output to stdout
-s {gff,gtf}, --style {gff,gtf}
gtf database or gff database
- intron
featurExtract intron -h
usage: featurExtract intron [-h] -d DATABASE [-f {csv,fasta,gff}] -g GENOME
[-i TRANSCRIPT] [-o OUTPUT] [-p PROCESS]
[-r {mrna,all}] [-s {gff,gtf}] [-v]
optional arguments:
-h, --help show this help message and exit
-d DATABASE, --database DATABASE
database generated by subcommand create
-f {csv,fasta,gff}, --output_format {csv,fasta,gff}
output format
-g GENOME, --genome GENOME
genome fasta
-i TRANSCRIPT, --transcript TRANSCRIPT
specific transcript (optional); if not given, return
whole transcripts
-o OUTPUT, --output OUTPUT
output file path
-p PROCESS, --process PROCESS
number of exon extract process, (default: 4)
-r {mrna,all}, --rna_feature {mrna,all}
The type of RNA for intron extraction (default: mrna)
-s {gff,gtf}, --style {gff,gtf}
gtf database or gff database
-v, --print output to stdout
- igr
featurExtract igr -h
usage: featurExtract igr [-h] -d DATABASE [-f {csv,fasta,gff}] -g GENOME
[-l IGR_LENGTH] [-o OUTPUT] [-p PROCESS]
[-s {gff,gtf}] [-v]
optional arguments:
-h, --help show this help message and exit
-d DATABASE, --database DATABASE
database generated by subcommand create
-f {csv,fasta,gff}, --output_format {csv,fasta,gff}
output format
-g GENOME, --genome GENOME
genome fasta
-l IGR_LENGTH, --igr_length IGR_LENGTH
igr length threshold
-o OUTPUT, --output OUTPUT
output fasta file path
-p PROCESS, --process PROCESS
number of igr extract process, (default: 4)
-s {gff,gtf}, --style {gff,gtf}
gtf database only contain protein genes, while gff
database contain protein genes and nocoding genes
-v, --print output to stdout
- mrna
featurExtract mrna -h
usage: featurExtract mrna [-h] -d DATABASE [-f {csv,fasta}] -g GENOME
[-i TRANSCRIPT] [-o OUTPUT] [-p] [-s {gff,gtf}] [-u]
optional arguments:
-h, --help show this help message and exit
-d DATABASE, --database DATABASE
database generated by subcommand create
-f {csv,fasta}, --output_format {csv,fasta}
output format
-g GENOME, --genome GENOME
genome fasta
-i TRANSCRIPT, --transcript TRANSCRIPT
specific transcript (optional); if not given, return
whole transcripts
-o OUTPUT, --output OUTPUT
output file path
-p, --print output to stdout
-s {gff,gtf}, --style {gff,gtf}
gtf database or gff database
-u, --upper upper cds and lower utr
- promoter
featurExtract promoter -h
usage: featurExtract promoter [-h] -d DATABASE [-f {csv,fasta}] -g GENOME
[-i GENE] [-l PROMOTER_LENGTH] [-o OUTPUT]
[-p PROCESS] [-u UTR5_UPPER_LENGTH] [-v]
optional arguments:
-h, --help show this help message and exit
-d DATABASE, --database DATABASE
database generated by subcommand create
-f {csv,fasta}, --output_format {csv,fasta}
output format
-g GENOME, --genome GENOME
genome fasta path
-i GENE, --gene GENE specific gene (optional); if not given, return whole
genes
-l PROMOTER_LENGTH, --promoter_length PROMOTER_LENGTH
promoter length before TSS (default: 100)
-o OUTPUT, --output OUTPUT
output file path
-p PROCESS, --process PROCESS
number of promoter extract process, (default: 4)
-u UTR5_UPPER_LENGTH, --utr5_upper_length UTR5_UPPER_LENGTH
5' utr length after TSS (default: 10)
-v, --print output to stdout
- terminator
featurExtract terminator -h
usage: featurExtract terminator [-h] -d DATABASE [-f {csv,fasta}] -g GENOME
[-i GENE] [-l TERMINATOR_LENGTH] [-o OUTPUT]
[-u UTR3_LOWER_LENGTH] [-v]
optional arguments:
-h, --help show this help message and exit
-d DATABASE, --database DATABASE
database generated by subcommand create
-f {csv,fasta}, --output_format {csv,fasta}
output format
-g GENOME, --genome GENOME
genome fasta path
-i GENE, --gene GENE specific gene (optional); if not given, return whole
genes
-l TERMINATOR_LENGTH, --terminator_length TERMINATOR_LENGTH
terminator length (default: 100)
-o OUTPUT, --output OUTPUT
output file path
-u UTR3_LOWER_LENGTH, --utr3_lower_length UTR3_LOWER_LENGTH
3' length (default: 10)
-v, --print output to stdout
- transcript
featurExtract transcript -h
usage: featurExtract transcript [-h] -d DATABASE [-f {csv,fasta}] -g GENOME
[-i TRANSCRIPT] [-o OUTPUT] [-p PROCESS]
[-r {mrna,all}] [-s {gff,gtf}] [-u] [-v]
optional arguments:
-h, --help show this help message and exit
-d DATABASE, --database DATABASE
database generated by subcommand create
-f {csv,fasta}, --output_format {csv,fasta}
output format
-g GENOME, --genome GENOME
genome fasta
-i TRANSCRIPT, --transcript TRANSCRIPT
specific transcript (optional); if not given, return
whole transcripts
-o OUTPUT, --output OUTPUT
output file path
-p PROCESS, --process PROCESS
number of cDNA extract process, (default: 4)
-r {mrna,all}, --rna_feature {mrna,all}
The type of RNA for extract transcript, (default:
mrna)
-s {gff,gtf}, --style {gff,gtf}
gtf database or gff database
-u, --upper upper cds and lower utr
-v, --print output to stdout
- uorf
featurExtract uorf -h
usage: featurExtract uorf [-h] -d DATABASE [-f {csv,fasta,gff}] -g GENOME
[-i TRANSCRIPT] [-l LENGTH] [-m] [-n] [-o OUTPUT]
[-p PROCESS] [-r {mrna,all}] [-s {gff,gtf}] [-v]
optional arguments:
-h, --help show this help message and exit
-d DATABASE, --database DATABASE
database generated by subcommand create
-f {csv,fasta,gff}, --output_format {csv,fasta,gff}
output format (default: csv)
-g GENOME, --genome GENOME
genome fasta
-i TRANSCRIPT, --transcript TRANSCRIPT
specific transcript (optional); if not given, return
whole transcripts
-l LENGTH, --length LENGTH
uorf length, (default: 6)
-m, --schematic_without_intron
schematic figure file for uorf, cds and transcript
without intron
-n, --schematic_with_intron
schematic figure file for uorf, cds and transcript
with intron
-o OUTPUT, --output OUTPUT
output file path
-p PROCESS, --process PROCESS
number of uorf extract process, (default: 4)
-r {mrna,all}, --rna_feature {mrna,all}
The type of RNA for uorf extraction (default: mrna)
-s {gff,gtf}, --style {gff,gtf}
gtf database or gff database
-v, --print output to stdout. -v and -o option are mutually
exclusive
- utr
featurExtract utr -h
usage: featurExtract utr [-h] -d DATABASE [-f {csv,fasta,gff}] -g GENOME
[-i TRANSCRIPT] [-o OUTPUT] [-p PROCESS]
[-r {mrna,all}] [-s {gff,gtf}] [-v]
optional arguments:
-h, --help show this help message and exit
-d DATABASE, --database DATABASE
database generated by subcommand create
-f {csv,fasta,gff}, --output_format {csv,fasta,gff}
output format (default: csv)
-g GENOME, --genome GENOME
genome fasta file
-i TRANSCRIPT, --transcript TRANSCRIPT
specific transcript (optional); if not given, return
whole transcripts
-o OUTPUT, --output OUTPUT
output file path
-p PROCESS, --process PROCESS
number of utr extract process, (default: 4)
-r {mrna,all}, --rna_feature {mrna,all}
The type of RNA for extract utr, (default: mrna)
-s {gff,gtf}, --style {gff,gtf}
gtf database or gff database
-v, --print output to stdout. -v and -o option are mutually
exclusive
# GenBank database
which genBankExtract
genBankExtract -h
genBankExtract gene -h
genBankExtract CDS -h
genBankExtract rRNA -h
genBankExtract tRNA -h
# step 1 create database
time featurExtract create -s gtf -g Araport11_GTF_genes_transposons.Mar202021.gtf -o test/ -p ath
# step 2 command
# cds
time featurExtract cds -d test/ath.GFF -g ath_chr.fa -r all -f csv -p 1 -o test/zhusitao_cds3.csv
time featurExtract cds -d test/ath.GFF -g ath_chr.fa -r all -f fasta -p 1 -o test/zhusitao_cds3.fa
time featurExtract cds -d test/ath.GFF -g ath_chr.fa -r all -f gff -p 1 -o test/zhusitao_cds3.gff
# transcript
time featurExtract transcript -d test/ath.GFF -g ath_chr.fa -r all -f csv -p 1 -o test/zhusitao_transcript.csv
# promoter
time featurExtract promoter -d test/ath.GFF -f csv -g ath_chr.fa -l 10 -o test/zhusitao_promoter.csv -u 0 -i AT1G01010 -v
# terminator
time featurExtract terminator -d test/ath.GFF -f csv -g ath_chr.fa -l 10 -o test/zhusitao_terminator.csv -u 0 -i AT1G01010 -v
# exon
time featurExtract exon -d test/ath.GFF -f fasta -g ath_chr.fa -o test/zhusitao_exon.fa -s gff
# intron
time featurExtract intron -d test/ath.GFF -f fasta -g ath_chr.fa -o test/zhusitao_intron.fa -s gff
# uorf
time featurExtract uorf -d test/ath.GFF -g ath_chr.fa -l 1 -r all -f csv -o test/zhusitao_uORF.csv -s gff
# dorf
time featurExtract dorf -d test/ath.GFF -g ath_chr.fa -l 1 -r all -f csv -o test/zhusitao_dorf.csv -s gff
# GenBank step 3
genBankExtract gene -g NC_000932.gb -f dna -p
genBankExtract CDS -g NC_000932.gb -f dna -p
genBankExtract rRNA -g NC_000932.gb -f dna -p
genBankExtract tRNA -g NC_000932.gb -f dna -p