Overview

featurExtract is a Python package for bioinformatics, containing two command programs. The first, featurExtract, includes ten subroutines: create, gene, promoter, UTR, uORF, CDS, dORF, exon, intron and intergenic. The create subroutine is used to create a database, while the promoter subroutine is used to extract promoter sequences. The uORF subroutine extracts upstream open reading frames sequences, and the UTR subroutine extracts untranslated region sequences. The CDS subroutine extracts coding sequences and the intergenic subroutine extracts intergenic sequences between two genes. The second command program, genBankExtract, includes four subroutines: gene, CDS, rRNA and tRNA.

Brief introduction of featurExtract package

Install

Two way offer to install featurExtract module.

install command line

pip install featurExtract
# other
git clone https://github.com/SitaoZ/featurExtract.git
cd featurExtract
python setup.py install

Requirements

python >= 3.7.6 python
pandas >= 1.2.4 pandas
gffutils >= 0.10.1 gffutils
setuptools >= 49.2.0 setuptools
biopython >= 1.78 biopython

Usage

featurExtract is designed for GFF and GTF file
and GenBankExtract is suited for GenBank file.

featurExtract

# gff or gtf database

featurExtract -h
Program:  featurExtract (tools for genomic feature extract)
Version:  0.2.6.0
Contact:  Sitao Zhu <[email protected]>
Usage  :  featurExtract <command> [parameters] 
Command: 
          create        create GFF/GTF database
          stat          database statistics
          cds           extract CDS sequence
          dorf          extract dORF sequence
          exon          extract exon sequence
          gene          extract gene sequence
          intron        extract intron sequence
          igr           extract intergenic region
          mrna          extract mRNA sequence
          promoter      extract promoter sequence
          terminator    extract terminator sequence
          transcript    extract transcript sequence
          uorf          extract uORF sequence
          utr           extract 5/3UTR sequence

create

featurExtract create -h
usage: featurExtract create [-h] -g GENOMEFEATURE -o OUTPUT -p PREFIX
                            [-s {gff,gtf}]

optional arguments:
  -h, --help            show this help message and exit
  -g GENOMEFEATURE, --genomefeature GENOMEFEATURE
                        genome annotation file, gff or gtf
  -o OUTPUT, --output OUTPUT
                        database output dir path
  -p PREFIX, --prefix PREFIX
                        database prefix
  -s {gff,gtf}, --style {gff,gtf}
                        genome annotation file format

stat

featurExtract stat -h
usage: featurExtract stat [-h] -d DATABASE -g GENOME -o OUTPUT [-s {gff,gtf}]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database created from creat command
  -g GENOME, --genome GENOME
                        genome fasta path
  -o OUTPUT, --output OUTPUT
                        stat output
  -s {gff,gtf}, --style {gff,gtf}
                        genome annotation file format

cds

featurExtract cds -h
usage: featurExtract cds [-h] -d DATABASE [-f {csv,fasta,gff}] -g GENOME
                         [-i TRANSCRIPT] [-o OUTPUT] [-p PROCESS]
                         [-r {mrna,all}] [-s {gff,gtf}] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta,gff}, --output_format {csv,fasta,gff}
                        output format
  -g GENOME, --genome GENOME
                        genome fasta
  -i TRANSCRIPT, --transcript TRANSCRIPT
                        specific transcript (optional); if not given, return
                        whole transcripts
  -o OUTPUT, --output OUTPUT
                        output file path
  -p PROCESS, --process PROCESS
                        number of cds extract process, (default: 4)
  -r {mrna,all}, --rna_feature {mrna,all}
                        The type of RNA for extract cds, (default: mrna)
  -s {gff,gtf}, --style {gff,gtf}
                        gtf database or gff database
  -v, --print           output to stdout. -v and -o option are mutually
                        exclusive

dorf

featurExtract dorf -h 
usage: featurExtract dorf [-h] -d DATABASE [-f {csv,fasta,gff}] -g GENOME
                          [-i TRANSCRIPT] [-l LENGTH] [-m] [-n] [-o OUTPUT]
                          [-p PROCESS] [-r {mrna,all}] [-s {gff,gtf}] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta,gff}, --output_format {csv,fasta,gff}
                        output format
  -g GENOME, --genome GENOME
                        genome fasta
  -i TRANSCRIPT, --transcript TRANSCRIPT
                        specific transcript (optional); if not given, return
                        whole transcripts
  -l LENGTH, --length LENGTH
                        dorf length, (default: 6)
  -m, --schematic_without_intron
                        schematic figure file for dorf, cds and transcript
                        without intron
  -n, --schematic_with_intron
                        schematic figure file for dorf, cds and transcript
                        with intron
  -o OUTPUT, --output OUTPUT
                        output file path
  -p PROCESS, --process PROCESS
                        number of dorf extract process, (default: 4)
  -r {mrna,all}, --rna_feature {mrna,all}
                        The type of RNA for dorf extraction (default: mrna)
  -s {gff,gtf}, --style {gff,gtf}
                        gtf database or gff database
  -v, --print           output to stdout. -v and -o option are mutually
                        exclusive

exon

featurExtract exon -h 
usage: featurExtract exon [-h] -d DATABASE [-f {csv,fasta,gff}] -g GENOME
                          [-i TRANSCRIPT] [-o OUTPUT] [-p PROCESS]
                          [-r {mrna,all}] [-s {gff,gtf}] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta,gff}, --output_format {csv,fasta,gff}
                        output format
  -g GENOME, --genome GENOME
                        genome fasta
  -i TRANSCRIPT, --transcript TRANSCRIPT
                        specific transcript (optional); if not given, return
                        whole transcripts
  -o OUTPUT, --output OUTPUT
                        output file path
  -p PROCESS, --process PROCESS
                        number of exon extract process, (default: 4)
  -r {mrna,all}, --rna_feature {mrna,all}
                        The type of RNA for exon extraction (default: mrna)
  -s {gff,gtf}, --style {gff,gtf}
                        gtf database or gff database
  -v, --print           output to stdout

gene

featurExtract gene -h 
usage: featurExtract gene [-h] -d DATABASE [-f {csv,fasta,gff,gtf}] -g GENOME
                          [-i GENE] [-o OUTPUT] [-p] [-s {gff,gtf}]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta,gff,gtf}, --output_format {csv,fasta,gff,gtf}
                        output format
  -g GENOME, --genome GENOME
                        genome fasta
  -i GENE, --gene GENE  specific gene (optional); if not given, return whole
                        genes
  -o OUTPUT, --output OUTPUT
                        output file path
  -p, --print           output to stdout
  -s {gff,gtf}, --style {gff,gtf}
                        gtf database or gff database

intron

featurExtract intron -h 
usage: featurExtract intron [-h] -d DATABASE [-f {csv,fasta,gff}] -g GENOME
                            [-i TRANSCRIPT] [-o OUTPUT] [-p PROCESS]
                            [-r {mrna,all}] [-s {gff,gtf}] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta,gff}, --output_format {csv,fasta,gff}
                        output format
  -g GENOME, --genome GENOME
                        genome fasta
  -i TRANSCRIPT, --transcript TRANSCRIPT
                        specific transcript (optional); if not given, return
                        whole transcripts
  -o OUTPUT, --output OUTPUT
                        output file path
  -p PROCESS, --process PROCESS
                        number of exon extract process, (default: 4)
  -r {mrna,all}, --rna_feature {mrna,all}
                        The type of RNA for intron extraction (default: mrna)
  -s {gff,gtf}, --style {gff,gtf}
                        gtf database or gff database
  -v, --print           output to stdout

igr

featurExtract igr -h 
usage: featurExtract igr [-h] -d DATABASE [-f {csv,fasta,gff}] -g GENOME
                         [-l IGR_LENGTH] [-o OUTPUT] [-p PROCESS]
                         [-s {gff,gtf}] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta,gff}, --output_format {csv,fasta,gff}
                        output format
  -g GENOME, --genome GENOME
                        genome fasta
  -l IGR_LENGTH, --igr_length IGR_LENGTH
                        igr length threshold
  -o OUTPUT, --output OUTPUT
                        output fasta file path
  -p PROCESS, --process PROCESS
                        number of igr extract process, (default: 4)
  -s {gff,gtf}, --style {gff,gtf}
                        gtf database only contain protein genes, while gff
                        database contain protein genes and nocoding genes
  -v, --print           output to stdout

mrna

featurExtract mrna -h 
usage: featurExtract mrna [-h] -d DATABASE [-f {csv,fasta}] -g GENOME
                          [-i TRANSCRIPT] [-o OUTPUT] [-p] [-s {gff,gtf}] [-u]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta}, --output_format {csv,fasta}
                        output format
  -g GENOME, --genome GENOME
                        genome fasta
  -i TRANSCRIPT, --transcript TRANSCRIPT
                        specific transcript (optional); if not given, return
                        whole transcripts
  -o OUTPUT, --output OUTPUT
                        output file path
  -p, --print           output to stdout
  -s {gff,gtf}, --style {gff,gtf}
                        gtf database or gff database
  -u, --upper           upper cds and lower utr

promoter

featurExtract promoter -h 
usage: featurExtract promoter [-h] -d DATABASE [-f {csv,fasta}] -g GENOME
                              [-i GENE] [-l PROMOTER_LENGTH] [-o OUTPUT]
                              [-p PROCESS] [-u UTR5_UPPER_LENGTH] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta}, --output_format {csv,fasta}
                        output format
  -g GENOME, --genome GENOME
                        genome fasta path
  -i GENE, --gene GENE  specific gene (optional); if not given, return whole
                        genes
  -l PROMOTER_LENGTH, --promoter_length PROMOTER_LENGTH
                        promoter length before TSS (default: 100)
  -o OUTPUT, --output OUTPUT
                        output file path
  -p PROCESS, --process PROCESS
                        number of promoter extract process, (default: 4)
  -u UTR5_UPPER_LENGTH, --utr5_upper_length UTR5_UPPER_LENGTH
                        5' utr length after TSS (default: 10)
  -v, --print           output to stdout

terminator

featurExtract terminator -h 
usage: featurExtract terminator [-h] -d DATABASE [-f {csv,fasta}] -g GENOME
                                [-i GENE] [-l TERMINATOR_LENGTH] [-o OUTPUT]
                                [-u UTR3_LOWER_LENGTH] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta}, --output_format {csv,fasta}
                        output format
  -g GENOME, --genome GENOME
                        genome fasta path
  -i GENE, --gene GENE  specific gene (optional); if not given, return whole
                        genes
  -l TERMINATOR_LENGTH, --terminator_length TERMINATOR_LENGTH
                        terminator length (default: 100)
  -o OUTPUT, --output OUTPUT
                        output file path
  -u UTR3_LOWER_LENGTH, --utr3_lower_length UTR3_LOWER_LENGTH
                        3' length (default: 10)
  -v, --print           output to stdout

transcript

featurExtract transcript -h 
usage: featurExtract transcript [-h] -d DATABASE [-f {csv,fasta}] -g GENOME
                                [-i TRANSCRIPT] [-o OUTPUT] [-p PROCESS]
                                [-r {mrna,all}] [-s {gff,gtf}] [-u] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta}, --output_format {csv,fasta}
                        output format
  -g GENOME, --genome GENOME
                        genome fasta
  -i TRANSCRIPT, --transcript TRANSCRIPT
                        specific transcript (optional); if not given, return
                        whole transcripts
  -o OUTPUT, --output OUTPUT
                        output file path
  -p PROCESS, --process PROCESS
                        number of cDNA extract process, (default: 4)
  -r {mrna,all}, --rna_feature {mrna,all}
                        The type of RNA for extract transcript, (default:
                        mrna)
  -s {gff,gtf}, --style {gff,gtf}
                        gtf database or gff database
  -u, --upper           upper cds and lower utr
  -v, --print           output to stdout

uorf

featurExtract uorf -h 
usage: featurExtract uorf [-h] -d DATABASE [-f {csv,fasta,gff}] -g GENOME
                          [-i TRANSCRIPT] [-l LENGTH] [-m] [-n] [-o OUTPUT]
                          [-p PROCESS] [-r {mrna,all}] [-s {gff,gtf}] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta,gff}, --output_format {csv,fasta,gff}
                        output format (default: csv)
  -g GENOME, --genome GENOME
                        genome fasta
  -i TRANSCRIPT, --transcript TRANSCRIPT
                        specific transcript (optional); if not given, return
                        whole transcripts
  -l LENGTH, --length LENGTH
                        uorf length, (default: 6)
  -m, --schematic_without_intron
                        schematic figure file for uorf, cds and transcript
                        without intron
  -n, --schematic_with_intron
                        schematic figure file for uorf, cds and transcript
                        with intron
  -o OUTPUT, --output OUTPUT
                        output file path
  -p PROCESS, --process PROCESS
                        number of uorf extract process, (default: 4)
  -r {mrna,all}, --rna_feature {mrna,all}
                        The type of RNA for uorf extraction (default: mrna)
  -s {gff,gtf}, --style {gff,gtf}
                        gtf database or gff database
  -v, --print           output to stdout. -v and -o option are mutually
                        exclusive

utr

featurExtract utr -h 
usage: featurExtract utr [-h] -d DATABASE [-f {csv,fasta,gff}] -g GENOME
                         [-i TRANSCRIPT] [-o OUTPUT] [-p PROCESS]
                         [-r {mrna,all}] [-s {gff,gtf}] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta,gff}, --output_format {csv,fasta,gff}
                        output format (default: csv)
  -g GENOME, --genome GENOME
                        genome fasta file
  -i TRANSCRIPT, --transcript TRANSCRIPT
                        specific transcript (optional); if not given, return
                        whole transcripts
  -o OUTPUT, --output OUTPUT
                        output file path
  -p PROCESS, --process PROCESS
                        number of utr extract process, (default: 4)
  -r {mrna,all}, --rna_feature {mrna,all}
                        The type of RNA for extract utr, (default: mrna)
  -s {gff,gtf}, --style {gff,gtf}
                        gtf database or gff database
  -v, --print           output to stdout. -v and -o option are mutually
                        exclusive

genBankExtract

# GenBank database
which genBankExtract
genBankExtract -h
genBankExtract gene -h
genBankExtract CDS  -h
genBankExtract rRNA -h
genBankExtract tRNA -h

Examples

featurExtract

# step 1 create database
time featurExtract create -s gtf -g Araport11_GTF_genes_transposons.Mar202021.gtf -o test/ -p ath

# step 2 command
# cds 
time featurExtract cds -d test/ath.GFF -g ath_chr.fa -r all -f csv -p 1 -o test/zhusitao_cds3.csv
time featurExtract cds -d test/ath.GFF -g ath_chr.fa -r all -f fasta -p 1 -o test/zhusitao_cds3.fa
time featurExtract cds -d test/ath.GFF -g ath_chr.fa -r all -f gff -p 1 -o test/zhusitao_cds3.gff

# transcript
time featurExtract transcript -d test/ath.GFF -g ath_chr.fa -r all -f csv -p 1 -o test/zhusitao_transcript.csv


# promoter 
time featurExtract promoter -d test/ath.GFF -f csv -g ath_chr.fa -l 10 -o test/zhusitao_promoter.csv -u 0 -i AT1G01010 -v 

# terminator
time featurExtract terminator -d test/ath.GFF -f csv -g ath_chr.fa -l 10 -o test/zhusitao_terminator.csv -u 0 -i AT1G01010 -v 

# exon 
time featurExtract exon -d test/ath.GFF -f fasta -g ath_chr.fa -o test/zhusitao_exon.fa -s gff 

# intron 
time featurExtract intron -d test/ath.GFF -f fasta -g ath_chr.fa -o test/zhusitao_intron.fa -s gff

# uorf 
time featurExtract uorf -d test/ath.GFF -g ath_chr.fa -l 1 -r all -f csv -o test/zhusitao_uORF.csv -s gff

# dorf
time featurExtract dorf -d test/ath.GFF -g ath_chr.fa -l 1 -r all -f csv -o test/zhusitao_dorf.csv -s gff

genBankExtract

# GenBank step 3
genBankExtract gene -g NC_000932.gb -f dna -p  
genBankExtract CDS  -g NC_000932.gb -f dna -p 
genBankExtract rRNA -g NC_000932.gb -f dna -p
genBankExtract tRNA -g NC_000932.gb -f dna -p

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
dist		dist
docs		docs
featurExtract.egg-info		featurExtract.egg-info
featurExtract		featurExtract
MANIFEST.in		MANIFEST.in
README.md		README.md
readthedocs.yaml		readthedocs.yaml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Brief introduction of featurExtract package

Install

install command line

Requirements

Usage

featurExtract

genBankExtract

Examples

featurExtract

genBankExtract

About

Releases

Packages

Languages

SitaoZ/featurExtract

Folders and files

Latest commit

History

Repository files navigation

Overview

Brief introduction of featurExtract package

Install

install command line

Requirements

Usage

featurExtract

genBankExtract

Examples

featurExtract

genBankExtract

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages