Skip to content

SitaoZ/featurExtract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

featurExtract is a Python package for bioinformatics, containing two command programs. The first, featurExtract, includes ten subroutines: create, gene, promoter, UTR, uORF, CDS, dORF, exon, intron and intergenic. The create subroutine is used to create a database, while the promoter subroutine is used to extract promoter sequences. The uORF subroutine extracts upstream open reading frames sequences, and the UTR subroutine extracts untranslated region sequences. The CDS subroutine extracts coding sequences and the intergenic subroutine extracts intergenic sequences between two genes. The second command program, genBankExtract, includes four subroutines: gene, CDS, rRNA and tRNA.

Brief introduction of featurExtract package

Install

Two way offer to install featurExtract module.

install command line

pip install featurExtract
# other
git clone https://github.com/SitaoZ/featurExtract.git
cd featurExtract
python setup.py install

Requirements

python >= 3.7.6 python
pandas >= 1.2.4 pandas
gffutils >= 0.10.1 gffutils
setuptools >= 49.2.0 setuptools
biopython >= 1.78 biopython

Usage

featurExtract is designed for GFF and GTF file
and GenBankExtract is suited for GenBank file.

featurExtract

# gff or gtf database

featurExtract -h
Program:  featurExtract (tools for genomic feature extract)
Version:  0.2.6.0
Contact:  Sitao Zhu <[email protected]>
Usage  :  featurExtract <command> [parameters] 
Command: 
          create        create GFF/GTF database
          stat          database statistics
          cds           extract CDS sequence
          dorf          extract dORF sequence
          exon          extract exon sequence
          gene          extract gene sequence
          intron        extract intron sequence
          igr           extract intergenic region
          mrna          extract mRNA sequence
          promoter      extract promoter sequence
          terminator    extract terminator sequence
          transcript    extract transcript sequence
          uorf          extract uORF sequence
          utr           extract 5/3UTR sequence
  • create
featurExtract create -h
usage: featurExtract create [-h] -g GENOMEFEATURE -o OUTPUT -p PREFIX
                            [-s {gff,gtf}]

optional arguments:
  -h, --help            show this help message and exit
  -g GENOMEFEATURE, --genomefeature GENOMEFEATURE
                        genome annotation file, gff or gtf
  -o OUTPUT, --output OUTPUT
                        database output dir path
  -p PREFIX, --prefix PREFIX
                        database prefix
  -s {gff,gtf}, --style {gff,gtf}
                        genome annotation file format
  • stat
featurExtract stat -h
usage: featurExtract stat [-h] -d DATABASE -g GENOME -o OUTPUT [-s {gff,gtf}]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database created from creat command
  -g GENOME, --genome GENOME
                        genome fasta path
  -o OUTPUT, --output OUTPUT
                        stat output
  -s {gff,gtf}, --style {gff,gtf}
                        genome annotation file format
  • cds
featurExtract cds -h
usage: featurExtract cds [-h] -d DATABASE [-f {csv,fasta,gff}] -g GENOME
                         [-i TRANSCRIPT] [-o OUTPUT] [-p PROCESS]
                         [-r {mrna,all}] [-s {gff,gtf}] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta,gff}, --output_format {csv,fasta,gff}
                        output format
  -g GENOME, --genome GENOME
                        genome fasta
  -i TRANSCRIPT, --transcript TRANSCRIPT
                        specific transcript (optional); if not given, return
                        whole transcripts
  -o OUTPUT, --output OUTPUT
                        output file path
  -p PROCESS, --process PROCESS
                        number of cds extract process, (default: 4)
  -r {mrna,all}, --rna_feature {mrna,all}
                        The type of RNA for extract cds, (default: mrna)
  -s {gff,gtf}, --style {gff,gtf}
                        gtf database or gff database
  -v, --print           output to stdout. -v and -o option are mutually
                        exclusive
  • dorf
featurExtract dorf -h 
usage: featurExtract dorf [-h] -d DATABASE [-f {csv,fasta,gff}] -g GENOME
                          [-i TRANSCRIPT] [-l LENGTH] [-m] [-n] [-o OUTPUT]
                          [-p PROCESS] [-r {mrna,all}] [-s {gff,gtf}] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta,gff}, --output_format {csv,fasta,gff}
                        output format
  -g GENOME, --genome GENOME
                        genome fasta
  -i TRANSCRIPT, --transcript TRANSCRIPT
                        specific transcript (optional); if not given, return
                        whole transcripts
  -l LENGTH, --length LENGTH
                        dorf length, (default: 6)
  -m, --schematic_without_intron
                        schematic figure file for dorf, cds and transcript
                        without intron
  -n, --schematic_with_intron
                        schematic figure file for dorf, cds and transcript
                        with intron
  -o OUTPUT, --output OUTPUT
                        output file path
  -p PROCESS, --process PROCESS
                        number of dorf extract process, (default: 4)
  -r {mrna,all}, --rna_feature {mrna,all}
                        The type of RNA for dorf extraction (default: mrna)
  -s {gff,gtf}, --style {gff,gtf}
                        gtf database or gff database
  -v, --print           output to stdout. -v and -o option are mutually
                        exclusive
  • exon
featurExtract exon -h 
usage: featurExtract exon [-h] -d DATABASE [-f {csv,fasta,gff}] -g GENOME
                          [-i TRANSCRIPT] [-o OUTPUT] [-p PROCESS]
                          [-r {mrna,all}] [-s {gff,gtf}] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta,gff}, --output_format {csv,fasta,gff}
                        output format
  -g GENOME, --genome GENOME
                        genome fasta
  -i TRANSCRIPT, --transcript TRANSCRIPT
                        specific transcript (optional); if not given, return
                        whole transcripts
  -o OUTPUT, --output OUTPUT
                        output file path
  -p PROCESS, --process PROCESS
                        number of exon extract process, (default: 4)
  -r {mrna,all}, --rna_feature {mrna,all}
                        The type of RNA for exon extraction (default: mrna)
  -s {gff,gtf}, --style {gff,gtf}
                        gtf database or gff database
  -v, --print           output to stdout
  • gene
featurExtract gene -h 
usage: featurExtract gene [-h] -d DATABASE [-f {csv,fasta,gff,gtf}] -g GENOME
                          [-i GENE] [-o OUTPUT] [-p] [-s {gff,gtf}]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta,gff,gtf}, --output_format {csv,fasta,gff,gtf}
                        output format
  -g GENOME, --genome GENOME
                        genome fasta
  -i GENE, --gene GENE  specific gene (optional); if not given, return whole
                        genes
  -o OUTPUT, --output OUTPUT
                        output file path
  -p, --print           output to stdout
  -s {gff,gtf}, --style {gff,gtf}
                        gtf database or gff database
  • intron
featurExtract intron -h 
usage: featurExtract intron [-h] -d DATABASE [-f {csv,fasta,gff}] -g GENOME
                            [-i TRANSCRIPT] [-o OUTPUT] [-p PROCESS]
                            [-r {mrna,all}] [-s {gff,gtf}] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta,gff}, --output_format {csv,fasta,gff}
                        output format
  -g GENOME, --genome GENOME
                        genome fasta
  -i TRANSCRIPT, --transcript TRANSCRIPT
                        specific transcript (optional); if not given, return
                        whole transcripts
  -o OUTPUT, --output OUTPUT
                        output file path
  -p PROCESS, --process PROCESS
                        number of exon extract process, (default: 4)
  -r {mrna,all}, --rna_feature {mrna,all}
                        The type of RNA for intron extraction (default: mrna)
  -s {gff,gtf}, --style {gff,gtf}
                        gtf database or gff database
  -v, --print           output to stdout
  • igr
featurExtract igr -h 
usage: featurExtract igr [-h] -d DATABASE [-f {csv,fasta,gff}] -g GENOME
                         [-l IGR_LENGTH] [-o OUTPUT] [-p PROCESS]
                         [-s {gff,gtf}] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta,gff}, --output_format {csv,fasta,gff}
                        output format
  -g GENOME, --genome GENOME
                        genome fasta
  -l IGR_LENGTH, --igr_length IGR_LENGTH
                        igr length threshold
  -o OUTPUT, --output OUTPUT
                        output fasta file path
  -p PROCESS, --process PROCESS
                        number of igr extract process, (default: 4)
  -s {gff,gtf}, --style {gff,gtf}
                        gtf database only contain protein genes, while gff
                        database contain protein genes and nocoding genes
  -v, --print           output to stdout
  • mrna
featurExtract mrna -h 
usage: featurExtract mrna [-h] -d DATABASE [-f {csv,fasta}] -g GENOME
                          [-i TRANSCRIPT] [-o OUTPUT] [-p] [-s {gff,gtf}] [-u]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta}, --output_format {csv,fasta}
                        output format
  -g GENOME, --genome GENOME
                        genome fasta
  -i TRANSCRIPT, --transcript TRANSCRIPT
                        specific transcript (optional); if not given, return
                        whole transcripts
  -o OUTPUT, --output OUTPUT
                        output file path
  -p, --print           output to stdout
  -s {gff,gtf}, --style {gff,gtf}
                        gtf database or gff database
  -u, --upper           upper cds and lower utr
  • promoter
featurExtract promoter -h 
usage: featurExtract promoter [-h] -d DATABASE [-f {csv,fasta}] -g GENOME
                              [-i GENE] [-l PROMOTER_LENGTH] [-o OUTPUT]
                              [-p PROCESS] [-u UTR5_UPPER_LENGTH] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta}, --output_format {csv,fasta}
                        output format
  -g GENOME, --genome GENOME
                        genome fasta path
  -i GENE, --gene GENE  specific gene (optional); if not given, return whole
                        genes
  -l PROMOTER_LENGTH, --promoter_length PROMOTER_LENGTH
                        promoter length before TSS (default: 100)
  -o OUTPUT, --output OUTPUT
                        output file path
  -p PROCESS, --process PROCESS
                        number of promoter extract process, (default: 4)
  -u UTR5_UPPER_LENGTH, --utr5_upper_length UTR5_UPPER_LENGTH
                        5' utr length after TSS (default: 10)
  -v, --print           output to stdout
  • terminator
featurExtract terminator -h 
usage: featurExtract terminator [-h] -d DATABASE [-f {csv,fasta}] -g GENOME
                                [-i GENE] [-l TERMINATOR_LENGTH] [-o OUTPUT]
                                [-u UTR3_LOWER_LENGTH] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta}, --output_format {csv,fasta}
                        output format
  -g GENOME, --genome GENOME
                        genome fasta path
  -i GENE, --gene GENE  specific gene (optional); if not given, return whole
                        genes
  -l TERMINATOR_LENGTH, --terminator_length TERMINATOR_LENGTH
                        terminator length (default: 100)
  -o OUTPUT, --output OUTPUT
                        output file path
  -u UTR3_LOWER_LENGTH, --utr3_lower_length UTR3_LOWER_LENGTH
                        3' length (default: 10)
  -v, --print           output to stdout
  • transcript
featurExtract transcript -h 
usage: featurExtract transcript [-h] -d DATABASE [-f {csv,fasta}] -g GENOME
                                [-i TRANSCRIPT] [-o OUTPUT] [-p PROCESS]
                                [-r {mrna,all}] [-s {gff,gtf}] [-u] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta}, --output_format {csv,fasta}
                        output format
  -g GENOME, --genome GENOME
                        genome fasta
  -i TRANSCRIPT, --transcript TRANSCRIPT
                        specific transcript (optional); if not given, return
                        whole transcripts
  -o OUTPUT, --output OUTPUT
                        output file path
  -p PROCESS, --process PROCESS
                        number of cDNA extract process, (default: 4)
  -r {mrna,all}, --rna_feature {mrna,all}
                        The type of RNA for extract transcript, (default:
                        mrna)
  -s {gff,gtf}, --style {gff,gtf}
                        gtf database or gff database
  -u, --upper           upper cds and lower utr
  -v, --print           output to stdout
  • uorf
featurExtract uorf -h 
usage: featurExtract uorf [-h] -d DATABASE [-f {csv,fasta,gff}] -g GENOME
                          [-i TRANSCRIPT] [-l LENGTH] [-m] [-n] [-o OUTPUT]
                          [-p PROCESS] [-r {mrna,all}] [-s {gff,gtf}] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta,gff}, --output_format {csv,fasta,gff}
                        output format (default: csv)
  -g GENOME, --genome GENOME
                        genome fasta
  -i TRANSCRIPT, --transcript TRANSCRIPT
                        specific transcript (optional); if not given, return
                        whole transcripts
  -l LENGTH, --length LENGTH
                        uorf length, (default: 6)
  -m, --schematic_without_intron
                        schematic figure file for uorf, cds and transcript
                        without intron
  -n, --schematic_with_intron
                        schematic figure file for uorf, cds and transcript
                        with intron
  -o OUTPUT, --output OUTPUT
                        output file path
  -p PROCESS, --process PROCESS
                        number of uorf extract process, (default: 4)
  -r {mrna,all}, --rna_feature {mrna,all}
                        The type of RNA for uorf extraction (default: mrna)
  -s {gff,gtf}, --style {gff,gtf}
                        gtf database or gff database
  -v, --print           output to stdout. -v and -o option are mutually
                        exclusive
  • utr
featurExtract utr -h 
usage: featurExtract utr [-h] -d DATABASE [-f {csv,fasta,gff}] -g GENOME
                         [-i TRANSCRIPT] [-o OUTPUT] [-p PROCESS]
                         [-r {mrna,all}] [-s {gff,gtf}] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        database generated by subcommand create
  -f {csv,fasta,gff}, --output_format {csv,fasta,gff}
                        output format (default: csv)
  -g GENOME, --genome GENOME
                        genome fasta file
  -i TRANSCRIPT, --transcript TRANSCRIPT
                        specific transcript (optional); if not given, return
                        whole transcripts
  -o OUTPUT, --output OUTPUT
                        output file path
  -p PROCESS, --process PROCESS
                        number of utr extract process, (default: 4)
  -r {mrna,all}, --rna_feature {mrna,all}
                        The type of RNA for extract utr, (default: mrna)
  -s {gff,gtf}, --style {gff,gtf}
                        gtf database or gff database
  -v, --print           output to stdout. -v and -o option are mutually
                        exclusive

genBankExtract

# GenBank database
which genBankExtract
genBankExtract -h
genBankExtract gene -h
genBankExtract CDS  -h
genBankExtract rRNA -h
genBankExtract tRNA -h

Examples

featurExtract

# step 1 create database
time featurExtract create -s gtf -g Araport11_GTF_genes_transposons.Mar202021.gtf -o test/ -p ath

# step 2 command
# cds 
time featurExtract cds -d test/ath.GFF -g ath_chr.fa -r all -f csv -p 1 -o test/zhusitao_cds3.csv
time featurExtract cds -d test/ath.GFF -g ath_chr.fa -r all -f fasta -p 1 -o test/zhusitao_cds3.fa
time featurExtract cds -d test/ath.GFF -g ath_chr.fa -r all -f gff -p 1 -o test/zhusitao_cds3.gff

# transcript
time featurExtract transcript -d test/ath.GFF -g ath_chr.fa -r all -f csv -p 1 -o test/zhusitao_transcript.csv


# promoter 
time featurExtract promoter -d test/ath.GFF -f csv -g ath_chr.fa -l 10 -o test/zhusitao_promoter.csv -u 0 -i AT1G01010 -v 

# terminator
time featurExtract terminator -d test/ath.GFF -f csv -g ath_chr.fa -l 10 -o test/zhusitao_terminator.csv -u 0 -i AT1G01010 -v 

# exon 
time featurExtract exon -d test/ath.GFF -f fasta -g ath_chr.fa -o test/zhusitao_exon.fa -s gff 

# intron 
time featurExtract intron -d test/ath.GFF -f fasta -g ath_chr.fa -o test/zhusitao_intron.fa -s gff

# uorf 
time featurExtract uorf -d test/ath.GFF -g ath_chr.fa -l 1 -r all -f csv -o test/zhusitao_uORF.csv -s gff

# dorf
time featurExtract dorf -d test/ath.GFF -g ath_chr.fa -l 1 -r all -f csv -o test/zhusitao_dorf.csv -s gff

genBankExtract

# GenBank step 3
genBankExtract gene -g NC_000932.gb -f dna -p  
genBankExtract CDS  -g NC_000932.gb -f dna -p 
genBankExtract rRNA -g NC_000932.gb -f dna -p
genBankExtract tRNA -g NC_000932.gb -f dna -p

About

genomic feature extract in genomics and bioinformatics

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages