test_output/testing.log

python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB_test.py
PATH_OF_PRECISIONPRODB: /data/p/xiaolong/PrecisionProDB
OUTPUT_TEST: /data/p/xiaolong/PrecisionProDB/test_output
running test with key_input: Ensembl, key_variant: tsv, sqlite_key: no_sqlite
Running test: without use sqlite file
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/tsv/no_sqlite &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz -g /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz -o /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/tsv/no_sqlite/Ensembl.tsv.no_sqlite -a Ensembl_GTF --PEFF  -t 4
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz', mutations='/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz', protein='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/tsv/no_sqlite/Ensembl.tsv.no_sqlite', datatype='Ensembl_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='')
variant file is a tsv file
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
finish splitting the mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
started running perChrom for chromosome: 1
from the protein file, totally 663 protein sequences.
finish reading genome file
number of proteins from the gtf file 663
finishing get locs and frame
transcripts with no mutations: 467
transcripts with mutations in CDSplus: 196
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 90
finished running perChrom for chromosome: 1
total number of proteins with AA mutation: 90
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 4.464511156082153
Command finished in 5.41 seconds


running test with key_input: Ensembl, key_variant: tsv, sqlite_key: sqlite_one_step
Running test: use SQLite file as intermediate file
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/tsv/sqlite_one_step &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz -g /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz -o /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/tsv/sqlite_one_step/Ensembl.tsv.sqlite_one_step -a Ensembl_GTF --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz', mutations='/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz', protein='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/tsv/sqlite_one_step/Ensembl.tsv.sqlite_one_step', datatype='Ensembl_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite')
using sqlite database to speed up
running in sqlite mode
sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite" does not exist, create one first
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
1 No mutation file provided, will not do mutation analysis
from the protein file, totally 663 protein sequences.
finish reading genome file
number of proteins from the gtf file 663
finishing get locs and frame
building sqlite done
remove temp folder
finished!
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
variant file is a tsv file
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
run perChrom for chromosome 1
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 90
finished running perchrom_sqlite for chromosome: 1
total number of proteins with AA mutation: 90
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 6.047613143920898
Command finished in 6.86 seconds


running test with key_input: Ensembl, key_variant: tsv, sqlite_key: sqlite_two_step
Running test: Generate SQLite file in advance and use SQLite
/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/tsv/sqlite_two_step && python /data/p/xiaolong/PrecisionProDB/src/buildSqlite.py -S /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite -g /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz  -a Ensembl_GTF && python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz  -o /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/tsv/sqlite_two_step/Ensembl.tsv.sqlite_two_step -a Ensembl_GTF --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
1 No mutation file provided, will not do mutation analysis
from the protein file, totally 663 protein sequences.
finish reading genome file
number of proteins from the gtf file 663
finishing get locs and frame
building sqlite done
remove temp folder
finished!
SQLite creation complete.
Namespace(genome='', gtf='', mutations='/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz', protein='', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/tsv/sqlite_two_step/Ensembl.tsv.sqlite_two_step', datatype='Ensembl_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite')
using sqlite database to speed up
running in sqlite mode
use existing sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite" for gene annotation
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
variant file is a tsv file
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
run perChrom for chromosome 1
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 90
finished running perchrom_sqlite for chromosome: 1
total number of proteins with AA mutation: 90
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 4.627309560775757
Command finished in 8.01 seconds


running test with key_input: Ensembl, key_variant: vcf, sqlite_key: no_sqlite
Running test: without use sqlite file
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/vcf/no_sqlite &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz -g /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz -o /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/vcf/no_sqlite/Ensembl.vcf.no_sqlite -a Ensembl_GTF --PEFF  -t 4
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz', mutations='/data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz', protein='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/vcf/no_sqlite/Ensembl.vcf.no_sqlite', datatype='Ensembl_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='')
variant file is a vcf file
start extracting mutation file from the vcf input
individual is None, select the first sample in vcf file, which is HEK293
before QC, sites with mutations: 26922 of which, homozyous sites: 11242
after QC, number of variants in two dataframes are: 11094 26604
finished extracting mutations from the vcf file
start running PrecisionProDB for first strand of the genome mutation file
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
{'file_genome': '/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz', 'file_gtf': '/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz', 'file_mutations': '/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/vcf/no_sqlite/Ensembl.vcf.no_sqlite.vcf2mutation_1.tsv', 'file_protein': '/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz', 'threads': 4, 'outprefix': '/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/vcf/no_sqlite/Ensembl.vcf.no_sqlite_1', 'datatype': 'Ensembl_GTF', 'protein_keyword': 'protein_id', 'keep_all': False, 'tempfolder': '/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/vcf/no_sqlite/Ensembl.vcf.no_sqlite_1_temp', 'chromosomes_genome': None, 'chromosomes_genome_description': None, 'chromosomes_mutation': None, 'chromosomes_gtf': None}
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
finish splitting the mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
started running perChrom for chromosome: 1
from the protein file, totally 663 protein sequences.
finish reading genome file
number of proteins from the gtf file 663
finishing get locs and frame
transcripts with no mutations: 501
transcripts with mutations in CDSplus: 162
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 90
finished running perChrom for chromosome: 1
total number of proteins with AA mutation: 90
remove temp folder
finished!
start running PrecisionProDB for second strand of the genome mutation file
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
{'file_genome': '/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz', 'file_gtf': '/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz', 'file_mutations': '/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/vcf/no_sqlite/Ensembl.vcf.no_sqlite.vcf2mutation_1.tsv', 'file_protein': '/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz', 'threads': 4, 'outprefix': '/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/vcf/no_sqlite/Ensembl.vcf.no_sqlite_1', 'datatype': 'Ensembl_GTF', 'protein_keyword': 'protein_id', 'keep_all': False, 'tempfolder': '/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/vcf/no_sqlite/Ensembl.vcf.no_sqlite_1_temp', 'chromosomes_genome': ['1'], 'chromosomes_genome_description': ['1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF'], 'chromosomes_mutation': ['1'], 'chromosomes_gtf': ['1'], 'chromosomes_protein': ['1'], 'chromosomes': ['1']}
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
finish splitting the mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
started running perChrom for chromosome: 1
from the protein file, totally 663 protein sequences.
finish reading genome file
number of proteins from the gtf file 663
finishing get locs and frame
transcripts with no mutations: 430
transcripts with mutations in CDSplus: 233
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 129
finished running perChrom for chromosome: 1
total number of proteins with AA mutation: 129
remove temp folder
finished!
total number of changed proteins from two strands of chromosome: 219. After removing protein seq_id with the same mutations, 142 sequences left. After remove protein seq_id with same final sequence, 142 sequences left.
proteins changed in at least one strand: 129. Proteins changed in both strands: 90. Proteins changed in both strands and the mutation in both strands are identical: 77. Proteins changed in both strands, the final protein sequences are identical but the mutations are different: 0
perGeno_vcf finished!
PrecisionProDB finished! Total seconds: 10.278377771377563
Command finished in 11.14 seconds


running test with key_input: Ensembl, key_variant: vcf, sqlite_key: sqlite_one_step
Running test: use SQLite file as intermediate file
/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/vcf/sqlite_one_step &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz -g /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz -o /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/vcf/sqlite_one_step/Ensembl.vcf.sqlite_one_step -a Ensembl_GTF --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz', mutations='/data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz', protein='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/vcf/sqlite_one_step/Ensembl.vcf.sqlite_one_step', datatype='Ensembl_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite')
using sqlite database to speed up
running in sqlite mode
sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite" does not exist, create one first
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
1 No mutation file provided, will not do mutation analysis
from the protein file, totally 663 protein sequences.
finish reading genome file
number of proteins from the gtf file 663
finishing get locs and frame
building sqlite done
remove temp folder
finished!
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
variant file is a vcf file
start extracting mutation file from the vcf input
individual is None, select the first sample in vcf file, which is HEK293
before QC, sites with mutations: 26922 of which, homozyous sites: 11242
after QC, number of variants in two dataframes are: 11094 26604
finished extracting mutations from the vcf file
start running PrecisionProDB for first strand of the genome mutation file
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/vcf/sqlite_one_step/Ensembl.vcf.sqlite_one_step.vcf2mutation_1.tsv
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
run perChrom for chromosome 1
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 90
finished running perchrom_sqlite for chromosome: 1
total number of proteins with AA mutation: 90
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
remove temp folder
finished!
start running PrecisionProDB for second strand of the genome mutation file
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/vcf/sqlite_one_step/Ensembl.vcf.sqlite_one_step.vcf2mutation_2.tsv
run perChrom for chromosome 1
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 129
finished running perchrom_sqlite for chromosome: 1
total number of proteins with AA mutation: 129
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
remove temp folder
finished!
total number of changed proteins from two strands of chromosome: 219. After removing protein seq_id with the same mutations, 142 sequences left. After remove protein seq_id with same final sequence, 142 sequences left.
proteins changed in at least one strand: 129. Proteins changed in both strands: 90. Proteins changed in both strands and the mutation in both strands are identical: 77. Proteins changed in both strands, the final protein sequences are identical but the mutations are different: 0
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
perGeno_vcf finished!
PrecisionProDB finished! Total seconds: 12.158420085906982
Command finished in 12.91 seconds


running test with key_input: Ensembl, key_variant: vcf, sqlite_key: sqlite_two_step
Running test: Generate SQLite file in advance and use SQLite
/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/vcf/sqlite_two_step && python /data/p/xiaolong/PrecisionProDB/src/buildSqlite.py -S /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite -g /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz  -a Ensembl_GTF && python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz  -o /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/vcf/sqlite_two_step/Ensembl.vcf.sqlite_two_step -a Ensembl_GTF --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
1 No mutation file provided, will not do mutation analysis
from the protein file, totally 663 protein sequences.
finish reading genome file
number of proteins from the gtf file 663
finishing get locs and frame
building sqlite done
remove temp folder
finished!
SQLite creation complete.
Namespace(genome='', gtf='', mutations='/data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz', protein='', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/vcf/sqlite_two_step/Ensembl.vcf.sqlite_two_step', datatype='Ensembl_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite')
using sqlite database to speed up
running in sqlite mode
use existing sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite" for gene annotation
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
variant file is a vcf file
start extracting mutation file from the vcf input
individual is None, select the first sample in vcf file, which is HEK293
before QC, sites with mutations: 26922 of which, homozyous sites: 11242
after QC, number of variants in two dataframes are: 11094 26604
finished extracting mutations from the vcf file
start running PrecisionProDB for first strand of the genome mutation file
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/vcf/sqlite_two_step/Ensembl.vcf.sqlite_two_step.vcf2mutation_1.tsv
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
run perChrom for chromosome 1
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 90
finished running perchrom_sqlite for chromosome: 1
total number of proteins with AA mutation: 90
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
remove temp folder
finished!
start running PrecisionProDB for second strand of the genome mutation file
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/vcf/sqlite_two_step/Ensembl.vcf.sqlite_two_step.vcf2mutation_2.tsv
run perChrom for chromosome 1
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 129
finished running perchrom_sqlite for chromosome: 1
total number of proteins with AA mutation: 129
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
remove temp folder
finished!
total number of changed proteins from two strands of chromosome: 219. After removing protein seq_id with the same mutations, 142 sequences left. After remove protein seq_id with same final sequence, 142 sequences left.
proteins changed in at least one strand: 129. Proteins changed in both strands: 90. Proteins changed in both strands and the mutation in both strands are identical: 77. Proteins changed in both strands, the final protein sequences are identical but the mutations are different: 0
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
perGeno_vcf finished!
PrecisionProDB finished! Total seconds: 10.67515230178833
Command finished in 14.16 seconds


running test with key_input: Ensembl, key_variant: str, sqlite_key: sqlite_one_step
Running test: use SQLite file as intermediate file
/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/str/sqlite_one_step &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A -g /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz -o /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/str/sqlite_one_step/Ensembl.str.sqlite_one_step -a Ensembl_GTF --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz', mutations='chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A', protein='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/str/sqlite_one_step/Ensembl.str.sqlite_one_step', datatype='Ensembl_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite')
using sqlite database to speed up
running in sqlite mode
sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite" does not exist, create one first
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
1 No mutation file provided, will not do mutation analysis
from the protein file, totally 663 protein sequences.
finish reading genome file
number of proteins from the gtf file 663
finishing get locs and frame
building sqlite done
remove temp folder
finished!
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
file_mutations is a mutation string chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
run perChrom for chromosome 1
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 24
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 2.2501449584960938
Command finished in 3.06 seconds


running test with key_input: Ensembl, key_variant: str, sqlite_key: sqlite_two_step
Running test: Generate SQLite file in advance and use SQLite
/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/str/sqlite_two_step && python /data/p/xiaolong/PrecisionProDB/src/buildSqlite.py -S /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite -g /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz  -a Ensembl_GTF && python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A  -o /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/str/sqlite_two_step/Ensembl.str.sqlite_two_step -a Ensembl_GTF --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
1 No mutation file provided, will not do mutation analysis
from the protein file, totally 663 protein sequences.
finish reading genome file
number of proteins from the gtf file 663
finishing get locs and frame
building sqlite done
remove temp folder
finished!
SQLite creation complete.
Namespace(genome='', gtf='', mutations='chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A', protein='', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/str/sqlite_two_step/Ensembl.str.sqlite_two_step', datatype='Ensembl_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite')
using sqlite database to speed up
running in sqlite mode
use existing sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite" for gene annotation
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
file_mutations is a mutation string chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/Ensembl/Ensembl.sqlite
run perChrom for chromosome 1
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 24
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 0.4944791793823242
Command finished in 3.96 seconds


running test with key_input: GENCODE, key_variant: tsv, sqlite_key: no_sqlite
Running test: without use sqlite file
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/tsv/no_sqlite &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz -g /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.gtf.gz -o /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/tsv/no_sqlite/GENCODE.tsv.no_sqlite -a GENCODE_GTF --PEFF  -t 4
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.gtf.gz', mutations='/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz', protein='/data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.protein.fa.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/tsv/no_sqlite/GENCODE.tsv.no_sqlite', datatype='GENCODE_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='')
variant file is a tsv file
input data is provided as from GENCODE_GTF
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
add "chr" to chromosome 1 in mutation file
finish splitting the mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
started running perChrom for chromosome: chr1
from the protein file, totally 1154 protein sequences.
finish reading genome file
number of proteins from the gtf file 1154
finishing get locs and frame
transcripts with no mutations: 665
transcripts with mutations in CDSplus: 489
number of proteins with AA change: 161
finished running perChrom for chromosome: chr1
total number of proteins with AA mutation: 161
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 12.404561996459961
Command finished in 13.28 seconds


running test with key_input: GENCODE, key_variant: tsv, sqlite_key: sqlite_one_step
Running test: use SQLite file as intermediate file
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/tsv/sqlite_one_step &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz -g /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.gtf.gz -o /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/tsv/sqlite_one_step/GENCODE.tsv.sqlite_one_step -a GENCODE_GTF --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.gtf.gz', mutations='/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz', protein='/data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.protein.fa.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/tsv/sqlite_one_step/GENCODE.tsv.sqlite_one_step', datatype='GENCODE_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite')
using sqlite database to speed up
running in sqlite mode
sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite" does not exist, create one first
input data is provided as from GENCODE_GTF
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
chr1 No mutation file provided, will not do mutation analysis
from the protein file, totally 1154 protein sequences.
finish reading genome file
number of proteins from the gtf file 1154
finishing get locs and frame
building sqlite done
remove temp folder
finished!
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
variant file is a tsv file
input data is provided as from GENCODE_GTF
add "chr" to chromosome 1 in mutation file
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
run perChrom for chromosome chr1
number of proteins with AA change: 161
finished running perchrom_sqlite for chromosome: chr1
total number of proteins with AA mutation: 161
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 16.49066972732544
Command finished in 17.29 seconds


running test with key_input: GENCODE, key_variant: tsv, sqlite_key: sqlite_two_step
Running test: Generate SQLite file in advance and use SQLite
/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/tsv/sqlite_two_step && python /data/p/xiaolong/PrecisionProDB/src/buildSqlite.py -S /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite -g /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.gtf.gz  -a GENCODE_GTF && python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz  -o /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/tsv/sqlite_two_step/GENCODE.tsv.sqlite_two_step -a GENCODE_GTF --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
input data is provided as from GENCODE_GTF
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
chr1 No mutation file provided, will not do mutation analysis
from the protein file, totally 1154 protein sequences.
finish reading genome file
number of proteins from the gtf file 1154
finishing get locs and frame
building sqlite done
remove temp folder
finished!
SQLite creation complete.
Namespace(genome='', gtf='', mutations='/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz', protein='', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/tsv/sqlite_two_step/GENCODE.tsv.sqlite_two_step', datatype='GENCODE_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite')
using sqlite database to speed up
running in sqlite mode
use existing sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite" for gene annotation
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
variant file is a tsv file
input data is provided as from GENCODE_GTF
add "chr" to chromosome 1 in mutation file
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
run perChrom for chromosome chr1
number of proteins with AA change: 161
finished running perchrom_sqlite for chromosome: chr1
total number of proteins with AA mutation: 161
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 10.553922891616821
Command finished in 18.11 seconds


running test with key_input: GENCODE, key_variant: vcf, sqlite_key: no_sqlite
Running test: without use sqlite file
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/vcf/no_sqlite &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz -g /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.gtf.gz -o /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/vcf/no_sqlite/GENCODE.vcf.no_sqlite -a GENCODE_GTF --PEFF  -t 4
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.gtf.gz', mutations='/data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz', protein='/data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.protein.fa.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/vcf/no_sqlite/GENCODE.vcf.no_sqlite', datatype='GENCODE_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='')
variant file is a vcf file
start extracting mutation file from the vcf input
individual is None, select the first sample in vcf file, which is HEK293
before QC, sites with mutations: 26922 of which, homozyous sites: 11242
after QC, number of variants in two dataframes are: 11094 26604
finished extracting mutations from the vcf file
start running PrecisionProDB for first strand of the genome mutation file
input data is provided as from GENCODE_GTF
{'file_genome': '/data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.genome.fa.gz', 'file_gtf': '/data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.gtf.gz', 'file_mutations': '/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/vcf/no_sqlite/GENCODE.vcf.no_sqlite.vcf2mutation_1.tsv', 'file_protein': '/data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.protein.fa.gz', 'threads': 4, 'outprefix': '/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/vcf/no_sqlite/GENCODE.vcf.no_sqlite_1', 'datatype': 'GENCODE_GTF', 'protein_keyword': 'transcript_id', 'keep_all': False, 'tempfolder': '/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/vcf/no_sqlite/GENCODE.vcf.no_sqlite_1_temp', 'chromosomes_genome': None, 'chromosomes_genome_description': None, 'chromosomes_mutation': None, 'chromosomes_gtf': None}
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
add "chr" to chromosome 1 in mutation file
finish splitting the mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
started running perChrom for chromosome: chr1
from the protein file, totally 1154 protein sequences.
finish reading genome file
number of proteins from the gtf file 1154
finishing get locs and frame
transcripts with no mutations: 766
transcripts with mutations in CDSplus: 388
number of proteins with AA change: 136
finished running perChrom for chromosome: chr1
total number of proteins with AA mutation: 136
remove temp folder
finished!
start running PrecisionProDB for second strand of the genome mutation file
input data is provided as from GENCODE_GTF
{'file_genome': '/data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.genome.fa.gz', 'file_gtf': '/data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.gtf.gz', 'file_mutations': '/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/vcf/no_sqlite/GENCODE.vcf.no_sqlite.vcf2mutation_1.tsv', 'file_protein': '/data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.protein.fa.gz', 'threads': 4, 'outprefix': '/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/vcf/no_sqlite/GENCODE.vcf.no_sqlite_1', 'datatype': 'GENCODE_GTF', 'protein_keyword': 'transcript_id', 'keep_all': False, 'tempfolder': '/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/vcf/no_sqlite/GENCODE.vcf.no_sqlite_1_temp', 'chromosomes_genome': ['chr1'], 'chromosomes_genome_description': ['chr1 1'], 'chromosomes_mutation': ['chr1'], 'chromosomes_gtf': ['chr1'], 'chromosomes_protein': ['chr1'], 'chromosomes': ['chr1']}
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
add "chr" to chromosome 1 in mutation file
finish splitting the mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
started running perChrom for chromosome: chr1
from the protein file, totally 1154 protein sequences.
finish reading genome file
number of proteins from the gtf file 1154
finishing get locs and frame
transcripts with no mutations: 559
transcripts with mutations in CDSplus: 595
number of proteins with AA change: 249
finished running perChrom for chromosome: chr1
total number of proteins with AA mutation: 249
remove temp folder
finished!
total number of changed proteins from two strands of chromosome: 385. After removing protein seq_id with the same mutations, 271 sequences left. After remove protein seq_id with same final sequence, 271 sequences left.
proteins changed in at least one strand: 249. Proteins changed in both strands: 136. Proteins changed in both strands and the mutation in both strands are identical: 114. Proteins changed in both strands, the final protein sequences are identical but the mutations are different: 0
perGeno_vcf finished!
PrecisionProDB finished! Total seconds: 24.511504650115967
Command finished in 25.33 seconds


running test with key_input: GENCODE, key_variant: vcf, sqlite_key: sqlite_one_step
Running test: use SQLite file as intermediate file
/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/vcf/sqlite_one_step &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz -g /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.gtf.gz -o /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/vcf/sqlite_one_step/GENCODE.vcf.sqlite_one_step -a GENCODE_GTF --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.gtf.gz', mutations='/data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz', protein='/data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.protein.fa.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/vcf/sqlite_one_step/GENCODE.vcf.sqlite_one_step', datatype='GENCODE_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite')
using sqlite database to speed up
running in sqlite mode
sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite" does not exist, create one first
input data is provided as from GENCODE_GTF
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
chr1 No mutation file provided, will not do mutation analysis
from the protein file, totally 1154 protein sequences.
finish reading genome file
number of proteins from the gtf file 1154
finishing get locs and frame
building sqlite done
remove temp folder
finished!
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
variant file is a vcf file
start extracting mutation file from the vcf input
individual is None, select the first sample in vcf file, which is HEK293
before QC, sites with mutations: 26922 of which, homozyous sites: 11242
after QC, number of variants in two dataframes are: 11094 26604
finished extracting mutations from the vcf file
start running PrecisionProDB for first strand of the genome mutation file
input data is provided as from GENCODE_GTF
add "chr" to chromosome 1 in mutation file
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/vcf/sqlite_one_step/GENCODE.vcf.sqlite_one_step.vcf2mutation_1.tsv
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
run perChrom for chromosome chr1
number of proteins with AA change: 136
finished running perchrom_sqlite for chromosome: chr1
total number of proteins with AA mutation: 136
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
remove temp folder
finished!
start running PrecisionProDB for second strand of the genome mutation file
input data is provided as from GENCODE_GTF
add "chr" to chromosome 1 in mutation file
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/vcf/sqlite_one_step/GENCODE.vcf.sqlite_one_step.vcf2mutation_2.tsv
run perChrom for chromosome chr1
number of proteins with AA change: 249
finished running perchrom_sqlite for chromosome: chr1
total number of proteins with AA mutation: 249
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
remove temp folder
finished!
total number of changed proteins from two strands of chromosome: 385. After removing protein seq_id with the same mutations, 271 sequences left. After remove protein seq_id with same final sequence, 271 sequences left.
proteins changed in at least one strand: 249. Proteins changed in both strands: 136. Proteins changed in both strands and the mutation in both strands are identical: 114. Proteins changed in both strands, the final protein sequences are identical but the mutations are different: 0
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
perGeno_vcf finished!
PrecisionProDB finished! Total seconds: 28.467210054397583
Command finished in 29.43 seconds


running test with key_input: GENCODE, key_variant: vcf, sqlite_key: sqlite_two_step
Running test: Generate SQLite file in advance and use SQLite
/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/vcf/sqlite_two_step && python /data/p/xiaolong/PrecisionProDB/src/buildSqlite.py -S /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite -g /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.gtf.gz  -a GENCODE_GTF && python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz  -o /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/vcf/sqlite_two_step/GENCODE.vcf.sqlite_two_step -a GENCODE_GTF --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
input data is provided as from GENCODE_GTF
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
chr1 No mutation file provided, will not do mutation analysis
from the protein file, totally 1154 protein sequences.
finish reading genome file
number of proteins from the gtf file 1154
finishing get locs and frame
building sqlite done
remove temp folder
finished!
SQLite creation complete.
Namespace(genome='', gtf='', mutations='/data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz', protein='', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/vcf/sqlite_two_step/GENCODE.vcf.sqlite_two_step', datatype='GENCODE_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite')
using sqlite database to speed up
running in sqlite mode
use existing sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite" for gene annotation
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
variant file is a vcf file
start extracting mutation file from the vcf input
individual is None, select the first sample in vcf file, which is HEK293
before QC, sites with mutations: 26922 of which, homozyous sites: 11242
after QC, number of variants in two dataframes are: 11094 26604
finished extracting mutations from the vcf file
start running PrecisionProDB for first strand of the genome mutation file
input data is provided as from GENCODE_GTF
add "chr" to chromosome 1 in mutation file
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/vcf/sqlite_two_step/GENCODE.vcf.sqlite_two_step.vcf2mutation_1.tsv
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
run perChrom for chromosome chr1
number of proteins with AA change: 136
finished running perchrom_sqlite for chromosome: chr1
total number of proteins with AA mutation: 136
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
remove temp folder
finished!
start running PrecisionProDB for second strand of the genome mutation file
input data is provided as from GENCODE_GTF
add "chr" to chromosome 1 in mutation file
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/vcf/sqlite_two_step/GENCODE.vcf.sqlite_two_step.vcf2mutation_2.tsv
run perChrom for chromosome chr1
number of proteins with AA change: 249
finished running perchrom_sqlite for chromosome: chr1
total number of proteins with AA mutation: 249
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
remove temp folder
finished!
total number of changed proteins from two strands of chromosome: 385. After removing protein seq_id with the same mutations, 271 sequences left. After remove protein seq_id with same final sequence, 271 sequences left.
proteins changed in at least one strand: 249. Proteins changed in both strands: 136. Proteins changed in both strands and the mutation in both strands are identical: 114. Proteins changed in both strands, the final protein sequences are identical but the mutations are different: 0
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
perGeno_vcf finished!
PrecisionProDB finished! Total seconds: 22.309823274612427
Command finished in 29.92 seconds


running test with key_input: GENCODE, key_variant: str, sqlite_key: sqlite_one_step
Running test: use SQLite file as intermediate file
/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/str/sqlite_one_step &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A -g /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.gtf.gz -o /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/str/sqlite_one_step/GENCODE.str.sqlite_one_step -a GENCODE_GTF --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.gtf.gz', mutations='chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A', protein='/data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.protein.fa.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/str/sqlite_one_step/GENCODE.str.sqlite_one_step', datatype='GENCODE_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite')
using sqlite database to speed up
running in sqlite mode
sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite" does not exist, create one first
input data is provided as from GENCODE_GTF
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
chr1 No mutation file provided, will not do mutation analysis
from the protein file, totally 1154 protein sequences.
finish reading genome file
number of proteins from the gtf file 1154
finishing get locs and frame
building sqlite done
remove temp folder
finished!
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
file_mutations is a mutation string chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
run perChrom for chromosome chr1
number of proteins with AA change: 27
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 6.177743911743164
Command finished in 7.03 seconds


running test with key_input: GENCODE, key_variant: str, sqlite_key: sqlite_two_step
Running test: Generate SQLite file in advance and use SQLite
/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/str/sqlite_two_step && python /data/p/xiaolong/PrecisionProDB/src/buildSqlite.py -S /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite -g /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/GENCODE/GENCODE.gtf.gz  -a GENCODE_GTF && python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A  -o /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/str/sqlite_two_step/GENCODE.str.sqlite_two_step -a GENCODE_GTF --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
input data is provided as from GENCODE_GTF
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
chr1 No mutation file provided, will not do mutation analysis
from the protein file, totally 1154 protein sequences.
finish reading genome file
number of proteins from the gtf file 1154
finishing get locs and frame
building sqlite done
remove temp folder
finished!
SQLite creation complete.
Namespace(genome='', gtf='', mutations='chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A', protein='', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/str/sqlite_two_step/GENCODE.str.sqlite_two_step', datatype='GENCODE_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite')
using sqlite database to speed up
running in sqlite mode
use existing sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite" for gene annotation
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
file_mutations is a mutation string chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/GENCODE/GENCODE.sqlite
run perChrom for chromosome chr1
number of proteins with AA change: 27
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 0.5567793846130371
Command finished in 7.89 seconds


running test with key_input: RefSeq, key_variant: tsv, sqlite_key: no_sqlite
Running test: without use sqlite file
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/tsv/no_sqlite &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz -g /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.gtf.gz -o /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/tsv/no_sqlite/RefSeq.tsv.no_sqlite -a RefSeq --PEFF  -t 4
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.gtf.gz', mutations='/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz', protein='/data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.protein.fa.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/tsv/no_sqlite/RefSeq.tsv.no_sqlite', datatype='RefSeq', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='')
variant file is a tsv file
input data is provided as from RefSeq
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
chromosomes in mutation file is different from the genome. try to solve that. This is usually True if datatype is RefSeq
    mutation chromosome change 1 to NC_000001.11
finish splitting the mutation file
protein-chromosome is already determined in file /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/tsv/no_sqlite/RefSeq.tsv.no_sqlite_temp/protein2chr.pickle, use previous result
finish splitting the gtf file
number of chromosomes with proteins: 1
started running perChrom for chromosome: NC_000001.11
from the protein file, totally 738 protein sequences.
finish reading genome file
number of proteins from the gtf file 738
finishing get locs and frame
transcripts with no mutations: 481
transcripts with mutations in CDSplus: 257
number of proteins with AA change: 152
finished running perChrom for chromosome: NC_000001.11
total number of proteins with AA mutation: 152
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 6.483970880508423
Command finished in 7.29 seconds


running test with key_input: RefSeq, key_variant: tsv, sqlite_key: sqlite_one_step
Running test: use SQLite file as intermediate file
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/tsv/sqlite_one_step &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz -g /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.gtf.gz -o /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/tsv/sqlite_one_step/RefSeq.tsv.sqlite_one_step -a RefSeq --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.gtf.gz', mutations='/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz', protein='/data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.protein.fa.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/tsv/sqlite_one_step/RefSeq.tsv.sqlite_one_step', datatype='RefSeq', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite')
using sqlite database to speed up
running in sqlite mode
sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite" does not exist, create one first
input data is provided as from RefSeq
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
protein-chromosome is already determined in file /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/tsv/sqlite_one_step/RefSeq.tsv.sqlite_one_step_temp/protein2chr.pickle, use previous result
finish splitting the gtf file
number of chromosomes with proteins: 1
NC_000001.11 No mutation file provided, will not do mutation analysis
from the protein file, totally 738 protein sequences.
finish reading genome file
number of proteins from the gtf file 738
finishing get locs and frame
building sqlite done
remove temp folder
finished!
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
variant file is a tsv file
input data is provided as from RefSeq
chromosomes in mutation file is different from the genome. try to solve that. This is usually True if datatype is RefSeq
    mutation chromosome change 1 to NC_000001.11
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
run perChrom for chromosome NC_000001.11
number of proteins with AA change: 152
finished running perchrom_sqlite for chromosome: NC_000001.11
total number of proteins with AA mutation: 152
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 8.597932815551758
Command finished in 9.42 seconds


running test with key_input: RefSeq, key_variant: tsv, sqlite_key: sqlite_two_step
Running test: Generate SQLite file in advance and use SQLite
/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/tsv/sqlite_two_step && python /data/p/xiaolong/PrecisionProDB/src/buildSqlite.py -S /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite -g /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.gtf.gz  -a RefSeq && python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz  -o /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/tsv/sqlite_two_step/RefSeq.tsv.sqlite_two_step -a RefSeq --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
input data is provided as from RefSeq
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
protein-chromosome is already determined in file perGeno_temp/protein2chr.pickle, use previous result
finish splitting the gtf file
number of chromosomes with proteins: 1
NC_000001.11 No mutation file provided, will not do mutation analysis
from the protein file, totally 738 protein sequences.
finish reading genome file
number of proteins from the gtf file 738
finishing get locs and frame
building sqlite done
remove temp folder
finished!
SQLite creation complete.
Namespace(genome='', gtf='', mutations='/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz', protein='', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/tsv/sqlite_two_step/RefSeq.tsv.sqlite_two_step', datatype='RefSeq', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite')
using sqlite database to speed up
running in sqlite mode
use existing sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite" for gene annotation
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
variant file is a tsv file
input data is provided as from RefSeq
chromosomes in mutation file is different from the genome. try to solve that. This is usually True if datatype is RefSeq
    mutation chromosome change 1 to NC_000001.11
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
run perChrom for chromosome NC_000001.11
number of proteins with AA change: 152
finished running perchrom_sqlite for chromosome: NC_000001.11
total number of proteins with AA mutation: 152
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 6.752286195755005
Command finished in 10.73 seconds


running test with key_input: RefSeq, key_variant: vcf, sqlite_key: no_sqlite
Running test: without use sqlite file
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/no_sqlite &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz -g /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.gtf.gz -o /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/no_sqlite/RefSeq.vcf.no_sqlite -a RefSeq --PEFF  -t 4
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.gtf.gz', mutations='/data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz', protein='/data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.protein.fa.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/no_sqlite/RefSeq.vcf.no_sqlite', datatype='RefSeq', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='')
variant file is a vcf file
start extracting mutation file from the vcf input
individual is None, select the first sample in vcf file, which is HEK293
before QC, sites with mutations: 26922 of which, homozyous sites: 11242
after QC, number of variants in two dataframes are: 11094 26604
finished extracting mutations from the vcf file
start running PrecisionProDB for first strand of the genome mutation file
input data is provided as from RefSeq
{'file_genome': '/data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.genome.fa.gz', 'file_gtf': '/data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.gtf.gz', 'file_mutations': '/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/no_sqlite/RefSeq.vcf.no_sqlite.vcf2mutation_1.tsv', 'file_protein': '/data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.protein.fa.gz', 'threads': 4, 'outprefix': '/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/no_sqlite/RefSeq.vcf.no_sqlite_1', 'datatype': 'RefSeq', 'protein_keyword': 'protein_id', 'keep_all': False, 'tempfolder': '/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/no_sqlite/RefSeq.vcf.no_sqlite_1_temp', 'chromosomes_genome': None, 'chromosomes_genome_description': None, 'chromosomes_mutation': None, 'chromosomes_gtf': None}
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
chromosomes in mutation file is different from the genome. try to solve that. This is usually True if datatype is RefSeq
    mutation chromosome change 1 to NC_000001.11
finish splitting the mutation file
protein-chromosome is already determined in file /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/no_sqlite/RefSeq.vcf.no_sqlite_1_temp/protein2chr.pickle, use previous result
finish splitting the gtf file
number of chromosomes with proteins: 1
started running perChrom for chromosome: NC_000001.11
from the protein file, totally 738 protein sequences.
finish reading genome file
number of proteins from the gtf file 738
finishing get locs and frame
transcripts with no mutations: 502
transcripts with mutations in CDSplus: 236
number of proteins with AA change: 159
finished running perChrom for chromosome: NC_000001.11
total number of proteins with AA mutation: 159
remove temp folder
finished!
start running PrecisionProDB for second strand of the genome mutation file
input data is provided as from RefSeq
{'file_genome': '/data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.genome.fa.gz', 'file_gtf': '/data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.gtf.gz', 'file_mutations': '/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/no_sqlite/RefSeq.vcf.no_sqlite.vcf2mutation_1.tsv', 'file_protein': '/data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.protein.fa.gz', 'threads': 4, 'outprefix': '/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/no_sqlite/RefSeq.vcf.no_sqlite_1', 'datatype': 'RefSeq', 'protein_keyword': 'protein_id', 'keep_all': False, 'tempfolder': '/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/no_sqlite/RefSeq.vcf.no_sqlite_1_temp', 'chromosomes_genome': ['NC_000001.11'], 'chromosomes_genome_description': ['NC_000001.11 Homo sapiens chromosome 1, GRCh38.p13 Primary Assembly'], 'chromosomes_mutation': ['NC_000001.11'], 'chromosomes_gtf': ['NC_000001.11'], 'chromosomes_protein': ['NC_000001.11'], 'chromosomes': ['NC_000001.11']}
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
chromosomes in mutation file is different from the genome. try to solve that. This is usually True if datatype is RefSeq
    mutation chromosome change 1 to NC_000001.11
finish splitting the mutation file
protein-chromosome is already determined in file /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/no_sqlite/RefSeq.vcf.no_sqlite_2_temp/protein2chr.pickle, use previous result
finish splitting the gtf file
number of chromosomes with proteins: 1
started running perChrom for chromosome: NC_000001.11
from the protein file, totally 738 protein sequences.
finish reading genome file
number of proteins from the gtf file 738
finishing get locs and frame
transcripts with no mutations: 403
transcripts with mutations in CDSplus: 335
number of proteins with AA change: 200
finished running perChrom for chromosome: NC_000001.11
total number of proteins with AA mutation: 200
remove temp folder
finished!
total number of changed proteins from two strands of chromosome: 359. After removing protein seq_id with the same mutations, 234 sequences left. After remove protein seq_id with same final sequence, 234 sequences left.
proteins changed in at least one strand: 200. Proteins changed in both strands: 159. Proteins changed in both strands and the mutation in both strands are identical: 125. Proteins changed in both strands, the final protein sequences are identical but the mutations are different: 0
perGeno_vcf finished!
PrecisionProDB finished! Total seconds: 14.618520975112915
Command finished in 15.43 seconds


running test with key_input: RefSeq, key_variant: vcf, sqlite_key: sqlite_one_step
Running test: use SQLite file as intermediate file
/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/sqlite_one_step &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz -g /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.gtf.gz -o /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/sqlite_one_step/RefSeq.vcf.sqlite_one_step -a RefSeq --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.gtf.gz', mutations='/data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz', protein='/data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.protein.fa.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/sqlite_one_step/RefSeq.vcf.sqlite_one_step', datatype='RefSeq', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite')
using sqlite database to speed up
running in sqlite mode
sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite" does not exist, create one first
input data is provided as from RefSeq
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
protein-chromosome is already determined in file /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/sqlite_one_step/RefSeq.vcf.sqlite_one_step_temp/protein2chr.pickle, use previous result
finish splitting the gtf file
number of chromosomes with proteins: 1
NC_000001.11 No mutation file provided, will not do mutation analysis
from the protein file, totally 738 protein sequences.
finish reading genome file
number of proteins from the gtf file 738
finishing get locs and frame
building sqlite done
remove temp folder
finished!
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
variant file is a vcf file
start extracting mutation file from the vcf input
individual is None, select the first sample in vcf file, which is HEK293
before QC, sites with mutations: 26922 of which, homozyous sites: 11242
after QC, number of variants in two dataframes are: 11094 26604
finished extracting mutations from the vcf file
start running PrecisionProDB for first strand of the genome mutation file
input data is provided as from RefSeq
chromosomes in mutation file is different from the genome. try to solve that. This is usually True if datatype is RefSeq
    mutation chromosome change 1 to NC_000001.11
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/sqlite_one_step/RefSeq.vcf.sqlite_one_step.vcf2mutation_1.tsv
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
run perChrom for chromosome NC_000001.11
number of proteins with AA change: 159
finished running perchrom_sqlite for chromosome: NC_000001.11
total number of proteins with AA mutation: 159
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
remove temp folder
finished!
start running PrecisionProDB for second strand of the genome mutation file
input data is provided as from RefSeq
chromosomes in mutation file is different from the genome. try to solve that. This is usually True if datatype is RefSeq
    mutation chromosome change 1 to NC_000001.11
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/sqlite_one_step/RefSeq.vcf.sqlite_one_step.vcf2mutation_2.tsv
run perChrom for chromosome NC_000001.11
number of proteins with AA change: 200
finished running perchrom_sqlite for chromosome: NC_000001.11
total number of proteins with AA mutation: 200
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
remove temp folder
finished!
total number of changed proteins from two strands of chromosome: 359. After removing protein seq_id with the same mutations, 234 sequences left. After remove protein seq_id with same final sequence, 234 sequences left.
proteins changed in at least one strand: 200. Proteins changed in both strands: 159. Proteins changed in both strands and the mutation in both strands are identical: 125. Proteins changed in both strands, the final protein sequences are identical but the mutations are different: 0
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
perGeno_vcf finished!
PrecisionProDB finished! Total seconds: 18.232579708099365
Command finished in 19.00 seconds


running test with key_input: RefSeq, key_variant: vcf, sqlite_key: sqlite_two_step
Running test: Generate SQLite file in advance and use SQLite
/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/sqlite_two_step && python /data/p/xiaolong/PrecisionProDB/src/buildSqlite.py -S /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite -g /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.gtf.gz  -a RefSeq && python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz  -o /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/sqlite_two_step/RefSeq.vcf.sqlite_two_step -a RefSeq --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
input data is provided as from RefSeq
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
protein-chromosome is already determined in file perGeno_temp/protein2chr.pickle, use previous result
finish splitting the gtf file
number of chromosomes with proteins: 1
NC_000001.11 No mutation file provided, will not do mutation analysis
from the protein file, totally 738 protein sequences.
finish reading genome file
number of proteins from the gtf file 738
finishing get locs and frame
building sqlite done
remove temp folder
finished!
SQLite creation complete.
Namespace(genome='', gtf='', mutations='/data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz', protein='', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/sqlite_two_step/RefSeq.vcf.sqlite_two_step', datatype='RefSeq', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite')
using sqlite database to speed up
running in sqlite mode
use existing sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite" for gene annotation
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
variant file is a vcf file
start extracting mutation file from the vcf input
individual is None, select the first sample in vcf file, which is HEK293
before QC, sites with mutations: 26922 of which, homozyous sites: 11242
after QC, number of variants in two dataframes are: 11094 26604
finished extracting mutations from the vcf file
start running PrecisionProDB for first strand of the genome mutation file
input data is provided as from RefSeq
chromosomes in mutation file is different from the genome. try to solve that. This is usually True if datatype is RefSeq
    mutation chromosome change 1 to NC_000001.11
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/sqlite_two_step/RefSeq.vcf.sqlite_two_step.vcf2mutation_1.tsv
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
run perChrom for chromosome NC_000001.11
number of proteins with AA change: 159
finished running perchrom_sqlite for chromosome: NC_000001.11
total number of proteins with AA mutation: 159
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
remove temp folder
finished!
start running PrecisionProDB for second strand of the genome mutation file
input data is provided as from RefSeq
chromosomes in mutation file is different from the genome. try to solve that. This is usually True if datatype is RefSeq
    mutation chromosome change 1 to NC_000001.11
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/vcf/sqlite_two_step/RefSeq.vcf.sqlite_two_step.vcf2mutation_2.tsv
run perChrom for chromosome NC_000001.11
number of proteins with AA change: 200
finished running perchrom_sqlite for chromosome: NC_000001.11
total number of proteins with AA mutation: 200
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
remove temp folder
finished!
total number of changed proteins from two strands of chromosome: 359. After removing protein seq_id with the same mutations, 234 sequences left. After remove protein seq_id with same final sequence, 234 sequences left.
proteins changed in at least one strand: 200. Proteins changed in both strands: 159. Proteins changed in both strands and the mutation in both strands are identical: 125. Proteins changed in both strands, the final protein sequences are identical but the mutations are different: 0
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
perGeno_vcf finished!
PrecisionProDB finished! Total seconds: 16.733232259750366
Command finished in 20.36 seconds


running test with key_input: RefSeq, key_variant: str, sqlite_key: sqlite_one_step
Running test: use SQLite file as intermediate file
/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/str/sqlite_one_step &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A -g /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.gtf.gz -o /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/str/sqlite_one_step/RefSeq.str.sqlite_one_step -a RefSeq --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.gtf.gz', mutations='chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A', protein='/data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.protein.fa.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/str/sqlite_one_step/RefSeq.str.sqlite_one_step', datatype='RefSeq', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite')
using sqlite database to speed up
running in sqlite mode
sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite" does not exist, create one first
input data is provided as from RefSeq
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
protein-chromosome is already determined in file /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/str/sqlite_one_step/RefSeq.str.sqlite_one_step_temp/protein2chr.pickle, use previous result
finish splitting the gtf file
number of chromosomes with proteins: 1
NC_000001.11 No mutation file provided, will not do mutation analysis
from the protein file, totally 738 protein sequences.
finish reading genome file
number of proteins from the gtf file 738
finishing get locs and frame
building sqlite done
remove temp folder
finished!
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
file_mutations is a mutation string chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
run perChrom for chromosome NC_000001.11
number of proteins with AA change: 20
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 2.420304775238037
Command finished in 3.26 seconds


running test with key_input: RefSeq, key_variant: str, sqlite_key: sqlite_two_step
Running test: Generate SQLite file in advance and use SQLite
/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/str/sqlite_two_step && python /data/p/xiaolong/PrecisionProDB/src/buildSqlite.py -S /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite -g /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/RefSeq/RefSeq.gtf.gz  -a RefSeq && python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A  -o /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/str/sqlite_two_step/RefSeq.str.sqlite_two_step -a RefSeq --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
input data is provided as from RefSeq
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
protein-chromosome is already determined in file perGeno_temp/protein2chr.pickle, use previous result
finish splitting the gtf file
number of chromosomes with proteins: 1
NC_000001.11 No mutation file provided, will not do mutation analysis
from the protein file, totally 738 protein sequences.
finish reading genome file
number of proteins from the gtf file 738
finishing get locs and frame
building sqlite done
remove temp folder
finished!
SQLite creation complete.
Namespace(genome='', gtf='', mutations='chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A', protein='', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/str/sqlite_two_step/RefSeq.str.sqlite_two_step', datatype='RefSeq', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite')
using sqlite database to speed up
running in sqlite mode
use existing sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite" for gene annotation
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
file_mutations is a mutation string chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/RefSeq/RefSeq.sqlite
run perChrom for chromosome NC_000001.11
number of proteins with AA change: 20
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 0.3854036331176758
Command finished in 4.43 seconds


running test with key_input: TransDecoder, key_variant: tsv, sqlite_key: no_sqlite
Running test: without use sqlite file
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/tsv/no_sqlite &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz -g /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.pep.gz -f /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.genome.gff3.gz -o /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/tsv/no_sqlite/TransDecoder.tsv.no_sqlite -a gtf --PEFF  -t 4
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.genome.gff3.gz', mutations='/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz', protein='/data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.pep.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/tsv/no_sqlite/TransDecoder.tsv.no_sqlite', datatype='gtf', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='')
variant file is a tsv file
datatype is not provided. Try to infer datatype. Note: Ensembl datatype cannot be inferred.
input data is general gtf
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
finish splitting the mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
started running perChrom for chromosome: 1
from the protein file, totally 932 protein sequences.
finish reading genome file
number of proteins from the gtf file 932
MSTRG.31.1.p4 strand changed from + to -
MSTRG.43.1.p1 strand changed from + to -
MSTRG.54.1.p7 strand changed from + to -
MSTRG.92.1.p3 strand changed from + to -
finishing get locs and frame
transcripts with no mutations: 564
transcripts with mutations in CDSplus: 368
number of proteins with AA change: 106
finished running perChrom for chromosome: 1
total number of proteins with AA mutation: 106
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 8.925481081008911
Command finished in 9.73 seconds


running test with key_input: TransDecoder, key_variant: tsv, sqlite_key: sqlite_one_step
Running test: use SQLite file as intermediate file
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/tsv/sqlite_one_step &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz -g /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.pep.gz -f /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.genome.gff3.gz -o /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/tsv/sqlite_one_step/TransDecoder.tsv.sqlite_one_step -a gtf --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.genome.gff3.gz', mutations='/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz', protein='/data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.pep.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/tsv/sqlite_one_step/TransDecoder.tsv.sqlite_one_step', datatype='gtf', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite')
using sqlite database to speed up
running in sqlite mode
sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite" does not exist, create one first
datatype is not provided. Try to infer datatype. Note: Ensembl datatype cannot be inferred.
input data is general gtf
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
1 No mutation file provided, will not do mutation analysis
from the protein file, totally 932 protein sequences.
finish reading genome file
number of proteins from the gtf file 932
MSTRG.31.1.p4 strand changed from + to -
MSTRG.43.1.p1 strand changed from + to -
MSTRG.54.1.p7 strand changed from + to -
MSTRG.92.1.p3 strand changed from + to -
finishing get locs and frame
building sqlite done
remove temp folder
finished!
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
variant file is a tsv file
datatype is not provided. Try to infer datatype. Note: Ensembl datatype cannot be inferred.
input data is general gtf
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
run perChrom for chromosome 1
number of proteins with AA change: 106
finished running perchrom_sqlite for chromosome: 1
total number of proteins with AA mutation: 106
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 12.198411703109741
Command finished in 12.96 seconds


running test with key_input: TransDecoder, key_variant: tsv, sqlite_key: sqlite_two_step
Running test: Generate SQLite file in advance and use SQLite
/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/tsv/sqlite_two_step && python /data/p/xiaolong/PrecisionProDB/src/buildSqlite.py -S /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite -g /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.pep.gz -f /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.genome.gff3.gz  -a gtf && python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz  -o /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/tsv/sqlite_two_step/TransDecoder.tsv.sqlite_two_step -a gtf --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
datatype is not provided. Try to infer datatype. Note: Ensembl datatype cannot be inferred.
input data is general gtf
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
1 No mutation file provided, will not do mutation analysis
from the protein file, totally 932 protein sequences.
finish reading genome file
number of proteins from the gtf file 932
MSTRG.31.1.p4 strand changed from + to -
MSTRG.43.1.p1 strand changed from + to -
MSTRG.54.1.p7 strand changed from + to -
MSTRG.92.1.p3 strand changed from + to -
finishing get locs and frame
building sqlite done
remove temp folder
finished!
SQLite creation complete.
Namespace(genome='', gtf='', mutations='/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz', protein='', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/tsv/sqlite_two_step/TransDecoder.tsv.sqlite_two_step', datatype='gtf', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite')
using sqlite database to speed up
running in sqlite mode
use existing sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite" for gene annotation
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
variant file is a tsv file
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
run perChrom for chromosome 1
number of proteins with AA change: 106
finished running perchrom_sqlite for chromosome: 1
total number of proteins with AA mutation: 106
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 8.262892484664917
Command finished in 13.67 seconds


running test with key_input: TransDecoder, key_variant: vcf, sqlite_key: no_sqlite
Running test: without use sqlite file
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/vcf/no_sqlite &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz -g /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.pep.gz -f /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.genome.gff3.gz -o /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/vcf/no_sqlite/TransDecoder.vcf.no_sqlite -a gtf --PEFF  -t 4
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.genome.gff3.gz', mutations='/data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz', protein='/data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.pep.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/vcf/no_sqlite/TransDecoder.vcf.no_sqlite', datatype='gtf', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='')
variant file is a vcf file
start extracting mutation file from the vcf input
individual is None, select the first sample in vcf file, which is HEK293
before QC, sites with mutations: 26922 of which, homozyous sites: 11242
after QC, number of variants in two dataframes are: 11094 26604
finished extracting mutations from the vcf file
start running PrecisionProDB for first strand of the genome mutation file
datatype is not provided. Try to infer datatype. Note: Ensembl datatype cannot be inferred.
input data is general gtf
{'file_genome': '/data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.genome.fa.gz', 'file_gtf': '/data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.genome.gff3.gz', 'file_mutations': '/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/vcf/no_sqlite/TransDecoder.vcf.no_sqlite.vcf2mutation_1.tsv', 'file_protein': '/data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.pep.gz', 'threads': 4, 'outprefix': '/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/vcf/no_sqlite/TransDecoder.vcf.no_sqlite_1', 'datatype': 'gtf', 'protein_keyword': 'Parent', 'keep_all': False, 'tempfolder': '/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/vcf/no_sqlite/TransDecoder.vcf.no_sqlite_1_temp', 'chromosomes_genome': None, 'chromosomes_genome_description': None, 'chromosomes_mutation': None, 'chromosomes_gtf': None}
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
finish splitting the mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
started running perChrom for chromosome: 1
from the protein file, totally 932 protein sequences.
finish reading genome file
number of proteins from the gtf file 932
MSTRG.31.1.p4 strand changed from + to -
MSTRG.43.1.p1 strand changed from + to -
MSTRG.54.1.p7 strand changed from + to -
MSTRG.92.1.p3 strand changed from + to -
finishing get locs and frame
transcripts with no mutations: 609
transcripts with mutations in CDSplus: 323
number of proteins with AA change: 72
finished running perChrom for chromosome: 1
total number of proteins with AA mutation: 72
remove temp folder
finished!
start running PrecisionProDB for second strand of the genome mutation file
datatype is not provided. Try to infer datatype. Note: Ensembl datatype cannot be inferred.
input data is general gtf
{'file_genome': '/data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.genome.fa.gz', 'file_gtf': '/data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.genome.gff3.gz', 'file_mutations': '/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/vcf/no_sqlite/TransDecoder.vcf.no_sqlite.vcf2mutation_1.tsv', 'file_protein': '/data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.pep.gz', 'threads': 4, 'outprefix': '/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/vcf/no_sqlite/TransDecoder.vcf.no_sqlite_1', 'datatype': 'gtf', 'protein_keyword': 'Parent', 'keep_all': False, 'tempfolder': '/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/vcf/no_sqlite/TransDecoder.vcf.no_sqlite_1_temp', 'chromosomes_genome': ['1'], 'chromosomes_genome_description': ['1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF'], 'chromosomes_mutation': ['1'], 'chromosomes_gtf': ['1'], 'chromosomes_protein': ['1'], 'chromosomes': ['1']}
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
finish splitting the mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
started running perChrom for chromosome: 1
from the protein file, totally 932 protein sequences.
finish reading genome file
number of proteins from the gtf file 932
MSTRG.31.1.p4 strand changed from + to -
MSTRG.43.1.p1 strand changed from + to -
MSTRG.54.1.p7 strand changed from + to -
MSTRG.92.1.p3 strand changed from + to -
finishing get locs and frame
transcripts with no mutations: 498
transcripts with mutations in CDSplus: 434
number of proteins with AA change: 122
finished running perChrom for chromosome: 1
total number of proteins with AA mutation: 122
remove temp folder
finished!
total number of changed proteins from two strands of chromosome: 194. After removing protein seq_id with the same mutations, 122 sequences left. After remove protein seq_id with same final sequence, 122 sequences left.
proteins changed in at least one strand: 122. Proteins changed in both strands: 72. Proteins changed in both strands and the mutation in both strands are identical: 72. Proteins changed in both strands, the final protein sequences are identical but the mutations are different: 0
perGeno_vcf finished!
PrecisionProDB finished! Total seconds: 18.4150230884552
Command finished in 19.25 seconds


running test with key_input: TransDecoder, key_variant: vcf, sqlite_key: sqlite_one_step
Running test: use SQLite file as intermediate file
/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/vcf/sqlite_one_step &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz -g /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.pep.gz -f /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.genome.gff3.gz -o /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/vcf/sqlite_one_step/TransDecoder.vcf.sqlite_one_step -a gtf --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.genome.gff3.gz', mutations='/data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz', protein='/data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.pep.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/vcf/sqlite_one_step/TransDecoder.vcf.sqlite_one_step', datatype='gtf', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite')
using sqlite database to speed up
running in sqlite mode
sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite" does not exist, create one first
datatype is not provided. Try to infer datatype. Note: Ensembl datatype cannot be inferred.
input data is general gtf
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
1 No mutation file provided, will not do mutation analysis
from the protein file, totally 932 protein sequences.
finish reading genome file
number of proteins from the gtf file 932
MSTRG.31.1.p4 strand changed from + to -
MSTRG.43.1.p1 strand changed from + to -
MSTRG.54.1.p7 strand changed from + to -
MSTRG.92.1.p3 strand changed from + to -
finishing get locs and frame
building sqlite done
remove temp folder
finished!
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
variant file is a vcf file
start extracting mutation file from the vcf input
individual is None, select the first sample in vcf file, which is HEK293
before QC, sites with mutations: 26922 of which, homozyous sites: 11242
after QC, number of variants in two dataframes are: 11094 26604
finished extracting mutations from the vcf file
start running PrecisionProDB for first strand of the genome mutation file
datatype is not provided. Try to infer datatype. Note: Ensembl datatype cannot be inferred.
input data is general gtf
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/vcf/sqlite_one_step/TransDecoder.vcf.sqlite_one_step.vcf2mutation_1.tsv
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
run perChrom for chromosome 1
number of proteins with AA change: 72
finished running perchrom_sqlite for chromosome: 1
total number of proteins with AA mutation: 72
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
remove temp folder
finished!
start running PrecisionProDB for second strand of the genome mutation file
datatype is not provided. Try to infer datatype. Note: Ensembl datatype cannot be inferred.
input data is general gtf
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/vcf/sqlite_one_step/TransDecoder.vcf.sqlite_one_step.vcf2mutation_2.tsv
run perChrom for chromosome 1
number of proteins with AA change: 122
finished running perchrom_sqlite for chromosome: 1
total number of proteins with AA mutation: 122
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
remove temp folder
finished!
total number of changed proteins from two strands of chromosome: 194. After removing protein seq_id with the same mutations, 122 sequences left. After remove protein seq_id with same final sequence, 122 sequences left.
proteins changed in at least one strand: 122. Proteins changed in both strands: 72. Proteins changed in both strands and the mutation in both strands are identical: 72. Proteins changed in both strands, the final protein sequences are identical but the mutations are different: 0
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
perGeno_vcf finished!
PrecisionProDB finished! Total seconds: 22.27476215362549
Command finished in 23.05 seconds


running test with key_input: TransDecoder, key_variant: vcf, sqlite_key: sqlite_two_step
Running test: Generate SQLite file in advance and use SQLite
/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/vcf/sqlite_two_step && python /data/p/xiaolong/PrecisionProDB/src/buildSqlite.py -S /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite -g /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.pep.gz -f /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.genome.gff3.gz  -a gtf && python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz  -o /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/vcf/sqlite_two_step/TransDecoder.vcf.sqlite_two_step -a gtf --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
datatype is not provided. Try to infer datatype. Note: Ensembl datatype cannot be inferred.
input data is general gtf
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
1 No mutation file provided, will not do mutation analysis
from the protein file, totally 932 protein sequences.
finish reading genome file
number of proteins from the gtf file 932
MSTRG.31.1.p4 strand changed from + to -
MSTRG.43.1.p1 strand changed from + to -
MSTRG.54.1.p7 strand changed from + to -
MSTRG.92.1.p3 strand changed from + to -
finishing get locs and frame
building sqlite done
remove temp folder
finished!
SQLite creation complete.
Namespace(genome='', gtf='', mutations='/data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz', protein='', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/vcf/sqlite_two_step/TransDecoder.vcf.sqlite_two_step', datatype='gtf', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite')
using sqlite database to speed up
running in sqlite mode
use existing sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite" for gene annotation
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
variant file is a vcf file
start extracting mutation file from the vcf input
individual is None, select the first sample in vcf file, which is HEK293
before QC, sites with mutations: 26922 of which, homozyous sites: 11242
after QC, number of variants in two dataframes are: 11094 26604
finished extracting mutations from the vcf file
start running PrecisionProDB for first strand of the genome mutation file
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/vcf/sqlite_two_step/TransDecoder.vcf.sqlite_two_step.vcf2mutation_1.tsv
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
run perChrom for chromosome 1
number of proteins with AA change: 72
finished running perchrom_sqlite for chromosome: 1
total number of proteins with AA mutation: 72
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
remove temp folder
finished!
start running PrecisionProDB for second strand of the genome mutation file
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/vcf/sqlite_two_step/TransDecoder.vcf.sqlite_two_step.vcf2mutation_2.tsv
run perChrom for chromosome 1
number of proteins with AA change: 122
finished running perchrom_sqlite for chromosome: 1
total number of proteins with AA mutation: 122
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
remove temp folder
finished!
total number of changed proteins from two strands of chromosome: 194. After removing protein seq_id with the same mutations, 122 sequences left. After remove protein seq_id with same final sequence, 122 sequences left.
proteins changed in at least one strand: 122. Proteins changed in both strands: 72. Proteins changed in both strands and the mutation in both strands are identical: 72. Proteins changed in both strands, the final protein sequences are identical but the mutations are different: 0
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
perGeno_vcf finished!
PrecisionProDB finished! Total seconds: 18.41061806678772
Command finished in 23.86 seconds


running test with key_input: TransDecoder, key_variant: str, sqlite_key: sqlite_one_step
Running test: use SQLite file as intermediate file
/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/str/sqlite_one_step &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A -g /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.pep.gz -f /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.genome.gff3.gz -o /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/str/sqlite_one_step/TransDecoder.str.sqlite_one_step -a gtf --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.genome.gff3.gz', mutations='chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A', protein='/data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.pep.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/str/sqlite_one_step/TransDecoder.str.sqlite_one_step', datatype='gtf', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite')
using sqlite database to speed up
running in sqlite mode
sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite" does not exist, create one first
datatype is not provided. Try to infer datatype. Note: Ensembl datatype cannot be inferred.
input data is general gtf
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
1 No mutation file provided, will not do mutation analysis
from the protein file, totally 932 protein sequences.
finish reading genome file
number of proteins from the gtf file 932
MSTRG.31.1.p4 strand changed from + to -
MSTRG.43.1.p1 strand changed from + to -
MSTRG.54.1.p7 strand changed from + to -
MSTRG.92.1.p3 strand changed from + to -
finishing get locs and frame
building sqlite done
remove temp folder
finished!
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
file_mutations is a mutation string chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
run perChrom for chromosome 1
number of proteins with AA change: 18
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 4.771468639373779
Command finished in 5.54 seconds


running test with key_input: TransDecoder, key_variant: str, sqlite_key: sqlite_two_step
Running test: Generate SQLite file in advance and use SQLite
/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/str/sqlite_two_step && python /data/p/xiaolong/PrecisionProDB/src/buildSqlite.py -S /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite -g /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.pep.gz -f /data/p/xiaolong/PrecisionProDB/examples/TransDecoder/TransDecoder.transcripts.fa.transdecoder.genome.gff3.gz  -a gtf && python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A  -o /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/str/sqlite_two_step/TransDecoder.str.sqlite_two_step -a gtf --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
datatype is not provided. Try to infer datatype. Note: Ensembl datatype cannot be inferred.
input data is general gtf
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
1 No mutation file provided, will not do mutation analysis
from the protein file, totally 932 protein sequences.
finish reading genome file
number of proteins from the gtf file 932
MSTRG.31.1.p4 strand changed from + to -
MSTRG.43.1.p1 strand changed from + to -
MSTRG.54.1.p7 strand changed from + to -
MSTRG.92.1.p3 strand changed from + to -
finishing get locs and frame
building sqlite done
remove temp folder
finished!
SQLite creation complete.
Namespace(genome='', gtf='', mutations='chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A', protein='', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/str/sqlite_two_step/TransDecoder.str.sqlite_two_step', datatype='gtf', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite')
using sqlite database to speed up
running in sqlite mode
use existing sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite" for gene annotation
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
file_mutations is a mutation string chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/TransDecoder/TransDecoder.sqlite
run perChrom for chromosome 1
number of proteins with AA change: 18
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 0.8881592750549316
Command finished in 6.66 seconds


running test with key_input: UniProt, key_variant: tsv, sqlite_key: no_sqlite
Running test: without use sqlite file
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/UniProt/tsv/no_sqlite &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz -g /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz -o /data/p/xiaolong/PrecisionProDB/test_output/UniProt/tsv/no_sqlite/UniProt.tsv.no_sqlite -a Ensembl_GTF --PEFF  -U /data/p/xiaolong/PrecisionProDB/examples/UniProt/UniProt.protein.fa.gz -t 4 -D Uniprot
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz', mutations='/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz', protein='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/UniProt/tsv/no_sqlite/UniProt.tsv.no_sqlite', datatype='Ensembl_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='Uniprot', uniprot='/data/p/xiaolong/PrecisionProDB/examples/UniProt/UniProt.protein.fa.gz', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='')
-D --download is set to be UNIPROT

download already finished for UNIPROT
variant file is a tsv file
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
finish splitting the mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
started running perChrom for chromosome: 1
from the protein file, totally 663 protein sequences.
finish reading genome file
number of proteins from the gtf file 663
finishing get locs and frame
transcripts with no mutations: 467
transcripts with mutations in CDSplus: 196
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 90
finished running perChrom for chromosome: 1
total number of proteins with AA mutation: 90
remove temp folder
finished!
try to extract Uniprot proteins from Ensembl models
PrecisionProDB finished! Total seconds: 4.627462148666382
Command finished in 5.41 seconds


running test with key_input: UniProt, key_variant: tsv, sqlite_key: sqlite_one_step
Running test: use SQLite file as intermediate file
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/UniProt/tsv/sqlite_one_step &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz -g /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz -o /data/p/xiaolong/PrecisionProDB/test_output/UniProt/tsv/sqlite_one_step/UniProt.tsv.sqlite_one_step -a Ensembl_GTF --PEFF  -U /data/p/xiaolong/PrecisionProDB/examples/UniProt/UniProt.protein.fa.gz -t 4 -D Uniprot  -S /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz', mutations='/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz', protein='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/UniProt/tsv/sqlite_one_step/UniProt.tsv.sqlite_one_step', datatype='Ensembl_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='Uniprot', uniprot='/data/p/xiaolong/PrecisionProDB/examples/UniProt/UniProt.protein.fa.gz', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite')
using sqlite database to speed up
running in sqlite mode
sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite" does not exist, create one first
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
1 No mutation file provided, will not do mutation analysis
from the protein file, totally 663 protein sequences.
finish reading genome file
number of proteins from the gtf file 663
finishing get locs and frame
building sqlite done
remove temp folder
finished!
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
variant file is a tsv file
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
run perChrom for chromosome 1
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 90
finished running perchrom_sqlite for chromosome: 1
total number of proteins with AA mutation: 90
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
remove temp folder
finished!
try to extract Uniprot proteins from Ensembl models
PrecisionProDB finished! Total seconds: 6.1342010498046875
Command finished in 7.02 seconds


running test with key_input: UniProt, key_variant: tsv, sqlite_key: sqlite_two_step
Running test: Generate SQLite file in advance and use SQLite
/data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/UniProt/tsv/sqlite_two_step && python /data/p/xiaolong/PrecisionProDB/src/buildSqlite.py -S /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite -g /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz  -a Ensembl_GTF && python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz  -o /data/p/xiaolong/PrecisionProDB/test_output/UniProt/tsv/sqlite_two_step/UniProt.tsv.sqlite_two_step -a Ensembl_GTF --PEFF  -U /data/p/xiaolong/PrecisionProDB/examples/UniProt/UniProt.protein.fa.gz -t 4 -D Uniprot  -S /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
1 No mutation file provided, will not do mutation analysis
from the protein file, totally 663 protein sequences.
finish reading genome file
number of proteins from the gtf file 663
finishing get locs and frame
building sqlite done
remove temp folder
finished!
SQLite creation complete.
Namespace(genome='', gtf='', mutations='/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz', protein='', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/UniProt/tsv/sqlite_two_step/UniProt.tsv.sqlite_two_step', datatype='Ensembl_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='Uniprot', uniprot='/data/p/xiaolong/PrecisionProDB/examples/UniProt/UniProt.protein.fa.gz', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite')
using sqlite database to speed up
running in sqlite mode
use existing sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite" for gene annotation
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
variant file is a tsv file
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/examples/gnomAD.variant.txt.gz
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
run perChrom for chromosome 1
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 90
finished running perchrom_sqlite for chromosome: 1
total number of proteins with AA mutation: 90
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
remove temp folder
finished!
extract all protein sequences from sqlite file
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
try to extract Uniprot proteins from Ensembl models
PrecisionProDB finished! Total seconds: 4.696371555328369
Command finished in 8.16 seconds


running test with key_input: UniProt, key_variant: vcf, sqlite_key: no_sqlite
Running test: without use sqlite file
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/UniProt/vcf/no_sqlite &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz -g /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz -o /data/p/xiaolong/PrecisionProDB/test_output/UniProt/vcf/no_sqlite/UniProt.vcf.no_sqlite -a Ensembl_GTF --PEFF  -U /data/p/xiaolong/PrecisionProDB/examples/UniProt/UniProt.protein.fa.gz -t 4 -D Uniprot
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz', mutations='/data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz', protein='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/UniProt/vcf/no_sqlite/UniProt.vcf.no_sqlite', datatype='Ensembl_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='Uniprot', uniprot='/data/p/xiaolong/PrecisionProDB/examples/UniProt/UniProt.protein.fa.gz', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='')
-D --download is set to be UNIPROT

download already finished for UNIPROT
variant file is a vcf file
start extracting mutation file from the vcf input
individual is None, select the first sample in vcf file, which is HEK293
before QC, sites with mutations: 26922 of which, homozyous sites: 11242
after QC, number of variants in two dataframes are: 11094 26604
finished extracting mutations from the vcf file
start running PrecisionProDB for first strand of the genome mutation file
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
{'file_genome': '/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz', 'file_gtf': '/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz', 'file_mutations': '/data/p/xiaolong/PrecisionProDB/test_output/UniProt/vcf/no_sqlite/UniProt.vcf.no_sqlite.vcf2mutation_1.tsv', 'file_protein': '/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz', 'threads': 4, 'outprefix': '/data/p/xiaolong/PrecisionProDB/test_output/UniProt/vcf/no_sqlite/UniProt.vcf.no_sqlite_1', 'datatype': 'Ensembl_GTF', 'protein_keyword': 'protein_id', 'keep_all': False, 'tempfolder': '/data/p/xiaolong/PrecisionProDB/test_output/UniProt/vcf/no_sqlite/UniProt.vcf.no_sqlite_1_temp', 'chromosomes_genome': None, 'chromosomes_genome_description': None, 'chromosomes_mutation': None, 'chromosomes_gtf': None}
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
finish splitting the mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
started running perChrom for chromosome: 1
from the protein file, totally 663 protein sequences.
finish reading genome file
number of proteins from the gtf file 663
finishing get locs and frame
transcripts with no mutations: 501
transcripts with mutations in CDSplus: 162
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 90
finished running perChrom for chromosome: 1
total number of proteins with AA mutation: 90
remove temp folder
finished!
start running PrecisionProDB for second strand of the genome mutation file
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
{'file_genome': '/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz', 'file_gtf': '/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz', 'file_mutations': '/data/p/xiaolong/PrecisionProDB/test_output/UniProt/vcf/no_sqlite/UniProt.vcf.no_sqlite.vcf2mutation_1.tsv', 'file_protein': '/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz', 'threads': 4, 'outprefix': '/data/p/xiaolong/PrecisionProDB/test_output/UniProt/vcf/no_sqlite/UniProt.vcf.no_sqlite_1', 'datatype': 'Ensembl_GTF', 'protein_keyword': 'protein_id', 'keep_all': False, 'tempfolder': '/data/p/xiaolong/PrecisionProDB/test_output/UniProt/vcf/no_sqlite/UniProt.vcf.no_sqlite_1_temp', 'chromosomes_genome': ['1'], 'chromosomes_genome_description': ['1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF'], 'chromosomes_mutation': ['1'], 'chromosomes_gtf': ['1'], 'chromosomes_protein': ['1'], 'chromosomes': ['1']}
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
finish splitting the mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
started running perChrom for chromosome: 1
from the protein file, totally 663 protein sequences.
finish reading genome file
number of proteins from the gtf file 663
finishing get locs and frame
transcripts with no mutations: 430
transcripts with mutations in CDSplus: 233
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 129
finished running perChrom for chromosome: 1
total number of proteins with AA mutation: 129
remove temp folder
finished!
total number of changed proteins from two strands of chromosome: 219. After removing protein seq_id with the same mutations, 142 sequences left. After remove protein seq_id with same final sequence, 142 sequences left.
proteins changed in at least one strand: 129. Proteins changed in both strands: 90. Proteins changed in both strands and the mutation in both strands are identical: 77. Proteins changed in both strands, the final protein sequences are identical but the mutations are different: 0
perGeno_vcf finished!
try to extract Uniprot proteins from Ensembl models
PrecisionProDB finished! Total seconds: 10.193703174591064
Command finished in 10.95 seconds


running test with key_input: UniProt, key_variant: vcf, sqlite_key: sqlite_one_step
Running test: use SQLite file as intermediate file
/data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/UniProt/vcf/sqlite_one_step &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz -g /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz -o /data/p/xiaolong/PrecisionProDB/test_output/UniProt/vcf/sqlite_one_step/UniProt.vcf.sqlite_one_step -a Ensembl_GTF --PEFF  -U /data/p/xiaolong/PrecisionProDB/examples/UniProt/UniProt.protein.fa.gz -t 4 -D Uniprot  -S /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz', mutations='/data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz', protein='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/UniProt/vcf/sqlite_one_step/UniProt.vcf.sqlite_one_step', datatype='Ensembl_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='Uniprot', uniprot='/data/p/xiaolong/PrecisionProDB/examples/UniProt/UniProt.protein.fa.gz', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite')
using sqlite database to speed up
running in sqlite mode
sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite" does not exist, create one first
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
1 No mutation file provided, will not do mutation analysis
from the protein file, totally 663 protein sequences.
finish reading genome file
number of proteins from the gtf file 663
finishing get locs and frame
building sqlite done
remove temp folder
finished!
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
variant file is a vcf file
start extracting mutation file from the vcf input
individual is None, select the first sample in vcf file, which is HEK293
before QC, sites with mutations: 26922 of which, homozyous sites: 11242
after QC, number of variants in two dataframes are: 11094 26604
finished extracting mutations from the vcf file
start running PrecisionProDB for first strand of the genome mutation file
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/test_output/UniProt/vcf/sqlite_one_step/UniProt.vcf.sqlite_one_step.vcf2mutation_1.tsv
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
run perChrom for chromosome 1
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 90
finished running perchrom_sqlite for chromosome: 1
total number of proteins with AA mutation: 90
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
remove temp folder
finished!
start running PrecisionProDB for second strand of the genome mutation file
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/test_output/UniProt/vcf/sqlite_one_step/UniProt.vcf.sqlite_one_step.vcf2mutation_2.tsv
run perChrom for chromosome 1
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 129
finished running perchrom_sqlite for chromosome: 1
total number of proteins with AA mutation: 129
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
remove temp folder
finished!
total number of changed proteins from two strands of chromosome: 219. After removing protein seq_id with the same mutations, 142 sequences left. After remove protein seq_id with same final sequence, 142 sequences left.
proteins changed in at least one strand: 129. Proteins changed in both strands: 90. Proteins changed in both strands and the mutation in both strands are identical: 77. Proteins changed in both strands, the final protein sequences are identical but the mutations are different: 0
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
perGeno_vcf finished!
try to extract Uniprot proteins from Ensembl models
PrecisionProDB finished! Total seconds: 12.394466638565063
Command finished in 13.14 seconds


running test with key_input: UniProt, key_variant: vcf, sqlite_key: sqlite_two_step
Running test: Generate SQLite file in advance and use SQLite
/data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/UniProt/vcf/sqlite_two_step && python /data/p/xiaolong/PrecisionProDB/src/buildSqlite.py -S /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite -g /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz  -a Ensembl_GTF && python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m /data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz  -o /data/p/xiaolong/PrecisionProDB/test_output/UniProt/vcf/sqlite_two_step/UniProt.vcf.sqlite_two_step -a Ensembl_GTF --PEFF  -U /data/p/xiaolong/PrecisionProDB/examples/UniProt/UniProt.protein.fa.gz -t 4 -D Uniprot  -S /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
1 No mutation file provided, will not do mutation analysis
from the protein file, totally 663 protein sequences.
finish reading genome file
number of proteins from the gtf file 663
finishing get locs and frame
building sqlite done
remove temp folder
finished!
SQLite creation complete.
Namespace(genome='', gtf='', mutations='/data/p/xiaolong/PrecisionProDB/examples/celline.vcf.gz', protein='', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/UniProt/vcf/sqlite_two_step/UniProt.vcf.sqlite_two_step', datatype='Ensembl_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='Uniprot', uniprot='/data/p/xiaolong/PrecisionProDB/examples/UniProt/UniProt.protein.fa.gz', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite')
using sqlite database to speed up
running in sqlite mode
use existing sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite" for gene annotation
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
variant file is a vcf file
start extracting mutation file from the vcf input
individual is None, select the first sample in vcf file, which is HEK293
before QC, sites with mutations: 26922 of which, homozyous sites: 11242
after QC, number of variants in two dataframes are: 11094 26604
finished extracting mutations from the vcf file
start running PrecisionProDB for first strand of the genome mutation file
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/test_output/UniProt/vcf/sqlite_two_step/UniProt.vcf.sqlite_two_step.vcf2mutation_1.tsv
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
run perChrom for chromosome 1
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 90
finished running perchrom_sqlite for chromosome: 1
total number of proteins with AA mutation: 90
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
remove temp folder
finished!
start running PrecisionProDB for second strand of the genome mutation file
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
finish splitting the mutation file
/data/p/xiaolong/PrecisionProDB/test_output/UniProt/vcf/sqlite_two_step/UniProt.vcf.sqlite_two_step.vcf2mutation_2.tsv
run perChrom for chromosome 1
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 129
finished running perchrom_sqlite for chromosome: 1
total number of proteins with AA mutation: 129
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
remove temp folder
finished!
total number of changed proteins from two strands of chromosome: 219. After removing protein seq_id with the same mutations, 142 sequences left. After remove protein seq_id with same final sequence, 142 sequences left.
proteins changed in at least one strand: 129. Proteins changed in both strands: 90. Proteins changed in both strands and the mutation in both strands are identical: 77. Proteins changed in both strands, the final protein sequences are identical but the mutations are different: 0
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
perGeno_vcf finished!
extract all protein sequences from sqlite file
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
try to extract Uniprot proteins from Ensembl models
PrecisionProDB finished! Total seconds: 10.563108444213867
Command finished in 13.99 seconds


running test with key_input: UniProt, key_variant: str, sqlite_key: sqlite_one_step
Running test: use SQLite file as intermediate file
/data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/UniProt/str/sqlite_one_step &&  python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A -g /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz -o /data/p/xiaolong/PrecisionProDB/test_output/UniProt/str/sqlite_one_step/UniProt.str.sqlite_one_step -a Ensembl_GTF --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
Namespace(genome='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz', gtf='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz', mutations='chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A', protein='/data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/UniProt/str/sqlite_one_step/UniProt.str.sqlite_one_step', datatype='Ensembl_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite')
using sqlite database to speed up
running in sqlite mode
sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite" does not exist, create one first
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
1 No mutation file provided, will not do mutation analysis
from the protein file, totally 663 protein sequences.
finish reading genome file
number of proteins from the gtf file 663
finishing get locs and frame
building sqlite done
remove temp folder
finished!
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
file_mutations is a mutation string chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
run perChrom for chromosome 1
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 24
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 2.185404062271118
Command finished in 2.94 seconds


running test with key_input: UniProt, key_variant: str, sqlite_key: sqlite_two_step
Running test: Generate SQLite file in advance and use SQLite
/data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite exists. Removing it.
Running command: cd /data/p/xiaolong/PrecisionProDB/test_output/UniProt/str/sqlite_two_step && python /data/p/xiaolong/PrecisionProDB/src/buildSqlite.py -S /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite -g /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.genome.fa.gz -p /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.protein.fa.gz -f /data/p/xiaolong/PrecisionProDB/examples/Ensembl/Ensembl.gtf.gz  -a Ensembl_GTF && python /data/p/xiaolong/PrecisionProDB/src/PrecisionProDB.py -m chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A  -o /data/p/xiaolong/PrecisionProDB/test_output/UniProt/str/sqlite_two_step/UniProt.str.sqlite_two_step -a Ensembl_GTF --PEFF  -t 4  -S /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
input data is provided as from Ensembl_GTF. Note: for some Ensembl models, the protein sequences are not translated based on GTF annotation. Those sequences will be ignored.
split the genome, mutation, gtf and protein file based on chromosomes
finish splitting the genome file
finish assign chromosome for each protein
finish splitting the protein file
file_mutation is not provided, skip splitting mutation file
finish splitting the gtf file
number of chromosomes with proteins: 1
1 No mutation file provided, will not do mutation analysis
from the protein file, totally 663 protein sequences.
finish reading genome file
number of proteins from the gtf file 663
finishing get locs and frame
building sqlite done
remove temp folder
finished!
SQLite creation complete.
Namespace(genome='', gtf='', mutations='chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A', protein='', threads=4, out='/data/p/xiaolong/PrecisionProDB/test_output/UniProt/str/sqlite_two_step/UniProt.str.sqlite_two_step', datatype='Ensembl_GTF', protein_keyword='auto', no_filter=False, sample=None, all_chromosomes=False, download='', uniprot='', uniprot_min_len=20, PEFF=True, keep_all=False, sqlite='/data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite')
using sqlite database to speed up
running in sqlite mode
use existing sqlite file "/data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite" for gene annotation
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
file_mutations is a mutation string chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A
Connected to SQLite database: /data/p/xiaolong/PrecisionProDB/test_output/UniProt/UniProt.sqlite
run perChrom for chromosome 1
ENSP00000480678 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000480678, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSGCSSEGPQWPGSPPALLHGRSASEAGPGSAPGGRRPSCRPVLLGEGAASAAPLAVAAECPSRRPGPPSQAPLPGGALGSVPDPRLRLPAPRAGGDVRLAAGAPAEAEPGPAGAARRPPAAEGAGERAPTAAGARDRPAPQRRRRGAAAARGPAGAEPRRGATAGPAPPGAPGLRTPHPVPGLCPASPPEGGSRPCLSAAQRVQGDDGG
ENSP00000482090 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
ENSP00000480678 cannot be processed properly, please check Error: Grouper and axis must be same length
ENSP00000478421 input protein sequences cannot be translated from the CDS sequence in gtf annotation.
warning! protein ENSP00000482090, the original provided protein sequence is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQAEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC
ENSP00000482090 cannot be processed properly, please check Error: Grouper and axis must be same length
warning! protein ENSP00000478421, the original provided protein sequence is MPAVKKEFPGREDLALALATFHPTLAALPLPPLPGYLAPLPAAAALPPAASLPASAAGYEALLAPPLRPPRAYLSLHEAAPHLHLPRDPLALERFSATAAAAPDFQPLLDNGEPCIEVECGANRALLYVRKLCQGSKGPSIRHRGEWLTPNEFQFVSGRETAKDWKRSIRHKGKSLKTLMSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGLAQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGFLPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQRRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTGARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFPYAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDGETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPERELGTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC and translated protein sequence from the GTF annotation is MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLVSALSEASTFEDPQRLYHLGLPSHGEDPPWHDPPHHLPSHGSCLSRPPCCRRRMPLTSPWAPISGPPSWGCPRLCARPQATASCPPRRRRCSPGSRSSCGSRTWPGWSCPPTSCGRRSWRARAHSCWRPRPPCAPTTAPRSCSGAGPCWC
ENSP00000478421 cannot be processed properly, please check Error: Grouper and axis must be same length
number of proteins with AA change: 24
remove temp folder
finished!
PrecisionProDB finished! Total seconds: 0.4095458984375
Command finished in 4.00 seconds