Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pharokka protein crashed after completing mmseqs searches #300

Open
luisalbertoc95 opened this issue Nov 1, 2023 · 9 comments
Open

pharokka protein crashed after completing mmseqs searches #300

luisalbertoc95 opened this issue Nov 1, 2023 · 9 comments
Labels
bug Something isn't working

Comments

@luisalbertoc95
Copy link

luisalbertoc95 commented Nov 1, 2023

  • pharokka version:1.4 & 1.5.1
  • Python version: Python 3.10.8
  • Operating System: Rocky Linux 8.7 (Green Obsidian)

Description

Hi @gbouras13, When trying to run pharokka_proteins.py in a set of 755001 ORFs I'm having an error due to a mismatch in lengths between the keys and columns in the pandas DataFrame. According to the log file, all mmseqs searches were completed.

Thank you!

What I Did

Command run: 

pharokka_proteins.py -i ${WD}/out.CAT.predicted_proteins.faa  \
-o ${WD}/pharokka_prot_out_assembly_1Kb_NoPhablesContigs_PhablesresolvedGenomes \
-d /ref/sahlab/data/viral_analysis_DBs/pharokka1.5_DBs \
-t 24 \
-e 1E-03 \
--force

Traceback: 
2023-10-31 21:26:34.164 | INFO     | post_processing:process_vfdb_results:2134 - Processing VFDB output.
2023-10-31 21:26:35.099 | INFO     | post_processing:process_vfdb_results:2197 - 46 VFDB virulence factors identified.
Traceback (most recent call last):
  File "/ref/sahlab/software/anaconda3/envs/pharokka1.5_env/bin/pharokka_proteins.py", line 213, in <module>
    main()
  File "/ref/sahlab/software/anaconda3/envs/pharokka1.5_env/bin/pharokka_proteins.py", line 172, in main
    pharok.process_dataframes()
  File "/ref/sahlab/software/anaconda3/envs/pharokka1.5_env/bin/proteins.py", line 526, in process_dataframes
    (tophits_df, vfdb_results) = process_vfdb_results(self.out_dir, tophits_df)
  File "/ref/sahlab/software/anaconda3/envs/pharokka1.5_env/bin/post_processing.py", line 2198, in process_vfdb_results
    merged_df[["genbank", "desc_tmp", "vfdb_species"]] = merged_df[
  File "/ref/sahlab/software/anaconda3/envs/pharokka1.5_env/lib/python3.10/site-packages/pandas/core/frame.py", line 4082, in __setitem__
    self._setitem_array(key, value)
  File "/ref/sahlab/software/anaconda3/envs/pharokka1.5_env/lib/python3.10/site-packages/pandas/core/frame.py", line 4124, in _setitem_array
    check_key_length(self.columns, key, value)
  File "/ref/sahlab/software/anaconda3/envs/pharokka1.5_env/lib/python3.10/site-packages/pandas/core/indexers/utils.py", line 390, in check_key_length
    raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key

pharokka_proteins_1698789518.5425682.log

@gbouras13
Copy link
Owner

Hi @luisalbertoc95 ,

Thanks for reporting this bug and using Pharokka! I see you're using Phables too :)

I'm pretty sure this has to do with the VFDB naming (it's annoying :) ).

Would you be able to do a few things:

  1. I'd upgrade to 1.5.1 regardless (that log is from v1.4.0).
  2. Re-run this with --hmm_only. It should work to get all the PHROG annotations, but it will skip CARD and VFDB steps. So do that if you're in a hurry.
  3. I'm sure you want the CARD and VFDB steps too, so would you be able to send me the VFDB output? In particular vfdb_results.tsv. [email protected] (it should be small enough to email or attach here). I'm pretty sure it's because one of the VFDB outputs has a strange character and if so I will implement a fix soon once I can replicate the error.

George

@gbouras13 gbouras13 added the bug Something isn't working label Nov 1, 2023
@luisalbertoc95
Copy link
Author

Hi George,

Thanks a lot for you suggestions. Running the code with --hmm_only worked! I'll send the vfdb_results.tsv to you.

Thank you,

Luis

@gbouras13
Copy link
Owner

Hi @luisalbertoc95 ,

It took a while but I solved this error - it was a bug in pharokka to do with matching VFDB and other outputs.

If you re-run pharokka now it should work (but seemingly you were happy enough with --hmm_only so maybe you've moved on)

George

@ebueren
Copy link

ebueren commented Jan 23, 2024

Hello! I'm running pharokka 1.6.1 (fresh env and database install), and still receiving the same error (below). Running in --fast mode fixes the problem, so I think it seems like it has to do with the VFDB/CARD databases.

Pharokka version: 1.6.1
Python 3.10.8
OS: Linux, 3.10.0

Command:
pharokka.py -i file.fna -f -o test.out -d /x/x/x/pharokka_db/ -t 32 -m -g prodigal --skip_mash


2024-01-22 20:59:20.921 | INFO     | __main__:main:379 - Post Processing Output.
2024-01-22 20:59:23.455 | INFO     | post_processing:create_mmseqs_tophits:2104 - Processing MMseqs2 outputs.
2024-01-22 20:59:23.455 | INFO     | post_processing:create_mmseqs_tophits:2105 - Processing PHROGs output.
2024-01-22 20:59:30.113 | INFO     | post_processing:process_vfdb_results:2309 - Processing VFDB output.
2024-01-22 20:59:30.149 | INFO     | post_processing:process_vfdb_results:2368 - 17 VFDB virulence factors identified.
Traceback (most recent call last):
  File "/home/ebueren/miniconda3/envs/pharokka1.6/bin/pharokka.py", line 499, in <module>
    main()
  File "/home/ebueren/miniconda3/envs/pharokka1.6/bin/pharokka.py", line 418, in main
    pharok.process_results()
  File "/home/ebueren/miniconda3/envs/pharokka1.6/bin/post_processing.py", line 356, in process_results
    (merged_df, vfdb_results) = process_vfdb_results(
  File "/home/ebueren/miniconda3/envs/pharokka1.6/bin/post_processing.py", line 2369, in process_vfdb_results
    merged_df[["genbank", "desc_tmp", "vfdb_species"]] = merged_df[
  File "/home/ebueren/miniconda3/envs/pharokka1.6/lib/python3.10/site-packages/pandas/core/frame.py", line 4287, in __setitem__
    self._setitem_array(key, value)
  File "/home/ebueren/miniconda3/envs/pharokka1.6/lib/python3.10/site-packages/pandas/core/frame.py", line 4329, in _setitem_array
    check_key_length(self.columns, key, value)
  File "/home/ebueren/miniconda3/envs/pharokka1.6/lib/python3.10/site-packages/pandas/core/indexers/utils.py", line 390, in check_key_length
    raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key

@fluhus
Copy link

fluhus commented May 23, 2024

Hi, I am having this issue as well on a fresh mamba+pharokka (1.7.1) install.

pharokka.py -i vir.fa -o vir.prk -d ~/data/pharokka

Same error. Adding --hmm_only or --fast did not help. Happy to provide additional information that could help debug this!

@gbouras13
Copy link
Owner

Hi @fluhus ,

how big is your input? Is it very small? I have a feeling this error may be because MMseqs2 found no hits at all. I’ll try and replicate later this week and put in a fix if so.

george

@fluhus
Copy link

fluhus commented May 23, 2024

Thanks for the quick response!

Here is the input file (111K unzipped):

vir.fa.gz

@gbouras13
Copy link
Owner

Hi @fluhus,

I have narrowed down your error to the '#' in the header. If you remove this it will work. I'll put in a bug fix at some point :)

George

gbouras13 added a commit that referenced this issue May 23, 2024
@fluhus
Copy link

fluhus commented May 24, 2024

Thanks for looking into this! I removed the # signs from the names and now it runs :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants