Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Duplicate key (while no duplicated ids in input) #361

Open
art-egorov opened this issue Oct 4, 2024 · 2 comments
Open

ValueError: Duplicate key (while no duplicated ids in input) #361

art-egorov opened this issue Oct 4, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@art-egorov
Copy link

  • pharokka version:1.7.3
  • Python version:3.10.14
  • Operating System: linux

Description

Hi!

When running pharokka on meta mode it returns a funky error about duplication of key which looks like contig id. The problem that it fails only on subset of sequences (for most runs it is ok), moreover, reported duplicated key is not present in the list of input files..

What I Did

Command:

pharokka.py -i FAILED_SEQS.fa  -o pharokka_batches/ALL_SEQS  --meta --split -t 45 --skip_mash --dnaapler  --database pharokka/pharokka_v1.4.0_databases

logs:

024-10-04 13:21:28.625 | INFO     | external_tools:run:50 - Started running mmseqs createtsv pharokka/pharokka_v1.4.0_databases/vfdb pharokka_batches/ALL_SEQS/VFDB_target_dir/target_seqs pharokka_batches/ALL_SEQS/VFDB/results_mmseqs pharokka_batches/ALL_SEQS/vfdb_results.tsv --full-header --threads 45 ...
2024-10-04 13:21:28.773 | INFO     | external_tools:run:52 - Done running mmseqs createtsv pharokka/pharokka_v1.4.0_databases/vfdb pharokka_batches/ALL_SEQS/VFDB_target_dir/target_seqs pharokka_batches/ALL_SEQS/VFDB/results_mmseqs pharokka_batches/ALL_SEQS/vfdb_results.tsv --full-header --threads 45
2024-10-04 13:21:28.828 | INFO     | __main__:main:364 - Post Processing Output.
Traceback (most recent call last):
  File "/home/aegorov/.conda/envs/pharokka_env/bin/pharokka.py", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/lunarc/nobackup/projects/lutafold/aegorov/Hotspots/Phages/pharokka/bin/pharokka.py", line 489, in <module>
    main()
  File "/lunarc/nobackup/projects/lutafold/aegorov/Hotspots/Phages/pharokka/bin/pharokka.py", line 403, in main
    pharok.process_results()
  File "/lunarc/nobackup/projects/lutafold/aegorov/Hotspots/Phages/pharokka/bin/post_processing.py", line 204, in process_results
    prot_dict = SeqIO.to_dict(SeqIO.parse(fasta_input_aas_tmp, "fasta"))
  File "/home/aegorov/.conda/envs/pharokka_env/lib/python3.10/site-packages/Bio/SeqIO/__init__.py", line 754, in to_dict
    raise ValueError(f"Duplicate key '{key}'")
ValueError: Duplicate key 'TemPhD_cluster_4683480'
(pharokka_env) [aegorov@cn001 Phages]$ cat FAILED_SEQS.fa  | grep TemPhD_cluster_4683480
(pharokka_env) [aegorov@cn001 Phages]$ 
(pharokka_env) [aegorov@cn001 Phages]$ grep TemPhD_cluster_4683480 PhageScope_annotation_filtered.tsv 
(pharokka_env) [aegorov@cn001 Phages]$ 

seems like it adds some suffix numbers for prodigal which then overlaps with other contigs?

pharokka_batches/ALL_SEQS/prodigal-gv_aas_tmp.fasta:>TemPhD_cluster_4683480 1299_2654
pharokka_batches/ALL_SEQS/prodigal-gv_aas_tmp.fasta:>TemPhD_cluster_4683480 1_1272

Because in fasta file i have the following, for instance:

pharokka_env) [aegorov@cn001 Phages]$ grep "4683" FAILED_SEQS.fa 
>TemPhD_cluster_46833
>TemPhD_cluster_46834
>TemPhD_cluster_46835
>TemPhD_cluster_46836
>TemPhD_cluster_46837
>TemPhD_cluster_46838
>TemPhD_cluster_46839
>TemPhD_cluster_4683

Anyway, is there anything to do to fix such exceptions?
Thanks in advance

Best,
Artyom

@art-egorov
Copy link
Author

small upd: you can avoid the error by adding non-number suffix to contig id, but still.. would be nice to be able to run on any unique set

@gbouras13
Copy link
Owner

I fully agree @art-egorov - the issue is with phanotate that makes the output hard to parse and I didn't know a smarter way back when I coded pharokka originally.

When I have some dev time, I'll try and think of a better solution.

George

@gbouras13 gbouras13 added the bug Something isn't working label Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants