Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Annotation fails, cause mysterious #323

Open
dlhuseby29 opened this issue Oct 1, 2024 · 28 comments
Open

[BUG] Annotation fails, cause mysterious #323

dlhuseby29 opened this issue Oct 1, 2024 · 28 comments

Comments

@dlhuseby29
Copy link

Describe the bug
I can successfully annotate the test genome, and I can run the annotation process on my own sequence, as long as I run it from the command line in the 'for-my-own-use' way, so errors are ignored. As soon as I run it using the yaml file inputs, it fails within a few minutes. I have stripped down the yaml files to the minimum, but it still doesn't complete the annotation.

Expected behavior
I expect the annotation process to complete without errors --or if it fails to at least fail in a way such that I can tell what has gone wrong and fix it.

Software versions (please complete the following information):
iOS 13.6.9
pgap.py 2024-07-18.build7555
Docker version 27.2.0, build 3ab4256

Log Files
I've attached the full cwltool.log, but the first permanentFail is here:

[2024-10-01 19:17:12] DEBUG [job Prepare_Unannotated_Sequences_asnvalidate_evaluate] initial work dir {}
[2024-10-01 19:17:12] INFO [job Prepare_Unannotated_Sequences_asnvalidate_evaluate] /pgap/output/debug/tmp-outdir/_g21qgzc$ xml_evaluate
-input
/pgap/output/debug/tmpdir/aviqz9nx/stg49659d99-b614-4b88-8754-a66ec8ce1077/sequences.val
-xpath-fail
'//*[
( @Severity="ERROR" or @Severity="REJECT" )
and not(contains(@code, "GENERIC_MissingPubRequirement"))
and not(contains(@code, "SEQ_DESCR_ChromosomeLocation"))
and not(contains(@code, "SEQ_DESCR_MissingLineage"))
and not(contains(@code, "SEQ_DESCR_NoTaxonID"))
and not(contains(@code, "SEQ_DESCR_OrganismIsUndefinedSpecies"))
and not(contains(@code, "SEQ_DESCR_StrainWithEnvironSample"))
and not(contains(@code, "SEQ_DESCR_BacteriaMissingSourceQualifier"))
and not(contains(@code, "SEQ_DESCR_UnwantedCompleteFlag"))
and not(contains(@code, "SEQ_FEAT_BadCharInAuthorLastName"))
and not(contains(@code, "SEQ_FEAT_ShortIntron"))
and not(contains(@code, "SEQ_INST_InternalNsInSeqRaw"))
and not(contains(@code, "SEQ_INST_ProteinsHaveGeneralID"))
and not(contains(@code, "SEQ_PKG_NucProtProblem"))
and not(contains(@code, "SEQ_PKG_ComponentMissingTitle"))
]
' > /pgap/output/debug/tmp-outdir/_g21qgzc/initial_asnval_diag.xml
[2024-10-01 19:17:12] DEBUG Could not collect memory usage, job ended before monitoring began.
[2024-10-01 19:17:12] WARNING [job Prepare_Unannotated_Sequences_asnvalidate_evaluate] exited with status: 1
[2024-10-01 19:17:12] WARNING [job Prepare_Unannotated_Sequences_asnvalidate_evaluate] completed permanentFail

Additional context
The final message in the crash is always that it hates my name for some reason?

Failer nodes:

Bad last name 'Lastname'

Bad first name 'Firstname'

Any suggestions of how to get around this would be really appreciated.

Kind regards,

cwltool.log

@azat-badretdin
Copy link
Contributor

Thank you for your report, user @dlhuseby29 !

Kudos for looking into permanentFail in cwltool.log. Right move!

The output says

Bad last name 'Lastname'

Bad first name 'Firstname'

This is from our QA which needed to prevent users from submitting genomes to Genbank under the default name from our examples.

I would recommend to come up with a different fictitious name in the input file.

@dlhuseby29
Copy link
Author

It gave me this name error even when I had a submol yaml file with my name in it....

It seems like there is something going wrong with the processing of the yaml files, since it appears that I can successfully run the pipeline with the following command:

./pgap.py -r --debug -o test_annotation_2 -g EN1740_complete.fasta -s 'Pseudomonas aeruginosa'

But not at all if I use this command:

./pgap.py -r --debug -o result input.yaml

With extremely stripped down input and submol yaml files:

fasta:
class: File
location: EN1740_complete.fasta
submol:
class: File
location: submol2.yaml

and

organism:
genus_species: Pseudomonas aeruginosa

@azat-badretdin
Copy link
Contributor

It gave me this name error even when I had a submol yaml file with my name in it....

Could you please post the relevant portion of cwltool.log file (as you did before) for this case?

Meanwhile, if input.yaml method works for you feel free to use it as workaround.

@azat-badretdin
Copy link
Contributor

Also: does it work with Quick Start example for -s/-g option combo?

@azat-badretdin
Copy link
Contributor

I am confused. The cwltool.log you attached does not complain about "Bad name". Instead it ends at failing at xml_evaluate

Please have a look at initial_asnval_diag.xml in the output directory. Is it there? If yes, it might have additional clues: messages with ERROR in them

@dlhuseby29
Copy link
Author

Yes, I have confused the issue.

The bad name is just a side error that pops up every time I run it which makes it seem like the problem I am having is with the name. I added this below the additional context heading, but I realize that this was confusing.

The actual problem is posted above that and it is that whenever I have any yaml files as inputs, the annotation fails within the first 20 minutes or so. If I just run everything from the command line with a manual input of the fasta file and no metadata besides the organism, everything seems to run fine.

Sorry for the confusing post.

@dlhuseby29
Copy link
Author

I just had it fail in this way using the yaml file inputs. I have posted the cwltool.log for this failure.
cwltool.log

@azat-badretdin
Copy link
Contributor

Please have a look at initial_asnval_diag.xml in the output directory. Is it there? If yes, it might have additional clues: messages with ERROR in them

@dlhuseby29
Copy link
Author

I don't see that file in the output folder of any of these runs.

@azat-badretdin
Copy link
Contributor

Do you have any .xml files?

@azat-badretdin
Copy link
Contributor

Could you please post the listing of output folder?

@dlhuseby29
Copy link
Author

dlhuseby29 commented Oct 2, 2024

Here is a screenshot of everything in there and a zipped version of the entire folder.

Archive.zip
Screenshot 2024-10-02 at 14 46 15

@azat-badretdin
Copy link
Contributor

OK, you have a debug/ foilder already. This is good. Please find this file using

find debug/ -type f -name initial_asnval_diag.xml

and post `grep ERROR ' output here.

Thanks!

@dlhuseby29
Copy link
Author

Maybe I did this wrong, but...

(base) xxxxxxx@UUC-02V8279HTD5 alpkek6w % grep ERROR initial_asnval_diag.xml
Bad last name 'Lastname'
Bad first name 'Firstname'

The whole file is attached (with txt appended for upload purposes), because maybe I did the search wrong.
initial_asnval_diag.txt

@azat-badretdin
Copy link
Contributor

Thanks. I just found it myself as well, using your Archive.zip link (sorry did not notice right away)

gpipedev21:debug$ less ./debug/tmp-outdir/alpkek6w/initial_asnval_diag.xml
Failer nodes:
<?xml version="1.0" encoding="UTF-8"?>
<message severity="ERROR" seq-id="lcl|contig001" code="GENERIC_BadSubmissionAuthorName">Bad last name 'Lastname'</message>

<?xml version="1.0" encoding="UTF-8"?>
<message severity="ERROR" seq-id="lcl|contig001" code="GENERIC_BadSubmissionAuthorName">Bad first name 'Firstname'</message>

This is becoming interesting.

So what happens when you run the example in Qucik Start? It should have the same problem. In the absence of submission parameters for authors it resorts to the same lastname/firstname combo

Also:

You have posted cwltool.log snippet without including it in "code" markup. This resulted in removal of all XML elements from the output of cwltool.log except their content. This resulted in confusion, I did not realize immediately that the report of these "bad last" lines came in the form of XML.

@dlhuseby29
Copy link
Author

Sorry about that, I know just enough of this stuff to get myself into trouble, but not really enough to solve any problems. I will try to do the code markup next time.

If I run this as a quick start annotation, it runs cleanly without any trouble. Here is a screenshot of the output folder:
Screenshot 2024-10-02 at 14 50 52

@dlhuseby29
Copy link
Author

And as I said, this is the entire text of the submol.yaml:

organism:
genus_species: Pseudomonas aeruginosa

I've run it with other versions of the submol.yaml, but everything seems to generate a similar crash.

@azat-badretdin
Copy link
Contributor

When you run it with -g/-s options you do not need any submol.yaml files.

Did you try


./pgap.py -r -o mg37_results -g $HOME/.pgap/test_genomes/MG37/ASM2732v1.annotation.nucleotide.1.fasta -s 'Mycoplasmoides genitalium

the example from Quick Start? This example is part of our regular testing of software before the release.

I am failing to catch what specifically changes between this example and your example

@dlhuseby29
Copy link
Author

dlhuseby29 commented Oct 2, 2024

Yes, I have tried the test annotation, and it works fine.

If I annotate this genome (my data) using the command line (-g/-s), it also works fine.

If I annotate this genome with an input of yaml files, it fails.

I'm currently running a '-g/-s' annotation on one of my fasta files and it has been happily churning for about an hour. If I input a yaml file it dies within 10 minutes or so.

@azat-badretdin
Copy link
Contributor

If I annotate this genome with an input of yaml files, it fails.

OK. So let's work on this one. This is better. For this case I think you have to specify the correct first and last name and if you did that already as you said and failed, I need both input.yaml and submol.yaml (please post them as "code" markup)

Thanks!

@dlhuseby29
Copy link
Author

input.yaml

fasta:
  class: File
  location: EN1740_complete.fasta
submol:
  class: File
  location: submol.yaml

submol.yaml

organism:
    genus_species: Pseudomonas aeruginosa 

These are the files that I have been trying to use. Originally, I had a submol.yaml with more metadata, but if the basic form above doesn't work, then the longer version certainly won't work?

Anyway, here is the more complete version:

submol2.yaml

topology: 'circular'
organism:
    genus_species: 'Pseudomonas aeruginosa' 
    strain: 'EN1740'
contact_info:
    last_name: 'Huseby'
    first_name: 'Douglas'
    email: '[email protected]'
    organization: 'Uppsala University'
    department: 'Department of Medical Biochemistry and Microbiology'
    street: 'Husargatan 3, Box 582'
    city: 'Uppsala'
    postal_code: '75124'
    country: 'Sweden'

@azat-badretdin
Copy link
Contributor

Thanks

Could you please post the relevant portion of the cwltool.log output. Does it complain about last name Lastname or your own name?

Also, if the failure persist, I might recommend to follow the example of the submol that comes in test_genomes/ directories and include authors: section in your submol.yaml file.

@dlhuseby29
Copy link
Author

I'm not sure if you wanted me to actually find it in the cwltool.log or not, since it isn't super obvious there. I've attached that file so you can see if you can find anything.

Finding the initial_asnval_diag.xml file in the debug folder had the same thing in it though:

Failer nodes:
<?xml version="1.0" encoding="UTF-8"?>
<message severity="ERROR" seq-id="lcl|contig001" code="GENERIC_BadSubmissionAuthorName">Bad last name 'Lastname'</message>

<?xml version="1.0" encoding="UTF-8"?
[cwltool.log](https://github.com/user-attachments/files/17231854/cwltool.log)
>
<message severity="ERROR" seq-id="lcl|contig001" code="GENERIC_BadSubmissionAuthorName">Bad first name 'Firstname'</message>

cwltool.log

@azat-badretdin
Copy link
Contributor

azat-badretdin commented Oct 2, 2024

I think what happens is that you need to fill up both contact info and authors sections of submol.yaml, could you please try this?

@dlhuseby29
Copy link
Author

dlhuseby29 commented Oct 2, 2024

So I took your suggestion and grabbed the submol.yaml from the MG37 test folder.

I ran it as-is only changing the organism name. This seemed to run as if it was going to do the whole annotation. I didn't run it the full time, but if it fails, it actually usually fails in under 5 minutes and this went for 15 minutes before I killed it.

If I delete the 'author' portion and everything after in this submol.yaml, I recreate the error and it fails in 3 minutes due to a BadSubmissionAuthor error.

@azat-badretdin
Copy link
Contributor

It looks like there is a problem when in YAML input scenario only contact_info: is specified but not authors

So the workaround is to always specify them both.

@dlhuseby29
Copy link
Author

dlhuseby29 commented Oct 2, 2024

Looks like I don't even have to do that much. This worked for the submol.yaml:

organism:
    genus_species: Pseudomonas aeruginosa 
authors:
    -     author:
            first_name: 'Arnold'
            last_name: 'Schwarzenegger'

And it would not work with just the organism specified, so it seems like specifying an author is an absolute requirement now?

So, is this just a 'me' problem?

Thanks for working through this with me. Such an oddly specific problem.

@azat-badretdin
Copy link
Contributor

And it would not work with just the organism specified, so it seems like specifying an author is an absolute requirement now?

It looks this way. Without this field, Dr. Firstname Lastname sneaks into your submissions and that triggers our validation guards that do not like this hyperactive scientist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants