-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Annotation fails, cause mysterious #323
Comments
Thank you for your report, user @dlhuseby29 ! Kudos for looking into The output says
This is from our QA which needed to prevent users from submitting genomes to Genbank under the default name from our examples. I would recommend to come up with a different fictitious name in the input file. |
It gave me this name error even when I had a submol yaml file with my name in it.... It seems like there is something going wrong with the processing of the yaml files, since it appears that I can successfully run the pipeline with the following command: ./pgap.py -r --debug -o test_annotation_2 -g EN1740_complete.fasta -s 'Pseudomonas aeruginosa' But not at all if I use this command: ./pgap.py -r --debug -o result input.yaml With extremely stripped down input and submol yaml files: fasta: and organism: |
Could you please post the relevant portion of Meanwhile, if input.yaml method works for you feel free to use it as workaround. |
Also: does it work with Quick Start example for -s/-g option combo? |
I am confused. The Please have a look at |
Yes, I have confused the issue. The bad name is just a side error that pops up every time I run it which makes it seem like the problem I am having is with the name. I added this below the additional context heading, but I realize that this was confusing. The actual problem is posted above that and it is that whenever I have any yaml files as inputs, the annotation fails within the first 20 minutes or so. If I just run everything from the command line with a manual input of the fasta file and no metadata besides the organism, everything seems to run fine. Sorry for the confusing post. |
I just had it fail in this way using the yaml file inputs. I have posted the cwltool.log for this failure. |
Please have a look at initial_asnval_diag.xml in the output directory. Is it there? If yes, it might have additional clues: messages with ERROR in them |
I don't see that file in the output folder of any of these runs. |
Do you have any .xml files? |
Could you please post the listing of output folder? |
Here is a screenshot of everything in there and a zipped version of the entire folder. |
OK, you have a debug/ foilder already. This is good. Please
and post `grep ERROR ' output here. Thanks! |
Maybe I did this wrong, but... (base) xxxxxxx@UUC-02V8279HTD5 alpkek6w % grep ERROR initial_asnval_diag.xml The whole file is attached (with txt appended for upload purposes), because maybe I did the search wrong. |
Thanks. I just found it myself as well, using your Archive.zip link (sorry did not notice right away)
This is becoming interesting. So what happens when you run the example in Qucik Start? It should have the same problem. In the absence of submission parameters for authors it resorts to the same lastname/firstname combo Also: You have posted cwltool.log snippet without including it in "code" markup. This resulted in removal of all XML elements from the output of cwltool.log except their content. This resulted in confusion, I did not realize immediately that the report of these "bad last" lines came in the form of XML. |
And as I said, this is the entire text of the submol.yaml: organism: I've run it with other versions of the submol.yaml, but everything seems to generate a similar crash. |
When you run it with -g/-s options you do not need any submol.yaml files. Did you try
the example from Quick Start? This example is part of our regular testing of software before the release. I am failing to catch what specifically changes between this example and your example |
Yes, I have tried the test annotation, and it works fine. If I annotate this genome (my data) using the command line (-g/-s), it also works fine. If I annotate this genome with an input of yaml files, it fails. I'm currently running a '-g/-s' annotation on one of my fasta files and it has been happily churning for about an hour. If I input a yaml file it dies within 10 minutes or so. |
OK. So let's work on this one. This is better. For this case I think you have to specify the correct first and last name and if you did that already as you said and failed, I need both input.yaml and submol.yaml (please post them as "code" markup) Thanks! |
input.yaml
submol.yaml
These are the files that I have been trying to use. Originally, I had a submol.yaml with more metadata, but if the basic form above doesn't work, then the longer version certainly won't work? Anyway, here is the more complete version: submol2.yaml
|
Thanks Could you please post the relevant portion of the Also, if the failure persist, I might recommend to follow the example of the submol that comes in test_genomes/ directories and include |
I'm not sure if you wanted me to actually find it in the cwltool.log or not, since it isn't super obvious there. I've attached that file so you can see if you can find anything. Finding the initial_asnval_diag.xml file in the debug folder had the same thing in it though:
|
I think what happens is that you need to fill up both contact info and authors sections of submol.yaml, could you please try this? |
So I took your suggestion and grabbed the submol.yaml from the MG37 test folder. I ran it as-is only changing the organism name. This seemed to run as if it was going to do the whole annotation. I didn't run it the full time, but if it fails, it actually usually fails in under 5 minutes and this went for 15 minutes before I killed it. If I delete the 'author' portion and everything after in this submol.yaml, I recreate the error and it fails in 3 minutes due to a BadSubmissionAuthor error. |
It looks like there is a problem when in YAML input scenario only contact_info: is specified but not authors So the workaround is to always specify them both. |
Looks like I don't even have to do that much. This worked for the submol.yaml:
And it would not work with just the organism specified, so it seems like specifying an author is an absolute requirement now? So, is this just a 'me' problem? Thanks for working through this with me. Such an oddly specific problem. |
It looks this way. Without this field, Dr. Firstname Lastname sneaks into your submissions and that triggers our validation guards that do not like this hyperactive scientist. |
Describe the bug
I can successfully annotate the test genome, and I can run the annotation process on my own sequence, as long as I run it from the command line in the 'for-my-own-use' way, so errors are ignored. As soon as I run it using the yaml file inputs, it fails within a few minutes. I have stripped down the yaml files to the minimum, but it still doesn't complete the annotation.
Expected behavior
I expect the annotation process to complete without errors --or if it fails to at least fail in a way such that I can tell what has gone wrong and fix it.
Software versions (please complete the following information):
iOS 13.6.9
pgap.py 2024-07-18.build7555
Docker version 27.2.0, build 3ab4256
Log Files
I've attached the full cwltool.log, but the first permanentFail is here:
[2024-10-01 19:17:12] DEBUG [job Prepare_Unannotated_Sequences_asnvalidate_evaluate] initial work dir {}
[2024-10-01 19:17:12] INFO [job Prepare_Unannotated_Sequences_asnvalidate_evaluate] /pgap/output/debug/tmp-outdir/_g21qgzc$ xml_evaluate
-input
/pgap/output/debug/tmpdir/aviqz9nx/stg49659d99-b614-4b88-8754-a66ec8ce1077/sequences.val
-xpath-fail
'//*[
( @Severity="ERROR" or @Severity="REJECT" )
and not(contains(@code, "GENERIC_MissingPubRequirement"))
and not(contains(@code, "SEQ_DESCR_ChromosomeLocation"))
and not(contains(@code, "SEQ_DESCR_MissingLineage"))
and not(contains(@code, "SEQ_DESCR_NoTaxonID"))
and not(contains(@code, "SEQ_DESCR_OrganismIsUndefinedSpecies"))
and not(contains(@code, "SEQ_DESCR_StrainWithEnvironSample"))
and not(contains(@code, "SEQ_DESCR_BacteriaMissingSourceQualifier"))
and not(contains(@code, "SEQ_DESCR_UnwantedCompleteFlag"))
and not(contains(@code, "SEQ_FEAT_BadCharInAuthorLastName"))
and not(contains(@code, "SEQ_FEAT_ShortIntron"))
and not(contains(@code, "SEQ_INST_InternalNsInSeqRaw"))
and not(contains(@code, "SEQ_INST_ProteinsHaveGeneralID"))
and not(contains(@code, "SEQ_PKG_NucProtProblem"))
and not(contains(@code, "SEQ_PKG_ComponentMissingTitle"))
]
' > /pgap/output/debug/tmp-outdir/_g21qgzc/initial_asnval_diag.xml
[2024-10-01 19:17:12] DEBUG Could not collect memory usage, job ended before monitoring began.
[2024-10-01 19:17:12] WARNING [job Prepare_Unannotated_Sequences_asnvalidate_evaluate] exited with status: 1
[2024-10-01 19:17:12] WARNING [job Prepare_Unannotated_Sequences_asnvalidate_evaluate] completed permanentFail
Additional context
The final message in the crash is always that it hates my name for some reason?
Failer nodes:
Bad last name 'Lastname'
Bad first name 'Firstname'
Any suggestions of how to get around this would be really appreciated.
Kind regards,
cwltool.log
The text was updated successfully, but these errors were encountered: