Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to prepare a novel strain/isolate of a bacteria? #71

Open
peterjc opened this issue Nov 24, 2016 · 11 comments
Open

How to prepare a novel strain/isolate of a bacteria? #71

peterjc opened this issue Nov 24, 2016 · 11 comments
Labels

Comments

@peterjc
Copy link
Contributor

peterjc commented Nov 24, 2016

(Some months back I did this successfully to submit a new strain from a different genus, so while I might be doing something wrong/different, I suspect the ENA validator has become stricter in the meantime)

For an un-named Serratia which does not (yet) have a unique NCBI taxonomy entry - the parent would be Serratia, taxid 613,

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=613&lvl=3&lin=f&keep=1&srchmode=1&unlock

I have tried that, and the entry Serratia sp., taxid 616

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=616&lvl=5&lin=f&keep=1&srchmode=1&unlock

$ gff3_to_embl --authors "Other A.N." -m "Serratia sp. XYZ annotated using Prokka." -g circular -c PROK -l XYZ -n 11 -f XYZ.embl "Serratia sp." 616 PRJEB00000 "Serratia sp. XYZ" XZY.gff

Either taxid approach fails validation:

$ java -jar embl-api-validator-1.1.149.jar XYZ.embl
...
ERROR: Scientific_name "Serratia sp." is not submittable. (MasterEntrySourceCheck_2)  line: 1 of XYZ.embl
ERROR: At least one of the following qualifiers "strain, environmental_sample, isolate" must exist when organism belongs to Bacteria. (OrganismAndRequiredQualifierCheck)  line: 17 of XYZ.embl
...

Here line 17 was the source feature. Manually editing the EMBL file to add a strain qualifier to the feature worked for me, but what exactly it wants for species name eludes me.

Am I missing something simple?

[Update: Yes, I was not giving the full organism name to gff3_to_embl, but also there was a problem with this version of the validator]

Should gff3_to_embl have options for inserting source feature qualifiers "strain, environmental_sample, isolate" (or should I have done this in prokka)?

Thanks!

@andrewjpage
Copy link
Member

Hi Peter,
A few months ago they blocked high level Taxa. They want you to use more specific taxa apparently. For completely new species theres a chicken and egg problem. In the olden days every assembly got a new taxon ID (which is why there are nearly 2 million). However now NCBI (who assign taxon IDs) demand a publication before they will grant one, so you have to use a temporary taxa, then update later. Its quite convolted.

As for strain, we submit using their API interface, so we have to provide a header in the embl, which then gets overwritten with whatever metadata is in the BioSample. Its possible they have moved the goal posts again in the week since we last submitted data.....

@peterjc
Copy link
Contributor Author

peterjc commented Nov 24, 2016

Ah. My hunch was right, and yes - this is exactly the chicken-and-egg situation I am facing.

Could you elaborate on what you meant by using a temporary taxa?

@peterjc
Copy link
Contributor Author

peterjc commented Nov 28, 2016

See enasequence/sequencetools#15

This error turned out to be with the validator's internal settings:

ERROR: Scientific_name "Serratia sp." is not submittable. (MasterEntrySourceCheck_2)

However, to avoid this error I currently need to manually edit the source feature in my EMBL file:

ERROR: At least one of the following qualifiers "strain, environmental_sample, isolate" must exist when organism belongs to Bacteria. (OrganismAndRequiredQualifierCheck)

Perhaps for people like me using the ENA webin (web interface), rather than the API, there needs to be an extra set of options on gff3_to_embl to record the strain, environmental sample or isolate fields?

[Update: Human error, see below - I was not giving the full organism name to gff3_to_embl]

@peterjc
Copy link
Contributor Author

peterjc commented Nov 28, 2016

(I've not actually submitted this new sequence yet - but I intend to try using the genus level taxid as before)

@andrewjpage
Copy link
Member

andrewjpage commented Nov 29, 2016

Hi Peter,
I cant replicate your error from the latest version of the validator. Using the following EMBL file, it validates fine (without a strain/ environmental_sample, isolate). Might be another issue somewhere?

ID   XXX; XXX; circular; genomic DNA; STD; PROK; 240 BP.
XX
AC   XXX;
XX
AC * _ERS111111SCcontig000001
XX
PR   Project:PRJEB1111;
XX
DE   XXX;
XX
RN   [1]
RA   Pathogen Genomics;
RT   "Draft assembly annotated with Prokka";
RL   Submitted (24-Nov-2016) to the INSDC.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..240
FT                   /organism="Staphylococcus aureus"
FT                   /mol_type="genomic DNA"
FT                   /db_xref="taxon:1280"
FT                   /note="ERS11111|SC|contig000001"
FT   tRNA            143..218
FT                   /product="tRNA-Val(tac)"
FT                   /inference="COORDINATES:profile:Aragorn:1.2.36"
FT                   /locus_tag="SAMEA1111111_00001"
SQ   Sequence 240 BP; 60 A; 60 C; 60 G; 60 T; 0 other;
     aatctacatt catatgtctg gtgactatag caaggaggtc acacctgttc ccatgccgaa        60
     cacagaagtt aagctcctta gcgtcgatgg tagttggact tacgttccgc tagagtagaa       120
     cgttgccagg caatgataaa tcggagaatt agctcagctg ggagagcatc tgccttacaa       180
     gcagagggtc ggcggttcga acccgtcatt ctccaccatt tattcttaca tattgccggc       240
//

@peterjc
Copy link
Contributor Author

peterjc commented Nov 29, 2016

If you could edit your example above on GitHub to wrap it in triple back-ticks, GitHub will render it as a code block, and preserve the white space (so I can copy and paste it for testing here).

I suspect the key difference is your example has a taxid for a full species name, Staphylococcus aureus taxon 1280.

What happens if you change the example to pretend you have a new species/strain without a pre-existing taxon id, say Staphylococcus sp. XYZ, and try either taxon 1279 (Staphylococcus) or 29387 (Staphylococcus sp.)?

@andrewjpage
Copy link
Member

Heres the file (as a file).
example_embl.txt

So the genus taxon 1279 (Staphylococcus) gets through the validator, but you'll get an email in a few days/weeks informing you that the 'computer says NO'.

@peterjc
Copy link
Contributor Author

peterjc commented Nov 30, 2016

Confirmed using embl-api-validator-1.1.150.jar. Likewise using taxon 613 and Serratia sp. XYZ passes validation:

FT   source          1..240
FT                   /organism="Serratia sp. XYZ"
FT                   /mol_type="genomic DNA"
FT                   /db_xref="taxon:613"

This was my problematic version:

FT   source          1..5090820
FT                   /organism="Serratia sp."
FT                   /mol_type="genomic DNA"
FT                   /db_xref="taxon:613"

I can pass validation by adding /strain="XYZ" (as mentioned above) or more simply by giving the full organism name in as /organism="Serratia sp. XYZ". With hindsight this seems obvious, your example was very helpful, thank you.

So there were at least two problems: I was not telling gff3_to_embl the full organism name, and the version of the validator I was using was (wrongly) being too strict.

I hope to submit this week, anticipating a query back about this being a novel species without a taxon ID. I will report back later with an update for future readers of this issue. Thanks!

@andrewjpage
Copy link
Member

Good luck with your submission!

@peterjc
Copy link
Contributor Author

peterjc commented Jan 25, 2017

Update on the ENA side of interest: http://listserver.ebi.ac.uk/pipermail/ena-announce/2017-January/000165.html

@andrewjpage
Copy link
Member

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants