Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plasmid reads #9

Open
erinyoung opened this issue Mar 27, 2023 · 3 comments
Open

Plasmid reads #9

erinyoung opened this issue Mar 27, 2023 · 3 comments

Comments

@erinyoung
Copy link

I would like to contribute to this effort, but I want to make sure that my methods are sound. I would love feedback and insight.

I think I can create a toy dataset for some plasmids containing AMR genes.

Here's my current plan:

  1. Assemble nanopore reads into genome with flye (or do hybrid assembly with unicycler) to create a closed genome
  • I'd be using existing assemblies, which are mostly Citrobacter and Acinetobacter, but there are some other organisms I could look into if needed
  1. Use minimap2 to map nanopore reads to the assembled genome
  2. Separate nanopore reads by plasmid
  3. Ensure that nanopore read subset re-assembles into a similar plasmid using flye (and perhaps other assemblers like raven?)
  4. Ensure that the nanopore fastq.gz files are "small enough" for github
  5. Use minimap2 to map illumina reads to the assembled genome
  6. Separate illumina reads by plasmid
  7. Ensure that the nanopore + illumina read subset still assembles with unicycler
  8. Ensure that the illumina fastq.gz files are "small enough" for github
  9. Add resultant files to this repo via a PR
@lskatz
Copy link
Member

lskatz commented Mar 29, 2023

This is interesting, thank you! I think any contribution would be appreciated. I haven't updated the spec yet where we would host the datasets yet but I will brainstorm more. For now, I think a dataset with accessions and perhaps AMR results would be the most helpful. Let me know!

@erinyoung
Copy link
Author

erinyoung commented Mar 30, 2023

Don't thank me just yet.

I've attached a file that may be helpful to you.

There are six columns in this file that designate

  • Organism: predicted organism (many of which have changed since submission)
  • ID : ARLN ID of the isolate in case googling is needed
  • Illumina SRA: SRA accession of paired-end Illumina reads
  • nanopore SRA: SRA accession of nanopore reads
  • Accessions: NCBI genomes accessions of chromosome and plasmids
    • These are listed from initial accession to last accession (for example: CP118189-CP118194 actually means CP118189, CP118190, CP118191, CP118192, CP118193, and CP118194)
  • AMR: potential AMR gene located in sequence

There are some caveats to this file. This file may contain assemblies or SRA accessions that are not, yet, publicly available. Also, some of these isolates may have their AMR gene on their chromosome as opposed to a plasmid. I wanted to vet these problems first, but I do not think that I'll have the time for that for awhile.

I may come back and edit filter this information in the future, but it's here if it will start being useful.

LR Seq of ARLN.csv

@gbouras13
Copy link

Hi @erinyoung,

Just came across this issue while I was looking for some more benchmarking datasets for my tool Plassembler which implements a good chunk of what you outline :) It doesn't go to the individual plasmid level though.

It's still a work in progress for now, but just thought I would share. I'm going to implement a "--keep fastqs" flag now I think based on your comments so thanks for that as others may find it useful!

https://github.com/gbouras13/plassembler

George

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants