Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove the need for ncbi_assembly_metadata #172

Open
SchwarzMarek opened this issue Oct 2, 2024 · 1 comment
Open

remove the need for ncbi_assembly_metadata #172

SchwarzMarek opened this issue Oct 2, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@SchwarzMarek
Copy link

Description of feature

As discussed in #170 I'm suggesting to get rid of --ncbi_assembly_metadata requirement and obtain relevant assemblies directly based on assembly IDs.

Below I provide python3 script that is able to download assemblies based on their accession (using NCBI's API).

At this moment the script downloads fasta, gff and gbff (for my convenience), this can be adjusted based on the bacass needs.

Possible interfaces:

  • python import of download function
  • cli (2 modes for my convenience, can be easily simplified)

dependencies:

  • urllib3 (probably may be rewritten for urllib)

result:

  • obtained data accessible under [target dir]/ncbi_dataset/data/ (can be adjusted at the cost of added complexity)

know limitations:

  • works well with low to medium number of assemblies; personally, I would keep this under 50 per request. Reasonable numbers can be handled by the script (chunk iter). But if we would aim for even larger numbers (thousands) than I would advice to use ncbi-datasets-cli (available e.g. from conda).
import urllib3
import sys
import zipfile
from io import BytesIO


def download(accs, target):
    url = (f"https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/"
           f"{','.join(accs)}/download?"
           f"include_annotation_type=GENOME_FASTA&include_annotation_type=GENOME_GBFF&include_annotation_type=GENOME_GFF"
           f"&hydrated=FULLY_HYDRATED")

    with urllib3.PoolManager() as http:
        with http.request("GET", url, preload_content=False) as resp:
            resp.auto_close = False

            with zipfile.ZipFile(BytesIO(resp.data)) as z:
                z.extractall(target)


if __name__ == "__main__":
    if len(sys.argv) <4:
        print('USAGE: python3 dbkref.py accs [TARGET DIR] SPACE DELIMITED ASSEMBLY ACCESSIONS')
        print('USAGE: python3 dbkref.py kmerfinder_summary [TARGET DIR] PATH_TO_KMERFINDER_SUMMARY_FILE')
        raise(ValueError('Invalid input, please see usage.'))
        
    target = sys.argv[2]
    if sys.argv[1] == 'accs':
        accs = set(sys.argv[3:])
        download(accs, target)
    elif sys.argv[1] == 'kmerfinder_summary':
        kmersumm = sys.argv[3]
        with open(kmersumm) as f:
            _ = f.readline()  # ditch first line
            accs = {l.split(',')[1] for l in f if l.strip() != ''}
        download(accs, target)
    else:
        raise ValueError('invalid mode choice, valid are "kmerfinder_summary" or "accs"')

If you are interested in this, could you please check, if urllib3 is available in bacass python3?

dbkref.py.zip

@SchwarzMarek SchwarzMarek added the enhancement New feature or request label Oct 2, 2024
@Daniel-VM
Copy link
Contributor

Thank you so much, @SchwarzMarek ! I've been out of office these days. 🙏🏾 I plan to test it locally by next week and share my thoughts. 😉

@Daniel-VM Daniel-VM self-assigned this Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants