remove the need for `ncbi_assembly_metadata` #172

SchwarzMarek · 2024-10-02T08:41:56Z

Description of feature

As discussed in #170 I'm suggesting to get rid of --ncbi_assembly_metadata requirement and obtain relevant assemblies directly based on assembly IDs.

Below I provide python3 script that is able to download assemblies based on their accession (using NCBI's API).

At this moment the script downloads fasta, gff and gbff (for my convenience), this can be adjusted based on the bacass needs.

Possible interfaces:

python import of download function
cli (2 modes for my convenience, can be easily simplified)

dependencies:

urllib3 (probably may be rewritten for urllib)

result:

obtained data accessible under [target dir]/ncbi_dataset/data/ (can be adjusted at the cost of added complexity)

know limitations:

works well with low to medium number of assemblies; personally, I would keep this under 50 per request. Reasonable numbers can be handled by the script (chunk iter). But if we would aim for even larger numbers (thousands) than I would advice to use ncbi-datasets-cli (available e.g. from conda).

import urllib3
import sys
import zipfile
from io import BytesIO


def download(accs, target):
    url = (f"https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/"
           f"{','.join(accs)}/download?"
           f"include_annotation_type=GENOME_FASTA&include_annotation_type=GENOME_GBFF&include_annotation_type=GENOME_GFF"
           f"&hydrated=FULLY_HYDRATED")

    with urllib3.PoolManager() as http:
        with http.request("GET", url, preload_content=False) as resp:
            resp.auto_close = False

            with zipfile.ZipFile(BytesIO(resp.data)) as z:
                z.extractall(target)


if __name__ == "__main__":
    if len(sys.argv) <4:
        print('USAGE: python3 dbkref.py accs [TARGET DIR] SPACE DELIMITED ASSEMBLY ACCESSIONS')
        print('USAGE: python3 dbkref.py kmerfinder_summary [TARGET DIR] PATH_TO_KMERFINDER_SUMMARY_FILE')
        raise(ValueError('Invalid input, please see usage.'))
        
    target = sys.argv[2]
    if sys.argv[1] == 'accs':
        accs = set(sys.argv[3:])
        download(accs, target)
    elif sys.argv[1] == 'kmerfinder_summary':
        kmersumm = sys.argv[3]
        with open(kmersumm) as f:
            _ = f.readline()  # ditch first line
            accs = {l.split(',')[1] for l in f if l.strip() != ''}
        download(accs, target)
    else:
        raise ValueError('invalid mode choice, valid are "kmerfinder_summary" or "accs"')

If you are interested in this, could you please check, if urllib3 is available in bacass python3?

dbkref.py.zip

The text was updated successfully, but these errors were encountered:

Daniel-VM · 2024-10-04T16:10:44Z

Thank you so much, @SchwarzMarek ! I've been out of office these days. 🙏🏾 I plan to test it locally by next week and share my thoughts. 😉

SchwarzMarek added the enhancement New feature or request label Oct 2, 2024

SchwarzMarek mentioned this issue Oct 2, 2024

kmerfinder update and optimization #170

Closed

Daniel-VM self-assigned this Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove the need for `ncbi_assembly_metadata` #172

remove the need for `ncbi_assembly_metadata` #172

SchwarzMarek commented Oct 2, 2024

Daniel-VM commented Oct 4, 2024

remove the need for ncbi_assembly_metadata #172

remove the need for ncbi_assembly_metadata #172

Comments

SchwarzMarek commented Oct 2, 2024

Description of feature

Daniel-VM commented Oct 4, 2024

remove the need for `ncbi_assembly_metadata` #172

remove the need for `ncbi_assembly_metadata` #172