-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move metadata from mongodb into the index manifest? #18
Comments
Interesting idea @luizirber! Offhand I think it makes sense but I don't have a great handle on the mechanics. Were you thinking of a similar approach to filtering metadata from bigquery but just adding it to the manifest rather than the mongodb? My initial thoughts on drawbacks:t 1- this would not constrain the frontend as-is, but may make it clunky if we did want to do some sort of overall visualization of every accession and it's metadata (like here: https://web.app.ufz.de/marmdb/)? 2 - The way metadata is organized in the app right now could drastically be improved. I took a bit of a 'take what we get' approach, because what is available from the SRA varies so much and it would take a lot of time to slightly improve it. As branchwater updates, I'm assuming it wouldn't pull the entire SRA metadata, just the metadata for the newly added accessions? As long as adding the metadata to the manifest doesn't make it a huge pain to update the entire manifest, if for example the SRA reorganizes it or someone improves my filter method, I don't see an issue. 3 - One reason we went with mongodb is it's super fast to search the accessions and pull the select metadata of interest in one query- how do you think the manifest will compare? I'm guessing all the metadata would be returned and we'd filter to metadata of interest on the flask server as a second step. |
I really like it from a simplicity point of view, plus extended manifests would support additional utility for sourmash and/or api access. I don't have a sense for potential performance drawbacks, though! |
hot take: don't do it in the manifest by adding columns, but support multiple files that key on There's some description of this over in https://sourmash.readthedocs.io/en/latest/sourmash-internals.html#taxonomy-and-assigning-lineages, but for this crowd & channeling sourmash-bio/sourmash#1790 - taxonomy in sourmash works by getting results from (e.g.) sourmash gather that contain space-separated identifiers ( So, for example, the following taxonomy spreadsheet:
would let us identify results for genomes with This scheme has proven to be pretty robust and debuggable in practice, and it allows us to support multiple different taxonomies in sourmash (I think we're up to NCBI, GTDB, LINS, and ICTV!) with only a moderate amount of @bluegenes blood, sweat, and tears. So the modified proposal I'd suggest here - we combine In the case of branchwater-web we could more rapidly evolve the format of this metadata file to meet needs. heck, it could even remain in the mongodb, maybe. over in sourmash I'd probably suggest adding generic support for this into our plugin interface so that we could try things out freely. conveniently, this also would help support private/custom/user-specific metadata so that people could build up their own annotated/curated databases of SRA info and then use them as picklists for the search output - perhaps something to support in future versions of the Web app? |
Over at sourmash-bio/sourmash#3006 (comment) I mentioned adding extra columns to manifest to hold metadata not available in a signature. I think we can do the same approach to store the SRA metadata into the manifest, and remove the mongodb dependency, returning the metadata from the search index together with the containment.
More refs on the sourmash context: sourmash-bio/sourmash#2180
But... is it a good idea?
Over at #4 I'm trying to make it easy to bring up a new
branchwater
installation, and there is a bit of a dance for building index, bringing up mongo, loading metadata, and then bringing up server/frontend. Moving the metadata into the index building step makes things easier, but requires to be able to update the manifest in the index in case we want different data (which is not that hard, it's a CSV). It can be more constraining for developing new frontend features, tho?pinging @bluegenes and @SuzanneFleishman for ideas =]
The text was updated successfully, but these errors were encountered: