feat: clean up data organization/add basic fetching scripts #558

jsstevenson · 2024-12-19T21:29:07Z

Use a centralized location for data in user space, rather than storing everything within a subdirectory of the repo. This is really more of a convenience for me because it means keeping fewer copies of the latest ChEMBL release around. It reflects a strategy we've used in our lil data management tool.
Add a script to download all DGIdb-relevant files, including the latest versions of updateable stuff. I know we'd previously maintained some updater modules for this purpose but I'd prefer something that reuses said data management tool above, it's a little less maintenance for me.
Put all DGIdb static files into an S3 bucket. By default this won't be accessible to non-NCH users but we could probably arrange something if necessary. Hypothetically some of this data is supposed to be non-public, so I don't want to put it on a public bucket for now.
Assign version values dynamically where possible. Previously, dgidb generally liked it when files are named "claims.tsv" or some variation therein, but we've used a file naming pattern in our other projects along the lines of <source>_<the kind of stuff>_<version>.<filetype>, since not every file includes its version as part of the data. So now, the file downloader is in charge of identifying the version of the data and naming the file as needed, and then DGI importers can just read version values from filenames.
Remove some unused/no-longer-functional/now-redundant code related to updating/downloading

The workflow for updating would go like

rake db:reset
fetch/update data with python3 scripts/download_files.py
run imports

Unfortunately this entails some awkwardness about running Python modules before the rake tasks. I thought a bit about doing something like shell commands from the ruby side to call the python or even looking at ruby-python FFIs but I don't think any particular solution will be terribly elegant so I'm not worrying about it for now.

acoffman

I like this idea, and the diff looks good overall to me, I think it's fine to merge as is. A couple of thoughts:

I do see "#{Dir.home}/.local/share/wags_tails" scattered a few places around the code (though mostly it's used via the default_data_dir method you wrote).

Since it looks like this pr adds a dependency on wags_tails anyways, I wonder if it's worth having the DGIdb importers "ask" wags-tails where its data root path is somehow? That's definitely not a huge/blocking issue since only a few people will even utilize this process but it does feel a little strange to have this just hardcoded to the default path for another tool and hope they haven't configured it to point to a non-default path.

re: calling python from within the DGIdb codebase, its not too bad, we have an example here calling civicpy and capturing stderr, stdout, and the return code, or the simpler but less robust directly shelling out here. The challenging part is if you want to use a venv or similar, making sure you're getting the correct versions of everything in this content, but its doable if we want to automate the process more going forward.

Anyways, those are just thoughts, not blockers. LGTM!

jsstevenson · 2024-12-20T17:53:35Z

Yeah, we've sort of written the data management code to enable this kind of thing (although we'd envisioned it more in the context of shell scripting/workflow files). What you have in CIViC looks good so I went ahead and copied it over into the places where we need to look up the root data directory -- let me know if it looks ok.

In the long term it would also be nice to just call wags-tails get-latest <source> at the beginning of every source import but getting all of the ad hoc sources plugged into the CLI is probably not a priority right now

acoffman

Looks good!

jsstevenson added 6 commits December 19, 2024 16:22

initial work before i go home

88eaacf

remove online updater?

18e20b2

more housecleaning

ffc5936

update pharmgkb

f95e448

final updates

4e6eac8

idg and drugbank fixes

cf7aad3

jsstevenson marked this pull request as ready for review December 20, 2024 14:39

jsstevenson requested a review from acoffman December 20, 2024 14:41

jsstevenson added the priority:low Low priority label Dec 20, 2024

jsstevenson added 3 commits December 20, 2024 09:42

format

f31f1f3

refactor

6fa3022

fix paths

c67baed

acoffman approved these changes Dec 20, 2024

View reviewed changes

jsstevenson added 2 commits December 20, 2024 12:43

use open3 cmd

89dc9c1

add description of python deps

11f83ab

jsstevenson requested a review from acoffman December 20, 2024 17:53

acoffman approved these changes Dec 20, 2024

View reviewed changes

jsstevenson merged commit 31dcf81 into dev Dec 20, 2024
3 checks passed

jsstevenson deleted the download branch December 20, 2024 21:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: clean up data organization/add basic fetching scripts #558

feat: clean up data organization/add basic fetching scripts #558

jsstevenson commented Dec 19, 2024 •

edited

Loading

acoffman left a comment

jsstevenson commented Dec 20, 2024 •

edited

Loading

acoffman left a comment

feat: clean up data organization/add basic fetching scripts #558

feat: clean up data organization/add basic fetching scripts #558

Conversation

jsstevenson commented Dec 19, 2024 • edited Loading

acoffman left a comment

Choose a reason for hiding this comment

jsstevenson commented Dec 20, 2024 • edited Loading

acoffman left a comment

Choose a reason for hiding this comment

jsstevenson commented Dec 19, 2024 •

edited

Loading

jsstevenson commented Dec 20, 2024 •

edited

Loading