Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: clean up data organization/add basic fetching scripts #558

Merged
merged 11 commits into from
Dec 20, 2024
Merged

Conversation

jsstevenson
Copy link
Contributor

@jsstevenson jsstevenson commented Dec 19, 2024

  • Use a centralized location for data in user space, rather than storing everything within a subdirectory of the repo. This is really more of a convenience for me because it means keeping fewer copies of the latest ChEMBL release around. It reflects a strategy we've used in our lil data management tool.
  • Add a script to download all DGIdb-relevant files, including the latest versions of updateable stuff. I know we'd previously maintained some updater modules for this purpose but I'd prefer something that reuses said data management tool above, it's a little less maintenance for me.
  • Put all DGIdb static files into an S3 bucket. By default this won't be accessible to non-NCH users but we could probably arrange something if necessary. Hypothetically some of this data is supposed to be non-public, so I don't want to put it on a public bucket for now.
  • Assign version values dynamically where possible. Previously, dgidb generally liked it when files are named "claims.tsv" or some variation therein, but we've used a file naming pattern in our other projects along the lines of <source>_<the kind of stuff>_<version>.<filetype>, since not every file includes its version as part of the data. So now, the file downloader is in charge of identifying the version of the data and naming the file as needed, and then DGI importers can just read version values from filenames.
  • Remove some unused/no-longer-functional/now-redundant code related to updating/downloading

The workflow for updating would go like

  1. rake db:reset
  2. fetch/update data with python3 scripts/download_files.py
  3. run imports

Unfortunately this entails some awkwardness about running Python modules before the rake tasks. I thought a bit about doing something like shell commands from the ruby side to call the python or even looking at ruby-python FFIs but I don't think any particular solution will be terribly elegant so I'm not worrying about it for now.

@jsstevenson jsstevenson marked this pull request as ready for review December 20, 2024 14:39
@jsstevenson jsstevenson requested a review from acoffman December 20, 2024 14:41
@jsstevenson jsstevenson added the priority:low Low priority label Dec 20, 2024
Copy link
Collaborator

@acoffman acoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea, and the diff looks good overall to me, I think it's fine to merge as is. A couple of thoughts:

I do see "#{Dir.home}/.local/share/wags_tails" scattered a few places around the code (though mostly it's used via the default_data_dir method you wrote).

Since it looks like this pr adds a dependency on wags_tails anyways, I wonder if it's worth having the DGIdb importers "ask" wags-tails where its data root path is somehow? That's definitely not a huge/blocking issue since only a few people will even utilize this process but it does feel a little strange to have this just hardcoded to the default path for another tool and hope they haven't configured it to point to a non-default path.

re: calling python from within the DGIdb codebase, its not too bad, we have an example here calling civicpy and capturing stderr, stdout, and the return code, or the simpler but less robust directly shelling out here. The challenging part is if you want to use a venv or similar, making sure you're getting the correct versions of everything in this content, but its doable if we want to automate the process more going forward.

Anyways, those are just thoughts, not blockers. LGTM!

@jsstevenson
Copy link
Contributor Author

jsstevenson commented Dec 20, 2024

Yeah, we've sort of written the data management code to enable this kind of thing (although we'd envisioned it more in the context of shell scripting/workflow files). What you have in CIViC looks good so I went ahead and copied it over into the places where we need to look up the root data directory -- let me know if it looks ok.

In the long term it would also be nice to just call wags-tails get-latest <source> at the beginning of every source import but getting all of the ad hoc sources plugged into the CLI is probably not a priority right now

@jsstevenson jsstevenson requested a review from acoffman December 20, 2024 17:53
Copy link
Collaborator

@acoffman acoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@jsstevenson jsstevenson merged commit 31dcf81 into dev Dec 20, 2024
3 checks passed
@jsstevenson jsstevenson deleted the download branch December 20, 2024 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:low Low priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants