Reviving load #384

blue442 · 2023-07-06T19:01:17Z

Addressing #379, implemented changes to separate the loading of the metadata and dataset from the instantiation of a foundry object in order to facilitate browsing/searching of datasets prior to downloading. In addition, I refactored the downloading of metadata and data to utilize clearer verbiage when calling the functions.

Specific changes

Made load() a public function again
Removed the metadata arg from load()
Removed if metadata: res = metadata from load() method
if is_doi(name) and not metadata: -> if is_doi(name):
Remove call to load() from the __init__() function
Changed f.load() -> f.fetch_data()
Changed download=True to metadata_only=False

New usage example snippet

import foundry
f = foundry.Foundry()
f.search()
f.fetch_data('foundry_assorted_computational_band_gaps_v1.1', metadata_only=True)
f.download_full_dataset()
data_dictionary = f.load_data()

Documentation notes:

load_data() loads the data into memory, while download_data() downloads it to disk, correct? If so, the docstring should probably reflect this.
Moved 'todos' from code into docstrings (google-style)

TODO:

Need to update gitbook
Need to update notebooks
From __init__() - remove metadata (dict): **For debug purposes.** A search result analog to prepopulate metadata.
Not sure about "# TODO: Creating a new Foundry instance is a problematic way to update the metadata, we should find a way to abstract this." (I don't understand this comment)
Why do we only return the first of multiple search results? We're already getting all of the results and just throwing all but the first away - should we handle this differently?
Calling load() (now fetch_data())- does a user need to pass in the 'source_id' generated by list() as the name argument, or can they pass in the entry under 'name'? ('name' doesn't seem to work)

what-the-diff · 2023-07-06T19:01:47Z

PR Summary

Python Setup Version Update in tests.yml
The version of actions/setup-python was updated from v2 to v4.
Foundry.py Updates
- Removed FoundryDataset import statement
- Replaced argument download with metadata_only in __init__ along with similar replacements in other relevant methods
- The dataset attribute has been renamed to metadata and is now used for loading dataset metadata
- download_dataset now downloads the data if metadata_only is false, replacing the original download method
- metadata is now used instead of dataset in various methods, such as load_data and _get_inputs_targets
- The unused res variable has been removed, streamlining the code
- FoundryBase class has simplified to include metadata and use FoundryConfig for configuration settings, removing unused attributes such as dc, mdf, and dataset
Models.py Updates
- Type of the metadata attribute in FoundryBase class has been changed from FoundryDataset to FoundryMetadata
Test_foundry.py Updates
- Test updates mirror changes made in foundry.py, using metadata_only argument instead of the download argument
- Import updates reflect changes in foundry module
- Test names have been updated to match the updated method names
- test_metadata_pull, test_download_https, test_dataframe_load, and other various tests have been updated to use metadata_only argument.

Outcome: This refactoring significantly streamlines the code and improves both the utility and readability of the program. The use of metadata_only gives more control over the features while downloading. Plus, renaming dataset to metadata clarifies its use in the program.

ascourtas

Hi Steve! I was discussing this solution with @blaiszik, and he pointed out that there wouldn't be a case where the user would want to download the dataset but not load the metadata (but there are cases where a user would want to load the metadata, but not download the dataset). He suggested the following refactor which I think is very smooth:

"I think we should just make the metadata loading occur without user intervention. It’s a very cheap operation, and I don’t see really any use to have the Foundry object without the metadata"

from foundry-ml import Foundry
f = Foundry("$DOI$") # this gets the metadata, and never gets the data
f.get_data() # this downloads the data
X,y = f.load_data() #this loads the data into RAM

I think get_data() could also be called download_dataset(), depending on what we think is clearest. What are your thoughts?

blue442 · 2023-07-28T16:46:41Z

Updated based on feedback to reflect three possible use cases:

initialize foundry object w/o name (for publishing purposes)
initialize foundry object w/ name, downloads metadata AND dataset
initialize foundry object w/ name and pass metadata_only=True so dataset is not downloaded

Also changed function names to download_dataset and download_metadata for clarity.

Upgraded python-setup in github action to v4 (was causing an error).

ascourtas

Just need a little clarity on things before I can approve!

foundry/foundry.py

foundry/models.py

ascourtas

Thanks for the changes Steve, almost there! Last thing for this PR: can you please remove the unnecessary added cells and remove the big outputs (ie for importing libraries, etc) from the data publishing notebook?

Also, where there anything you changed in the data publishing notebook that is supposed to be there? I didn't see anything but it's hard to tell without a proper diff.

The last thing will be to update the Foundry documentation and example notebooks with the syntax change before we can cut this release.

codecov-commenter · 2023-08-07T14:32:19Z

Codecov Report

Merging #384 (37eb21c) into main (0ccb39b) will decrease coverage by 1.13%.
Report is 2 commits behind head on main.
The diff coverage is 67.74%.

❗ Current head 37eb21c differs from pull request most recent head 6812a41. Consider uploading reports for the commit 6812a41 to get more accurate results

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main     #384      +/-   ##
==========================================
- Coverage   72.08%   70.96%   -1.13%     
==========================================
  Files           9        9              
  Lines         541      527      -14     
==========================================
- Hits          390      374      -16     
- Misses        151      153       +2

Files Changed	Coverage Δ
foundry/foundry.py	`58.64% <66.66%> (-0.76%)`	⬇️
foundry/models.py	`85.89% <100.00%> (-1.75%)`	⬇️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

blue442 added 2 commits June 30, 2023 16:17

extract load from __init__ and restructure

9f5a2e9

refactoring for clarity

1c360fb

blue442 requested a review from ascourtas July 6, 2023 19:02

blue442 added 5 commits July 20, 2023 13:41

testing auth issues

1577bdd

fixing auth issues

341f24a

remove import

5fcd0a1

remove globus from Foundry object instantiation

764788f

update call to fetch_data()

ddd995b

ascourtas requested changes Jul 26, 2023

View reviewed changes

blue442 added 3 commits July 28, 2023 09:35

updating to allow instantiation with dataset name

074857d

update setup-python in gh action to v4

17fa692

remove too many blank lines

5baf531

blue442 requested a review from ascourtas July 28, 2023 16:41

ascourtas reviewed Jul 28, 2023

View reviewed changes

foundry/foundry.py Show resolved Hide resolved

foundry/foundry.py Show resolved Hide resolved

foundry/foundry.py Outdated Show resolved Hide resolved

foundry/models.py Show resolved Hide resolved

blue442 added 2 commits July 28, 2023 16:36

adding docstrings, moving check for empty name

6966030

removing foundry_dataset model

294c001

ascourtas requested changes Aug 4, 2023

View reviewed changes

Clean up publishing dataset example notebook

6812a41

ascourtas added the DO NOT MERGE label Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reviving load #384

Reviving load #384

Uh oh!

blue442 commented Jul 6, 2023

Uh oh!

what-the-diff bot commented Jul 6, 2023 •

edited

Loading

Uh oh!

ascourtas left a comment

Uh oh!

blue442 commented Jul 28, 2023

Uh oh!

ascourtas left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ascourtas left a comment

Uh oh!

codecov-commenter commented Aug 7, 2023 •

edited

Loading

Uh oh!

Uh oh!

Reviving load #384

Are you sure you want to change the base?

Reviving load #384

Uh oh!

Conversation

blue442 commented Jul 6, 2023

Specific changes

New usage example snippet

Documentation notes:

TODO:

Uh oh!

what-the-diff bot commented Jul 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

ascourtas left a comment

Choose a reason for hiding this comment

Uh oh!

blue442 commented Jul 28, 2023

Uh oh!

ascourtas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ascourtas left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Aug 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

what-the-diff bot commented Jul 6, 2023 •

edited

Loading

codecov-commenter commented Aug 7, 2023 •

edited

Loading