Skip to content

Reviving load #384

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main-bak
Choose a base branch
from
Open

Reviving load #384

wants to merge 13 commits into from

Conversation

blue442
Copy link
Collaborator

@blue442 blue442 commented Jul 6, 2023

Addressing #379, implemented changes to separate the loading of the metadata and dataset from the instantiation of a foundry object in order to facilitate browsing/searching of datasets prior to downloading. In addition, I refactored the downloading of metadata and data to utilize clearer verbiage when calling the functions.

Specific changes

  • Made load() a public function again
  • Removed the metadata arg from load()
  • Removed if metadata: res = metadata from load() method
  • if is_doi(name) and not metadata: -> if is_doi(name):
  • Remove call to load() from the __init__() function
  • Changed f.load() -> f.fetch_data()
  • Changed download=True to metadata_only=False

New usage example snippet

import foundry
f = foundry.Foundry()
f.search()
f.fetch_data('foundry_assorted_computational_band_gaps_v1.1', metadata_only=True)
f.download_full_dataset()
data_dictionary = f.load_data()

Documentation notes:

  • load_data() loads the data into memory, while download_data() downloads it to disk, correct? If so, the docstring should probably reflect this.
  • Moved 'todos' from code into docstrings (google-style)

TODO:

  • Need to update gitbook
  • Need to update notebooks
  • From __init__() - remove metadata (dict): **For debug purposes.** A search result analog to prepopulate metadata.
  • Not sure about "# TODO: Creating a new Foundry instance is a problematic way to update the metadata, we should find a way to abstract this." (I don't understand this comment)
  • Why do we only return the first of multiple search results? We're already getting all of the results and just throwing all but the first away - should we handle this differently?
  • Calling load() (now fetch_data())- does a user need to pass in the 'source_id' generated by list() as the name argument, or can they pass in the entry under 'name'? ('name' doesn't seem to work)

@what-the-diff
Copy link

what-the-diff bot commented Jul 6, 2023

PR Summary

  • Python Setup Version Update in tests.yml
    The version of actions/setup-python was updated from v2 to v4.
  • Foundry.py Updates
    • Removed FoundryDataset import statement
    • Replaced argument download with metadata_only in __init__ along with similar replacements in other relevant methods
    • The dataset attribute has been renamed to metadata and is now used for loading dataset metadata
    • download_dataset now downloads the data if metadata_only is false, replacing the original download method
    • metadata is now used instead of dataset in various methods, such as load_data and _get_inputs_targets
    • The unused res variable has been removed, streamlining the code
    • FoundryBase class has simplified to include metadata and use FoundryConfig for configuration settings, removing unused attributes such as dc, mdf, and dataset
  • Models.py Updates
    • Type of the metadata attribute in FoundryBase class has been changed from FoundryDataset to FoundryMetadata
  • Test_foundry.py Updates
    • Test updates mirror changes made in foundry.py, using metadata_only argument instead of the download argument
    • Import updates reflect changes in foundry module
    • Test names have been updated to match the updated method names
    • test_metadata_pull, test_download_https, test_dataframe_load, and other various tests have been updated to use metadata_only argument.

Outcome: This refactoring significantly streamlines the code and improves both the utility and readability of the program. The use of metadata_only gives more control over the features while downloading. Plus, renaming dataset to metadata clarifies its use in the program.

@blue442 blue442 requested a review from ascourtas July 6, 2023 19:02
Copy link
Contributor

@ascourtas ascourtas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Steve! I was discussing this solution with @blaiszik, and he pointed out that there wouldn't be a case where the user would want to download the dataset but not load the metadata (but there are cases where a user would want to load the metadata, but not download the dataset). He suggested the following refactor which I think is very smooth:

"I think we should just make the metadata loading occur without user intervention. It’s a very cheap operation, and I don’t see really any use to have the Foundry object without the metadata"

from foundry-ml import Foundry
f = Foundry("$DOI$") # this gets the metadata, and never gets the data
f.get_data() # this downloads the data
X,y = f.load_data() #this loads the data into RAM

I think get_data() could also be called download_dataset(), depending on what we think is clearest. What are your thoughts?

@blue442 blue442 requested a review from ascourtas July 28, 2023 16:41
@blue442
Copy link
Collaborator Author

blue442 commented Jul 28, 2023

Updated based on feedback to reflect three possible use cases:

  • initialize foundry object w/o name (for publishing purposes)
  • initialize foundry object w/ name, downloads metadata AND dataset
  • initialize foundry object w/ name and pass metadata_only=True so dataset is not downloaded

Also changed function names to download_dataset and download_metadata for clarity.

Upgraded python-setup in github action to v4 (was causing an error).

Copy link
Contributor

@ascourtas ascourtas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need a little clarity on things before I can approve!

Copy link
Contributor

@ascourtas ascourtas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes Steve, almost there! Last thing for this PR: can you please remove the unnecessary added cells and remove the big outputs (ie for importing libraries, etc) from the data publishing notebook?

Also, where there anything you changed in the data publishing notebook that is supposed to be there? I didn't see anything but it's hard to tell without a proper diff.

The last thing will be to update the Foundry documentation and example notebooks with the syntax change before we can cut this release.

@codecov-commenter
Copy link

codecov-commenter commented Aug 7, 2023

Codecov Report

Merging #384 (37eb21c) into main (0ccb39b) will decrease coverage by 1.13%.
Report is 2 commits behind head on main.
The diff coverage is 67.74%.

❗ Current head 37eb21c differs from pull request most recent head 6812a41. Consider uploading reports for the commit 6812a41 to get more accurate results

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main     #384      +/-   ##
==========================================
- Coverage   72.08%   70.96%   -1.13%     
==========================================
  Files           9        9              
  Lines         541      527      -14     
==========================================
- Hits          390      374      -16     
- Misses        151      153       +2     
Files Changed Coverage Δ
foundry/foundry.py 58.64% <66.66%> (-0.76%) ⬇️
foundry/models.py 85.89% <100.00%> (-1.75%) ⬇️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants