Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft classes of object/opinionated layer #267

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

vinisalazar
Copy link
Contributor

@vinisalazar vinisalazar commented Sep 13, 2022

Hi,

this PR is a draft of how the new object/opinionated layer should look like. It does not have any "proper" code yet, just a scaffold of how the classes would look like. I am submitting it so we can discuss the best way of implementing this layer based on the discussion on #228.

I have a couple of questions:

  • What would be the best name for the subpackage and main module? I went with objects/ and objects.py so there wouldn't be any confusion with the Python object reserved word, but I also considered orm as the name.
  • On GSoC 2022 ideas #228, @ocefpaf suggests that on the ERDDAPConnection class:

@property(server) -> ERDDAPConnection
Return a new ERDDAPConnection if trying to set a new server, or change other attributes rather than changing it in place.

I wasn't sure of what's the best way to do this. I wrote a .to_string method which takes either a string (with the new server URL) or an instance of ERDDAPConnection and returns just a string, or a TypeError:

@classmethod
def to_string(cls, value):
"""Convert an instance of ERDDAPConnection to a string."""
if isinstance(value, str):
return value
elif isinstance(value, cls):
return value.server
else:
raise TypeError(
f"Server must be either a string or an instance of ERDDAPConnection. '{value}' was "
f"passed.",
)

That function is then used in the .server setter and in the ERDDAPServer __init__ and connection. How should the ERDDAPConnection.server setter return a new instance? This doesn't look like it would work:

    @connection.setter
    def connection(self, value: str):
        return ERDDAPConnection(value)

Once you approve this, I will start implementing the core functions in this layer. Looking forward to hearing your feedback.

Thanks,
Vini

@vinisalazar vinisalazar force-pushed the classes branch 2 times, most recently from 62583c8 to b84c858 Compare September 13, 2022 17:44
@vinisalazar
Copy link
Contributor Author

I think test_to_iris_tabledap is failing because of the latest version of iris. I can reproduce the same error after updating to 3.3.0.

I'll attempt a fix on a separate PR.

@ocefpaf
Copy link
Member

ocefpaf commented Sep 19, 2022

Small request, as we discussed today, let's change the name from objects.py to erddap.py (I know, zero creativity but I just want to avoid clobbering a python built-in).

Regarding the draft: this look great. I'd like to wait for @abkfenris to be back from his vacation so to get his input. IMO the best strategy to get this merge is to issue new PRs with 1 class at a time instead of everything we have in this draft.

Copy link
Contributor

@abkfenris abkfenris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that would probably help is to break up the classes into separate files. As is, maybe read my notes starting with connection, then server, then datasets.

* What would be the best name for the subpackage and main module? I went with `objects/` and `objects.py` so there wouldn't be any confusion with the Python `object` reserved word, but I also considered `orm` as the name.

What about something like array_like?

Even if we aren't implementing any/all of the Python array manipulation dunder-methods at this point, it gives us an interface something to aspire to.

* On [GSoC 2022 ideas #228](https://github.com/ioos/erddapy/issues/228), @ocefpaf suggests that on the `ERDDAPConnection` class:

@property(server) -> ERDDAPConnection
Return a new ERDDAPConnection if trying to set a new server, or change other attributes rather than changing it in place.

I wasn't sure of what's the best way to do this. I wrote a .to_string method which takes either a string (with the new server URL) or an instance of ERDDAPConnection and returns just a string, or a TypeError:

@classmethod
def to_string(cls, value):
"""Convert an instance of ERDDAPConnection to a string."""
if isinstance(value, str):
return value
elif isinstance(value, cls):
return value.server
else:
raise TypeError(
f"Server must be either a string or an instance of ERDDAPConnection. '{value}' was "
f"passed.",
)

That function is then used in the .server setter and in the ERDDAPServer __init__ and connection. How should the ERDDAPConnection.server setter return a new instance? This doesn't look like it would work:

    @connection.setter
    def connection(self, value: str):
        return ERDDAPConnection(value)

See my inline comments, but I think we can simplify the connection and servers so that this isn't an issue.

FilePath = Union[str, Path]


class ERDDAPConnection:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would probably want to split each class out into it's own file. Maybe use the init docstring to give a big overview of relationship between classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ERDDAPConnection largely acts as a wrapper around requests/htmx/urllib that then can be used by the other classes when data access needs to happen.

Comment on lines 20 to 22
def __init__(self, server: str):
"""Initialize instance of ERDDAPConnection."""
self._server = self.to_string(server)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure if we actually need to store the server on here, which may simplify other things.

Comment on lines 48 to 50
def open(self, url_part: str) -> FilePath:
"""Yield file-like object for access for file types that don't enjoy getting passed a string."""
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe .open() should be a context manager that gives a pathlib.Path to a temporary file.

We could also have an explicit .save() method that uses .open() under the hood.

erddapy/objects/objects.py Outdated Show resolved Hide resolved
Comment on lines 63 to 78
class ERDDAPDataset:
"""Base class for more focused table or grid datasets."""

def __init__(
self,
dataset_id: str,
connection: str | ERDDAPConnection,
variables,
constraints,
):
"""Initialize instance of ERDDAPDataset."""
self.dataset_id = dataset_id
self._connection = ERDDAPConnection(ERDDAPConnection.to_string(connection))
self._variables = variables
self._constraints = constraints
self._meta = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the datasets, I envisions supporting some subset of the xarray/pandas filtering/subsetting API.

For instance if I had a GridDataset for http://www.neracoos.org/erddap/griddap/WW3_72_GulfOfMaine_latest.html as eds I could do something like to filter down:

eds = GridDataset.from_url("http://www.neracoos.org/erddap/griddap/WW3_72_GulfOfMaine_latest.html")

eds = eds[['hs', 'dir']]  # select just the wave height and direction variables
eds = eds.sel(time=slice("2022-09-23T00:00:00Z", "2022-09-26T00:00:00Z")) # only for a subset of the time range

Then there are different ways that we could consume the dataset.

ds = eds.to_xarray() # creates a xarray dataset via OpenDAP and filters/slices based on the earlier operations

netcdf_url = eds.url('nc')  # return the NetCDF url

eds.save('nc', NETCDF_PATH)  # Save a netcdf locally.

We probably want to validate that filtering/slicing is working on legit values by default, but have a way to enable/disable it both globally/per dataset. To validate it, we would need to retrieve all the valid variables and dimensions from the server.

@ocefpaf ocefpaf added the GSoC22 label Sep 27, 2022
@vinisalazar
Copy link
Contributor Author

Hi @ocefpaf and @abkfenris,

thank you very much for the review. I just arrived in Australia after a long trip and will take a couple of days to settle, so I will address the comments next week, and for this week I will attempt a fix to #261. Please let me know if there are any problems.

Best,
V

@vinisalazar
Copy link
Contributor Author

The dataset used in test_to_objects.py is blocking the CI from passing. It's necessary to merge #271 or #272 before continuing

vinisalazar and others added 15 commits October 29, 2022 09:54
  - Add interfaces.py to process responses into third-party library objects
  - Use existing dataset
  - Pass requests_kwargs to 'to_pandas' method
  This commit introduces a rough draft of the classes to be introduced in the
  object/opinionated layer. For now, the subpackage containing them is named 'objects'
  but that may change in the future. For discussion, see ioos#228.

  - Add 'objects' subpackage with init module
  - Add 'objects.py' module with 5 new classes
    - Methods will be implemented in the following commits.
  The '|' operator for typing was introduced in Python 3.10. This import
  allows previous Python versions to work with this operator.
- code review

Co-authored-by: Alex Kerney <[email protected]>
  - Rename 'objects' to 'array_like' to avoid clobbering Python built-ins
  - Refactor imports
  - Create connection.py, datasets.py, server.py
  Replaced by connections, datasets, and server modules.
  Add __future__.annotations import to modules 'datasets' and 'server'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants