Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding reader_options kwargs to open_virtual_dataset. #67

Merged
merged 37 commits into from
May 14, 2024
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
4c6cb63
adding reader_options kwargs to open_virtual_dataset
norlandrhagen Mar 29, 2024
adf311a
Merge branch 'main' into reader_options
TomNicholas Apr 30, 2024
ba5ac6d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 30, 2024
ea30914
fix typing
TomNicholas Apr 30, 2024
448800b
modifies _automatically_determine_filetype to open file with fsspec t…
norlandrhagen May 1, 2024
8c5dff7
using UPath to get file protocol and open with fsspec
norlandrhagen May 1, 2024
6cd77ce
tests passing locally. Reading over s3/local w+w/o indexes & guessing…
norlandrhagen May 2, 2024
f0daafe
merge w/ main
norlandrhagen May 2, 2024
ed3d0f4
add s3fs to test
norlandrhagen May 2, 2024
beec724
typing school 101
norlandrhagen May 2, 2024
e669841
anon
norlandrhagen May 2, 2024
09f89a6
tying
norlandrhagen May 2, 2024
e4db860
test_anon update
norlandrhagen May 2, 2024
ba8b1e3
anon failing
norlandrhagen May 2, 2024
b12d32c
double down on storage_options
norlandrhagen May 2, 2024
f9478b9
fsspec nit
norlandrhagen May 3, 2024
6958b59
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 3, 2024
aefa22d
seting s3 defaults as empty to try to appease the cruel boto3 gods
norlandrhagen May 3, 2024
464ffd3
merge
norlandrhagen May 3, 2024
d108978
added fpath to SingleHDF5ToZarr
norlandrhagen May 3, 2024
5cc5ecd
hardcode in empty storage opts for s3
norlandrhagen May 3, 2024
3509a1f
hardcode default + unpack test
norlandrhagen May 3, 2024
80cf22b
changed reader_options defaults
norlandrhagen May 3, 2024
a3fc72e
Merge branch 'main' into reader_options
norlandrhagen May 3, 2024
0235f51
updated docs install
norlandrhagen May 3, 2024
1e9e2fe
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 3, 2024
55031f9
changed docstring type in utils to numpy style
norlandrhagen May 6, 2024
6a3d7be
added TYPE_CHECKING for fsspec and s3fs mypy type hints
norlandrhagen May 8, 2024
5aec9db
merged w/ main and lint
norlandrhagen May 8, 2024
83b3c4b
fixed TYPE_CHECKING import
norlandrhagen May 8, 2024
a143cf4
pinned xarray to latest commit on github
norlandrhagen May 9, 2024
9d124ef
merged w/ main to pin xarray and kerchunk
norlandrhagen May 13, 2024
3a29b41
re-add upath
norlandrhagen May 13, 2024
b9c056a
Merge branch 'main' into reader_options
norlandrhagen May 13, 2024
13fc295
merged w/ main
norlandrhagen May 14, 2024
4f766d9
ådds section to usage
norlandrhagen May 14, 2024
e6f047f
Minor formatting nit of code example in docs
TomNicholas May 14, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion ci/doc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ dependencies:
- "sphinx_design"
- "sphinx_togglebutton"
- "sphinx-autodoc-typehints"
- -e ..
- -e "..[test]"
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ dynamic = ["version"]
dependencies = [
"xarray@git+https://github.com/TomNicholas/xarray.git@concat-avoid-index-auto-creation#egg=xarray",
"kerchunk==0.2.2",
"universal-pathlib",
"h5netcdf",
"pydantic",
"numpy",
Expand All @@ -40,6 +41,7 @@ test = [
"scipy",
"pooch",
"ruff",
"s3fs"

]

Expand Down
28 changes: 17 additions & 11 deletions virtualizarr/kerchunk.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import xarray as xr

from virtualizarr.zarr import ZArray, ZAttrs
from virtualizarr.utils import _fsspec_openfile_from_filepath

# Distinguishing these via type hints makes it a lot easier to mentally keep track of what the opaque kerchunk "reference dicts" actually mean
# (idea from https://kobzol.github.io/rust/python/2023/05/20/writing-python-like-its-rust.html)
Expand Down Expand Up @@ -35,7 +36,9 @@ class FileType(AutoName):
zarr = auto()

def read_kerchunk_references_from_file(
filepath: str, filetype: Optional[FileType]
filepath: str, filetype: Optional[FileType],
reader_options: Optional[dict] = {'storage_options':{'key':'', 'secret':'', 'anon':True}}

) -> KerchunkStoreRefs:
"""
Read a single legacy file and return kerchunk references to its contents.
Expand All @@ -47,55 +50,57 @@ def read_kerchunk_references_from_file(
filetype : FileType, default: None
Type of file to be opened. Used to determine which kerchunk file format backend to use.
If not provided will attempt to automatically infer the correct filetype from the the filepath's extension.
reader_options: dict, default {'storage_options':{'key':'', 'secret':'', 'anon':True}}
Dict passed into Kerchunk file readers. Note: Each Kerchunk file reader has distinct arguments,
so ensure reader_options match selected Kerchunk reader arguments.
"""

if filetype is None:
filetype = _automatically_determine_filetype(filepath)
filetype = _automatically_determine_filetype(filepath=filepath, reader_options=reader_options)

# if filetype is user defined, convert to FileType
filetype = FileType(filetype)

if filetype.name.lower() == "netcdf3":
from kerchunk.netCDF3 import NetCDF3ToZarr
refs = NetCDF3ToZarr(filepath, inline_threshold=0).translate()

refs = NetCDF3ToZarr(filepath, inline_threshold=0, **reader_options).translate()
elif filetype.name.lower() == "netcdf4":
from kerchunk.hdf import SingleHdf5ToZarr

refs = SingleHdf5ToZarr(filepath, inline_threshold=0).translate()
refs = SingleHdf5ToZarr(filepath, inline_threshold=0, **reader_options).translate()
elif filetype.name.lower() == "grib":
# TODO Grib files should be handled as a DataTree object
# see https://github.com/TomNicholas/VirtualiZarr/issues/11
raise NotImplementedError(f"Unsupported file type: {filetype}")
elif filetype.name.lower() == "tiff":
from kerchunk.tiff import tiff_to_zarr

refs = tiff_to_zarr(filepath, inline_threshold=0)
refs = tiff_to_zarr(filepath, inline_threshold=0, **reader_options)
elif filetype.name.lower() == "fits":
from kerchunk.fits import process_file

refs = process_file(filepath, inline_threshold=0)
refs = process_file(filepath, inline_threshold=0, **reader_options)
else:
raise NotImplementedError(f"Unsupported file type: {filetype.name}")

# TODO validate the references that were read before returning?
return refs


def _automatically_determine_filetype(filepath: str) -> FileType:
def _automatically_determine_filetype(*,filepath: str, reader_options: Optional[dict]={}) -> FileType:
file_extension = Path(filepath).suffix
fpath = _fsspec_openfile_from_filepath(filepath=filepath,reader_options=reader_options)

if file_extension == ".nc":
# based off of: https://github.com/TomNicholas/VirtualiZarr/pull/43#discussion_r1543415167
with open(filepath, 'rb') as f:
magic = f.read()
magic = fpath.read()

if magic[0:3] == b"CDF":
filetype = FileType.netcdf3
elif magic[1:4] == b"HDF":
filetype = FileType.netcdf4
else:
raise ValueError(".nc file does not appear to be NETCDF3 OR NETCDF4")

elif file_extension == ".zarr":
# TODO we could imagine opening an existing zarr store, concatenating it, and writing a new virtual one...
raise NotImplementedError()
Expand All @@ -108,6 +113,7 @@ def _automatically_determine_filetype(filepath: str) -> FileType:
else:
raise NotImplementedError(f"Unrecognised file extension: {file_extension}")

fpath.close()
return filetype


Expand Down
4 changes: 2 additions & 2 deletions virtualizarr/tests/test_kerchunk.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,8 +159,8 @@ def test_automatically_determine_filetype_netcdf3_netcdf4():
ds.to_netcdf(netcdf3_file_path, engine="scipy", format="NETCDF3_CLASSIC")
ds.to_netcdf(netcdf4_file_path, engine="h5netcdf")

assert FileType("netcdf3") == _automatically_determine_filetype(netcdf3_file_path)
assert FileType("netcdf4") == _automatically_determine_filetype(netcdf4_file_path)
assert FileType("netcdf3") == _automatically_determine_filetype(filepath = netcdf3_file_path)
assert FileType("netcdf4") == _automatically_determine_filetype(filepath = netcdf4_file_path)



Expand Down
16 changes: 16 additions & 0 deletions virtualizarr/tests/test_xarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import xarray as xr
import xarray.testing as xrt
from xarray.core.indexes import Index
import pytest

from virtualizarr import open_virtual_dataset
from virtualizarr.manifests import ChunkManifest, ManifestArray
Expand Down Expand Up @@ -271,6 +272,19 @@ def test_combine_by_coords(self, netcdf4_files):
assert combined_vds.xindexes["time"].to_pandas_index().is_monotonic_increasing






norlandrhagen marked this conversation as resolved.
Show resolved Hide resolved

pytest.importorskip("s3fs")
@pytest.mark.parametrize("filetype", ['netcdf4', None], ids=["netcdf4 filetype", "None filetype"])
@pytest.mark.parametrize("indexes", [None, {}], ids=["None index", "empty dict index"])
def test_anon_read_s3(filetype, indexes):
#TODO: Switch away from this s3 url after minIO is implemented.
fpath = 's3://carbonplan-share/virtualizarr/local.nc'
assert open_virtual_dataset(fpath, filetype=filetype, indexes=indexes)
norlandrhagen marked this conversation as resolved.
Show resolved Hide resolved

class TestLoadVirtualDataset:
def test_loadable_variables(self, netcdf4_file):
vars_to_load = ['air', 'time']
Expand All @@ -283,6 +297,8 @@ def test_loadable_variables(self, netcdf4_file):
assert isinstance(vds[name].data, ManifestArray), name

full_ds = xr.open_dataset(netcdf4_file)

for name in full_ds.variables:

if name in vars_to_load:
xrt.assert_identical(vds.variables[name], full_ds.variables[name])
46 changes: 46 additions & 0 deletions virtualizarr/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
from typing import Optional, Union

# TODO: importing fsspec and s3fs to get typing. Is there a better way incase these are optional deps?
from s3fs.core import S3File
from fsspec.implementations.local import LocalFileOpener
norlandrhagen marked this conversation as resolved.
Show resolved Hide resolved


def _fsspec_openfile_from_filepath(*, filepath: str, reader_options: Optional[dict] = {'storage_options':{'key':'', 'secret':'', 'anon':True}}) -> Union[S3File, LocalFileOpener]:
"""Utility function to facilitate reading remote file paths using fsspec.

:param filepath: Input filepath
:type filepath: str
:param reader_options: Dict containing options to pass to fsspec file reader. Default: {'storage_options':{'key':'', 'secret':'', 'anon':True}}
:type reader_options: Optional[dict]
:rtype: Union[S3File, LocalFileOpener]
norlandrhagen marked this conversation as resolved.
Show resolved Hide resolved
"""
import fsspec
from upath import UPath

universal_filepath = UPath(filepath)
protocol = universal_filepath.protocol

# why does UPath give an empty string for a local file protocol :(
# import pdb; pdb.set_trace()
norlandrhagen marked this conversation as resolved.
Show resolved Hide resolved

if protocol == '':

fpath = fsspec.open(filepath, 'rb').open()

elif protocol in ["s3"]:
s3_anon_defaults = {'key':'', 'secret':'', 'anon':True}
if not bool(reader_options):
storage_options = s3_anon_defaults

else:
storage_options = reader_options.get('storage_options') #type: ignore

# using dict merge operator to add in defaults if keys are not specified
storage_options = s3_anon_defaults | storage_options

fpath = fsspec.filesystem(protocol, **storage_options).open(filepath)

else:
raise NotImplementedError("Only local and s3 file protocols are currently supported")

return fpath
11 changes: 10 additions & 1 deletion virtualizarr/xarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from xarray.core.variable import IndexVariable

import virtualizarr.kerchunk as kerchunk
from virtualizarr.utils import _fsspec_openfile_from_filepath
from virtualizarr.kerchunk import KerchunkStoreRefs, FileType
from virtualizarr.manifests import ChunkManifest, ManifestArray
from virtualizarr.zarr import dataset_to_zarr, attrs_from_zarr_group_json, metadata_from_zarr_json
Expand All @@ -27,6 +28,7 @@ def open_virtual_dataset(
loadable_variables: Optional[Iterable[str]] = None,
indexes: Optional[Mapping[str, Index]] = None,
virtual_array_class=ManifestArray,
reader_options: Optional[dict] = {'storage_options':{'key':'', 'secret':'', 'anon':True}},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you normally point xr.open_dataset at an S3 url, you don't need to pass reader_options do you? Can we try to follow the signature of xr.open_dataset as closely as possible? (Maybe this already is as close as we can get)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong, but I thought you had to pass in some sort of fsspec/s3fs mapper.

For me this fails:

ds = xr.open_dataset('s3://carbonplan-share/virtualizarr/local.nc')

) -> xr.Dataset:
"""
Open a file or store as an xarray Dataset wrapping virtualized zarr arrays.
Expand Down Expand Up @@ -55,13 +57,17 @@ def open_virtual_dataset(
virtual_array_class
Virtual array class to use to represent the references to the chunks in each on-disk array.
Currently can only be ManifestArray, but once VirtualZarrArray is implemented the default should be changed to that.
reader_options: dict, default {'storage_options':{'key':'', 'secret':'', 'anon':True}}
Dict passed into Kerchunk file readers. Note: Each Kerchunk file reader has distinct arguments,
so ensure reader_options match selected Kerchunk reader arguments.

Returns
-------
vds
An xarray Dataset containing instances of virtual_array_cls for each variable, or normal lazily indexed arrays for each variable in loadable_variables.
"""


if drop_variables is None:
drop_variables = []
elif isinstance(drop_variables, str):
Expand Down Expand Up @@ -103,7 +109,9 @@ def open_virtual_dataset(
# TODO we are reading a bunch of stuff we know we won't need here, e.g. all of the data variables...
# TODO it would also be nice if we could somehow consolidate this with the reading of the kerchunk references
# TODO really we probably want a dedicated xarray backend that iterates over all variables only once
ds = xr.open_dataset(filepath, drop_variables=drop_variables)
fpath = _fsspec_openfile_from_filepath(filepath=filepath,reader_options=reader_options)

ds = xr.open_dataset(fpath, drop_variables=drop_variables)

if indexes is None:
# add default indexes by reading data from file
Expand Down Expand Up @@ -139,6 +147,7 @@ def open_virtual_dataset(
return vds



def open_virtual_dataset_from_v3_store(
storepath: str,
drop_variables: List[str],
Expand Down
Loading