Skip to content

Commit

Permalink
Added PNC backend to xarray (#1905)
Browse files Browse the repository at this point in the history
* Added PNC backend to xarray

PNC is used for GEOS-Chem, CAMx, CMAQ and other atmospheric data formats
that have their own file formats and meta-data conventions. It can provide a CF compliant netCDF-like interface.

* Added whats-new documentation

* Updating pnc_ to remove DunderArrayMixin dependency

* Adding basic tests for pnc

Right now, pnc is simply being tested as a reader for NetCDF3 files

* Updating for flake8 compliance

* flake does not like unused e

* Updating pnc to PseudoNetCDF

* Remove outer except

* Updating pnc to PseudoNetCDF

* Added open and updated init

Based on shoyer review

* Updated indexing and test fix

Indexing supports #1899

* Added PseudoNetCDF to doc/io.rst

* Changing test subtype

* Changing test subtype
removing pdb

* pnc test case requires netcdf3only

For now, pnc is only supporting the classic data model

* adding backend_kwargs default as dict

This ensures **mapping is possible.

* Upgrading tests to CFEncodedDataTest

Some tests are bypassed. PseudoNetCDF string treatment is not currently
compatible with xarray. This will be addressed soon.

* Not currently supporting autoclose

I do not fully understand the usecase, so I have not implemented these tests.

* Minor updates for flake8

* Explicit skipping

Using pytest.mark.skip to skip unsupported tests

* removing trailing whitespace from pytest skip

* Adding pip support

* Addressing comments

* Bypassing pickle, mask/scale, and object

These tests cause errors that do not affect desired backend performance.

* Added uamiv test

PseudoNetCDF reads other formats. This adds a test
of uamiv to the standard test for a backend

and skips mask/scale, object, and boolean tests

* Adding support for autoclose

ensure open must be called before accessing variable data

* Adding bakcend_kwargs to all backends

Most backends currently take no keywords, so an empty ditionary is appropriate.

* Small tweaks to PNC backend

* remove warning and update whats-new

* Separating isntall and io pnc doc and updating whats new

* fixing line length in test

* Tests now use non-netcdf files

* Removing unknown meta-data netcdf support.

* flake8 cleanup

* Using python 2 and 3 compat testing

* Disabling mask_and_scale by default

prevents inadvertent double scaling in PNC formats

* consistent with 3.0.0

Updates in 3.0.1 will fix close in uamiv.

* Updating readers and line length

* Updating readers and line length

* Updating readers and line length

* Adding open_mfdataset test

Testing by opening same file twice and stacking it.

* Using conda version of PseudoNetCDF

* Removing xfail for netcdf

Mask and scale with PseudoNetCDF and NetCDF4 is not supported, but
not prevented.

* Moving pseudonetcdf to v0.15

* Updating what's new

* Fixing open_dataarray CF options

mask_and_scale is None (diagnosed by open_dataset) and decode_cf should be True
  • Loading branch information
barronh authored and shoyer committed Jun 1, 2018
1 parent 4106b94 commit cf19528
Show file tree
Hide file tree
Showing 11 changed files with 440 additions and 17 deletions.
1 change: 1 addition & 0 deletions ci/requirements-py36.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ dependencies:
- rasterio
- bottleneck
- zarr
- pseudonetcdf>=3.0.1
- pip:
- coveralls
- pytest-cov
Expand Down
7 changes: 5 additions & 2 deletions doc/installing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@ For netCDF and IO
- `cftime <https://unidata.github.io/cftime>`__: recommended if you
want to encode/decode datetimes for non-standard calendars or dates before
year 1678 or after year 2262.
- `PseudoNetCDF <http://github.com/barronh/pseudonetcdf/>`__: recommended
for accessing CAMx, GEOS-Chem (bpch), NOAA ARL files, ICARTT files
(ffi1001) and many other.

For accelerating xarray
~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -65,9 +68,9 @@ with its recommended dependencies using the conda command line tool::

.. _conda: http://conda.io/

We recommend using the community maintained `conda-forge <https://conda-forge.github.io/>`__ channel if you need difficult\-to\-build dependencies such as cartopy or pynio::
We recommend using the community maintained `conda-forge <https://conda-forge.github.io/>`__ channel if you need difficult\-to\-build dependencies such as cartopy, pynio or PseudoNetCDF::

$ conda install -c conda-forge xarray cartopy pynio
$ conda install -c conda-forge xarray cartopy pynio pseudonetcdf

New releases may also appear in conda-forge before being updated in the default
channel.
Expand Down
23 changes: 22 additions & 1 deletion doc/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -650,7 +650,26 @@ We recommend installing PyNIO via conda::

.. _PyNIO: https://www.pyngl.ucar.edu/Nio.shtml

.. _combining multiple files:
.. _io.PseudoNetCDF:

Formats supported by PseudoNetCDF
---------------------------------

xarray can also read CAMx, BPCH, ARL PACKED BIT, and many other file
formats supported by PseudoNetCDF_, if PseudoNetCDF is installed.
PseudoNetCDF can also provide Climate Forecasting Conventions to
CMAQ files. In addition, PseudoNetCDF can automatically register custom
readers that subclass PseudoNetCDF.PseudoNetCDFFile. PseudoNetCDF can
identify readers heuristically, or format can be specified via a key in
`backend_kwargs`.

To use PseudoNetCDF to read such files, supply
``engine='pseudonetcdf'`` to :py:func:`~xarray.open_dataset`.

Add ``backend_kwargs={'format': '<format name>'}`` where `<format name>`
options are listed on the PseudoNetCDF page.

.. _PseuodoNetCDF: http://github.com/barronh/PseudoNetCDF


Formats supported by Pandas
Expand All @@ -662,6 +681,8 @@ exporting your objects to pandas and using its broad range of `IO tools`_.
.. _IO tools: http://pandas.pydata.org/pandas-docs/stable/io.html


.. _combining multiple files:


Combining multiple files
------------------------
Expand Down
4 changes: 4 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@ Enhancements
dask<0.17.4. (related to :issue:`2203`)
By `Keisuke Fujii <https://github.com/fujiisoup`_.

- added a PseudoNetCDF backend for many Atmospheric data formats including
GEOS-Chem, CAMx, NOAA arlpacked bit and many others.
By `Barron Henderson <https://github.com/barronh>`_.

- :py:meth:`~DataArray.cumsum` and :py:meth:`~DataArray.cumprod` now support
aggregation over multiple dimensions at the same time. This is the default
behavior when dimensions are not specified (previously this raised an error).
Expand Down
2 changes: 2 additions & 0 deletions xarray/backends/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from .pynio_ import NioDataStore
from .scipy_ import ScipyDataStore
from .h5netcdf_ import H5NetCDFStore
from .pseudonetcdf_ import PseudoNetCDFDataStore
from .zarr import ZarrStore

__all__ = [
Expand All @@ -21,4 +22,5 @@
'ScipyDataStore',
'H5NetCDFStore',
'ZarrStore',
'PseudoNetCDFDataStore',
]
55 changes: 42 additions & 13 deletions xarray/backends/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,9 +152,10 @@ def _finalize_store(write, store):


def open_dataset(filename_or_obj, group=None, decode_cf=True,
mask_and_scale=True, decode_times=True, autoclose=False,
mask_and_scale=None, decode_times=True, autoclose=False,
concat_characters=True, decode_coords=True, engine=None,
chunks=None, lock=None, cache=None, drop_variables=None):
chunks=None, lock=None, cache=None, drop_variables=None,
backend_kwargs=None):
"""Load and decode a dataset from a file or file-like object.
Parameters
Expand All @@ -178,7 +179,8 @@ def open_dataset(filename_or_obj, group=None, decode_cf=True,
taken from variable attributes (if they exist). If the `_FillValue` or
`missing_value` attribute contains multiple values a warning will be
issued and all array values matching one of the multiple values will
be replaced by NA.
be replaced by NA. mask_and_scale defaults to True except for the
pseudonetcdf backend.
decode_times : bool, optional
If True, decode times encoded in the standard NetCDF datetime format
into datetime objects. Otherwise, leave them encoded as numbers.
Expand All @@ -194,7 +196,7 @@ def open_dataset(filename_or_obj, group=None, decode_cf=True,
decode_coords : bool, optional
If True, decode the 'coordinates' attribute to identify coordinates in
the resulting dataset.
engine : {'netcdf4', 'scipy', 'pydap', 'h5netcdf', 'pynio'}, optional
engine : {'netcdf4', 'scipy', 'pydap', 'h5netcdf', 'pynio', 'pseudonetcdf'}, optional
Engine to use when reading files. If not provided, the default engine
is chosen based on available dependencies, with a preference for
'netcdf4'.
Expand All @@ -219,6 +221,10 @@ def open_dataset(filename_or_obj, group=None, decode_cf=True,
A variable or list of variables to exclude from being parsed from the
dataset. This may be useful to drop variables with problems or
inconsistent values.
backend_kwargs: dictionary, optional
A dictionary of keyword arguments to pass on to the backend. This
may be useful when backend options would improve performance or
allow user control of dataset processing.
Returns
-------
Expand All @@ -229,6 +235,10 @@ def open_dataset(filename_or_obj, group=None, decode_cf=True,
--------
open_mfdataset
"""

if mask_and_scale is None:
mask_and_scale = not engine == 'pseudonetcdf'

if not decode_cf:
mask_and_scale = False
decode_times = False
Expand All @@ -238,6 +248,9 @@ def open_dataset(filename_or_obj, group=None, decode_cf=True,
if cache is None:
cache = chunks is None

if backend_kwargs is None:
backend_kwargs = {}

def maybe_decode_store(store, lock=False):
ds = conventions.decode_cf(
store, mask_and_scale=mask_and_scale, decode_times=decode_times,
Expand Down Expand Up @@ -303,18 +316,26 @@ def maybe_decode_store(store, lock=False):
if engine == 'netcdf4':
store = backends.NetCDF4DataStore.open(filename_or_obj,
group=group,
autoclose=autoclose)
autoclose=autoclose,
**backend_kwargs)
elif engine == 'scipy':
store = backends.ScipyDataStore(filename_or_obj,
autoclose=autoclose)
autoclose=autoclose,
**backend_kwargs)
elif engine == 'pydap':
store = backends.PydapDataStore.open(filename_or_obj)
store = backends.PydapDataStore.open(filename_or_obj,
**backend_kwargs)
elif engine == 'h5netcdf':
store = backends.H5NetCDFStore(filename_or_obj, group=group,
autoclose=autoclose)
autoclose=autoclose,
**backend_kwargs)
elif engine == 'pynio':
store = backends.NioDataStore(filename_or_obj,
autoclose=autoclose)
autoclose=autoclose,
**backend_kwargs)
elif engine == 'pseudonetcdf':
store = backends.PseudoNetCDFDataStore.open(
filename_or_obj, autoclose=autoclose, **backend_kwargs)
else:
raise ValueError('unrecognized engine for open_dataset: %r'
% engine)
Expand All @@ -334,9 +355,10 @@ def maybe_decode_store(store, lock=False):


def open_dataarray(filename_or_obj, group=None, decode_cf=True,
mask_and_scale=True, decode_times=True, autoclose=False,
mask_and_scale=None, decode_times=True, autoclose=False,
concat_characters=True, decode_coords=True, engine=None,
chunks=None, lock=None, cache=None, drop_variables=None):
chunks=None, lock=None, cache=None, drop_variables=None,
backend_kwargs=None):
"""Open an DataArray from a netCDF file containing a single data variable.
This is designed to read netCDF files with only one data variable. If
Expand All @@ -363,7 +385,8 @@ def open_dataarray(filename_or_obj, group=None, decode_cf=True,
taken from variable attributes (if they exist). If the `_FillValue` or
`missing_value` attribute contains multiple values a warning will be
issued and all array values matching one of the multiple values will
be replaced by NA.
be replaced by NA. mask_and_scale defaults to True except for the
pseudonetcdf backend.
decode_times : bool, optional
If True, decode times encoded in the standard NetCDF datetime format
into datetime objects. Otherwise, leave them encoded as numbers.
Expand Down Expand Up @@ -403,6 +426,10 @@ def open_dataarray(filename_or_obj, group=None, decode_cf=True,
A variable or list of variables to exclude from being parsed from the
dataset. This may be useful to drop variables with problems or
inconsistent values.
backend_kwargs: dictionary, optional
A dictionary of keyword arguments to pass on to the backend. This
may be useful when backend options would improve performance or
allow user control of dataset processing.
Notes
-----
Expand All @@ -417,13 +444,15 @@ def open_dataarray(filename_or_obj, group=None, decode_cf=True,
--------
open_dataset
"""

dataset = open_dataset(filename_or_obj, group=group, decode_cf=decode_cf,
mask_and_scale=mask_and_scale,
decode_times=decode_times, autoclose=autoclose,
concat_characters=concat_characters,
decode_coords=decode_coords, engine=engine,
chunks=chunks, lock=lock, cache=cache,
drop_variables=drop_variables)
drop_variables=drop_variables,
backend_kwargs=backend_kwargs)

if len(dataset.data_vars) != 1:
raise ValueError('Given file dataset contains more than one data '
Expand Down
101 changes: 101 additions & 0 deletions xarray/backends/pseudonetcdf_.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import functools

import numpy as np

from .. import Variable
from ..core.pycompat import OrderedDict
from ..core.utils import (FrozenOrderedDict, Frozen)
from ..core import indexing

from .common import AbstractDataStore, DataStorePickleMixin, BackendArray


class PncArrayWrapper(BackendArray):

def __init__(self, variable_name, datastore):
self.datastore = datastore
self.variable_name = variable_name
array = self.get_array()
self.shape = array.shape
self.dtype = np.dtype(array.dtype)

def get_array(self):
self.datastore.assert_open()
return self.datastore.ds.variables[self.variable_name]

def __getitem__(self, key):
key, np_inds = indexing.decompose_indexer(
key, self.shape, indexing.IndexingSupport.OUTER_1VECTOR)

with self.datastore.ensure_open(autoclose=True):
array = self.get_array()[key.tuple] # index backend array

if len(np_inds.tuple) > 0:
# index the loaded np.ndarray
array = indexing.NumpyIndexingAdapter(array)[np_inds]
return array


class PseudoNetCDFDataStore(AbstractDataStore, DataStorePickleMixin):
"""Store for accessing datasets via PseudoNetCDF
"""
@classmethod
def open(cls, filename, format=None, writer=None,
autoclose=False, **format_kwds):
from PseudoNetCDF import pncopen
opener = functools.partial(pncopen, filename, **format_kwds)
ds = opener()
mode = format_kwds.get('mode', 'r')
return cls(ds, mode=mode, writer=writer, opener=opener,
autoclose=autoclose)

def __init__(self, pnc_dataset, mode='r', writer=None, opener=None,
autoclose=False):

if autoclose and opener is None:
raise ValueError('autoclose requires an opener')

self._ds = pnc_dataset
self._autoclose = autoclose
self._isopen = True
self._opener = opener
self._mode = mode
super(PseudoNetCDFDataStore, self).__init__()

def open_store_variable(self, name, var):
with self.ensure_open(autoclose=False):
data = indexing.LazilyOuterIndexedArray(
PncArrayWrapper(name, self)
)
attrs = OrderedDict((k, getattr(var, k)) for k in var.ncattrs())
return Variable(var.dimensions, data, attrs)

def get_variables(self):
with self.ensure_open(autoclose=False):
return FrozenOrderedDict((k, self.open_store_variable(k, v))
for k, v in self.ds.variables.items())

def get_attrs(self):
with self.ensure_open(autoclose=True):
return Frozen(dict([(k, getattr(self.ds, k))
for k in self.ds.ncattrs()]))

def get_dimensions(self):
with self.ensure_open(autoclose=True):
return Frozen(self.ds.dimensions)

def get_encoding(self):
encoding = {}
encoding['unlimited_dims'] = set(
[k for k in self.ds.dimensions
if self.ds.dimensions[k].isunlimited()])
return encoding

def close(self):
if self._isopen:
self.ds.close()
self._isopen = False
1 change: 1 addition & 0 deletions xarray/tests/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ def _importorskip(modname, minversion=None):
has_netCDF4, requires_netCDF4 = _importorskip('netCDF4')
has_h5netcdf, requires_h5netcdf = _importorskip('h5netcdf')
has_pynio, requires_pynio = _importorskip('Nio')
has_pseudonetcdf, requires_pseudonetcdf = _importorskip('PseudoNetCDF')
has_cftime, requires_cftime = _importorskip('cftime')
has_dask, requires_dask = _importorskip('dask')
has_bottleneck, requires_bottleneck = _importorskip('bottleneck')
Expand Down
31 changes: 31 additions & 0 deletions xarray/tests/data/example.ict
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
27, 1001
Henderson, Barron
U.S. EPA
Example file with artificial data
JUST_A_TEST
1, 1
2018, 04, 27, 2018, 04, 27
0
Start_UTC
7
1, 1, 1, 1, 1
-9999, -9999, -9999, -9999, -9999
lat, degrees_north
lon, degrees_east
elev, meters
TEST_ppbv, ppbv
TESTM_ppbv, ppbv
0
8
ULOD_FLAG: -7777
ULOD_VALUE: N/A
LLOD_FLAG: -8888
LLOD_VALUE: N/A, N/A, N/A, N/A, 0.025
OTHER_COMMENTS: www-air.larc.nasa.gov/missions/etc/IcarttDataFormat.htm
REVISION: R0
R0: No comments for this revision.
Start_UTC, lat, lon, elev, TEST_ppbv, TESTM_ppbv
43200, 41.00000, -71.00000, 5, 1.2345, 2.220
46800, 42.00000, -72.00000, 15, 2.3456, -9999
50400, 42.00000, -73.00000, 20, 3.4567, -7777
50400, 42.00000, -74.00000, 25, 4.5678, -8888
Binary file added xarray/tests/data/example.uamiv
Binary file not shown.
Loading

0 comments on commit cf19528

Please sign in to comment.