Skip to content
This repository has been archived by the owner on Mar 1, 2023. It is now read-only.

Data file formats #114

Open
ali-ramadhan opened this issue Mar 19, 2019 · 34 comments
Open

Data file formats #114

ali-ramadhan opened this issue Mar 19, 2019 · 34 comments

Comments

@ali-ramadhan
Copy link
Member

ali-ramadhan commented Mar 19, 2019

For Oceananigans.jl we settled on supporting NetCDF output early as it's the most ubiquitous and familiar data format (see CliMA/Oceananigans.jl#31).

We've been using NetCDF.jl but it seems like an inactive package so we're thinking of switching to the more active and thoroughly tested NCDatasets.jl. NetCDF.jl has some missing features and seems slow. Output writing and compression was so slow compared to time stepping on the GPU that we ended up introducing asynchronous output writing (CliMA/Oceananigans.jl#137).

Anyways, just curious if NetCDF output is something you guys have looked at yet. If not, might be worth it to work on a common solution with features we all need (fast NetCDF output, option for asynchronous GPU output, standardized names, etc.)?

Copy link
Member Author

Another option, not sure if it is good as I have never used it, is a meta package like https://www.olcf.ornl.gov/center-projects/adios/ which supports parallel io and is somewhat file format agnostic.
View in Slack

Copy link
Member Author

<@UCSM0C54P> re: your NetCDF issue, why not HDF5?
View in Slack

Copy link
Member Author

if you do need NetCDF compatibility (e.g. for exporting to another program), it might be possible to enforce the necessary restrictions at the Julia level
View in Slack

Copy link
Member Author

My short exposure to HDF5 has all been positive
View in Slack

Copy link
Member Author

The main complaints I have heard about HDF5 have been when people try to update or mutate data, and corrupt their dataset
View in Slack

Copy link
Member Author

I think if we want to encourage broad use netCDF is widely ingrained in climate communities. Some ability to produce netCDF for downstream use might be wise. Possibly that is not sufficient for full DG info though? What is the thinking on street re targeting netCDF/CF style and engaging a bit with Pangeo community ( https://pangeo.io ). Pangeo has been leaning on netCDF to transparently include a zarr backend option ( https://zarr.readthedocs.io/en/stable/ ), which plays well with cloud object stores out of the box.
View in Slack

@rabernat
Copy link

I found this repo thanks to a tweet. Kudos to everyone for following modern software development best practices in terms of CI, docs, etc! This bodes very well for the CLIMA project!

It's correct that in Pangeo we are exploring storage options beyond netCDF. The primary reason is that HDF5, the library underlying netCDF4, does not currently play very well with object storage (e.g. Amazon S3, Google Cloud Storage, etc; see blog post about why). We are very excited about using the cloud as a shared environment for the analysis of big climate datasets, so we have been experimenting with other options.

Zarr is the thing we are most excited about. Zarr was originally developed for genomic analysis by researchers who were frustrated by the poor parallel write support in HDF5. It uses an open spec and is quite simple, yet it performs very well on single-workstation, HPC, and cloud environments. Here is a little talk I gave about Zarr at AGU. TileDB is a similar product, the difference being that it is backed by a company and, while open source, doesn't have an open spec. After the experience of HDF-group going commercial, I think the community is now a little wary about that.

Zarr has been adopted in a number of different fields and implemented in a number of different languages (python, C++ and Java). @meggart of MPI has a working julia implementation. (See also https://github.com/zarr-developers/zarr/issues/284.)

Although Zarr is working great for us, it is still experimental in the sense that it is not yet a community standard. The netCDF data model and the CF conventions themselves are excellent. They are the result of decades of tedious yet important work to define common standards for exchanging data. Going forward, I hope we can preserve these while experimenting with new underlying storage containers like zarr. netCDF != HDF. In this vein, it's been great to learn that the Unidata netCDF group themselves are working on creating a zarr-based backend for the netCDF core C library. I would expect them to release something within one year.

In the meantime, it would be great to see the CLIMA project experiment with zarr. The first step would be to check in at https://github.com/meggart/ZarrNative.jl to understand better where the julia library stands.

@simonbyrne
Copy link
Member

Another thing we might want to look at, especially if we're looking to use HDF5: https://www.openpmd.org

@simonbyrne simonbyrne changed the title What do you guys do for NetCDF output? Data file formats Jun 14, 2020
@simonbyrne
Copy link
Member

simonbyrne commented Jun 14, 2020

So in case anyone is interested, here is brief summary of different data formats.

HDF

HDF (Hierarchical Data Format) is a set of file formats and libraries, maintained by the HDF Group, which represent a hierarchies of groups and datasets (like a file system, stored in a file)

  • HDF4 is an old format, with many hard-coded limitations (limited types, 32-bit integer indexing).
  • HDF5 is newer, and more general, supporting different data layouts, compression, etc. It is extremely complex, but an open standard

The hdf5 library can optionally be built against MPI, for parallel I/O.

Given its standard but flexible nature, many other software packages have used HDF5 as the basis for their own format:

Packages

  • HDF5.jl wraps the hdf5 library. It is an old package, and appears to be intermittently maintained.
  • JLD.jl is built on top of HDF5.jl, for convenient saving/loading of arbitrary Julia objects.
  • JLD2.jl is pure-Julia implementation of a subset of HDF5, and supports similarly saving/loading of Julia objects.

NetCDF

NetCDF is developed by Unidata at UCAR is the name for any one of a set of file formats which can be read/written by the netcdf library. These are:

  1. NetCDF Classic (PnetCDF calls this CDF-1)
  2. NetCDF 64-bit offset, which is a a variant of 1 with 64-bit offset for large arrays (PnetCDF calls this CDF-2)
  3. PnetCDF (see below) supports an additional variant of 2 with support for 64-bit integer values, which it calls CDF-5 (this does not appear to be supported by the netcdf library).

Formats 1-3 are often labelled NetCDF-3

  1. NetCDF-4, which is a subset of HDF5. It supports additional features such as custom datatypes.
  2. NetCDF-4 Classic, which is a subset of NetCDF-4 limited to the the same features as the NetCDF Classic format (e.g. no custom datatypes), so that datasets can be losslessly converted between the two formats.
  3. It can also read (but not write) HDF4 files.

Parallel I/O is supported by:

  • Formats 1,2 and 3: the PnetCDF package, developed by ANL, which is a completely separate library built on top of MPI. It can either be used standalone, or netcdf can be built with PnetCDF support (--enable-pnetcdf) to allow it to use the same interface as formats 4 & 5.
  • Formats 4 and 5: directly if the underlying HDF5 library was built with MPI support.

Packages

There are 2 Julia packages which wrap the netcdf library:

  • NetCDF.jl is an older package and had been inactive, though recently appears to have had some attention
  • NCDatasets.jl is a newer package, with a slightly different interface: this is what we're currently using, and by default it uses the NetCDF-4 format.

@kpamnany
Copy link
Contributor

From here:

NETCDF3_CLASSIC was the original netcdf binary format, and was limited to file sizes less than 2 Gb. NETCDF3_64BIT_OFFSET was introduced in version 3.6.0 of the library, and extended the original binary format to allow for file sizes greater than 2 Gb. NETCDF3_64BIT_DATA is a new format that requires version 4.4.0 of the C library - it extends the NETCDF3_64BIT_OFFSET binary format to allow for unsigned/64 bit integer data types and 64-bit dimension sizes. NETCDF3_64BIT is an alias for NETCDF3_64BIT_OFFSET. NETCDF4_CLASSIC files use the version 4 disk format (HDF5), but omits features not found in the version 3 API. They can be read by netCDF 3 clients only if they have been relinked against the netCDF 4 library. They can also be read by HDF5 clients. NETCDF4 files use the version 4 disk format (HDF5) and use the new features of the version 4 API.

@rabernat
Copy link

I thought I would mention that Zarr.jl has continued to mature during this time. Here is an example of using Zarr.jl to read CMIP6 data directly from Google Cloud Storage: https://github.com/pangeo-data/pangeo-julia-examples

NCAR has been testing out Zarr with CESM and has found excellent performance: https://www2.cisl.ucar.edu/sites/default/files/Weile_Wei_Slides.pdf

Zarr has also received a CZI grant, putting it on a path for long-term sustainability.

We (Zarr developers) would be pleased to try to work with you if you were interested in playing around with Zarr as an I/O format.

@simonbyrne
Copy link
Member

@rabernat Thanks, that is interesting, especially the benchmarks.

@simonbyrne
Copy link
Member

simonbyrne commented Jun 15, 2020

To summarize discussion with @jakebolewski @kpamnany @christophernhill @skandalaCLIMA:

  • we will inevitably need to dump & stage raw binaries for huge runs
  • ideally small runs can just output directly
  • we need to support NetCDF for compatibility with existing toolchains
  • we can't cross compile an MPI-aware HDF5 (or NetCDF with parallel support) ([HDF5] Cross-compile for all platforms JuliaPackaging/Yggdrasil#567)
  • we might be able to get away with just using PnetCDF directly, since we don't seem to need a lot of extra functionality: it has a completely different API, so will require new Julia wrapper packages. @kpamnany is going to look into this.

@jakebolewski
Copy link
Contributor

jakebolewski commented Jun 15, 2020

@rabernat what are common tricks used to increase write performance on HPC systems with Zarr? Just use larger chunk sizes?

Zarr and TileDB have similar models (main difference being that TileDB can stream multiple zarr chunks to a single file + index), and TileDB was found to not perform well at all on HPC systems due to the number of filesystem metadata requests which I would imagine would be a similar limitation with Zarr. Admittedly the design focus was on object storage and not traditional shared posix filesystems.

@rabernat
Copy link

@jakebolewski -- All good questions! You'll probably have to email the author of that presentation to get some answers. Your point about metadata requests is indeed a serious limitation in traditional HPC. My own work has focused mostly on integrations between zarr, object store, and distributed processing in the cloud.

I understand your need to support NetCDF for compatibility with existing toolchains. One piece of good news is that the NetCDF Zarr backend development at Unidata is moving forward quickly. They should have some sort of release within a month or so. That will give us the best of both worlds (trusted data format + cloud-friendly backend).

@jakebolewski
Copy link
Contributor

That's great to hear that the NetCDF Zarr back-end release is happening soon.

Since we will have to stage the parallel upload to GCS anyways, doing Zarr conversion as a post processing step doesn't seem too burdensome especially if natively supported by the netcdf library.

@simonbyrne
Copy link
Member

simonbyrne commented Oct 19, 2020

@jedbrown suggested we also look into making use of conduit, which can be used with ascent for visualization.

@jedbrown
Copy link

jedbrown commented Oct 19, 2020

I think there are two distinct goals here:

  1. output formats for downstream analysis, with significant preference for CF conventions and simple spatial gridding
  2. high-fidelity representation (checkpoint or 3D viz) of high-order spaces

Goal 2 requires describing the finite element spaces, and thus more schema than needed by structured grids. The intent of Conduit is to give a rich in-memory description of your data for in-situ analysis/viz or writing to disk for postprocessing. It supports a few persistent formats and they would welcome more as demand warrants. The goal is to only need to provide that rich description of your data once, and the application doesn't need to think about what is being done with it.

It seems like this Zarr backend for NetCDF would be trying to satisfy Goal 1. I don't know the pros/cons relative to HDF5 VOL, which has been developed to address HDF5's shortcomings with respect to object storage (whose use in HPC is expected to become mainstream; see DAOS).

@PetrKryslUCSD
Copy link

@jedbrown : I have to say that I am confused by the purpose of the Conduit infrastructure. It seems to rely on an ASCII file to store the mesh. That is fine for academic purposes and small meshes, but how is that applicable to large-scale computing? Do you know if there is a good description of the intent and design of this software? I found some tutorials, but they left me more perplexed than before.

@jedbrown
Copy link

jedbrown commented Nov 2, 2020

Conduit is a no-copy in-memory description with in-situ viz/rendering capability and various storage backends (if you opt to persist the mesh/field data rather than render in-situ). Perhaps you're reading a JSON or YAML example, but it has binary support (raw, HDF5, and ADIOS).

The philosophy is that you write a Conduit interface once and can select at run-time whether/when to persist it to disk or connect to an an-situ analysis pipeline. This avoids the previous hassle of needing to maintain in-situ capability (with Paraview's Catalyst or VisIt's Libsim) in addition to persistent read/write capability.

@PetrKryslUCSD
Copy link

Okay, that makes a little bit more sense. I will have to see if I can find an example of in-situ sharing of the data.

@PetrKryslUCSD
Copy link

It seems kind of like in-memory HDF5, doesn't it?

@PetrKryslUCSD
Copy link

I wonder why the switch to the json-like representation? Why not stick to the HDF5 semantics?

@jedbrown
Copy link

jedbrown commented Nov 2, 2020

repo and examples

They show JSON in the small examples just because it needs no libraries and is easy to inspect/tinker with.

@PetrKryslUCSD
Copy link

I was rather referring to the schema itself. It appears to be modeled on JSON, if I am not mistaken. It might have been modeled on HDF5 instead, which would have made more sense given that it would be closer to a practically important file format.

@PetrKryslUCSD
Copy link

repo and examples

They show JSON in the small examples just because it needs no libraries and is easy to inspect/tinker with.

Many thanks. That is helpful.

I wonder why there are no updated presentations in the past four years? The code seems to be still worked on (the heartbeat shows latest commit having been introduced today).

@jedbrown
Copy link

jedbrown commented Nov 2, 2020

Maybe @mclarsen could let you know if there are recent public slides poster somewhere. (He was at our CEED meeting in August, but I don't know if I'm free to post slides here.)

Matt could also answer any questions about potential use of Conduit and other Alpine components with CLiMA.

@PetrKryslUCSD
Copy link

Maybe @mclarsen could let you know if there are recent public slides poster somewhere. (He was at our CEED meeting in August, but I don't know if I'm free to post slides here.)

Matt could also answer any questions about potential use of Conduit and other Alpine components with CLiMA.

Very cool, many thanks. The concept looks like something I should try to understand in depth. There was a discussion among several people about mesh interfaces in Julia... This may be very relevant to that conversation.

@mclarsen
Copy link

mclarsen commented Nov 3, 2020

So, i haven't processed the entire thread, but I can chime in. Conduit is primarily an in-memory way to describe any hierarchical data. The examples are in human readable JSON and YAML for clarity, but the underlying data is most defiantly not ASCII. Conduit has a some conventions (called mesh blueprint) that imposes some structure, so meshes can be shared in memory. Conduit has conventions for both low-order and high order meshes,

We currently use this mesh representation to share meshes between simulation codes and analysis libraries (e.g., visualization). Additionally, we conduit can write out HDF5 files that can be used for simulation checkpoints (we are doing this at LLNL) and visualization dumps.

I work on Ascent, which is an in situ visualization library, and we use conduit as our interface to simulation data. Additionally, ParaView catalyst also announced that it will be using conduit as their interface in an upcoming release (https://catalyst-in-situ.readthedocs.io/en/latest/introduction.html#relationship-with-conduit).

As for an example of what using conduit in a simulation code looks like, here is a simple example:
https://github.com/Alpine-DAV/ascent/blob/c866652e522dce11db8ccda3601f6ebd84a339b9/src/examples/proxies/lulesh2.0.3/lulesh-init.cc#L199

I am happy to answer any questions.

@PetrKryslUCSD
Copy link

PetrKryslUCSD commented Nov 3, 2020

@mclarsen Matt: Thank you, the information you provided is helpful. Are there any updated documents for Conduit?
And/or the mesh blueprint? Is there an actual spec of the mesh blueprint? I did find some user documentation from 2017, but it was pretty sparse.

@mclarsen
Copy link

mclarsen commented Nov 3, 2020

https://llnl-conduit.readthedocs.io/en/latest/blueprint_mesh.html is the current documentation. If you have any specific questions about any of that, I am happy answer. Like all documentation, it can be improved.

@PetrKryslUCSD
Copy link

PetrKryslUCSD commented Nov 3, 2020

Yes, I did find that. Unfortunately it looks like it doesn't quite nail down the mesh blueprint. You sort of have to dig through the examples and make guesses as to the meaning of the supplied data. :-\ Guessing == not good.

@PetrKryslUCSD
Copy link

@mclarsen Thank you for your generous offer to answer some questions. Would you prefer that I pose the questions elsewhere, or is this good?

@mclarsen
Copy link

mclarsen commented Nov 3, 2020

I don't want to hijack this discussion, but I am happy to answer here if it's appropriate. If not, we can take if offline.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants