Skip to content

HDF5 Image Schema

Adrianna Pińska edited this page Nov 13, 2018 · 15 revisions

Requirements

Our aim is to produce a schema which translates existing data and attributes from a FITS file to HDF5 in as simple and straightforward a way as possible, while also providing the means for us to store additional data which it is advantageous for us to cache on disk. In order to meet IDIA's requirements, a schema needs to support the following information, in the form of additional datasets or metadata:

  • Average datasets, constructed by averaging the dataset along a particular axis (normally Z). These generally have a higher signal to noise ratio, and are useful during datavis. Calculating them on the fly is computationally expensive, and results can easily be baked into the file. The name of the axis along which the average is taken should be indicated by the dataset name.
  • Permuted datasets (i.e. rotating XYZ -> ZYX) datasets will allow for enormous speedups when reading image slices along non-principal axes. The schema should define how optional permuted datasets are stored in a standardised manner, so software supporting the schema can check for these datasets when performing I/O-intensive dataset slices, such as reading a Z-profile at a given (X,Y) pixel value. The name of the permuted dataset should indicated the rotated layout.
  • Mip-mapped datasets store a copy of the dataset, down-sampled across a particular image plane (e.g XY). As the visualization of large data generally requires down-sampling of generated images to match the user's viewport, this allows for the visualization of large data sets without loading entire image planes and performing down-sampling for each generated image. In addition, this will enable an efficient delivery of images to the client using tiling techniques commonly used in geographic information system (GIS) applications.
  • Histograms, defined along a particular image plane (e.g. XY or YZ) are computationally expensive to compute, but relatively small and simple to store. For example, calculating the histogram for a 4096x4096 image slice takes approximately 80 ms of calculation time. Using the “square root” guideline, a histogram with N=4096 would take an additional 16 KB of storage space. Histograms could then be used to calculate approximate percentile values.
  • Percentile values are expensive to compute, but reusable and easily stored in metadata. We currently don't calculate and store exact percentile values, instead using approximations calculated from histogram data.

Example

As a working example, the DEEP_2_I_cube.fits file available on CyberSKA (ZA) is used It consists of a single HDU, with a 4D dataset (4096x4096 in the XY dimensions, 150 in the Z, and a single stokes parameter).

If we use our prototype converter on this file with no additional parameters, we will only translate the dataset and a selection of attributes from the header:

DEEP_2_I_cube.h5
-- VERSION (our schema version)
-- 0
   -- COMMENT
   -- HISTORY
   -- DATA (Dataset with shape 1 x 150 x 4096 x 4096 containing FP32 data)

We write both the data and the attributes to a group called 0, which corresponds to the first HDU of the original FITS file. We are discussing including other HDUs, if they are present, as separate numbered groups at the top level of the HDF5 file, but the expected role and format of these HDUs needs to be investigated further.

We currently keep the following header attributes: BUNIT, DATE-OBS, EQUINOX, INSTR, OBSDEC, OBSERVER, OBSGEO-X, OBSGEO-Y, OBSGEO-Z, OBSRA, RADESYS, TELE, TIMESYS, as well as any attributes beginning with CDELT, CROTA, CRPIX, CRVAL, CTYPE, CUNIT or NAXIS, but this is not a final list. We may decide to preserve all header attributes, or possibly the original header in its entirety, if we decide that lossless translation from FITS to HDF5 and back to FITS is valuable.

We translate the COMMENT and HISTORY attributes to datasets rather than multidimensional attributes.

To add statistics, permuted datasets or other features described above we would pass additional parameters to our converter: for example, one or more sets of axes along which statistics, histograms or percentiles should be calculated. We show an outline of our proposed schema for storing these additional features in the HDF5 file:

DEEP_2_I_cube.h5
-- VERSION (our schema version)
-- 0
   -- COMMENT
   -- HISTORY
   -- DATA (Dataset with shape 1 x 150 x 4096 x 4096 containing FP32 data)

   -- PERCENTILE_RANKS (Dataset of length M, containing the list of percentile ranks (e.g. 99.9, 0.01)

   -- PermutedData (group which collects all permuted datasets)
      -- ZYXW (Dataset with shape 1 x 4096 x 4096 x 150 containing FP32 data. Rotated XYZW -> ZYXW)

   -- Mipmaps (group which collects mipmaps for particular datasets, including permuted and averaged datasets)
      -- DATA (group which collects mipmaps derived from the DATA dataset)
         -- XY_2 (Dataset with shape 1 x 150 x 2048 x 2048, mipmap of each XY plane)
         -- XY_4 (Dataset with shape 1 x 150 x 1024 x 1024, mipmap of each XY plane)
      -- ZYXW (group which collects mipmaps derived from the permuted DATA_ZYXW dataset)
         -- ZY_2 (Dataset with shape 1 x 4096 x 2048 x 75, mipmap of each ZY plane)
      -- MEAN_Z (group which collects mipmaps derived from Statistics/Z/MEAN)

   -- Statistics (group which collects groups of statistics averaged along different sets of axes)
      -- XY (Group containing statistics of XY planes: datasets with shape 1 x 150)
         -- MEAN (Mean values for each XY plane)
         -- MAX (Max values for each XY plane)
         -- MIN (Min values for each XY plane)
         -- NAN_COUNT (Count of NaN values for each XY plane)
         -- PERCENTILES (Dataset with shape 1 x 150 x M, containing the values of the percentiles for each XY plane)
         -- HISTOGRAM (Dataset with shape 1 x 150 x N, storing the bin values of the histogram for each XY plane)
         -- HIST_MIN (Per XY plane. If this exists, it determines the minimum of the histogram. Otherwise MIN is used)
         -- HIST_MAX (Per XY plane. If this exists, it determines the maximum of the histogram. Otherwise MAX is used)
      -- Z (Group containing statistics averaged along Z: datasets with shape 1 x 4096 x 4096)
      -- XYZ (Group containing statistics averaged along XYZ: datasets with shape 1 x 1)

Schema diagram

A file will generally contain only a selection of the above additional features, depending on the application. We can strip features out by selectively copying datasets when offering downloads to clients. For example, permuted datasets and mipmaps are stored purely for performance reasons, and can be removed when we offer a download to clients, to minimise file size.

Clone this wiki locally