-
Notifications
You must be signed in to change notification settings - Fork 1
HDF5 Image Schema
We have considered adopting the HDFITS [1] schema as a starting point, as this closely resembles the FITS standard, and would allow for easy conversion of N-dimensional data sets.
In order to meet IDIA's requirements, a schema needs to support the following information, in the form of additional datasets or metadata:
- Average datasets, constructed by averaging the dataset along a particular axis (normally Z). These generally have a higher signal to noise ratio, and are useful during datavis. Calculating them on the fly is computationally expensive, and results can easily be baked into the file. The name of the axis along which the average is taken should be indicated by the dataset name.
- Swizzled datasets (i.e. rotating XYZ -> ZYX) datasets will allow for enormous speedups when reading image slices along non-principal axes. The schema should define how optional swizzled datasets are stored in a standardised manner, so software supporting the schema can check for these datasets when performing I/O-intensive dataset slices, such as reading a Z-profile at a given (X,Y) pixel value. The name of the swizzled dataset should indicated the swizzled layout.
- Mip-mapped datasets (this feature is still up for discussion)
- Histograms, defined along a particular image plane (e.g. XY or YZ) are computationally expensive to compute, but relatively small and simple to store. For example, calculating the histogram for a 4096x4096 image slice takes approximately 80 ms of calculation time. Using the “square root” guideline, a histogram with N=4096 would take an additional 16 KB of storage space. Histograms could then be used to calculate approximate percentile values
- Percentile values are expensive to compute, but reusable and easily stored in metadata.
- Combination of the above (e.g. histograms and percentile values for the average image)
As a working example, the DEEP_2_I_cube.fits
file available on CyberSKA (ZA) is used It consists of a single HDU, with a 4D dataset (4096x4096 in the XY dimensions, 150 in the Z, and a single stokes parameter). After processing through fits2hdf5
, the file DEEP_2_I_cube.h5
is produced, with the file outline as:
DEEP_2_I_cube.h5
-- Primary (Group. Attributes of group correspond to FITS header variables)
-- COMMENT (Text dataset containing comments from FITS file)
-- HISTORY (Text dataset containing history from FITS file)
-- DATA (Image dataset containing 1 x 150 x 4096 x 4096 dataset of FP32 data)
The attributes of the Primary
group are translated directly from the FITS header file. For example, values such as CRPIX1
, CDELT4
, CTYPE4
and so on are stored as attributes. The proposed data structure for additional features described above would be:
DEEP_2_I_cube.h5
-- Primary
-- COMMENT
-- HISTORY
-- PERCENTILE_RANKS (Dataset of length M, containing the list of percentile ranks (e.g. 99.9, 0.01)
-- DATA (Image dataset containing 1 x 150 x 4096 x 4096 dataset of FP32 data)
-- SwizzledData (group which collects all swizzled datasets)
-- DATA_ZYXW (Image dataset containing 1 x 4096 x 4096 x 150 dataset of FP32 data. Swizzled XYZW -> ZYXW)
-- Mipmaps (group which collects mipmaps for particular datasets, including swizzled and averaged datasets)
-- DATA (group which collects mipmaps derived from the DATA dataset)
-- XY_2 (Image dataset containing 1 x 150 x 2048 x 2048 dataset, mipmap of each XY plane)
-- XY_4 (Image dataset containing 1 x 150 x 1024 x 1024 dataset, mipmap of each XY plane)
-- DATA_ZYXW (group which collects mipmaps derived from the swizzled DATA_ZYXW dataset)
-- ZY_2 (Image dataset containing 1 x 4096 x 2048 x 75 dataset, mipmap of each ZY plane)
-- DATA_MEAN_Z (group which collects mipmaps derived from Statistics/DATA/Z/MEAN)
-- Statistics (group which collects groups of statistics for particular datasets, including swizzled and averaged datasets)
-- DATA (group which collects statistics derived from the DATA dataset)
-- XY (Group containing calculated statistics of XY planes for DATA dataset: datasets with shape 1 x 150)
-- MEAN (Mean values for each XY plane)
-- MAX (Max values for each XY plane)
-- MIN (Min values for each XY plane)
-- NAN_COUNT (Count of NaN values for each XY plane)
-- PERCENTILES (Dataset with shape 1 x 150 x M, containing the values of the percentiles for each XY plane)
-- HISTOGRAM (Scalar dataset of length 1 x 150 x N, storing the bin values of the histogram for each XY plane)
-- Z (Group containing calculated statistics of DATA averaged along Z: datasets with shape 1 x 4096 x 4096)
-- XYZ (Group containing calculated statistics of DATA averaged along XYZ: datasets with shape 1 x 1)
-- DATA_ZYXW
-- ZY (Group containing calculated statistics of ZY planes for swizzled DATA_ZYXW dataset)
A file will generally contain only a selection of the above additional features, depending on application. Features can be stripped off by selectively copying datasets when offering downloads to clients. For example, swizzled datasets and mipmaps are stored purely for performance reasons, and can be removed when offering a download to clients, to minimise file size.