Master file structure for each granule Type could reduce code significantly and add flexibility ? #58

alex-s-gardner · 2023-08-13T20:10:21Z

If we were to autogenerate maps of the h5 data tree with variable types for each granule type then we could replace a lot of the repetitive code with loops. It may also allow us to move to a single points function for all granule types.

Both #55, #57 and most others consist of a lot of find and replace as everything is hardcoded. If we were to autogenerate maps of the h5 data tree with variable types and save the output in the package then a lot of code could be replaced with loops. We could autogenerate new data trees each time a new version of the package was created.

This would also allow the user to pass a list of variables to the points function if they wanted more/less than the defaults.

This would also make it easier acomodate new granule types.

The text was updated successfully, but these errors were encountered:

alex-s-gardner · 2023-08-14T01:21:34Z

Looking at this more, I think the undertaking is more than I had anticipated. The idea is that a lot of the code looks like this:

    z = open_dataset(group, "land_ice_segments/h_li")[start:step:stop]::Vector{Float32}
    sigma_geo_h = open_dataset(group, "land_ice_segments/sigma_geo_h")[start:step:stop]::Vector{Float32}
    h_li_sigma = open_dataset(group, "land_ice_segments/h_li_sigma")[start:step:stop]::Vector{Float32}
    t = open_dataset(group, "land_ice_segments/delta_time")[start:step:stop]::Vector{Float64}
    q = open_dataset(group, "land_ice_segments/atl06_quality_summary")[start:step:stop]::Vector{Int8}
    dem = open_dataset(group, "land_ice_segments/dem/dem_h")[start:step:stop]::Vector{Float32}
    spot_number = read_attribute(group, "atlas_spot_number")::String
    atlas_beam_type = read_attribute(group, "atlas_beam_type")::String

And could be replaces with a more compositable from along the lines of:

vars = [
("land_ice_segments/h_li", Vector{Float32}, :variable)
("land_ice_segments/h_li_sigma", Vector{Float32}, :variable)
...
("atlas_beam_type", String, :attribute)
]

# **this part would take some fancy metaprogramming**
for var in vars
    if var[3] == :variable
        open_dataset(group, var[1])[start:step:stop]::var[2]
    else
        read_attribute(group, var[1])::var[2]
    end
end

alex-s-gardner · 2023-08-14T05:26:17Z

I guess what we actually need (at least I do) is a higher level h5 function that that can accept filename of fid and a list of variables and attributes to extract and it would return a structure... I guess this would be an h5 subsetter, with the ability to coordinate ranges. Does something like this exist? I wrote a bad one for mattlab that's in the climate tookbox : https://www.mathworks.com/matlabcentral/fileexchange/70338-climate-data-toolbox-for-matlab

evetion · 2023-08-14T05:30:34Z

Yeah, this is something we need to maintain this package long-term if we need to support more products/versions.

I love the idea of pre-generating most of this data, that helps a lot with validation.

But we also need something to implement

renaming
repeat/fill attributes to match length/common (time) dimension
postprocess to change type (bool)
postprocess to handle matrix data
postprocess indexing/lookups like for ATL03 or GEDI
handling non-shared dimensions (mostly time now, maybe there are exceptions)

alex-s-gardner · 2023-08-14T05:47:13Z

repeat attributes to match length

Are you thinking of this for Tables? Here I was thinking just to return the same as point. But the first thing I would do is turn it into a DataFrame.. for that I'm planning to make use of FillArrays.jl

evetion · 2023-08-14T06:34:38Z

repeat attributes to match length

Are you thinking of this for Tables? Here I was thinking just to return the same as point. But the first thing I would do is turn it into a DataFrame.. for that I'm planning to make use of FillArrays.jl

I just meant that we need to take care to make sure all output has the same length/common dimension. Fill is ideal for attributes indeed.

alex-s-gardner · 2023-08-14T19:03:28Z

@evetion should we use higher level HDF5.jl functionality and read directly to a table

e.g.

using DataFrames

vars = ["gt1l/land_segments/longitude", ...]
var_names = ["lat", ...]

h5 = HDF5.h5open(ATL08_fn)
df = DataFrame()

for i = eachindex(vars)
    if isa(h5[vars[i]], HDF5.Dataset)
        df[!,"$(var_names[i])"] = h5[vars[i]][]
    else
         # this would be replaced with a FillArray 
        df[!,"$(var_names[i])"] .= h5[vars[i]][]
    end
end

If this seems like an intelligent approach then I would add support for:

cropping along dimension
and seeing if I could figure out a way to handle non-shared dimensions by assigning a reference variable to which dimensionality needs to be matched.

The remaining items would happen in post processing.

This would essentially be an h5_to_df function... I might even move the renaming of variables as a post processing step

evetion · 2023-08-14T20:32:40Z

Well, the DataFrame is a pain only because of the multiple tracks that need/can be concatenated.

In this case, the selection of vars (I would handle attributes as a separate argument altogether), is just hard for the user API alone.

How do you give a name, location, and postprocess function for multiple cars easily? A Var object seems most logical?

alex-s-gardner · 2023-08-14T20:50:08Z

Yes, I think a Var and Att object is the way to go.

For multiple tracks we could just vcat DataFrames. In my workflows I'm finding moving to single value per table cell is the most efficient and intuitive way to work with the data. Using FillArrays means we can so this with minimal overhead.

If we went down this road it would allow you to concat Arrow files on disk without needed to read in the data to memory.

alex-s-gardner · 2023-08-15T00:55:08Z

Maybe an object like this?

"""
    GetDataset()
a dataset request object
"""
@kwdef struct GetDataset
    dataset_path::String # e.g. "gt1l/land_segments/terrain/h_te_mean"
    new_name::Union{String,Nothing} = nothing # e.g. "height"
    apply_function::Union{Function, Nothing} = nothing # e.g. x -> x .> 0
    dimension_ranges::Union{Vector{NamedTuple},Nothing} = nothing # e.g. [(name="latitude", min=50, max=60)]
    description::Union{String,Nothing} = nothing # e.g. "Standard land-ice segment height"
    unit::Union{String,Nothing} = nothing # e.g. "m above the WGS 84 ellipsoid"
end

One would then pass a Vector{GetDataset} to a h5_to_table and it would return a table/dataframe

evetion · 2023-08-16T06:45:45Z

I'm not sure what description and unit will do here?

The other design issue I see is that we either do know beforehand what all the names in a HDF5 are (ICESat, GEDI) so it's easy to check correct keys, dimensions, types, and do something about repeated groups like tracks. If we go purely generic, that becomes much harder. Although we could also parse a HDF5 file and generate such records from it (like generating a Struct based on some JSON file, which you can do with JSON3).

In terms of an interface, I wouldn't even go with h5_to_table, but index directly on a granule, or on a new HDF5Table wrapper around File?

Anyway, feel free to implement a first prototype that could work for ICESat-2 or GEDI, and let's generalize from there?

alex-s-gardner · 2023-08-16T16:12:18Z

I think a HDF5Table wrapper around File is the way to go... it would greatly reduce the the data wrangling needed to work with unique h5 structures. HDF5Table would use FillArrays to populate lower dimensional data. I guess one restriction is that requested datasets must share at least one dimension and are assumed to be unchanging across others.

alex-s-gardner · 2023-08-16T16:13:18Z

Any operations to the data or name changes would be handled outside of HDF5Table

alex-s-gardner · 2024-03-25T18:55:43Z

Here's my first attempt at H5ToTable

evetion · 2024-06-23T15:36:27Z

Here's my first attempt at H5ToTable

Sorry for taking so long to come up with an answer, it's looking great. I came up with something similar in a dead branch on SpaceLiDAR (not as polished), so it seems a good direction to go in.

However, I got stuck on integrating it as a predefined structure for each product, needing to include the size/datatype and its shared dimensions(!), as well as a function that would be applied on loading (needed for the timestamp at the very least). I might've made it too complex, by making it specific to ICESat-2 and GEDI, not sure.

This is functionality I would love to see in SpaceLiDAR before we submit it to JOSS.

alex-s-gardner added the enhancement New feature or request label Aug 13, 2023

evetion mentioned this issue Mar 1, 2024

Mixed data types #72

Closed

evetion mentioned this issue Jun 25, 2024

Questions about implementation alex-s-gardner/H5ToTable.jl#1

Open

evetion mentioned this issue Oct 10, 2024

JOSS paper #81

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Master file structure for each granule Type could reduce code significantly and add flexibility ? #58

Master file structure for each granule Type could reduce code significantly and add flexibility ? #58

alex-s-gardner commented Aug 13, 2023 •

edited

Loading

alex-s-gardner commented Aug 14, 2023

alex-s-gardner commented Aug 14, 2023

evetion commented Aug 14, 2023 •

edited

Loading

alex-s-gardner commented Aug 14, 2023

evetion commented Aug 14, 2023

alex-s-gardner commented Aug 14, 2023 •

edited

Loading

evetion commented Aug 14, 2023

alex-s-gardner commented Aug 14, 2023 •

edited

Loading

alex-s-gardner commented Aug 15, 2023

evetion commented Aug 16, 2023

alex-s-gardner commented Aug 16, 2023

alex-s-gardner commented Aug 16, 2023

alex-s-gardner commented Mar 25, 2024

evetion commented Jun 23, 2024

Master file structure for each granule Type could reduce code significantly and add flexibility ? #58

Master file structure for each granule Type could reduce code significantly and add flexibility ? #58

Comments

alex-s-gardner commented Aug 13, 2023 • edited Loading

alex-s-gardner commented Aug 14, 2023

alex-s-gardner commented Aug 14, 2023

evetion commented Aug 14, 2023 • edited Loading

alex-s-gardner commented Aug 14, 2023

evetion commented Aug 14, 2023

alex-s-gardner commented Aug 14, 2023 • edited Loading

evetion commented Aug 14, 2023

alex-s-gardner commented Aug 14, 2023 • edited Loading

alex-s-gardner commented Aug 15, 2023

evetion commented Aug 16, 2023

alex-s-gardner commented Aug 16, 2023

alex-s-gardner commented Aug 16, 2023

alex-s-gardner commented Mar 25, 2024

evetion commented Jun 23, 2024

alex-s-gardner commented Aug 13, 2023 •

edited

Loading

evetion commented Aug 14, 2023 •

edited

Loading

alex-s-gardner commented Aug 14, 2023 •

edited

Loading

alex-s-gardner commented Aug 14, 2023 •

edited

Loading