-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Master file structure for each granule Type could reduce code significantly and add flexibility ? #58
Comments
Looking at this more, I think the undertaking is more than I had anticipated. The idea is that a lot of the code looks like this: z = open_dataset(group, "land_ice_segments/h_li")[start:step:stop]::Vector{Float32}
sigma_geo_h = open_dataset(group, "land_ice_segments/sigma_geo_h")[start:step:stop]::Vector{Float32}
h_li_sigma = open_dataset(group, "land_ice_segments/h_li_sigma")[start:step:stop]::Vector{Float32}
t = open_dataset(group, "land_ice_segments/delta_time")[start:step:stop]::Vector{Float64}
q = open_dataset(group, "land_ice_segments/atl06_quality_summary")[start:step:stop]::Vector{Int8}
dem = open_dataset(group, "land_ice_segments/dem/dem_h")[start:step:stop]::Vector{Float32}
spot_number = read_attribute(group, "atlas_spot_number")::String
atlas_beam_type = read_attribute(group, "atlas_beam_type")::String And could be replaces with a more compositable from along the lines of: vars = [
("land_ice_segments/h_li", Vector{Float32}, :variable)
("land_ice_segments/h_li_sigma", Vector{Float32}, :variable)
...
("atlas_beam_type", String, :attribute)
]
# **this part would take some fancy metaprogramming**
for var in vars
if var[3] == :variable
open_dataset(group, var[1])[start:step:stop]::var[2]
else
read_attribute(group, var[1])::var[2]
end
end |
I guess what we actually need (at least I do) is a higher level h5 function that that can accept filename of fid and a list of variables and attributes to extract and it would return a structure... I guess this would be an h5 subsetter, with the ability to coordinate ranges. Does something like this exist? I wrote a bad one for mattlab that's in the climate tookbox : https://www.mathworks.com/matlabcentral/fileexchange/70338-climate-data-toolbox-for-matlab |
Yeah, this is something we need to maintain this package long-term if we need to support more products/versions. I love the idea of pre-generating most of this data, that helps a lot with validation. But we also need something to implement
|
Are you thinking of this for |
I just meant that we need to take care to make sure all output has the same length/common dimension. Fill is ideal for attributes indeed. |
@evetion should we use higher level HDF5.jl functionality and read directly to a table e.g. using DataFrames
vars = ["gt1l/land_segments/longitude", ...]
var_names = ["lat", ...]
h5 = HDF5.h5open(ATL08_fn)
df = DataFrame()
for i = eachindex(vars)
if isa(h5[vars[i]], HDF5.Dataset)
df[!,"$(var_names[i])"] = h5[vars[i]][]
else
# this would be replaced with a FillArray
df[!,"$(var_names[i])"] .= h5[vars[i]][]
end
end If this seems like an intelligent approach then I would add support for:
The remaining items would happen in post processing. This would essentially be an h5_to_df function... I might even move the renaming of variables as a post processing step |
Well, the DataFrame is a pain only because of the multiple tracks that need/can be concatenated. In this case, the selection of vars (I would handle attributes as a separate argument altogether), is just hard for the user API alone. How do you give a name, location, and postprocess function for multiple cars easily? A Var object seems most logical? |
Yes, I think a Var and Att object is the way to go. For multiple tracks we could just vcat DataFrames. In my workflows I'm finding moving to single value per table cell is the most efficient and intuitive way to work with the data. Using FillArrays means we can so this with minimal overhead. If we went down this road it would allow you to concat Arrow files on disk without needed to read in the data to memory. |
Maybe an object like this? """
GetDataset()
a dataset request object
"""
@kwdef struct GetDataset
dataset_path::String # e.g. "gt1l/land_segments/terrain/h_te_mean"
new_name::Union{String,Nothing} = nothing # e.g. "height"
apply_function::Union{Function, Nothing} = nothing # e.g. x -> x .> 0
dimension_ranges::Union{Vector{NamedTuple},Nothing} = nothing # e.g. [(name="latitude", min=50, max=60)]
description::Union{String,Nothing} = nothing # e.g. "Standard land-ice segment height"
unit::Union{String,Nothing} = nothing # e.g. "m above the WGS 84 ellipsoid"
end One would then pass a Vector{GetDataset} to a h5_to_table and it would return a table/dataframe |
I'm not sure what The other design issue I see is that we either do know beforehand what all the names in a HDF5 are (ICESat, GEDI) so it's easy to check correct keys, dimensions, types, and do something about repeated groups like tracks. If we go purely generic, that becomes much harder. Although we could also parse a HDF5 file and generate such records from it (like generating a Struct based on some JSON file, which you can do with JSON3). In terms of an interface, I wouldn't even go with h5_to_table, but index directly on a granule, or on a new HDF5Table wrapper around File? Anyway, feel free to implement a first prototype that could work for ICESat-2 or GEDI, and let's generalize from there? |
I think a HDF5Table wrapper around File is the way to go... it would greatly reduce the the data wrangling needed to work with unique h5 structures. HDF5Table would use FillArrays to populate lower dimensional data. I guess one restriction is that requested datasets must share at least one dimension and are assumed to be unchanging across others. |
Any operations to the data or name changes would be handled outside of HDF5Table |
Here's my first attempt at H5ToTable |
Sorry for taking so long to come up with an answer, it's looking great. I came up with something similar in a dead branch on SpaceLiDAR (not as polished), so it seems a good direction to go in. However, I got stuck on integrating it as a predefined structure for each product, needing to include the size/datatype and its shared dimensions(!), as well as a function that would be applied on loading (needed for the timestamp at the very least). I might've made it too complex, by making it specific to ICESat-2 and GEDI, not sure. This is functionality I would love to see in SpaceLiDAR before we submit it to JOSS. |
If we were to autogenerate maps of the h5 data tree with variable types for each granule type then we could replace a lot of the repetitive code with loops. It may also allow us to move to a single
points
function for all granule types.Both #55, #57 and most others consist of a lot of find and replace as everything is hardcoded. If we were to autogenerate maps of the h5 data tree with variable types and save the output in the package then a lot of code could be replaced with loops. We could autogenerate new data trees each time a new version of the package was created.
This would also allow the user to pass a list of variables to the
points
function if they wanted more/less than the defaults.This would also make it easier acomodate new granule types.
The text was updated successfully, but these errors were encountered: