Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested Dictionaries #1187

Open
DanDeepPhase opened this issue Jan 26, 2025 · 3 comments
Open

Nested Dictionaries #1187

DanDeepPhase opened this issue Jan 26, 2025 · 3 comments

Comments

@DanDeepPhase
Copy link

My mental model of HDF5s is as a folder structure, where related data is grouped together, and buried in a nested / hierarchical format. Currently the read functions deliver a flat dictionary, and the hierarchy is held in strings as opposed to structure. The alternative which matches my mental model is to read an HDF5 in as a nested dictionary, where the value of a key is a datatype if the key refers to a datatype, and the value is a dictionary if the key refers to a group.

So for an HDF5 like:

📂 h5file
├─ 🔢 B
└─ 📂 groupA
       ├─ 🔢 A1
       └─ 🔢 A2

The current read generates:

Dict(
   "B" => Bval
   "groupA/A1" => A1val
   "groupA/A2" => A2val
)

And I'd prefer an option to read_nested as:

Dict(
   "B" => Bval
   "groupA" => Dict(
          "A1" => A1val
          "A2" => A2val
         )
)

I've written this code locally (plus corresponding write_nested. Would it be reasonable to include it here?

@mkitti
Copy link
Member

mkitti commented Jan 28, 2025

I have several questions:

  1. What is the function that you are using to "read" the file which creates the Dict{String, Any}? It appears you might be using FileIO.load.
  2. Is there a reason we need to create new functions? An alternative would be to add keyword arguments to FileIO.load.
  3. Would the result also be a Dict{String, Any}? Would you have options for an OrderedDict from OrderedCollections.jl or a Dictionary from Dictionaries.jl?

@mkitti
Copy link
Member

mkitti commented Jan 28, 2025

One of my reservations about this is that it seems to encourage the user to read the entire file at once, and I would tend to encourage the user to use lazy interfaces for this purpose.

The nested dictionary behavior that you describe seems closely resemble the current dictionary-like interface.

julia> h5f = h5open("test.h5")
🗂️ HDF5.File: (read-only) test.h5
├─ 🔢 B
└─ 📂 groupA
   ├─ 🔢 A1
   └─ 🔢 A2

julia> h5f["groupA"]["A1"][]
2×2 Matrix{Float64}:
 0.43893   0.583493
 0.546226  0.652598

The only difference here is the final [] to actually access the data. Perhaps your eager Dict the way to access the contents of A1 would be just h5f["groupA"]["A1"]?

How would also then deal with attributes?

@DanDeepPhase
Copy link
Author

DanDeepPhase commented Jan 28, 2025

Thanks for the feedback! Here are my replies:

  1. What is the function that you are using to "read" the file which creates the Dict{String, Any}? It appears you might be using FileIO.load.

I wrote a custom read function which uses h5open. the reader code is as follows (writing code is similar). It's not too different from the code in FileIOExt.jl, just a different output:

function load_nested(filename)
    h5open(filename) do fid
        read_group(fid)
    end
end

function read_group(parent)
    d = OrderedDict{String,Any}()
    for key in keys(parent)
        content = read_dataset(parent[key])
        merge!(d,Dict(key => content))
    end
    d
end

read_dataset(val::HDF5.Group) = read_group(val)
read_dataset(val::HDF5.Dataset) = read(val)
  1. Is there a reason we need to create new functions? An alternative would be to add keyword arguments to FileIO.load.

I would prefer not to create a new function. My new functions are just my local solution to avoid type piracy. So load_nested(file) could be load(file; nested=true) or similar. "hierarchical = true" is something I'd probably mis-spell...

  1. Would the result also be a Dict{String, Any}? Would you have options for an OrderedDict from OrderedCollections.jl or a Dictionary from Dictionaries.jl?

I actuallly wrote it for OrderedDict, but it could mirror the current load function's sink / typeflag: julia> load("track_order.h5"; dict=OrderedDict())

One of my reservations about this is that it seems to encourage the user to read the entire file at once, and I would tend to encourage the user to use lazy interfaces for this purpose.

Probably best practice, especially if data volumes are large. In my case, I want to mutate the structure without risking modifying the file contents. Based on the example you provided it is read only. In the past (in other languages and data types) i got into a habit of not leaving files "open".

The nested dictionary behavior that you describe seems closely resemble the current dictionary-like interface.
The only difference here is the final [] to actually access the data. Perhaps your eager Dict the way to access the contents of A1 would be just h5f["groupA"]["A1"]?

That's interesting. so h5f["groupA"]["A1"][] is equivalent to read(h5f["groupA"], "A1") (but only for datasets not groups)... I didn't find that method in the documentation. I've gotten used to something similar working with observables in Makie, but it's still a little strange to me. I don't know what it means, i just know its how i access that type. Maybe its a concept from pointers?

So the intent of this would be a different structure for the FileIO.load "high level method".

# current:
h5f = load(file)                 # creates a flat dict
A1 = h5f["groupA/A1"]

# new 
h5f = load(file; nested = true)  # creates a nested dict
A1 = h5f["groupA"]["A1"]

How would also then deal with attributes?

I'm not sure. Does the current load function handle attributes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants