Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce allocations and a bit of memory. #57

Merged
merged 3 commits into from
Aug 14, 2023
Merged

Reduce allocations and a bit of memory. #57

merged 3 commits into from
Aug 14, 2023

Conversation

evetion
Copy link
Owner

@evetion evetion commented Aug 12, 2023

Apart from reducing the overall allocations and reducing some memory usage from an overuse of attrs and keys, all which (sadly) has a negligible impact on performance, I've also added a way of getting a DataFrame of a classified ATL03 pointset without extra allocations (as you would currently get from classify, having to go with reduce(vcat, DataFrame.())

struct ClassifyATL03
    atl03::ICESat2_Granule{:ATL03}
    atl08::ICESat2_Granule{:ATL08}
end
classify(::ClassifyATL03) == classify(::ICESat2_Granule{:ATL03}, ::ICESat2_Granule{:ATL08})
DataFrame(::ClassifyATL03)  # like DataFrame(::ICESat2_Granule)

I'm not yet sure about the name ClassifyATL03, but this lazy way of combining granules/operations seems useful. In the end, like icepyx or DataFrames Groupby, one could have Query structs that detail some (lazy) operations.

Also added test for all DataFrame(granule) calls for each data product.

src/GEDI/L2A.jl Show resolved Hide resolved
src/GEDI/L2A.jl Show resolved Hide resolved
"""
classify(gg::ClassifyATL03; kwargs...) = classify(gg.atl03, gg.atl08; kwargs...)


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see if I have this right... the idea hear is to lazily map between ATL03 and ATL08... by creating a type ClassifyATL03 you can dispatch on classify. Why is a new type needed? Why could you not just have a DF row of ICESat2_Granule{:ATL03} and a corresponding row of ICESat2_Granule{:ATL08} that could be dispatched on classify(gg.atl03, gg.atl08) when the classes are desired? Or could you not simply pass classify(gg.atl03, nothing) and the ATL08 file would be read in.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, if one wants a classified ATL03, one does classify(atl03, atl08=nothing), and you get the vector of namedtuples (just like points(atl08). To go from a DataFrame from there is costly.

Example of ATL08:

julia> @time points(g);
  0.008027 seconds (1.80 k allocations: 430.406 KiB)
julia> @time df = reduce(vcat, DataFrame.(points(g)));
  0.008888 seconds (3.17 k allocations: 952.445 KiB)

That's why I implemented the Tables.jl interface on top of a granule, so DataFrame(granule) calls points(granule) itself, but is vcatted with much less overhead.

julia> @time df2 = DataFrame(g);
  0.010479 seconds (2.24 k allocations: 466.641 KiB)
julia> isequal(df, df2) == true

To make this work one needs to define Tables.partitions on a struct. So we can either change the output of all points and classify methods to return a new generic SpaceLiDAR.Table thing, which implements the Table interface (breaking). And/Or we go the route of Query like objects (which this ClassifyATL03 is, and they could also define which columns one wants in the future). Something like

q = Query(::Granule; columns=("longitude",); attributes=("atlas_beam_type",))
df = DataFrame(q)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll have a better idea of this than I but it seems like we might want to bite the bullet and move toward Tables.partitions on a struct. It seems like this would be more flexible in the end and would place less burden on the users/developers to implement a number of Query methods. I personally find the Query clever buy difficult to intuit and I suspect users will as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said... there's an old saying that if "it ain't broken don't fix it" so I'm also supportive of non-breaking changes as the package is working really great now.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if I can hack a new type that subtypes a vector of namedtuples, so all documented code keeps working, and gives us the benefit of defining the Tables interface on it.

Will drop de Classify from this PR so we can merge the other changes.

test/runtests.jl Outdated
SL.materialize!(df)
@test df.classification isa Vector{String}

C = SL.ClassifyATL03(g, g8)
dfc = DataFrame(C)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dfc fails to save as an arrow file.

using Arrow

Arrow.write("test.arrow", dfc) fails:

 MethodError: no method matching arrowtype(::Arrow.FlatBuffers.Builder, ::Type{Ptr{Nothing}})

Closest candidates are:
  arrowtype(::Any, ::Union{Arrow.DenseUnion{S, Arrow.UnionT{T, typeIds, U}}, Arrow.SparseUnion{S, Arrow.UnionT{T, typeIds, U}}}) where {S, T, typeIds, U}
   @ Arrow ~/.julia/packages/Arrow/R2Rvz/src/eltypes.jl:495
  arrowtype(::Any, ::Arrow.Struct{T, S}) where {T, S}
   @ Arrow ~/.julia/packages/Arrow/R2Rvz/src/eltypes.jl:479
  arrowtype(::Any, ::Arrow.FixedSizeList{T, A}) where {T, A}
   @ Arrow ~/.julia/packages/Arrow/R2Rvz/src/eltypes.jl:431
  ...

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we don't test Arrow currently. But since C implements the table interface, you can just do:

Arrow.write("test.arrow", C)
Arrow.Table("test.arrow")

Which works, not sure why the DataFrame variant fails.

test/runtests.jl Outdated
SL.materialize!(df)
@test df.classification isa Vector{String}

C = SL.ClassifyATL03(g, g8)
dfc = DataFrame(C)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JLD2 works like a charm and keeps lazy evaluation in place like magic.

using JLD2

jldsave("test.jld2"; dfc)
dfc = load("test.jld2","dfc")

@evetion evetion merged commit 97d0d7f into master Aug 14, 2023
9 of 12 checks passed
@evetion evetion deleted the fix/allocations branch August 14, 2023 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants