Reduce allocations and a bit of memory. #57

evetion · 2023-08-12T12:14:43Z

Apart from reducing the overall allocations and reducing some memory usage from an overuse of attrs and keys, all which (sadly) has a negligible impact on performance, I've also added a way of getting a DataFrame of a classified ATL03 pointset without extra allocations (as you would currently get from classify, having to go with reduce(vcat, DataFrame.())

struct ClassifyATL03
    atl03::ICESat2_Granule{:ATL03}
    atl08::ICESat2_Granule{:ATL08}
end
classify(::ClassifyATL03) == classify(::ICESat2_Granule{:ATL03}, ::ICESat2_Granule{:ATL08})
DataFrame(::ClassifyATL03)  # like DataFrame(::ICESat2_Granule)

I'm not yet sure about the name ClassifyATL03, but this lazy way of combining granules/operations seems useful. In the end, like icepyx or DataFrames Groupby, one could have Query structs that detail some (lazy) operations.

Also added test for all DataFrame(granule) calls for each data product.

src/GEDI/L2A.jl

alex-s-gardner · 2023-08-12T21:13:44Z

src/ICESat-2/ATL03.jl

+"""
+classify(gg::ClassifyATL03; kwargs...) = classify(gg.atl03, gg.atl08; kwargs...)
+
+


Let me see if I have this right... the idea hear is to lazily map between ATL03 and ATL08... by creating a type ClassifyATL03 you can dispatch on classify. Why is a new type needed? Why could you not just have a DF row of ICESat2_Granule{:ATL03} and a corresponding row of ICESat2_Granule{:ATL08} that could be dispatched on classify(gg.atl03, gg.atl08) when the classes are desired? Or could you not simply pass classify(gg.atl03, nothing) and the ATL08 file would be read in.

At the moment, if one wants a classified ATL03, one does classify(atl03, atl08=nothing), and you get the vector of namedtuples (just like points(atl08). To go from a DataFrame from there is costly.

Example of ATL08:

julia> @time points(g); 0.008027 seconds (1.80 k allocations: 430.406 KiB) julia> @time df = reduce(vcat, DataFrame.(points(g))); 0.008888 seconds (3.17 k allocations: 952.445 KiB)

That's why I implemented the Tables.jl interface on top of a granule, so DataFrame(granule) calls points(granule) itself, but is vcatted with much less overhead.

julia> @time df2 = DataFrame(g); 0.010479 seconds (2.24 k allocations: 466.641 KiB) julia> isequal(df, df2) == true

To make this work one needs to define Tables.partitions on a struct. So we can either change the output of all points and classify methods to return a new generic SpaceLiDAR.Table thing, which implements the Table interface (breaking). And/Or we go the route of Query like objects (which this ClassifyATL03 is, and they could also define which columns one wants in the future). Something like

q = Query(::Granule; columns=("longitude",); attributes=("atlas_beam_type",)) df = DataFrame(q)

You'll have a better idea of this than I but it seems like we might want to bite the bullet and move toward Tables.partitions on a struct. It seems like this would be more flexible in the end and would place less burden on the users/developers to implement a number of Query methods. I personally find the Query clever buy difficult to intuit and I suspect users will as well.

That said... there's an old saying that if "it ain't broken don't fix it" so I'm also supportive of non-breaking changes as the package is working really great now.

Let's see if I can hack a new type that subtypes a vector of namedtuples, so all documented code keeps working, and gives us the benefit of defining the Tables interface on it.

Will drop de Classify from this PR so we can merge the other changes.

alex-s-gardner · 2023-08-12T21:52:22Z

test/runtests.jl

        SL.materialize!(df)
        @test df.classification isa Vector{String}
+
+        C = SL.ClassifyATL03(g, g8)
+        dfc = DataFrame(C)


dfc fails to save as an arrow file.

using Arrow Arrow.write("test.arrow", dfc) fails: MethodError: no method matching arrowtype(::Arrow.FlatBuffers.Builder, ::Type{Ptr{Nothing}}) Closest candidates are: arrowtype(::Any, ::Union{Arrow.DenseUnion{S, Arrow.UnionT{T, typeIds, U}}, Arrow.SparseUnion{S, Arrow.UnionT{T, typeIds, U}}}) where {S, T, typeIds, U} @ Arrow ~/.julia/packages/Arrow/R2Rvz/src/eltypes.jl:495 arrowtype(::Any, ::Arrow.Struct{T, S}) where {T, S} @ Arrow ~/.julia/packages/Arrow/R2Rvz/src/eltypes.jl:479 arrowtype(::Any, ::Arrow.FixedSizeList{T, A}) where {T, A} @ Arrow ~/.julia/packages/Arrow/R2Rvz/src/eltypes.jl:431 ...

Hmm, we don't test Arrow currently. But since C implements the table interface, you can just do:

Arrow.write("test.arrow", C) Arrow.Table("test.arrow")

Which works, not sure why the DataFrame variant fails.

alex-s-gardner · 2023-08-12T21:58:30Z

test/runtests.jl

        SL.materialize!(df)
        @test df.classification isa Vector{String}
+
+        C = SL.ClassifyATL03(g, g8)
+        dfc = DataFrame(C)


JLD2 works like a charm and keeps lazy evaluation in place like magic.

using JLD2 jldsave("test.jld2"; dfc) dfc = load("test.jld2","dfc")

Reduce allocations and a bit of memory.

a81261f

evetion requested a review from alex-s-gardner August 12, 2023 12:14

Use open_group instead of open_dataset.

927229e

alex-s-gardner reviewed Aug 12, 2023

View reviewed changes

alex-s-gardner mentioned this pull request Aug 13, 2023

Master file structure for each granule Type could reduce code significantly and add flexibility ? #58

Open

Remove ATL03Classify.

a7d5732

evetion merged commit 97d0d7f into master Aug 14, 2023
9 of 12 checks passed

evetion deleted the fix/allocations branch August 14, 2023 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce allocations and a bit of memory. #57

Reduce allocations and a bit of memory. #57

evetion commented Aug 12, 2023

alex-s-gardner Aug 12, 2023

evetion Aug 13, 2023

alex-s-gardner Aug 13, 2023

alex-s-gardner Aug 13, 2023

evetion Aug 13, 2023

alex-s-gardner Aug 12, 2023

evetion Aug 13, 2023

alex-s-gardner Aug 12, 2023

		"""
		classify(gg::ClassifyATL03; kwargs...) = classify(gg.atl03, gg.atl08; kwargs...)

Reduce allocations and a bit of memory. #57

Reduce allocations and a bit of memory. #57

Conversation

evetion commented Aug 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment