-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce allocations and a bit of memory. #57
Conversation
""" | ||
classify(gg::ClassifyATL03; kwargs...) = classify(gg.atl03, gg.atl08; kwargs...) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me see if I have this right... the idea hear is to lazily map between ATL03 and ATL08... by creating a type ClassifyATL03
you can dispatch on classify
. Why is a new type needed? Why could you not just have a DF row of ICESat2_Granule{:ATL03} and a corresponding row of ICESat2_Granule{:ATL08} that could be dispatched on classify(gg.atl03, gg.atl08) when the classes are desired? Or could you not simply pass classify(gg.atl03, nothing) and the ATL08 file would be read in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, if one wants a classified ATL03, one does classify(atl03, atl08=nothing)
, and you get the vector of namedtuples (just like points(atl08)
. To go from a DataFrame from there is costly.
Example of ATL08:
julia> @time points(g);
0.008027 seconds (1.80 k allocations: 430.406 KiB)
julia> @time df = reduce(vcat, DataFrame.(points(g)));
0.008888 seconds (3.17 k allocations: 952.445 KiB)
That's why I implemented the Tables.jl interface on top of a granule, so DataFrame(granule)
calls points(granule)
itself, but is vcatted with much less overhead.
julia> @time df2 = DataFrame(g);
0.010479 seconds (2.24 k allocations: 466.641 KiB)
julia> isequal(df, df2) == true
To make this work one needs to define Tables.partitions
on a struct. So we can either change the output of all points
and classify
methods to return a new generic SpaceLiDAR.Table
thing, which implements the Table interface (breaking). And/Or we go the route of Query like objects (which this ClassifyATL03
is, and they could also define which columns one wants in the future). Something like
q = Query(::Granule; columns=("longitude",); attributes=("atlas_beam_type",))
df = DataFrame(q)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll have a better idea of this than I but it seems like we might want to bite the bullet and move toward Tables.partitions
on a struct. It seems like this would be more flexible in the end and would place less burden on the users/developers to implement a number of Query
methods. I personally find the Query
clever buy difficult to intuit and I suspect users will as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said... there's an old saying that if "it ain't broken don't fix it" so I'm also supportive of non-breaking changes as the package is working really great now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see if I can hack a new type that subtypes a vector of namedtuples, so all documented code keeps working, and gives us the benefit of defining the Tables interface on it.
Will drop de Classify from this PR so we can merge the other changes.
test/runtests.jl
Outdated
SL.materialize!(df) | ||
@test df.classification isa Vector{String} | ||
|
||
C = SL.ClassifyATL03(g, g8) | ||
dfc = DataFrame(C) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dfc fails to save as an arrow file.
using Arrow
Arrow.write("test.arrow", dfc) fails:
MethodError: no method matching arrowtype(::Arrow.FlatBuffers.Builder, ::Type{Ptr{Nothing}})
Closest candidates are:
arrowtype(::Any, ::Union{Arrow.DenseUnion{S, Arrow.UnionT{T, typeIds, U}}, Arrow.SparseUnion{S, Arrow.UnionT{T, typeIds, U}}}) where {S, T, typeIds, U}
@ Arrow ~/.julia/packages/Arrow/R2Rvz/src/eltypes.jl:495
arrowtype(::Any, ::Arrow.Struct{T, S}) where {T, S}
@ Arrow ~/.julia/packages/Arrow/R2Rvz/src/eltypes.jl:479
arrowtype(::Any, ::Arrow.FixedSizeList{T, A}) where {T, A}
@ Arrow ~/.julia/packages/Arrow/R2Rvz/src/eltypes.jl:431
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we don't test Arrow currently. But since C implements the table interface, you can just do:
Arrow.write("test.arrow", C)
Arrow.Table("test.arrow")
Which works, not sure why the DataFrame variant fails.
test/runtests.jl
Outdated
SL.materialize!(df) | ||
@test df.classification isa Vector{String} | ||
|
||
C = SL.ClassifyATL03(g, g8) | ||
dfc = DataFrame(C) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JLD2 works like a charm and keeps lazy evaluation in place like magic.
using JLD2
jldsave("test.jld2"; dfc)
dfc = load("test.jld2","dfc")
Apart from reducing the overall allocations and reducing some memory usage from an overuse of
attrs
andkeys
, all which (sadly) has a negligible impact on performance, I've also added a way of getting a DataFrame of a classified ATL03 pointset without extra allocations (as you would currently get from classify, having to go withreduce(vcat, DataFrame.())
I'm not yet sure about the name
ClassifyATL03
, but this lazy way of combining granules/operations seems useful. In the end, like icepyx or DataFrames Groupby, one could have Query structs that detail some (lazy) operations.Also added test for all DataFrame(granule) calls for each data product.