-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow.write
performance on large DataFrame
#473
Comments
julia> 25 * 10^6 * 8/1024^3
0.1862645149230957 roughly 0.18GB? yeah, that's way too slow. Is there anyway to generate same schema with dummy data? |
julia> size(df)
(23558194, 71)
julia> countmap(eltype.(eachcol(df)))
Dict{Type, Int64} with 14 entries:
Int64 => 7
Union{Missing, String15} => 4
Union{Missing, String127} => 11
Union{Missing, String63} => 3
String15 => 5
Union{Missing, Int64} => 2
String63 => 3
String7 => 6
String127 => 2
Missing => 14
String255 => 1
String31 => 6
Union{Missing, String31} => 5
Union{Missing, String255} => 2 It takes ~1 second to write ~1 GB, so naively, writing 81 GB shouldn't take very long either; I expect some overhead but hours is a lot. julia> let r = rand(UInt8, 1024^3)
@time open("data/temp", "w") do f
write(f, r)
end
end;
1.118401 seconds (2.61 k allocations: 190.905 KiB, 0.87% compilation time) |
Computer is swapping. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Arrow.write
is taking a long time (over an hour, still going) on a 25M by ~100 (floats, ints, inlinestrings, missings) DataFrame to an NVMe SSD. Is this to be expected?I have InlineStrings
#main
(after JuliaStrings/InlineStrings.jl#66) installed.The text was updated successfully, but these errors were encountered: