Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow.write performance on large DataFrame #473

Closed
jariji opened this issue Jul 2, 2023 · 3 comments
Closed

Arrow.write performance on large DataFrame #473

jariji opened this issue Jul 2, 2023 · 3 comments

Comments

@jariji
Copy link

jariji commented Jul 2, 2023

Arrow.write is taking a long time (over an hour, still going) on a 25M by ~100 (floats, ints, inlinestrings, missings) DataFrame to an NVMe SSD. Is this to be expected?

I have InlineStrings #main (after JuliaStrings/InlineStrings.jl#66) installed.

julia> versioninfo()
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 3900XT 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 24 on 24 virtual cores

  [69666777] Arrow v2.6.2
  [a93c6f00] DataFrames v1.5.0
  [842dd82b] InlineStrings v1.4.0 `https://github.com/JuliaStrings/InlineStrings.jl#main`
@Moelf
Copy link
Contributor

Moelf commented Jul 2, 2023

julia> 25 * 10^6 * 8/1024^3
0.1862645149230957

roughly 0.18GB? yeah, that's way too slow. Is there anyway to generate same schema with dummy data?

@jariji
Copy link
Author

jariji commented Jul 2, 2023

Base.summarysize(df) says 81 GB.

julia> size(df)
(23558194, 71)

julia> countmap(eltype.(eachcol(df)))
Dict{Type, Int64} with 14 entries:
  Int64                     => 7
  Union{Missing, String15}  => 4
  Union{Missing, String127} => 11
  Union{Missing, String63}  => 3
  String15                  => 5
  Union{Missing, Int64}     => 2
  String63                  => 3
  String7                   => 6
  String127                 => 2
  Missing                   => 14
  String255                 => 1
  String31                  => 6
  Union{Missing, String31}  => 5
  Union{Missing, String255} => 2

It takes ~1 second to write ~1 GB, so naively, writing 81 GB shouldn't take very long either; I expect some overhead but hours is a lot.

julia> let r = rand(UInt8, 1024^3)
           @time open("data/temp", "w") do f
               write(f, r)
           end
       end;
  1.118401 seconds (2.61 k allocations: 190.905 KiB, 0.87% compilation time)

@jariji
Copy link
Author

jariji commented Jul 2, 2023

Computer is swapping.

@jariji jariji closed this as completed Jul 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants