You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Reading a parquet file into a DataFrame is ~170 slower than using CSV.read with the same data. Not sure I can help improve performance but this is limiting my use of ParquetFiles.jl
MWE:
(@v1.4) pkg> st
Status `~/.julia/environments/v1.4/Project.toml`
[6e4b80f9] BenchmarkTools v0.5.0
[336ed68f] CSV v0.6.2
[a93c6f00] DataFrames v0.21.2
[626c502c] Parquet v0.4.0
[46a55296] ParquetFiles v0.2.0
using ParquetFiles, BenchmarkTools, CSV, DataFrames
CSV.read("data.csv")
DataFrame(load("data.parquet"))
Loading times for ParquetFiles
@benchmarkDataFrame(load("data.parquet"))
BenchmarkTools.Trial:
memory estimate:45.66 MiB
allocs estimate:961290--------------
minimum time:287.492 ms (0.00% GC)
median time:290.843 ms (0.00% GC)
mean time:296.344 ms (1.64% GC)
maximum time:326.041 ms (8.46% GC)
--------------
samples:17
evals/sample:1
Loading times for CSV:
@benchmark CSV.read("data.csv")
BenchmarkTools.Trial:
memory estimate:758.14 KiB
allocs estimate:2299--------------
minimum time:1.690 ms (0.00% GC)
median time:1.735 ms (0.00% GC)
mean time:1.772 ms (1.43% GC)
maximum time:14.096 ms (63.93% GC)
--------------
samples:2817
evals/sample:1
As compared to pandas:
importpandasaspd%timeitpd.read_parquet("data.parquet")
# 3.61 ms ± 25.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)%timeitpd.read_csv("data.csv")
# 4.73 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I think one of the reasons is that ParquetFiles.jl doesn't have the interface Tables.columns implemented, which makes DataFrame(...) go to the fallback solution, that is, row by row appending.
Reading a parquet file into a DataFrame is ~170 slower than using CSV.read with the same data. Not sure I can help improve performance but this is limiting my use of ParquetFiles.jl
MWE:
Loading times for ParquetFiles
Loading times for CSV:
As compared to pandas:
Data are included in zip file:
data.zip
The text was updated successfully, but these errors were encountered: