Skip to content

Reading Parquet to DataFrame is slow  #32

Open
@tclements

Description

@tclements

Reading a parquet file into a DataFrame is ~170 slower than using CSV.read with the same data. Not sure I can help improve performance but this is limiting my use of ParquetFiles.jl

MWE:

(@v1.4) pkg> st
Status `~/.julia/environments/v1.4/Project.toml`
  [6e4b80f9] BenchmarkTools v0.5.0
  [336ed68f] CSV v0.6.2
  [a93c6f00] DataFrames v0.21.2
  [626c502c] Parquet v0.4.0
  [46a55296] ParquetFiles v0.2.0
using ParquetFiles, BenchmarkTools, CSV, DataFrames
CSV.read("data.csv")
DataFrame(load("data.parquet"))

Loading times for ParquetFiles

@benchmark DataFrame(load("data.parquet"))
BenchmarkTools.Trial: 
  memory estimate:  45.66 MiB
  allocs estimate:  961290
  --------------
  minimum time:     287.492 ms (0.00% GC)
  median time:      290.843 ms (0.00% GC)
  mean time:        296.344 ms (1.64% GC)
  maximum time:     326.041 ms (8.46% GC)
  --------------
  samples:          17
  evals/sample:     1

Loading times for CSV:

@benchmark CSV.read("data.csv")
BenchmarkTools.Trial: 
  memory estimate:  758.14 KiB
  allocs estimate:  2299
  --------------
  minimum time:     1.690 ms (0.00% GC)
  median time:      1.735 ms (0.00% GC)
  mean time:        1.772 ms (1.43% GC)
  maximum time:     14.096 ms (63.93% GC)
  --------------
  samples:          2817
  evals/sample:     1

As compared to pandas:

import pandas as pd
%timeit pd.read_parquet("data.parquet")                                                                                                                                          
# 3.61 ms ± 25.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit pd.read_csv("data.csv")                                                                                                                                                  
# 4.73 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Data are included in zip file:
data.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions