Serious performance issues with read #73

xiaodaigh · 2020-05-18T02:47:47Z

See the repo and the code below.

https://github.com/xiaodaigh/parquet-data-collection/blob/master/test_files.jl

To replicate, just clone the repo and run the above file (included in the repo).

It takes a LONG time.

xiaodaigh · 2020-05-18T05:10:44Z

I suspect is the that RecordCursor loops thru each row and this is highly inefficient especially considering Parquet is a columnar format.

I think the lowest level of cursor/iterations should be at the column chunk level, not at row level. The row level cursor should be in some utility library like ParquetFiles.jl to keep this package clean.

tanmaykm · 2020-05-18T06:45:31Z

Data does get read in chunks internally.

Could you post some benchmark results comparing times of Parquet.jl master to the last release?

xiaodaigh · 2020-05-18T11:40:28Z

Could you post some benchmark results comparing times of Parquet.jl master to the last release?

I mean to say, it's dead slow on both master and latest release.

tanmaykm · 2020-05-18T13:32:17Z

I was able to trace some issues, PR here: #77
But more can be done.

xiaodaigh · 2020-05-19T11:24:25Z

The issue seems to be much deeper with the reader design. In this fork here, I can read the file in 3~4 seconds, but Parquet.jl #77 it takes 130s

See the fork here https://github.com/xiaodaigh/Parquet.jl/blob/master/README.md

tanmaykm · 2020-05-19T12:43:05Z

I tried the fork, following the readme pointed to above. Not sure if I got it right, but read_parquet seems to return a single row in 13 sec.

julia> using Parquet

julia> @time df =  read_parquet("dsd50p.parquet");
Progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:07
 13.827823 seconds (60.12 M allocations: 2.331 GiB, 4.70% gc time)

julia> isa(df, NamedTuple)
true

xiaodaigh · 2020-05-19T12:55:20Z

but read_parquet seems to return a single row in 13 sec.

Try this

df.tile_count

and you will see that it returns the whole column, not just one row. Each name is attached to a vector, not a single value.

As to speed, it's more due to PC differences? I consistently get 3s on my 6 core machine, also I have 2 SSDs on RAID0

Update

BTW, 3s should be considered really slow as R's arrow can read it and summarise it in about ~0.13s. So lot's of room to improve.

tanmaykm · 2020-05-19T13:01:50Z

Okay, I see the column now. But looks like you are disregarding the nested schema? So this will work only for non nested data?

tanmaykm · 2020-05-19T13:04:41Z

And you seem to be using the same underlying page and column chunk read methods of Parquet.jl, right? So the current performance issues are on the layer above that.

xiaodaigh · 2020-05-19T13:04:48Z

So this will work only for non nested data?

That is correct. And the issue that is bothering me. I don't understand the nestedness, so I actually don't know how to deal with it.

Ideally, there might be things in my approach that can be used to handle nested schemas?

Honestly, the reason I am not motivated to do nested schema thing is that I have never needed it or used it.

Would it be acceptable to have a set of functions just to read non-nested data?

xiaodaigh · 2020-05-19T13:06:33Z

So the current performance issues are on the layer above that.

Yes I think so. Although, there's definitely much room to improve for the values method, e.g. it's using push! which is causing much more allocation then necessary for unnested column reads (not sure about nested columns, so kept my comment to unnested columns).

tanmaykm · 2020-05-19T13:11:34Z

Would it be acceptable to have a set of functions just to read non-nested data?

I guess. But I would try and figure out why we can not handle nestedness and be performant as well.

xiaodaigh · 2020-05-19T13:15:38Z

why we can not handle nestedness and be performant as well.

I don't see a reason. The issue is I don't have the knowledge to contribute to nestedness and performant code. Also, I have no motivation there since I never use nested columns.

So i only see 3 options

we try a PR with just unnested column (which I think is the majority of use cases out there)
someone other than me writes performant code to handle nested and unested columns
do nothing

I would prefer 1 to 3, but I can only contribute to 1 to prevent 3, not 2. If 2 can happen, it would be great! But if 2 takes a year, then I would rather see 1 instead of 3 being the eventuality. I can commit to doing 1.

tanmaykm · 2020-05-19T13:16:22Z

I don't understand the nestedness, so I actually don't know how to deal with it.

This is basically what the definition levels and repetition levels allow. It is based on dremel, there would be some material out there if you searcg for it.

xiaodaigh · 2020-05-19T13:19:56Z

there would be some material out there if you searcg for it.

yeah, I am not motivated that's the issue, since I never needed it. If you are too busy or also not motivated. Do you think we can ask NumFOCUS for funding? I would be happy to read the dremel paper and implement it for some monetary compensation.

If we get funding do you want to split it? Happy for it to be a solo effort on either side too. Happy either way.

tanmaykm · 2020-05-19T13:26:30Z

we try a PR with just unnested column (which I think is the majority of use cases out there)

I am open to this. Maybe we should keep it in a sub module to keep the interface clean. The sub module can implement the simpler non-nested read and write using the other lower level methods.

xiaodaigh · 2020-05-19T13:27:50Z

I am open to this. Maybe we should keep it in a sub module to keep the interface clean.

OK. It will be quite a release with writer and column-based read. Or you want to finish off the writer PR first?

Update
See WIP PR #79

xiaodaigh · 2020-05-19T13:34:33Z

Maybe we should keep it in a sub module

Any ideas for the sub module name? UnnestedColumnReader? It might be painful when the reader is fixed to deal with Nested columns too.

ym-han · 2021-08-16T02:36:49Z

just wanted to chime in real quick to say that it'd be great if we could improve the read performance! the difference between the julia parquet reader and fastparquet (via pycall) was like night and day for me. but i totally understand if there's no funding etc

xiaodaigh changed the title ~~Serious performance issues with master read~~ Serious performance issues with read May 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serious performance issues with read #73

Serious performance issues with read #73

xiaodaigh commented May 18, 2020

xiaodaigh commented May 18, 2020

tanmaykm commented May 18, 2020

xiaodaigh commented May 18, 2020

tanmaykm commented May 18, 2020

xiaodaigh commented May 19, 2020 •

edited

Loading

tanmaykm commented May 19, 2020

xiaodaigh commented May 19, 2020 •

edited

Loading

tanmaykm commented May 19, 2020 •

edited

Loading

tanmaykm commented May 19, 2020

xiaodaigh commented May 19, 2020

xiaodaigh commented May 19, 2020

tanmaykm commented May 19, 2020

xiaodaigh commented May 19, 2020 •

edited

Loading

tanmaykm commented May 19, 2020

xiaodaigh commented May 19, 2020

tanmaykm commented May 19, 2020

xiaodaigh commented May 19, 2020 •

edited

Loading

xiaodaigh commented May 19, 2020

ym-han commented Aug 16, 2021

Serious performance issues with read #73

Serious performance issues with read #73

Comments

xiaodaigh commented May 18, 2020

xiaodaigh commented May 18, 2020

tanmaykm commented May 18, 2020

xiaodaigh commented May 18, 2020

tanmaykm commented May 18, 2020

xiaodaigh commented May 19, 2020 • edited Loading

tanmaykm commented May 19, 2020

xiaodaigh commented May 19, 2020 • edited Loading

tanmaykm commented May 19, 2020 • edited Loading

tanmaykm commented May 19, 2020

xiaodaigh commented May 19, 2020

xiaodaigh commented May 19, 2020

tanmaykm commented May 19, 2020

xiaodaigh commented May 19, 2020 • edited Loading

tanmaykm commented May 19, 2020

xiaodaigh commented May 19, 2020

tanmaykm commented May 19, 2020

xiaodaigh commented May 19, 2020 • edited Loading

xiaodaigh commented May 19, 2020

ym-han commented Aug 16, 2021

xiaodaigh commented May 19, 2020 •

edited

Loading

xiaodaigh commented May 19, 2020 •

edited

Loading

tanmaykm commented May 19, 2020 •

edited

Loading

xiaodaigh commented May 19, 2020 •

edited

Loading

xiaodaigh commented May 19, 2020 •

edited

Loading