Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serious performance issues with read #73

Open
xiaodaigh opened this issue May 18, 2020 · 19 comments
Open

Serious performance issues with read #73

xiaodaigh opened this issue May 18, 2020 · 19 comments

Comments

@xiaodaigh
Copy link
Contributor

See the repo and the code below.

https://github.com/xiaodaigh/parquet-data-collection/blob/master/test_files.jl

To replicate, just clone the repo and run the above file (included in the repo).

It takes a LONG time.

@xiaodaigh
Copy link
Contributor Author

I suspect is the that RecordCursor loops thru each row and this is highly inefficient especially considering Parquet is a columnar format.

I think the lowest level of cursor/iterations should be at the column chunk level, not at row level. The row level cursor should be in some utility library like ParquetFiles.jl to keep this package clean.

@tanmaykm
Copy link
Member

Data does get read in chunks internally.

Could you post some benchmark results comparing times of Parquet.jl master to the last release?

@xiaodaigh
Copy link
Contributor Author

Could you post some benchmark results comparing times of Parquet.jl master to the last release?

I mean to say, it's dead slow on both master and latest release.

@tanmaykm
Copy link
Member

I was able to trace some issues, PR here: #77
But more can be done.

@xiaodaigh
Copy link
Contributor Author

xiaodaigh commented May 19, 2020

The issue seems to be much deeper with the reader design. In this fork here, I can read the file in 3~4 seconds, but Parquet.jl #77 it takes 130s

See the fork here https://github.com/xiaodaigh/Parquet.jl/blob/master/README.md

@tanmaykm
Copy link
Member

I tried the fork, following the readme pointed to above. Not sure if I got it right, but read_parquet seems to return a single row in 13 sec.

julia> using Parquet

julia> @time df =  read_parquet("dsd50p.parquet");
Progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:07
 13.827823 seconds (60.12 M allocations: 2.331 GiB, 4.70% gc time)

julia> isa(df, NamedTuple)
true

@xiaodaigh
Copy link
Contributor Author

xiaodaigh commented May 19, 2020

but read_parquet seems to return a single row in 13 sec.

Try this

df.tile_count

and you will see that it returns the whole column, not just one row. Each name is attached to a vector, not a single value.

As to speed, it's more due to PC differences? I consistently get 3s on my 6 core machine, also I have 2 SSDs on RAID0

Update

BTW, 3s should be considered really slow as R's arrow can read it and summarise it in about ~0.13s. So lot's of room to improve.

@xiaodaigh xiaodaigh changed the title Serious performance issues with master read Serious performance issues with read May 19, 2020
@tanmaykm
Copy link
Member

tanmaykm commented May 19, 2020

Okay, I see the column now. But looks like you are disregarding the nested schema? So this will work only for non nested data?

@tanmaykm
Copy link
Member

And you seem to be using the same underlying page and column chunk read methods of Parquet.jl, right? So the current performance issues are on the layer above that.

@xiaodaigh
Copy link
Contributor Author

So this will work only for non nested data?

That is correct. And the issue that is bothering me. I don't understand the nestedness, so I actually don't know how to deal with it.

Ideally, there might be things in my approach that can be used to handle nested schemas?

Honestly, the reason I am not motivated to do nested schema thing is that I have never needed it or used it.

Would it be acceptable to have a set of functions just to read non-nested data?

@xiaodaigh
Copy link
Contributor Author

So the current performance issues are on the layer above that.

Yes I think so. Although, there's definitely much room to improve for the values method, e.g. it's using push! which is causing much more allocation then necessary for unnested column reads (not sure about nested columns, so kept my comment to unnested columns).

@tanmaykm
Copy link
Member

Would it be acceptable to have a set of functions just to read non-nested data?

I guess. But I would try and figure out why we can not handle nestedness and be performant as well.

@xiaodaigh
Copy link
Contributor Author

xiaodaigh commented May 19, 2020

why we can not handle nestedness and be performant as well.

I don't see a reason. The issue is I don't have the knowledge to contribute to nestedness and performant code. Also, I have no motivation there since I never use nested columns.

So i only see 3 options

  1. we try a PR with just unnested column (which I think is the majority of use cases out there)
  2. someone other than me writes performant code to handle nested and unested columns
  3. do nothing

I would prefer 1 to 3, but I can only contribute to 1 to prevent 3, not 2. If 2 can happen, it would be great! But if 2 takes a year, then I would rather see 1 instead of 3 being the eventuality. I can commit to doing 1.

@tanmaykm
Copy link
Member

I don't understand the nestedness, so I actually don't know how to deal with it.

This is basically what the definition levels and repetition levels allow. It is based on dremel, there would be some material out there if you searcg for it.

@xiaodaigh
Copy link
Contributor Author

there would be some material out there if you searcg for it.

yeah, I am not motivated that's the issue, since I never needed it. If you are too busy or also not motivated. Do you think we can ask NumFOCUS for funding? I would be happy to read the dremel paper and implement it for some monetary compensation.

If we get funding do you want to split it? Happy for it to be a solo effort on either side too. Happy either way.

@tanmaykm
Copy link
Member

we try a PR with just unnested column (which I think is the majority of use cases out there)

I am open to this. Maybe we should keep it in a sub module to keep the interface clean. The sub module can implement the simpler non-nested read and write using the other lower level methods.

@xiaodaigh
Copy link
Contributor Author

xiaodaigh commented May 19, 2020

I am open to this. Maybe we should keep it in a sub module to keep the interface clean.

OK. It will be quite a release with writer and column-based read. Or you want to finish off the writer PR first?

Update
See WIP PR #79

@xiaodaigh
Copy link
Contributor Author

Maybe we should keep it in a sub module

Any ideas for the sub module name? UnnestedColumnReader? It might be painful when the reader is fixed to deal with Nested columns too.

@ym-han
Copy link

ym-han commented Aug 16, 2021

just wanted to chime in real quick to say that it'd be great if we could improve the read performance! the difference between the julia parquet reader and fastparquet (via pycall) was like night and day for me. but i totally understand if there's no funding etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants