Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of sorting_columns parquet metadata #18785

Open
2 tasks done
Thomzoy opened this issue Sep 17, 2024 · 0 comments
Open
2 tasks done

Use of sorting_columns parquet metadata #18785

Thomzoy opened this issue Sep 17, 2024 · 0 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@Thomzoy
Copy link

Thomzoy commented Sep 17, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import pyarrow.parquet as pq
import pyarrow as pa
import numpy as np
import polars as pl

data = dict(col=np.arange(1000))

# LazyFrame
lf = pl.LazyFrame(data)
lf = lf.set_sorted("col")
lf.sink_parquet("data_from_polars.parquet")

# Pyarrow Table
table = pa.Table.from_pydict(data)
sorting = [("col","ascending")]

sorting_columns = [
    pq.SortingColumn(table.column_names.index("col"))
]
pq.write_table(table, "data_from_pyarrow.parquet", sorting_columns=sorting_columns)

Then checking the parquet metadata using pyarrow:

pq.ParquetFile("data_from_polars.parquet").metadata.row_group(0).sorting_columns
# Out: ()

pq.ParquetFile("data_from_pyarrow.parquet").metadata.row_group(0).sorting_columns
# Out: (SortingColumn(column_index=0, descending=False, nulls_first=False),)

Log output

No response

Issue description

It seems that when sinking/writing frames to parquet, polars isn't able to save the sorting columns metadata, whereas e.g. pyarrow can.
As a side-note, I have trouble figuring out if/when polars would use this information (wether directly from the parquet metadata or by the use of set_sorted() ?

Expected behavior

Efficiently write/read the sorting_columns parquet metadata to speed-up further processing

Installed versions

--------Version info---------
Polars:              1.7.1
Index type:          UInt32
Platform:            Linux-4.18.0-425.10.1.el8_7.x86_64-x86_64-with-glibc2.35
Python:              3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.5.0
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                2.0.1
openpyxl             <not installed>
pandas               2.2.2
pyarrow              16.1.0
pydantic             2.8.2
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@Thomzoy Thomzoy added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

1 participant