Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema assumes the column order in the data when reading a CSV #18821

Open
2 tasks done
vmgustavo opened this issue Sep 18, 2024 · 1 comment
Open
2 tasks done

Schema assumes the column order in the data when reading a CSV #18821

vmgustavo opened this issue Sep 18, 2024 · 1 comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@vmgustavo
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Input 1

from pathlib import Path
import polars as pl

data = [
    {'b': 1, 'a': 2, 'c': 3},
    {'b': 1, 'a': 2, 'c': 3},
    {'b': 1, 'a': 2, 'c': 3},
]

outpath = Path('/tmp') / 'test.csv'
pl.DataFrame(data).write_csv(outpath)
pl.read_csv(outpath)

pl.read_csv(outpath, schema={'a': pl.Int16, 'b': pl.Int16, 'c': pl.Int16})

Output 1

┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i16 ┆ i16 ┆ i16 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   │
│ 1   ┆ 2   ┆ 3   │
│ 1   ┆ 2   ┆ 3   │
└─────┴─────┴─────┘

Input 2

from pathlib import Path
import polars as pl

data = [
    {'b': 1, 'a': 'asd', 'c': 2},
    {'b': 1, 'a': 'asd', 'c': 2},
    {'b': 1, 'a': 'asd', 'c': 2},
]

outpath = Path('/tmp') / 'test.csv'
pl.DataFrame(data).write_csv(outpath)
pl.read_csv(outpath)

pl.read_csv(outpath, schema={'a': pl.String, 'b': pl.Int16, 'c': pl.Int16})

Output 2

ComputeError: could not parse `asd` as dtype `i16` at column 'b' (column number 2)

The current offset in the file is 8 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `asd` to the `null_values` list.

Original error: ```remaining bytes non-empty```

Log output

No response

Issue description

When reading a CSV file using a schema, polars assumes the order of the schema is the order of the columns in the dataframe. If the order is different and the column doesn't match the expected type it fails with the error presented. If there is no column type problem it runs with no problems and the resulting dataframe is wrong.

Expected behavior

The order of the schema may not be the same order as the columns in the file, hence the order of a schema dict should not be relevant to read the file.

Installed versions

--------Version info---------
Polars:              1.7.1
Index type:          UInt32
Platform:            Linux-6.8.0-44-generic-x86_64-with-glibc2.39
Python:              3.10.15 (main, Sep  9 2024, 17:41:51) [GCC 13.2.0]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             3.1.5
pandas               2.2.2
pyarrow              12.0.1
pydantic             2.9.1
pyiceberg            <not installed>
sqlalchemy           1.4.54
torch                <not installed>
xlsx2csv             0.8.3
xlsxwriter           <not installed>
@vmgustavo vmgustavo added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Sep 18, 2024
@cmdlineluser
Copy link
Contributor

There is a pending PR to update the docs.

  1. docs(python): Clarify documentation for schema in read_csv function #18759

There is a general issue for schema= rules.

  1. Consistency around the behavior of the schema argument across the API #11723 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants