Schema assumes the column order in the data when reading a CSV #18821

vmgustavo · 2024-09-18T19:54:42Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Input 1

from pathlib import Path
import polars as pl

data = [
    {'b': 1, 'a': 2, 'c': 3},
    {'b': 1, 'a': 2, 'c': 3},
    {'b': 1, 'a': 2, 'c': 3},
]

outpath = Path('/tmp') / 'test.csv'
pl.DataFrame(data).write_csv(outpath)
pl.read_csv(outpath)

pl.read_csv(outpath, schema={'a': pl.Int16, 'b': pl.Int16, 'c': pl.Int16})

Output 1

┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i16 ┆ i16 ┆ i16 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   │
│ 1   ┆ 2   ┆ 3   │
│ 1   ┆ 2   ┆ 3   │
└─────┴─────┴─────┘

Input 2

from pathlib import Path
import polars as pl

data = [
    {'b': 1, 'a': 'asd', 'c': 2},
    {'b': 1, 'a': 'asd', 'c': 2},
    {'b': 1, 'a': 'asd', 'c': 2},
]

outpath = Path('/tmp') / 'test.csv'
pl.DataFrame(data).write_csv(outpath)
pl.read_csv(outpath)

pl.read_csv(outpath, schema={'a': pl.String, 'b': pl.Int16, 'c': pl.Int16})

Output 2

ComputeError: could not parse `asd` as dtype `i16` at column 'b' (column number 2)

The current offset in the file is 8 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `asd` to the `null_values` list.

Original error: ```remaining bytes non-empty```

Log output

No response

Issue description

When reading a CSV file using a schema, polars assumes the order of the schema is the order of the columns in the dataframe. If the order is different and the column doesn't match the expected type it fails with the error presented. If there is no column type problem it runs with no problems and the resulting dataframe is wrong.

Expected behavior

The order of the schema may not be the same order as the columns in the file, hence the order of a schema dict should not be relevant to read the file.

Installed versions

--------Version info---------
Polars:              1.7.1
Index type:          UInt32
Platform:            Linux-6.8.0-44-generic-x86_64-with-glibc2.39
Python:              3.10.15 (main, Sep  9 2024, 17:41:51) [GCC 13.2.0]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             3.1.5
pandas               2.2.2
pyarrow              12.0.1
pydantic             2.9.1
pyiceberg            <not installed>
sqlalchemy           1.4.54
torch                <not installed>
xlsx2csv             0.8.3
xlsxwriter           <not installed>

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2024-09-19T08:57:45Z

There is a pending PR to update the docs.

docs(python): Clarify documentation for schema in read_csv function #18759

There is a general issue for schema= rules.

Consistency around the behavior of the schema argument across the API #11723 (comment)

vmgustavo added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema assumes the column order in the data when reading a CSV #18821

Schema assumes the column order in the data when reading a CSV #18821

vmgustavo commented Sep 18, 2024

cmdlineluser commented Sep 19, 2024

Schema assumes the column order in the data when reading a CSV #18821

Schema assumes the column order in the data when reading a CSV #18821

Comments

vmgustavo commented Sep 18, 2024

Checks

Reproducible example

Input 1

Output 1

Input 2

Output 2

Log output

Issue description

Expected behavior

Installed versions

cmdlineluser commented Sep 19, 2024