read_excel plus "calamine" engine issues when loading Excel data with some empty values #14174

adrivn · 2024-02-01T10:02:57Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Load the attached Excel data sample file, containing sparse data.

import polars as pl

pl.read_excel(
    "sample_data_blanks_instead_of_nulls.xlsx",
    engine="calamine",
)

Log output

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[17], line 1
----> 1 pl.read_excel(
      2     "C:/Users/adrivn/Desktop/Unorganized Stuff/sample_data_blanks_instead_of_nulls.xlsx",
      3     engine="calamine",
      4 )

File c:\Users\adrivn\envs\main\Lib\site-packages\polars\utils\deprecation.py:133, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    128 @wraps(function)
    129 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    130     _rename_keyword_argument(
    131         old_name, new_name, kwargs, function.__name__, version
    132     )
--> 133     return function(*args, **kwargs)

File c:\Users\adrivn\envs\main\Lib\site-packages\polars\io\spreadsheet\functions.py:249, in read_excel(source, sheet_id, sheet_name, engine, engine_options, read_csv_options, schema_overrides, raise_if_empty)
    246     msg = f"cannot specify `read_csv_options` when engine={engine!r}"
    247     raise ValueError(msg)
--> 249 return _read_spreadsheet(
    250     sheet_id,
    251     sheet_name,
    252     source=source,
    253     engine=engine,
    254     engine_options=engine_options,
    255     read_csv_options=read_csv_options,
    256     schema_overrides=schema_overrides,
    257     raise_if_empty=raise_if_empty,
    258 )

File c:\Users\adrivn\envs\main\Lib\site-packages\polars\io\spreadsheet\functions.py:428, in _read_spreadsheet(sheet_id, sheet_name, source, engine, engine_options, read_csv_options, schema_overrides, raise_if_empty)
    425 try:
    426     # parse data from the indicated sheet(s)
    427     sheet_names, return_multi = _get_sheet_names(sheet_id, sheet_name, worksheets)
--> 428     parsed_sheets = {
    429         name: reader_fn(
    430             parser=parser,
    431             sheet_name=name,
    432             read_csv_options=read_csv_options,
    433             schema_overrides=schema_overrides,
    434             raise_if_empty=raise_if_empty,
    435         )
    436         for name in sheet_names
    437     }
    438 finally:
    439     if hasattr(parser, "close"):

File c:\Users\adrivn\envs\main\Lib\site-packages\polars\io\spreadsheet\functions.py:429, in <dictcomp>(.0)
    425 try:
    426     # parse data from the indicated sheet(s)
    427     sheet_names, return_multi = _get_sheet_names(sheet_id, sheet_name, worksheets)
    428     parsed_sheets = {
--> 429         name: reader_fn(
    430             parser=parser,
    431             sheet_name=name,
    432             read_csv_options=read_csv_options,
    433             schema_overrides=schema_overrides,
    434             raise_if_empty=raise_if_empty,
    435         )
    436         for name in sheet_names
    437     }
    438 finally:
    439     if hasattr(parser, "close"):

File c:\Users\adrivn\envs\main\Lib\site-packages\polars\io\spreadsheet\functions.py:784, in _read_spreadsheet_calamine(parser, sheet_name, read_csv_options, schema_overrides, raise_if_empty)
    781         type_checks.append(check_cast)
    783 if type_checks:
--> 784     apply_downcast = df.select([d[0] for d in type_checks]).row(0)
    786     # do a similar check for datetime columns that have only 00:00:00 times.
    787     if downcast := [
    788         cast for apply, (_, cast) in zip(apply_downcast, type_checks) if apply
    789     ]:

File c:\Users\adrivn\envs\main\Lib\site-packages\polars\dataframe\frame.py:8144, in DataFrame.select(self, *exprs, **named_exprs)
   8044 def select(
   8045     self, *exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr
   8046 ) -> DataFrame:
   8047     """
   8048     Select columns from this DataFrame.
   8049 
   (...)
   8142     └───────────┘
   8143     """
-> 8144     return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)

File c:\Users\adrivn\envs\main\Lib\site-packages\polars\lazyframe\frame.py:1940, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, background, _eager)
   1937 if background:
   1938     return InProcessQuery(ldf.collect_concurrently())
-> 1940 return wrap_df(ldf.collect())

ComputeError: Series length 91 doesn't match the DataFrame height of 287

Issue description

sample_data_blanks_instead_of_nulls.xlsx
sample_data_nulls.xlsx

The attached Excel spreadsheets contain, for simplicity of reproduction, a total of 12 columns (A:L) with a row count of 287 (288 if you include header). One file has null/empty values with the "NULL" string placeholder, the other does not (empty/blank cell value instead). The data has integer columns, strings, float/double/numeric, and dates in timestamp format. Some of the columns have every row populated, some do not (55/287, 38/287, etc.)

Upon loading the data using read_excel with the new engine=calamine integration, it results in a ComputeError: Series lenght <# of rows without empty values> doesn't match the DataFrame height of <# total rows in the Excel spreadsheet>

I have tested this same behavior using the openpyxl and default xlsx_to_csv engines and the data can and is read correctly.

Expected behavior

The data should be loaded correctly into memory as a DataFrame, and the datatypes inferred as it happens when using the openpyxl engine.

shape: (287, 12)

ID	ALMOST_INT	NEARLY_STR	FULL_STR	FEW_STR	FEWER_STR	SINGLE_STR	ALMOST_FLOAT	HALF_INT	HALF_DATE	FEW_INT	FEW_DATE
i64	i64	str	str	str	str	null	f64	i64	datetime[μs]	i64	datetime[μs]
1	1	"AAA"	"BBB"	"CCC"	null	null	null	41	2024-02-01 05:20:37.525	null	null
2	null	null	"BBB"	null	null	null	39.6	null	null	null	null
3	null	null	"BBB"	null	null	null	55.2	null	null	null	null
4	1	null	"BBB"	null	null	null	44.4	null	null	null	null
5	1	"AAA"	"BBB"	null	null	null	19.2	null	null	null	null
6	1	"AAA"	"BBB"	"CCC"	"DDD"	null	67.2	null	null	null	null
7	null	null	"BBB"	null	null	null	70.8	null	null	null	null
8	null	null	"BBB"	null	null	null	56.4	null	null	null	null
9	null	null	"BBB"	null	null	null	55.2	null	null	null	null
10	1	"AAA"	"BBB"	"CCC"	null	null	50.4	null	null	null	null
11	null	null	"BBB"	"CCC"	"DDD"	null	3.6	44	2024-02-01 05:20:37.525	null	null
12	null	null	"BBB"	null	null	null	39.6	null	null	null	null
…	…	…	…	…	…	…	…	…	…	…	…
276	1	"AAA"	"BBB"	null	null	null	31.2	null	null	null	null
277	1	"AAA"	"BBB"	null	null	null	40.8	null	null	null	null
278	1	"AAA"	"BBB"	null	null	null	31.2	null	null	null	null
279	1	"AAA"	"BBB"	null	null	null	33.6	null	null	null	null
280	1	"AAA"	"BBB"	null	null	null	52.8	null	null	null	null
281	1	"AAA"	"BBB"	null	null	null	42.0	null	null	null	null
282	1	"AAA"	"BBB"	null	null	null	18.0	25	2024-02-01 05:20:37.525	null	null
283	1	"AAA"	"BBB"	null	null	null	26.4	null	null	null	null
284	1	"AAA"	"BBB"	null	null	null	52.8	null	null	null	null
285	1	"AAA"	"BBB"	"CCC"	"DDD"	null	10.8	28	2024-02-01 05:20:37.525	null	null
286	1	"AAA"	"BBB"	null	null	null	40.8	null	null	null	null
287	1	"AAA"	"BBB"	null	null	null	4.8	null	null	null	null

Installed versions

--------Version info---------
Polars:               0.20.6
Index type:           UInt32
Platform:             Windows-10-10.0.19044-SP0
Python:               3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           0.3.2
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.2
openpyxl:             3.1.2
pandas:               2.2.0
pyarrow:              15.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               1.0.10
sqlalchemy:           2.0.23
xlsx2csv:             0.8.2
xlsxwriter:           3.1.9

The text was updated successfully, but these errors were encountered:

deanm0000 · 2024-02-02T00:18:11Z

That engine uses the package fastexcel. I think ToucanToco/fastexcel#164 will fix it.

deanm0000 · 2024-02-02T00:24:59Z

If, after they get their next version out with that merged, it still doesn't work then give it a try directly with their library. If that doesn't work post an issue on their board. If it works directly with them but not polars then post again here.

adrivn added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 1, 2024

deanm0000 closed this as completed Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_excel plus "calamine" engine issues when loading Excel data with some empty values #14174

read_excel plus "calamine" engine issues when loading Excel data with some empty values #14174

adrivn commented Feb 1, 2024 •

edited

Loading

deanm0000 commented Feb 2, 2024

deanm0000 commented Feb 2, 2024

read_excel plus "calamine" engine issues when loading Excel data with some empty values #14174

read_excel plus "calamine" engine issues when loading Excel data with some empty values #14174

Comments

adrivn commented Feb 1, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

deanm0000 commented Feb 2, 2024

deanm0000 commented Feb 2, 2024

adrivn commented Feb 1, 2024 •

edited

Loading