Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_excel plus "calamine" engine issues when loading Excel data with some empty values #14174

Closed
2 tasks done
adrivn opened this issue Feb 1, 2024 · 2 comments
Closed
2 tasks done
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@adrivn
Copy link

adrivn commented Feb 1, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Load the attached Excel data sample file, containing sparse data.

import polars as pl

pl.read_excel(
    "sample_data_blanks_instead_of_nulls.xlsx",
    engine="calamine",
)

Log output

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[17], line 1
----> 1 pl.read_excel(
      2     "C:/Users/adrivn/Desktop/Unorganized Stuff/sample_data_blanks_instead_of_nulls.xlsx",
      3     engine="calamine",
      4 )

File c:\Users\adrivn\envs\main\Lib\site-packages\polars\utils\deprecation.py:133, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    128 @wraps(function)
    129 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    130     _rename_keyword_argument(
    131         old_name, new_name, kwargs, function.__name__, version
    132     )
--> 133     return function(*args, **kwargs)

File c:\Users\adrivn\envs\main\Lib\site-packages\polars\io\spreadsheet\functions.py:249, in read_excel(source, sheet_id, sheet_name, engine, engine_options, read_csv_options, schema_overrides, raise_if_empty)
    246     msg = f"cannot specify `read_csv_options` when engine={engine!r}"
    247     raise ValueError(msg)
--> 249 return _read_spreadsheet(
    250     sheet_id,
    251     sheet_name,
    252     source=source,
    253     engine=engine,
    254     engine_options=engine_options,
    255     read_csv_options=read_csv_options,
    256     schema_overrides=schema_overrides,
    257     raise_if_empty=raise_if_empty,
    258 )

File c:\Users\adrivn\envs\main\Lib\site-packages\polars\io\spreadsheet\functions.py:428, in _read_spreadsheet(sheet_id, sheet_name, source, engine, engine_options, read_csv_options, schema_overrides, raise_if_empty)
    425 try:
    426     # parse data from the indicated sheet(s)
    427     sheet_names, return_multi = _get_sheet_names(sheet_id, sheet_name, worksheets)
--> 428     parsed_sheets = {
    429         name: reader_fn(
    430             parser=parser,
    431             sheet_name=name,
    432             read_csv_options=read_csv_options,
    433             schema_overrides=schema_overrides,
    434             raise_if_empty=raise_if_empty,
    435         )
    436         for name in sheet_names
    437     }
    438 finally:
    439     if hasattr(parser, "close"):

File c:\Users\adrivn\envs\main\Lib\site-packages\polars\io\spreadsheet\functions.py:429, in <dictcomp>(.0)
    425 try:
    426     # parse data from the indicated sheet(s)
    427     sheet_names, return_multi = _get_sheet_names(sheet_id, sheet_name, worksheets)
    428     parsed_sheets = {
--> 429         name: reader_fn(
    430             parser=parser,
    431             sheet_name=name,
    432             read_csv_options=read_csv_options,
    433             schema_overrides=schema_overrides,
    434             raise_if_empty=raise_if_empty,
    435         )
    436         for name in sheet_names
    437     }
    438 finally:
    439     if hasattr(parser, "close"):

File c:\Users\adrivn\envs\main\Lib\site-packages\polars\io\spreadsheet\functions.py:784, in _read_spreadsheet_calamine(parser, sheet_name, read_csv_options, schema_overrides, raise_if_empty)
    781         type_checks.append(check_cast)
    783 if type_checks:
--> 784     apply_downcast = df.select([d[0] for d in type_checks]).row(0)
    786     # do a similar check for datetime columns that have only 00:00:00 times.
    787     if downcast := [
    788         cast for apply, (_, cast) in zip(apply_downcast, type_checks) if apply
    789     ]:

File c:\Users\adrivn\envs\main\Lib\site-packages\polars\dataframe\frame.py:8144, in DataFrame.select(self, *exprs, **named_exprs)
   8044 def select(
   8045     self, *exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr
   8046 ) -> DataFrame:
   8047     """
   8048     Select columns from this DataFrame.
   8049 
   (...)
   8142     └───────────┘
   8143     """
-> 8144     return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)

File c:\Users\adrivn\envs\main\Lib\site-packages\polars\lazyframe\frame.py:1940, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, background, _eager)
   1937 if background:
   1938     return InProcessQuery(ldf.collect_concurrently())
-> 1940 return wrap_df(ldf.collect())

ComputeError: Series length 91 doesn't match the DataFrame height of 287

Issue description

sample_data_blanks_instead_of_nulls.xlsx
sample_data_nulls.xlsx

The attached Excel spreadsheets contain, for simplicity of reproduction, a total of 12 columns (A:L) with a row count of 287 (288 if you include header). One file has null/empty values with the "NULL" string placeholder, the other does not (empty/blank cell value instead). The data has integer columns, strings, float/double/numeric, and dates in timestamp format. Some of the columns have every row populated, some do not (55/287, 38/287, etc.)

Upon loading the data using read_excel with the new engine=calamine integration, it results in a ComputeError: Series lenght <# of rows without empty values> doesn't match the DataFrame height of <# total rows in the Excel spreadsheet>

I have tested this same behavior using the openpyxl and default xlsx_to_csv engines and the data can and is read correctly.

Expected behavior

The data should be loaded correctly into memory as a DataFrame, and the datatypes inferred as it happens when using the openpyxl engine.

shape: (287, 12)

IDALMOST_INTNEARLY_STRFULL_STRFEW_STRFEWER_STRSINGLE_STRALMOST_FLOATHALF_INTHALF_DATEFEW_INTFEW_DATE
i64i64strstrstrstrnullf64i64datetime[μs]i64datetime[μs]
11"AAA""BBB""CCC"nullnullnull412024-02-01 05:20:37.525nullnull
2nullnull"BBB"nullnullnull39.6nullnullnullnull
3nullnull"BBB"nullnullnull55.2nullnullnullnull
41null"BBB"nullnullnull44.4nullnullnullnull
51"AAA""BBB"nullnullnull19.2nullnullnullnull
61"AAA""BBB""CCC""DDD"null67.2nullnullnullnull
7nullnull"BBB"nullnullnull70.8nullnullnullnull
8nullnull"BBB"nullnullnull56.4nullnullnullnull
9nullnull"BBB"nullnullnull55.2nullnullnullnull
101"AAA""BBB""CCC"nullnull50.4nullnullnullnull
11nullnull"BBB""CCC""DDD"null3.6442024-02-01 05:20:37.525nullnull
12nullnull"BBB"nullnullnull39.6nullnullnullnull
2761"AAA""BBB"nullnullnull31.2nullnullnullnull
2771"AAA""BBB"nullnullnull40.8nullnullnullnull
2781"AAA""BBB"nullnullnull31.2nullnullnullnull
2791"AAA""BBB"nullnullnull33.6nullnullnullnull
2801"AAA""BBB"nullnullnull52.8nullnullnullnull
2811"AAA""BBB"nullnullnull42.0nullnullnullnull
2821"AAA""BBB"nullnullnull18.0252024-02-01 05:20:37.525nullnull
2831"AAA""BBB"nullnullnull26.4nullnullnullnull
2841"AAA""BBB"nullnullnull52.8nullnullnullnull
2851"AAA""BBB""CCC""DDD"null10.8282024-02-01 05:20:37.525nullnull
2861"AAA""BBB"nullnullnull40.8nullnullnullnull
2871"AAA""BBB"nullnullnull4.8nullnullnullnull

Installed versions

--------Version info---------
Polars:               0.20.6
Index type:           UInt32
Platform:             Windows-10-10.0.19044-SP0
Python:               3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           0.3.2
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.2
openpyxl:             3.1.2
pandas:               2.2.0
pyarrow:              15.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               1.0.10
sqlalchemy:           2.0.23
xlsx2csv:             0.8.2
xlsxwriter:           3.1.9

@adrivn adrivn added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 1, 2024
@deanm0000
Copy link
Collaborator

That engine uses the package fastexcel. I think ToucanToco/fastexcel#164 will fix it.

@deanm0000
Copy link
Collaborator

If, after they get their next version out with that merged, it still doesn't work then give it a try directly with their library. If that doesn't work post an issue on their board. If it works directly with them but not polars then post again here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants