Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic during broadcasting struct series containing datetime type #19277

Closed
2 tasks done
thomasaarholt opened this issue Oct 17, 2024 · 2 comments
Closed
2 tasks done

Panic during broadcasting struct series containing datetime type #19277

thomasaarholt opened this issue Oct 17, 2024 · 2 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@thomasaarholt
Copy link
Contributor

thomasaarholt commented Oct 17, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
from datetime import datetime

s_foo = pl.Series("foo", [1,2,]) # height 2
s_bar = pl.Series("bar", [{"datetime": datetime(2024, 1, 1)}]) # struct, height 1

# this worked in polars 1.6:
pl.DataFrame().with_columns([s_foo, s_bar])
# shape: (2, 2)
# ┌─────┬───────────────────────┐
# │ foo ┆ bar                   │
# │ --- ┆ ---                   │
# │ i64 ┆ struct[1]             │
# ╞═════╪═══════════════════════╡
# │ 1   ┆ {2024-01-01 00:00:00} │
# │ 2   ┆ {2024-01-01 00:00:00} │
# └─────┴───────────────────────┘

# error with polars 1.7+
thread '<unnamed>' panicked at crates/polars-core/src/scalar/mod.rs:46:92:
called `Result::unwrap()` on an `Err` value: SchemaMismatch(ErrString("unexpected value while building Series of type Datetime(Microseconds, None); found value of type Datetime(Microseconds, None): 2024-01-01 00:00:00"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
<ipython-input-6-47202b4391ba> in ?()
----> 1 pl.DataFrame().with_columns([s_foo, s_bar2])

~/repos/patito/.venv/lib/python3.12/site-packages/decorator.py in ?(*args, **kw)
    229         def fun(*args, **kw):
    230             if not kwsyntax:
    231                 args, kw = fix(args, kw, sig)
--> 232             return caller(func, *(extras + args), **kw)

~/repos/patito/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py in ?(self)
   1177     def __repr__(self) -> str:
-> 1178         return self.__str__()

~/repos/patito/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py in ?(self)
   1174     def __str__(self) -> str:
-> 1175         return self._df.as_str()

PanicException: called `Result::unwrap()` on an `Err` value: SchemaMismatch(ErrString("unexpected value while building Series of type Datetime(Microseconds, None); found value of type Datetime(Microseconds, None): 2024-01-01 00:00:00"))

Issue description

We discovered this one at patito, which is a dataframe validation library for polars. We have a .examples() method, which constructs single or multiple row example series of a given schema, which then a concatenated using with_columns.

Broadcasting a series of length 1 works fine using ints, datetimes etc. But not with structs containing datetimes as per the repro above.

Using other types works fine:

s_foo = pl.Series("foo", [1,2,])
s_bar = pl.Series("bar", [datetime(2024, 1, 1)])
s_baz = pl.Series("baz", [2.0])
pl.DataFrame().with_columns([s_foo, s_bar, s_baz])
shape: (2, 3)
# ┌─────┬─────────────────────┬─────┐
# │ foo ┆ bar                 ┆ baz │
# │ --- ┆ ---                 ┆ --- │
# │ i64 ┆ datetime[μs]        ┆ f64 │
# ╞═════╪═════════════════════╪═════╡
# │ 1   ┆ 2024-01-01 00:00:00 ┆ 2.0 │
# │ 2   ┆ 2024-01-01 00:00:00 ┆ 2.0 │
# └─────┴─────────────────────┴─────┘

Using a struct of ints and floats works fine:

s_foo = pl.Series("foo", [1,2,])
s_bar = pl.Series("bar", [{"a":1, "b":2.0}])
pl.DataFrame().with_columns([s_foo, s_bar])
# shape: (2, 3)
# ┌─────┬───────────┬─────┐
# │ foo ┆ bar       ┆ baz │
# │ --- ┆ ---       ┆ --- │
# │ i64 ┆ struct[2] ┆ f64 │
# ╞═════╪═══════════╪═════╡
# │ 1   ┆ {1,2.0}   ┆ 2.0 │
# │ 2   ┆ {1,2.0}   ┆ 2.0 │
# └─────┴───────────┴─────┘

Broadcasting with structs using with_columns used to work in polars 1.6. For the MWE example above, in 1.7.1, the following error is returned:

---------------------------------------------------------------------------
InvalidOperationError                     Traceback (most recent call last)
Cell In[1], line 8
      5 s_bar = pl.Series("bar", [{"datetime": datetime(2024, 1, 1)}]) # struct, height 1
      7 # this worked in polars 1.6:
----> 8 pl.DataFrame().with_columns([s_foo, s_bar])

File ~/codes/polars-struct/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py:9141, in DataFrame.with_columns(self, *exprs, **named_exprs)
   8995 def with_columns(
   8996     self,
   8997     *exprs: IntoExpr | Iterable[IntoExpr],
   8998     **named_exprs: IntoExpr,
   8999 ) -> DataFrame:
   9000     """
   9001     Add columns to this DataFrame.
   9002
   (...)
   9139     └─────┴──────┴─────────────┘
   9140     """
-> 9141     return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)

File ~/codes/polars-struct/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py:2032, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, engine, background, _eager, **_kwargs)
   2030 # Only for testing purposes
   2031 callback = _kwargs.get("post_opt_callback", callback)
-> 2032 return wrap_df(ldf.collect(callback))

InvalidOperationError: Series bar, length 1 doesn't match the DataFrame height of 0

If you want this Series to be broadcasted, ensure it is a scalar (for instance by adding '.first()').

Expected behavior

I'd expect the old behaviour as per the commented out section in the MWE.

Installed versions

pl.show_versions()
--------Version info---------
Polars:              1.9.0
Index type:          UInt32
Platform:            macOS-15.0.1-arm64-arm-64bit
Python:              3.12.4 (main, Jul 25 2024, 22:11:22) [Clang 18.1.8 ]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
great_tables         <not installed>
matplotlib           <not installed>
nest_asyncio         1.6.0
numpy                2.1.2
openpyxl             <not installed>
pandas               2.2.3
pyarrow              17.0.0
pydantic             2.9.2
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@thomasaarholt thomasaarholt added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Oct 17, 2024
@cmdlineluser
Copy link
Contributor

This does run again on main:

>>> pl.DataFrame().with_columns([s_foo, s_bar])
shape: (2, 2)
┌─────┬───────────────────────┐
│ foo ┆ bar                   │
│ --- ┆ ---                   │
│ i64 ┆ struct[1]             │
╞═════╪═══════════════════════╡
│ 1   ┆ {2024-01-01 00:00:00} │
│ 2   ┆ {2024-01-01 00:00:00} │
└─────┴───────────────────────┘

It seems it was fixed by #19148

@thomasaarholt
Copy link
Contributor Author

Ah! That’s the best case! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants