-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Series length 1 doesn't match DataFrame height 3 in select()
#18896
Comments
Can you show the query? You might need to add |
The query is not trivial as it involves multiple functions but the dumped JSON for v1.7.0 is: import io
import polars as pl
plan = r'''{"IR":{"version":1428,"dsl":{"MapFunction":{"input":{"HStack":{"input":{"IR":{"version":1351,"dsl":{"HStack":{"input":{"MapFunction":{"input":{"Select":{"expr":[{"Column":"__timestamp"},{"Function":{"input":[{"Function":{"input":[{"Column":"line"},{"Literal":"Null"}],"function":{"StringExpr":"StripChars"},"options":{"collect_groups":"ElementWise","fmt_str":"","check_lengths":true,"flags":"ALLOW_GROUP_AWARE"}}}],"function":{"StringExpr":{"ExtractGroups":{"dtype":{"Struct":[{"name":"data","dtype":"String"},{"name":"event","dtype":"String"}]},"pat":"(?:[\\w@]+):? *(?:data=(?P<data>.+?)(?: +|$)|event=(?P<event>.+?)(?: +|$)|\\w+=\\S+?(?: +|$))+"}}},"options":{"collect_groups":"ElementWise","fmt_str":"","check_lengths":true,"flags":"ALLOW_GROUP_AWARE"}}}],"input":{"Filter":{"input":{"MapFunction":{"input":{"Select":{"expr":[{"Column":"__event"},{"Column":"__fields"},{"Column":"__timestamp"},{"Column":"line"}],"input":{"IR":{"version":1269,"dsl":{"HStack":{"input":{"Select":{"expr":[{"Column":"__timestamp"},{"Column":"__event"},{"Column":"__fields"},{"Column":"line"}],"input":{"HStack":{"input":{"DataFrameScan":{"df":{"columns":[{"name":"__event","datatype":"Binary","bit_settings":"","values":[[114,116,97,112,112,95,109,97,105,110],[114,116,97,112,112,95,109,97,105,110],[114,116,97,112,112,95,109,97,105,110]]},{"name":"__fields","datatype":"Binary","bit_settings":"","values":[[101,118,101,110,116,61,115,116,97,114,116],[101,118,101,110,116,61,99,108,111,99,107,95,114,101,102,32,100,97,116,97,61,52,55,49,51,50,52,56,54,48],[101,118,101,110,116,61,101,110,100]]},{"name":"__timestamp","datatype":"Int64","bit_settings":"","values":[471410977940,471412970020,472920141960]},{"name":"line","datatype":"Binary","bit_settings":"","values":[[114,116,97,112,112,95,109,97,105,110,58,32,101,118,101,110,116,61,115,116,97,114,116,10],[114,116,97,112,112,95,109,97,105,110,58,32,101,118,101,110,116,61,99,108,111,99,107,95,114,101,102,32,100,97,116,97,61,52,55,49,51,50,52,56,54,48,10],[114,116,97,112,112,95,109,97,105,110,58,32,101,118,101,110,116,61,101,110,100,10]]}]},"schema":{"fields":{"__event":"Binary","__fields":"Binary","__timestamp":"Int64","line":"Binary"}},"output_schema":null,"filter":null}},"exprs":[{"Cast":{"expr":{"DtypeColumn":["Binary"]},"dtype":"String","options":"Strict"}}],"options":{"run_parallel":true,"duplicate_check":true,"should_broadcast":true}}},"options":{"run_parallel":true,"duplicate_check":true,"should_broadcast":true}}},"exprs":[{"Cast":{"expr":{"Column":"__event"},"dtype":{"Categorical":[null,"Physical"]},"options":"Strict"}}],"options":{"run_parallel":true,"duplicate_check":true,"should_broadcast":true}}}}},"options":{"run_parallel":true,"duplicate_check":true,"should_broadcast":true}}},"function":{"Drop":{"to_drop":[{"Root":{"Column":"__fields"}}],"strict":false}}}},"predicate":{"BinaryExpr":{"left":{"Column":"__event"},"op":"Eq","right":{"Literal":{"String":"rtapp_main"}}}}}},"options":{"run_parallel":true,"duplicate_check":true,"should_broadcast":true}}},"function":{"Unnest":[{"Root":{"Column":"line"}}]}}},"exprs":[{"Function":{"input":[{"DtypeColumn":["String"]},{"Literal":"Null"}],"function":{"StringExpr":"StripChars"},"options":{"collect_groups":"ElementWise","fmt_str":"","check_lengths":true,"flags":"ALLOW_GROUP_AWARE"}}}],"options":{"run_parallel":true,"duplicate_check":true,"should_broadcast":true}}}}},"exprs":[{"Cast":{"expr":{"Column":"__timestamp"},"dtype":"Int64","options":"Strict"}},{"Cast":{"expr":{"Column":"data"},"dtype":{"Categorical":[null,"Physical"]},"options":"Strict"}},{"Cast":{"expr":{"Column":"event"},"dtype":{"Categorical":[null,"Physical"]},"options":"Strict"}}],"options":{"run_parallel":true,"duplicate_check":true,"should_broadcast":true}}},"function":{"Rename":{"existing":["__timestamp"],"new":["Time"]}}}}}}'''
plan = io.StringIO(plan)
df = pl.LazyFrame.deserialize(plan, format='json')
print(df)
print(df.collect()) The code leading to this lazyframe is either:
EDIT: the LazyFrame pretty printed is:
EDIT 2: I'm rebuilding to get an up-to-date JSON for v1.8. |
Part of the problem here is that the error happens very late when collecting, at which point location information is completely lost. Is there a way to make polars validate at every step as some kind of debug mode ? EDIT: the other part of the problem is that the issue is not reproducible locally, but happens 100% of the time when building our documentation in the readthedocs.org runner. |
Can you serialize the result before running? As I cannot reproduce this locally. If I can reproduce I know what it can be. This query plan formatted doesn't seem to have any scalar misuse. That's where the error comes from. It checks at runtime if the literal is allowed to be broadcasted. This is only allowed if the literal is a scalar. |
If you mean serialize to JSON before calling |
Could you compile from source with #18904? That will try to print the expression that is at fault. |
I'll give it a go but this will probably take a while. Are the compiled binaries statically linked and fully portable ? I have no idea what libc is used on that runner, and compiling in-situ is impossible because of build timeouts. |
Ok, will patch tonight with an option to temporarily silence this error. Note that it will become a hard error in the future, but hopefully with the new better error message we can find the culprit. |
You can then silence it by setting |
I tried to build a wheel and commit it to a branch to try it out:
The DSO shipping in the .whl file seems to only depend on glibc, but I haven't tried to check what runs in the CI runner.
And then it failed to install in the readthedocs runner for some unknown reason:
|
So I just re-ran the job, which installed 1.8.2 and the error stays:
https://readthedocs.org/api/v2/build/25738204.txt I tried with POLARS_ALLOW_NON_SCALAR_EXP=1 just before collect(): import os
os.environ['POLARS_ALLOW_NON_SCALAR_EXP'] = '1'
df = df.collect() And still get the same error: So this makes me wonder if either the error is coming from another place, or v1.8.2 does not have this code in it somehow. Pip does install 1.8.2 and nothing else according to the log. EDIT: this makes me realize that setting EDIT 2: I used RTD facility to set an env var for the whole runner before any code runs, and still get the same error ... |
Can you set
You must set it before Polars is imported. |
Here we go:
|
Hmm... I don't understand where the parquet writer comes from? Is that somewhere else? |
I think this failure is down the line in another place, on another dataframe. So enabling these options "fixed" the issue reported here. I'll make another run and see whether that behavior is stable. |
Did another run, same output failing in sink_parquet(): https://readthedocs.org/api/v2/build/25750324.txt |
I re-ran the code with the sink_parquet() removed, and the issue is the same as before, now with polars 1.9.0:
And the verbose backtrace: thread '' panicked at /home/runner/work/polars/polars/crates/polars-error/src/lib.rs:45:37: If you want this Series to be broadcasted, ensure it is a scalar (for instance by adding '.first()'). |
Alright, I know where it happens now. Added the same trick, so you should (hopefully) get an error message showing the expression that caused it. And a way to temporarily silence the error. Can you come back with the faulting expression after next release? |
Sounds good, thanks |
Looks like I get another issue now with 1.11 , related to the type of the column (Time should be u64, not a string): https://readthedocs.org/api/v2/build/26067058.txt I'll investigate further, with enough luck that can be reproduced locally and it's possibly hiding the original issue we discussed here. |
@ritchie46 I tried locally with polars 1.11 both with Python 3.11 and 3.12 and it runs without problems. It only fails in the CI, so I guess this is just the new manifestation of the same underlying issue: https://readthedocs.org/api/v2/build/26068615.txt
This happens while converting the following pandas df to polars:
df.info() shows this
And the index is float64:
EDIT: s/include_index='Time'/include_index=True/ |
Hi @douglas-raillard-arm any update on this one? |
@ritchie46 Just gave it a go and the issue(s) is still there in polars 1.15:
https://app.readthedocs.org/api/v2/build/26403808.txt If I skip that
https://app.readthedocs.org/api/v2/build/26403916.txt Both builds were executed with:
|
Polars 1.16 still exhibits the same issue: https://app.readthedocs.org/api/v2/build/26455310.txt
|
I notice that the process may be getting forked - it could be worth to test running without forking, or move the polars import until after the process has been forked. |
@nameexhaustion if you mean setting "POLARS_ALLOW_FORKING_THREAD=1", this is due to this issue: So the process should not be forked (at least not in LISA nor devlib unless I missed a spot, in another dependency it's unlikely but not impossible). |
I see I had a closer look at the latest logs you sent - I realized the error message looks like it is due to attempting to Could you check if a |
Yes, this was set on that advice: #18896 (comment) I'll try with the collect() first. AFAIR that triggers the originally reported issue, so let's see what happens on the current polars version. |
@nameexhaustion It looks like the original issue is either gone or masked by another problem:
https://readthedocs.org/api/v2/build/26530794.txt It's quite possible that new problem just triggers on a dataframe processed before the one that was triggering the issue. |
So if I remove all code that tries to write to parquet (I can since it's only a cache), I hit that error:
https://readthedocs.org/api/v2/build/26530866.txt EDIT: I managed to reproduce locally, so assuming this is a bug on my part, it may be that the original problem is solved and all I need is fix that and disable the env var that turns errors into panics. The Duration(nanoseconds) issue will probably constitute another bug report if it's real). EDIT2: looks like that issue is coming from the env var: #20228 |
So it seems the original issue is fixed. The only one left is the env var problem for which I opened a separate ticket, but it's secondary. |
select()
Re-opening #18719 as it is still failing in v1.8.1
Checks
Reproducible example
Same reproducer as on #18719 but re-ran with 1.8.1:
Log output
Issue description
Collecting that LazyFrame triggers an exception in readthedocs CI but not locally, even after re-creating the same environment (pip freeze). The only material difference I can think of is some StringCache() state that is difference for whatever reason.
Note that this issue only started occuring from polars 1.7.0. Before that, the code was working.
Also note that the JSON plan is only there to make reproduction of the issue easier (both for me to extract the data from the CI log and for that bug report). The issue originally happened without that JSON layer (at least not at this spot). I also ended up trying the reported reproducer verbatim both in the CI and locally, with the same result (fails in the CI, succeeds locally).
Expected behavior
This should work or not work, but consistently everywhere. Most likely work.
Installed versions
Polars upgraded to 1.8.1 compared to initial report.
The text was updated successfully, but these errors were encountered: