Rust Polars 0.41.0
💥 Breaking changes
- Make
hive_partitioning
parameter default toNone
, which is automatically enabled for single directory inputs, and disabled otherwise (#17106) - Split
replace
functionality into two separate functions (#16921) - Rename
DataFrame.melt
tounpivot
and make parameters consistent withpivot
(#17095) - Default to writing binview data to IPC (#17084)
- Do not parse hive partitions from user provided directory/glob path (#17055)
- Add
strict
parameter toDataFrame/LazyFrame.drop
and fix behavior to default to True (#17044) - Remove supertype definition of List and non-List types (#16918)
- Native
selector
XOR set operation, guarantee consistent selector column-order (#16833) - move offset_by implementation from polars-plan to polars-time, rename feature from DateOffset to OffsetBy (#16796)
- Rename
str.concat
tostr.join
and update default delimiter (#16790) - Remove deprecated parameters in
Series.cut/qcut
and update struct field names (#16741) - Expedited removal of certain deprecated functionality (#16754)
- Update some error types to more appropriate variants (#15030)
- Change default
offset
ingroup_by_dynamic
from 'negativeevery
' to 'zero' (#16658) - Update
clip
to no longer propagate nulls in the given bounds (#14413) - Change
str.to_datetime
to default to microsecond precision for format specifiers"%f"
and"%.f"
(#13597) - Update resulting column names in
pivot
when pivoting by multiple values (#16439) - Preserve nulls in
ewm_mean
,ewm_std
, andewm_var
(#15503) - Restrict casting for temporal data types (#14142)
- Rename struct fields of
rle
output tolen
/value
and update data type oflen
field (#15249) - Add
check_names
parameter toSeries.equals
and default toFalse
(#16610) - Deprecate
str.explode
in favor ofstr.split("").explode()
(#16508) - Deprecate
how="outer"
join type in favour ofhow="full"
(left/right are *also* outer joins) (#16417) - Change
DataFrame.is_empty()
to checkheight == 0
instead ofwidth == 0
(#16351)
🚀 Performance improvements
- Default to writing binview data to IPC (#17084)
- Parallelize arrow conversion if binview -> large_bin (#17083)
- GC buffers in if_then_else view kernel (#16993)
- Desugar
AND
filter into multiple nodes (#16992) - Optimize generic argsort of row-encoding (#16894)
- Improve rle_id iteration perf and set sorted flags (#16893)
- Optimize string/binary sort (#16871)
- Use
split_at
insplit
(#16865) - Use
split_at
instead of double slice in chunk splits. (#16856) - Don't rechunk in
align_
if arrays are aligned (#16850) - Don't create small chunks in parallel collect. (#16845)
- Add dedicated no-null branch in
arg_sort
(#16808) - Speed up
dt.offset_by
2x for constant durations (#16728) - Toggle coalesce in
join
if non-coalesced key isn't projected (#16677) - Make
dt.truncate
1.5x faster whenevery
is just a single duration (and not an expression) (#16666) - Always prune unused columns in semi/anti join (#16665)
- make truncate 4x faster in simple cases (#16615)
- Cache arena's (and conversion) in SQL context (#16566)
- Partial schema cache. (#16549)
- improved numeric fill_(forward/backward) (#16475)
- only rechunk once per aggregate (#16469)
- Fix pathological small chunk parquet writing (#16433)
✨ Enhancements
- Make
hive_partitioning
parameter default toNone
, which is automatically enabled for single directory inputs, and disabled otherwise (#17106) - Split
replace
functionality into two separate functions (#16921) - Improve schema inference for hive partitions (#17079)
- Rename
DataFrame.melt
tounpivot
and make parameters consistent withpivot
(#17095) - print row index in explain + dot (#17074)
- Support top-level
pl.col
autocompletion for iPython (#17080) - predicate + projection pushdown in NDJson (#17068)
- Allow (non-)coalescing in join_asof (#17066)
- Turn of coalescing and fix mutation of join on expressions (#17061)
- Expand NDJson glob into one SCAN (#17063)
- Do not parse hive partitions from user provided directory/glob path (#17055)
- Support directory paths in scans for Parquet, IPC and CSV (#17017)
- Implement general array equality checks (#17043)
- Add
strict
parameter toDataFrame/LazyFrame.drop
and fix behavior to default to True (#17044) - allow experimental metadata use on release (#17005)
- first working prototype of new streaming engine (#16970)
- Desugar
AND
filter into multiple nodes (#16992) - use min/max metadata on debug builds with
POLARS_METADATA_FLAGS=extensive
(#16963) - Add SQL support for
INTERSECT
andEXCEPT
ops (#16960) - Allow setting file cache TTL on a per-file basis (#16891)
- Implement multiply and division for lhs duration (#16948)
- Raise on invalid temporal arithmetic (#16934)
- Always end with a in-memory sink on collect (#16928)
- Normalize
value_counts
(#16917) - add
eq
/ne
for moreFixedSizeList
s (#16902) - setup skeleton (#16900)
- add fundamentals for new async-based streaming execution engine (#16884)
- Cache downloaded cloud IPC files (#16892)
- Improve
read_csv
SQL table reading function defaults (better handle dates) (#16866) - Support SQL
VALUES
clause and inline renaming of columns in CTE & derived table definitions (#16851) - convert to give time zone in
.str.to_datetime
when values are offset-aware (#16742) - Support
SQL
"SELECT" with no tables, optimise registration of globals (#16836) - Native
selector
XOR set operation, guarantee consistent selector column-order (#16833) - Extend recognised
EXTRACT
andDATE_PART
SQL part abbreviations (#16767) - Improve error message when raising integers to negative integers, improve docs (#16827)
- Return datetime for mean/median of Date colum (#16795)
- Expose overflowing cast (#16805)
- Expose a few more expression nodes in the expression IR (#16781)
- Support array arithmetic for equally sized shapes (#16791)
- Support cloud storage in
scan_csv
(#16674) - Streamline SQL
INTERVAL
handling and improve related error messages, updatesqlparser-rs
lib (#16744) - Support use of ordinal values in SQL
ORDER BY
clause (#16745) - Support executing polars SQL against
pandas
andpyarrow
objects (#16746) - add
env
locked metadata functions (#16719) - Remove deprecated parameters in
Series.cut/qcut
and update struct field names (#16741) - Expedited removal of certain deprecated functionality (#16754)
- Update
date_range
to no longer produce datetime ranges (#16734) - Remove deprecated
top_k
parametersnulls_last
,maintain_order
, andmultithreaded
(#16599) - Support order-by in window functions (#16743)
- Add SQL support for
NULLS FIRST/LAST
ordering (#16711) - Update some error types to more appropriate variants (#15030)
- Initial SQL support for
INTERVAL
strings (#16732) - Enforce deprecation of
offset
arg intruncate
andround
(#16655) - eliminate ProjectionExprs and handle CSE by stacking extra columns (#16682)
- Change default
offset
ingroup_by_dynamic
from 'negativeevery
' to 'zero' (#16658) - Update
clip
to no longer propagate nulls in the given bounds (#14413) - Change
str.to_datetime
to default to microsecond precision for format specifiers"%f"
and"%.f"
(#13597) - Update resulting column names in
pivot
when pivoting by multiple values (#16439) - Preserve nulls in
ewm_mean
,ewm_std
, andewm_var
(#15503) - Restrict casting for temporal data types (#14142)
- Add many more auto-inferable datetime formats for
str.to_datetime
(#16634) - Rename struct fields of
rle
output tolen
/value
and update data type oflen
field (#15249) - Add
check_names
parameter toSeries.equals
and default toFalse
(#16610) - Dedicated
SQLInterface
andSQLSyntax
errors (#16635) - Add
DIV
function support to the SQL interface (#16678) - add additional control to
write_parquet::statistics
parameter (#16575) - Support non-coalescing streaming left join (#16672)
- Allow wildcard and exclude before struct expansions (#16671)
- Support per-column
nulls_last
on sort operations (#16639) - Add
split_at
method to arrowArray
(#16620) - Initial support for SQL
ARRAY
literals and theUNNEST
table function (#16330) - Don't allow
struct.with_fields
in grouping (#16629) - Add SQL support for
TRY_CAST
function (#16589) - add fuzzer for expressions (#16581)
- handle CSE dtypes in NodeTraverser.get_dtype (#16552)
- check if by column is sorted, rather than just checking sorted flag, in
group_by_dynamic
,upsample
, androlling
(#16494) - Add general metadata structure to
ChunkedArray
(#16399) - Add
is_column_selection()
to expression meta, enhanceexpand_selector
(#16479) - NDarray/Tensor support (#16466)
- Allow designation of a custom name for the
value_counts
"count" column (#16434) - Default rechunk=False for read_parquet (#16427)
- Add
field
expression as selector with an struct scope (#16402) - Field expansion renaming (#16397)
- add cluster_with_columns plan optimization (#16274)
- Change
DataFrame.is_empty()
to checkheight == 0
instead ofwidth == 0
(#16351) - add Expr.interpolate_by (#16313)
🐞 Bug fixes
- Expand i128 primitive type match (#17076)
- Fix decompress_impl for csv with n_rows set (#17118)
- adds "polars-ops/timezones" dependency for "timezones" feature (#17115)
- Fix incorrect window std for chunked series (#17110)
- make
GetOutput::get_field
fallible (#17114) - bubble error when no available bitrepr (#17116)
- Fix melt panic (#17088)
- Exclude index from expansion in rolling/group_by_dynamic (#17086)
- fix #17043 binary compare (#17052)
- Fix oob of join with literals and empty table (#17047)
- Don't silently accept multi-table FROM clauses (implicit JOIN syntax) (#17028)
- fix get categories on multiple row groups (#17041)
- Don't split up ANDed filters that are group-aware (#17031)
- Harden "async" check for users with out-of-date
sqlalchemy
libraries (#17029) - error when sort_by of unequal length (#17026)
- properly catch not found explode cols (#17020)
- Correctly convert data frames to NumPy for C index order (#17000)
- Raise on invalid arithmetic shapes (#16986)
- Don't pushdown predicates in cross join if the refer to both tables (#16983)
- Fix projection pushdown with literal joins (#16981)
- Handle strictness for Decimal Series construction (#15309)
- properly set
FAST_EXPLODE_LIST
metadata (#16951) - Raise informative error when writing object to file (#16954)
- Remove supertype definition of List and non-List types (#16918)
- Reject non-integral offset and length in AExpr::Slice (#16874)
- properly read/write fixed-sized lists from/to parquet files (#16747)
- Remove unwrap in
extend()
(#16890) - Fix
should_rechunk
check (#16852) - Standardised additional SQL interface errors (#16829)
- Ensure that splitted ChunkedArray also flattens chunks (#16837)
- Reduce needless panics in comparisons (#16831)
- Reset if next caller clones inner series (#16812)
- Raise on non-positive json schema inference (#16770)
describe
/explain
streaming plan (#16771)- Rewrite implementation of
top_k/bottom_k
and fix a variety of bugs (#16804) - Fix comparison of UInt64 with zero (#16799)
- properly set boolean distinct count (#16782)
- Fix incorrect parquet statistics written for UInt64 values > Int64::MAX (#16766)
- Fix boolean distinct (#16765)
DATE_PART
SQL syntax/parsing, improve some error messages (#16761)- Column selection wasn't applied when reading CSV with no rows (#16739)
- Only flush if operator can flush in streaming outer join (#16723)
- Raise unsupported cat array (#16717)
- Restrict casting for temporal data types (#14142)
- Improve
read_database
check for SQLAlchemy async Session objects (#16680) - Full null on dyn int (#16679)
- Fix filter shape on empty null (#16670)
- Potentially deal with empty range (#16650)
- Use of SQL
ORDER BY
should not cause reordering ofSELECT
cols (#16579) - ensure df in empty parquet (#16621)
- get_dtype handles input node schema and CSE (#16582)
- Crash using empty
Series
inLazyFrame.select()
(#16592) - small safety issue in CWC filtermap (#16591)
- bail on bool
floordiv
(#16578) - Resolve multiple SQL
JOIN
issues (#16507) - solve panic in
cluster_with_columns
, found with small fuzzer (#16562) - Project last column if count query (#16569)
- Properly split struct columns (#16563)
- Ensure strict chunking in chunked partitioned group by (#16561)
- Error selecting columns after non-coalesced join (multiple join keys) (#16559)
- Don't panic on hashing nested list types (#16555)
- deal with realiases in
cluster_with_columns
(#16548) - Ensure deduced join key names are unique (#16551)
- Crash selecting columns after non-coalesced join (#16541)
- Fix group gather of single literal (#16539)
- throw an invalid operation exception on performing a
sum
over alist
ofstr
s (#16521) - properly fill null into unknown struct fields (#16518)
- Fix df.chunked for struct (#16504)
- Mix of column and field expansion (#16502)
- Fix
split_chunks
for nested dtypes (#16493) - Handle struct.fields as special case of alias (#16484)
- Correct schema for list.sum (#16483)
- properly display other length in simple project and display more columns up to a maximum length (#16464)
- allow search_sorted directly on multiple chunks, and fix behavior around nulls (#16447)
- Fix use of
COUNT(*)
in SQLGROUP BY
operations (#16465) - properly set schemas in cluster_with_columns when reassigning to columns (#16463)
- Fix panic when computing min() of Duration series. (#16455)
- keep track of removed indices in CWC (#16443)
- Fix struct 'with_fields' schema for update dtypes (#16428)
- Fix error reading lists of CSV files that contain comments (#16426)
- Fix struct arithmetic schema (#16396)
- Fix don't panic on chunked to_numpy conversion (#16393)
- Don't check nulls before conversion (#16392)
- fix a printing issue in IR::Filter (#16378)
- correct AExpr.to_field for bitwise and logical and/or (#16360)
📖 Documentation
- Fix some warnings during doc build (#17077)
- Lots of additions to the SQL reference docs (#16990)
- Update the Overview section of the contributing guide (#15674)
- Update outdated performance section (#16409)
📦 Build system
🛠️ Other improvements
- Split
replace
functionality into two separate functions (#16921) - Rename
DataFrame.melt
tounpivot
and make parameters consistent withpivot
(#17095) - Add pivot test #17081 (#17090)
- Make Series and ChunkedArray ops fallible (#16965)
- Support directory paths in scans for Parquet, IPC and CSV (#17017)
- expose all metadata to polars-expr (#17018)
- Move in-memory engine to its own crate (#17039)
- Remove file cache test (#17038)
- Point polars-stream to crates/ again (#17024)
- Fix failing file cache test in CI (#17014)
- Use proper join type in test (#16994)
- Fix file cache verbose logging leakage during pytest (#16984)
- Remove redundant projection attribute in IR::DataFrameScan (#16952)
- Follow-up changes to polars-parquet (#16949)
- Factor out some apply calls in duration namespace (#16941)
- extend new streaming engine with some initial nodes (#16940)
- setup skeleton (#16900)
- Refactor parts of IR. (#16899)
- Remove inner
Arc
fromFileCacheEntry
(#16870) - Update links to API references (#16843)
- Prepare update of API reference URLs (#16816)
- Rename allow_overflow to wrap_numerical (#16807)
- Remove unneeded code (#16838)
- Don't enter streaming engine for groupby-> agg mean/median … (#16810)
- move offset_by implementation from polars-plan to polars-time, rename feature from DateOffset to OffsetBy (#16796)
- Improve safety of amortized_iter (#16820)
- start further use of polars-compute in polars-parquet (#16788)
- Remove deprecated
MutableBitmap.null_count
method (#16797) - Rename
str.concat
tostr.join
and update default delimiter (#16790) - add distinct_non_null_count method (#16789)
- Remove needless inner type clone (#16718)
- Simplify NodeTraverser.get_dtype (#16712)
- Fix incorrect debug assertion in
ChunkedArray::from_chunks_and_dtype
(#16697) - separate
Aggregation
evaluation PhysicalExpr from conversion (#16688) - Update version resolver for
1.0.0
release (#16705) - Branch earlier in binary type resolving (#16685)
- Avoid AWS pinning to outdated crc32c version (#16681)
- use binary
try_{add, mul, ...}
ops in borrowed dispatch (#16580) - Split
dates_times
into separatedate
andtime
modules (#16667) - Add test for 16642 (#16646)
- Remove duplicate tag in CODEOWNERS (#16625)
- CWC prealloc
pushable
andpotential_pushable
(#16626) - Update dprint hook versions and enable JSON linting (#16611)
- defensively invalidate metadata and start on copying of
min_value
,max_value
anddistinct_count
(#16593) - Deprecate
str.explode
in favor ofstr.split("").explode()
(#16508) - make Parquet
Statistics
intoenum
instead oftrait
(#16485) - small change on CWC where we can reuse the old
pushable_set_bits
and reserve space forinput_exprs
(#16468) - Deprecate
how="outer"
join type in favour ofhow="full"
(left/right are *also* outer joins) (#16417) - Include license file in polars-expr crate (#16421)
Thank you to all our contributors for making this release possible!
@BGR360, @JulianCologne, @KDruzhkin, @Kylea650, @MarcoGorelli, @Mottl, @Object905, @alexander-beedie, @ankane, @bertiewooster, @borchero, @c-peters, @cmdlineluser, @coastalwhite, @dangotbanned, @datenzauberai, @dependabot, @dependabot[bot], @hattajr, @henryharbeck, @itamarst, @machow, @marenwestermann, @mcrumiller, @mdavis-xyz, @messense, @montanarograziano, @nameexhaustion, @orlp, @p3i0t, @r-brink, @ritchie46, @siddharth-gulia, @stinodego, @tkellogg, @twoertwein, @universalmind303 and @wence-