diff --git a/docs/releases/upgrade/1.md b/docs/releases/upgrade/1.md index 3c0a32176f3d..2d60a3bb8663 100644 --- a/docs/releases/upgrade/1.md +++ b/docs/releases/upgrade/1.md @@ -1,9 +1,5 @@ # Version 1 -!!! warning "Work in progress" - - This upgrade guide is not yet complete. Check back when 1.0.0 is released for the full overview of breaking changes. - ## Breaking changes ### Properly apply `strict` parameter in Series constructor @@ -184,6 +180,55 @@ Traceback (most recent call last): polars.exceptions.InvalidOperationError: conversion from `i64` to `u8` failed in column 'a' for 1 out of 3 values: [300] ``` +### Update `read/scan_parquet` to disable Hive partitioning by default for file inputs + +Parquet reading functions now also support directory inputs. +Hive partitioning is enabled by default for directories, but is now _disabled_ by default for file inputs. +File inputs include single files, globs, and lists of files. +Explicitly pass `hive_partitioning=True` to restore previous behavior. + +**Example** + +Before: + +```pycon +>>> pl.read_parquet("dataset/a=1/foo.parquet") +shape: (2, 2) +┌─────┬─────┐ +│ a ┆ x │ +│ --- ┆ --- │ +│ i64 ┆ f64 │ +╞═════╪═════╡ +│ 1 ┆ 1.0 │ +│ 1 ┆ 2.0 │ +└─────┴─────┘ +``` + +After: + +```pycon +>>> pl.read_parquet("dataset/a=1/foo.parquet") +shape: (2, 1) +┌─────┐ +│ x │ +│ --- │ +│ f64 │ +╞═════╡ +│ 1.0 │ +│ 2.0 │ +└─────┘ +>>> pl.read_parquet("dataset/a=1/foo.parquet", hive_partitioning=True) +shape: (2, 2) +┌─────┬─────┐ +│ a ┆ x │ +│ --- ┆ --- │ +│ i64 ┆ f64 │ +╞═════╪═════╡ +│ 1 ┆ 1.0 │ +│ 1 ┆ 2.0 │ +└─────┴─────┘ +``` + ### Update `reshape` to return Array types instead of List types `reshape` now returns an Array type instead of a List type. @@ -218,6 +263,83 @@ Series: '' [array[i64, 3]] ] ``` +### Read 2D NumPy arrays as `Array` type instead of `List` + +The Series constructor now parses 2D NumPy arrays as an `Array` type rather than a `List` type. + +**Example** + +Before: + +```pycon +>>> import numpy as np +>>> arr = np.array([[1, 2], [3, 4]]) +>>> pl.Series(arr) +shape: (2,) +Series: '' [list[i64]] +[ + [1, 2] + [3, 4] +] +``` + +After: + +```pycon +>>> import numpy as np +>>> arr = np.array([[1, 2], [3, 4]]) +>>> pl.Series(arr) +shape: (2,) +Series: '' [array[i64, 2]] +[ + [1, 2] + [3, 4] +] +``` + +### Split `replace` functionality into two separate methods + +The API for `replace` has proven to be confusing to many users, particularly with regards to the `default` argument and the resulting data type. + +It has been split up into two methods: `replace` and `replace_strict`. +`replace` now always keeps the existing data type _(breaking, see example below)_ and is meant for replacing some values in your existing column. +Its parameters `default` and `return_dtype` have been deprecated. + +The new method `replace_strict` is meant for creating a new column, mapping some or all of the values of the original column, and optionally specifying a default value. If no default is provided, it raises an error if any non-null values are not mapped. + +**Example** + +Before: + +```pycon +>>> s = pl.Series([1, 2, 3]) +>>> s.replace(1, "a") +shape: (3,) +Series: '' [str] +[ + "a" + "2" + "3" +] +``` + +After: + +```pycon +>>> s.replace(1, "a") +Traceback (most recent call last): +... +polars.exceptions.InvalidOperationError: conversion from `str` to `i64` failed in column 'literal' for 1 out of 1 values: ["a"] +>>> s.replace_strict(1, "a", default=s) +shape: (3,) +Series: '' [str] +[ + "a" + "2" + "3" +] +``` + ### Preserve nulls in `ewm_mean`, `ewm_std`, and `ewm_var` Polars will no longer forward-fill null values in `ewm` methods. @@ -291,38 +413,6 @@ shape: (3, 1) └──────┘ ``` -### Read 2D NumPy arrays as `Array` type instead of `List` - -**Example** - -Before: - -```pycon ->>> import numpy as np ->>> arr = np.array([[1, 2], [3, 4]]) ->>> pl.Series(arr) -shape: (2,) -Series: '' [list[i64]] -[ - [1, 2] - [3, 4] -] -``` - -After: - -```pycon ->>> import numpy as np ->>> arr = np.array([[1, 2], [3, 4]]) ->>> pl.Series(arr) -shape: (2,) -Series: '' [array[i64, 2]] -[ - [1, 2] - [3, 4] -] -``` - ### Change `str.to_datetime` to default to microsecond precision for format specifiers `"%f"` and `"%.f"` In `.str.to_datetime`, when specifying `%.f` as the format, the default was to set the resulting datatype to nanosecond precision. This has been changed to microsecond precision. @@ -701,6 +791,39 @@ Series: '' [i64] ] ``` +### Change default engine for `read_excel` to `"calamine"` + +The `calamine` engine (available through the `fastexcel` package) has been added to Polars relatively recently. +It's much faster than the other engines, and was already the default for `xlsb` and `xls` files. +We now made it the default for all Excel files. + +There may be subtle differences between this engine and the previous default (`xlsx2csv`). +One clear difference is that the `calamine` engine does not support the `engine_options` parameter. +If you cannot get your desired behavior with the `calamine` engine, specify `engine="xlsx2csv"` to restore previous behavior. + +### Example + +Before: + +```pycon +>>> pl.read_excel("data.xlsx", engine_options={"skip_empty_lines": True}) +``` + +After: + +```pycon +>>> pl.read_excel("data.xlsx", engine_options={"skip_empty_lines": True}) +Traceback (most recent call last): +... +TypeError: read_excel() got an unexpected keyword argument 'skip_empty_lines' +``` + +Instead, explicitly specify the `xlsx2csv` engine or omit the `engine_options`: + +```pycon +>>> pl.read_excel("data.xlsx", engine="xlsx2csv", engine_options={"skip_empty_lines": True}) +``` + ### Remove class variables from some DataTypes Some DataType classes had class variables. @@ -779,6 +902,52 @@ shape: (3, 2) └────────────┴───────────┘ ``` +### Change default serialization format of `LazyFrame/DataFrame/Expr` + +The only serialization format available for the `serialize/deserialize` methods on Polars objects was JSON. +We added a more optimized binary format and made this the default. +JSON serialization is still available by passing `format="json"`. + +**Example** + +Before: + +```pycon +>>> lf = pl.LazyFrame({"a": [1, 2, 3]}).sum() +>>> serialized = lf.serialize() +>>> serialized +'{"MapFunction":{"input":{"DataFrameScan":{"df":{"columns":[{"name":...' +>>> from io import StringIO +>>> pl.LazyFrame.deserialize(StringIO(serialized)).collect() +shape: (1, 1) +┌─────┐ +│ a │ +│ --- │ +│ i64 │ +╞═════╡ +│ 6 │ +└─────┘ +``` + +After: + +```pycon +>>> lf = pl.LazyFrame({"a": [1, 2, 3]}).sum() +>>> serialized = lf.serialize() +>>> serialized +b'\xa1kMapFunction\xa2einput\xa1mDataFrameScan\xa4bdf...' +>>> from io import BytesIO # Note: using BytesIO instead of StringIO +>>> pl.LazyFrame.deserialize(BytesIO(serialized)).collect() +shape: (1, 1) +┌─────┐ +│ a │ +│ --- │ +│ i64 │ +╞═════╡ +│ 6 │ +└─────┘ +``` + ### Constrain access to globals from `DataFrame.sql` in favor of `pl.sql` The `sql` methods on `DataFrame` and `LazyFrame` can no longer access global variables. @@ -831,3 +1000,64 @@ shape: (4, 2) │ 2 ┆ 4 │ └─────┴─────┘ ``` + +### Remove re-export of type aliases + +We have a lot of type aliases defined in the `polars.type_aliases` module. +Some of these were re-exported at the top-level and in the `polars.datatypes` module. +These re-exports have been removed. + +We plan on adding a public `polars.typing` module in the future with a number of curated type aliases. +Until then, please define your own type aliases, or import from our `polars.type_aliases` module. +Note that the `type_aliases` module is not technically public, so use at your own risk. + +**Example** + +Before: + +```python +def foo(dtype: pl.PolarsDataType) -> None: ... +``` + +After: + +```python +PolarsDataType = pl.DataType | type[pl.DataType] + +def foo(dtype: PolarsDataType) -> None: ... +``` + +### Streamline optional dependency definitions in `pyproject.toml` + +We revisited to optional dependency definitions and made some minor changes. +If you were using the extras `fastexcel`, `gevent`, `matplotlib`, or `async`, this is a breaking change. +Please update your Polars installation to use the new extras. + +**Example** + +Before: + +```bash +pip install 'polars[fastexcel,gevent,matplotlib]' +``` + +After: + +```bash +pip install 'polars[calamine,async,graph]' +``` + +## Deprecations + +### Issue `PerformanceWarning` when LazyFrame properties `schema/dtypes/columns/width` are used + +Recent improvements to the correctness of the schema resolving in the lazy engine have had significant performance impact on the cost of resolving the schema. +It is no longer 'free' - in fact, in complex pipelines with lazy file reading, resolving the schema can be relatively expensive. + +Because of this, the schema-related properties on LazyFrame were no longer good API design. +Properties represent information that is already available, and just needs to be retrieved. +However, for the LazyFrame properties, accessing these may have significant performance cost. + +To solve this, we added the `LazyFrame.collect_schema` method, which retrieves the schema and returns a `Schema` object. +The properties raise a `PerformanceWarning` and tell the user to use `collect_schema` instead. +We chose not to deprecate the properties for now to facilitatate writing code that is generic for both DataFrames and LazyFrames.