Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Improve replace and replace_all docstring explanation of the "$" character with reference to capture groups (vs use as a literal) #19529

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 69 additions & 20 deletions py-polars/polars/expr/string.py
Original file line number Diff line number Diff line change
Expand Up @@ -1908,13 +1908,36 @@ def replace(

Notes
-----
The dollar sign (`$`) is a special character related to capture groups.
To refer to a literal dollar sign, use `$$` instead or set `literal` to `True`.

To modify regular expression behaviour (such as case-sensitivity) with flags,
use the inline `(?iLmsuxU)` syntax. See the regex crate's section on
`grouping and flags <https://docs.rs/regex/latest/regex/#grouping-and-flags>`_
for additional information about the use of inline expression modifiers.
* To modify regular expression behaviour (such as case-sensitivity) with flags,
use the inline `(?iLmsuxU)` syntax. See the regex crate's section on
`grouping and flags <https://docs.rs/regex/latest/regex/#grouping-and-flags>`_
rodrigogiraoserrao marked this conversation as resolved.
Show resolved Hide resolved
for additional information about the use of inline expression modifiers.

* The dollar sign (`$`) is a special character related to capture groups; if you
want to replace some target pattern with characters that include a literal `$`
you should escape it by doubling it up as `$$`, or set `literal=True` if you
do not need a full regular expression pattern match. Otherwise, you will be
referencing a (potentially non-existent) capture group.

In the example below we need to double up `$` (to represent a literal dollar
sign, and then refer to the capture group using `$n` or `${n}`, hence the
three consecutive `$` characters in the replacement value:

.. code-block:: python

>>> df = pl.DataFrame({"cost": ["#12.34", "#56.78"]})
>>> df.with_columns(
... cost_usd=pl.col("cost").str.replace(r"#(\d+)", "$$${1}")
... )
shape: (2, 2)
┌────────┬──────────┐
│ cost ┆ cost_usd │
│ --- ┆ --- │
│ str ┆ str │
╞════════╪══════════╡
│ #12.34 ┆ $12.34 │
│ #56.78 ┆ $56.78 │
└────────┴──────────┘

Examples
--------
Expand All @@ -1930,9 +1953,9 @@ def replace(
│ 2 ┆ abc456 │
└─────┴────────┘

Capture groups are supported. Use `${1}` in the `value` string to refer to the
first capture group in the `pattern`, `${2}` to refer to the second capture
group, and so on. You can also use named capture groups.
Capture groups are supported. Use `$1` or `${1}` in the `value` string to refer
to the first capture group in the `pattern`, `$2` or `${2}` to refer to the
second capture group, and so on. You can also use *named* capture groups.

>>> df = pl.DataFrame({"word": ["hat", "hut"]})
>>> df.with_columns(
Expand Down Expand Up @@ -1999,13 +2022,39 @@ def replace_all(

Notes
-----
The dollar sign (`$`) is a special character related to capture groups.
To refer to a literal dollar sign, use `$$` instead or set `literal` to `True`.

To modify regular expression behaviour (such as case-sensitivity) with flags,
use the inline `(?iLmsuxU)` syntax. See the regex crate's section on
`grouping and flags <https://docs.rs/regex/latest/regex/#grouping-and-flags>`_
for additional information about the use of inline expression modifiers.
* To modify regular expression behaviour (such as case-sensitivity) with flags,
use the inline `(?iLmsuxU)` syntax. See the regex crate's section on
`grouping and flags <https://docs.rs/regex/latest/regex/#grouping-and-flags>`_
for additional information about the use of inline expression modifiers.

* The dollar sign (`$`) is a special character related to capture groups; if you
want to replace some target pattern with characters that include a literal `$`
you should escape it by doubling it up as `$$`, or set `literal=True` if you
do not need a full regular expression pattern match. Otherwise, you will be
referencing a (potentially non-existent) capture group.

In the example below we need to double up `$` to represent a literal dollar
sign, otherwise we are referring to a capture group (which may or may not
exist):

.. code-block:: python

>>> df = pl.DataFrame({"text": ["ab12cd34ef", "gh45ij67kl"]})
>>> df.with_columns(
... # the replacement pattern refers back to the capture group
... text1=pl.col("text").str.replace_all(r"(?<N>\d{2,})", "$N$"),
... # doubling-up the `$` results in it appearing as a literal value
... text2=pl.col("text").str.replace_all(r"(?<N>\d{2,})", "$$N$$"),
... )
shape: (2, 3)
┌────────────┬──────────────┬──────────────┐
│ text ┆ text1 ┆ text2 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞════════════╪══════════════╪══════════════╡
│ ab12cd34ef ┆ ab12$cd34$ef ┆ ab$N$cd$N$ef │
│ gh45ij67kl ┆ gh45$ij67$kl ┆ gh$N$ij$N$kl │
└────────────┴──────────────┴──────────────┘

Examples
--------
Expand All @@ -2021,9 +2070,9 @@ def replace_all(
│ 2 ┆ 123-123 │
└─────┴─────────┘

Capture groups are supported. Use `${1}` in the `value` string to refer to the
first capture group in the `pattern`, `${2}` to refer to the second capture
group, and so on. You can also use named capture groups.
Capture groups are supported. Use `$1` or `${1}` in the `value` string to refer
to the first capture group in the `pattern`, `$2` or `${2}` to refer to the
second capture group, and so on. You can also use *named* capture groups.

>>> df = pl.DataFrame({"word": ["hat", "hut"]})
>>> df.with_columns(
Expand Down
107 changes: 79 additions & 28 deletions py-polars/polars/series/string.py
Original file line number Diff line number Diff line change
Expand Up @@ -1109,13 +1109,46 @@ def replace(

Notes
-----
The dollar sign (`$`) is a special character related to capture groups.
To refer to a literal dollar sign, use `$$` instead or set `literal` to `True`.

To modify regular expression behaviour (such as case-sensitivity) with flags,
use the inline `(?iLmsuxU)` syntax. See the regex crate's section on
`grouping and flags <https://docs.rs/regex/latest/regex/#grouping-and-flags>`_
for additional information about the use of inline expression modifiers.
* To modify regular expression behaviour (such as case-sensitivity) with flags,
use the inline `(?iLmsuxU)` syntax. (See the regex crate's section on
`grouping and flags <https://docs.rs/regex/latest/regex/#grouping-and-flags>`_
for additional information about the use of inline expression modifiers).

* The dollar sign (`$`) is a special character related to capture groups; if you
want to replace some target pattern with characters that include a literal `$`
you should escape it by doubling it up as `$$`, or set `literal=True` if you
do not need a full regular expression pattern match. Otherwise, you will be
referencing a (potentially non-existent) capture group.

If not escaped, the `$0` in the replacement value (below) represents a capture
group:

.. code-block:: python

>>> s = pl.Series("cents", ["000.25", "00.50", "0.75"])
>>> s.str.replace(r"^(0+)\.", "$0.")
shape: (3,)
Series: 'cents' [str]
[
"000..25"
"00..50"
"0..75"
]

To have `$` represent a literal value, it should be doubled-up as `$$`
(or, for simpler find/replace operations, set `literal=True` if you do
not require a full regular expression match):

.. code-block:: python

>>> s.str.replace(r"^(0+)\.", "$$0.")
shape: (3,)
Series: 'cents' [str]
[
"$0.25"
"$0.50"
"$0.75"
]

Examples
--------
Expand All @@ -1128,24 +1161,24 @@ def replace(
"abc456"
]

Capture groups are supported. Use `${1}` in the `value` string to refer to the
first capture group in the `pattern`, `${2}` to refer to the second capture
group, and so on. You can also use named capture groups.
Capture groups are supported. Use `$1` or `${1}` in the `value` string to refer
to the first capture group in the `pattern`, `$2` or `${2}` to refer to the
second capture group, and so on. You can also use *named* capture groups.

>>> s = pl.Series(["hat", "hut"])
>>> s.str.replace("h(.)t", "b${1}d")
shape: (2,)
Series: '' [str]
[
"bad"
"bud"
"bad"
"bud"
]
>>> s.str.replace("h(?<vowel>.)t", "b${vowel}d")
shape: (2,)
Series: '' [str]
[
"bad"
"bud"
"bad"
"bud"
]

Apply case-insensitive string replacement using the `(?i)` flag.
Expand Down Expand Up @@ -1181,13 +1214,31 @@ def replace_all(self, pattern: str, value: str, *, literal: bool = False) -> Ser

Notes
-----
The dollar sign (`$`) is a special character related to capture groups.
To refer to a literal dollar sign, use `$$` instead or set `literal` to `True`.

To modify regular expression behaviour (such as case-sensitivity) with flags,
use the inline `(?iLmsuxU)` syntax. See the regex crate's section on
`grouping and flags <https://docs.rs/regex/latest/regex/#grouping-and-flags>`_
for additional information about the use of inline expression modifiers.
* To modify regular expression behaviour (such as case-sensitivity) with flags,
use the inline `(?iLmsuxU)` syntax. (See the regex crate's section on
`grouping and flags <https://docs.rs/regex/latest/regex/#grouping-and-flags>`_
for additional information about the use of inline expression modifiers).

* The dollar sign (`$`) is a special character related to capture groups; if you
want to replace some target pattern with characters that include a literal `$`
you should escape it by doubling it up as `$$`, or set `literal=True` if you
do not need a full regular expression pattern match. Otherwise, you will be
referencing a (potentially non-existent) capture group.

In the example below we need to double up `$` (to represent a literal dollar
sign, and then refer to the capture group using `$n` or `${n}`, hence the
three consecutive `$` characters in the replacement value:

.. code-block:: python

>>> s = pl.Series("cost", ["#12.34", "#56.78"])
>>> s.str.replace_all(r"#(\d+)", "$$${1}").alias("cost_usd")
shape: (2,)
Series: 'cost_usd' [str]
[
"$12.34"
"$56.78"
]

Examples
--------
Expand All @@ -1200,24 +1251,24 @@ def replace_all(self, pattern: str, value: str, *, literal: bool = False) -> Ser
"abc456"
]

Capture groups are supported. Use `${1}` in the `value` string to refer to the
first capture group in the `pattern`, `${2}` to refer to the second capture
group, and so on. You can also use named capture groups.
Capture groups are supported. Use `$1` or `${1}` in the `value` string to refer
to the first capture group in the `pattern`, `$2` or `${2}` to refer to the
second capture group, and so on. You can also use *named* capture groups.

>>> s = pl.Series(["hat", "hut"])
>>> s.str.replace_all("h(.)t", "b${1}d")
shape: (2,)
Series: '' [str]
[
"bad"
"bud"
"bad"
"bud"
]
>>> s.str.replace_all("h(?<vowel>.)t", "b${vowel}d")
shape: (2,)
Series: '' [str]
[
"bad"
"bud"
"bad"
"bud"
]

Apply case-insensitive string replacement using the `(?i)` flag.
Expand Down
Loading