feat: Add `str.head` and `str.tail` #14425

mcrumiller · 2024-02-11T21:22:48Z

Resolves #10337.

Julian-J-S · 2024-02-12T07:41:49Z

nice addition! 😃

I would recommend a small change in the test

the check for "foobar" with

head(-3) == "foo"
tail(-3) == "bar"

is a little confusing because this would also work if the function just took the absolute value.

"abcde" with

head(2) == "ab" & head(-2) == "abc"
tail(2) == "de" & tail(-2) == "cde"

would be a little clearer to understand and is not ambiguous

crates/polars-ops/src/chunked_array/strings/namespace.rs

stinodego · 2024-02-12T10:08:40Z

crates/polars-plan/src/dsl/string.rs

@@ -526,6 +526,26 @@ impl StringNameSpace {
        )
    }

+    /// Take the first `n` characters of the string values.
+    pub fn head(self, n: Expr) -> Expr {


I'm wondering if we can't just define these in terms of a slice operation - that would save a lot of code bloat. But that might not work with negative indices.

Hi @stinodego -- this was my initial intent, when I said I would piggyback on @reswqa's implementation of str.slice. I realized soon after that the negative indexing for head requires calculation of the string length to determine the end of the slice.

Here are the operations and their slice equivalents:

s.str.head(3) # s.str.slice(3, None) s.str.head(-3) # no equivalent: must know string length for start offset s.str.tail(3) # s.str.slice(-3, None) s.str.tail(-3) # s.str.slice(3, None)

So for tail, we could do it because slice can run to the end of the string by itself, but for head we have no recourse. I suppose this would save us a little bit of bloat but it does make the code a bit asymmetric, but on the other hand the tail implementation is a bit more performant than slice because we have one fewer parameters, and so we have more fast paths. So it's a tradeoff here. What do you think?

I guess you can use slice in combination with len_chars, but clearly it will be a bit more efficient to have a dedicated implementation like in this PR. I'll leave it to Ritchie to be the judge here.

py-polars/tests/unit/namespaces/string/test_string.py

stinodego

Thanks! I left a few comments, I'll leave the proper review to Ritchie or Orson.

stinodego

The docstrings could use a bit more love (you can probably copy paste some stuff from slice) and a doc entry is missing in the API reference.

If you could address this, I'll approve and leave it to someone to judge the Rust side of things.

py-polars/polars/series/string.py

mcrumiller · 2024-02-12T19:13:01Z

@stinodego I've updated the docstrings with more detail and more examples.

I cannot for the life of me figure out why the CI doctest is failing with an "unexpected indentation" error. My doctests pass fine locally and I can't determine which part is causing the error.

I do note that when I run code locally, Series show 8 spaces of indentation:

>>> import polars as pl
>>> s = pl.Series(["pear", None, "papaya", "dragonfruit"])
>>> s.str.head(-3)
shape: (4,)
Series: '' [str]
[
        "p"
        null
        "pap"
        "dragonfr"
]

And the examples in string.py are a hodgepodge of 4 or 8 spaces. str.explode, for example, has 8 spaces in its docstring examples, but those do not seem to cause an error, whereas str.contains has only 4 spaces in its examples, and also does not cause an error. Could this be the issue?

Edit: I suspect this is the case, as my local doctest does not complain. I've reduced to 4 and we'll see how that fares.
Edit2: nope, still failing.

stinodego · 2024-02-12T19:25:24Z

It's probably an unclosed backtick. I can take a look. There are some issues with the docstring formatting anyway that I can see won't render.

py-polars/polars/series/string.py

crates/polars-ops/src/chunked_array/strings/substring.rs

stinodego

All right, Python side looks good to me. Leaving the Rust side review to @ritchie46

coolstudio1678 · 2024-03-16T06:59:21Z

When can it be used in rust?

mcrumiller · 2024-03-16T16:43:28Z

It needs to be approved first. @ritchie46 would you mind taking a look?

ritchie46 · 2024-03-18T08:17:04Z

I hope to get to this today. I am a bit worried about sliced that are within char boundaries.

mcrumiller · 2024-03-18T14:06:44Z

I am a bit worried about sliced that are within char boundaries.

Do we have reason not to trust str.len(), or do you mean that we must be very careful?

orlp · 2024-03-18T14:30:11Z

crates/polars-ops/src/chunked_array/strings/substring.rs

+        (_, 1) => {
+            // SAFETY: `n` was verified to have at least 1 element.
+            let n = unsafe { n.get_unchecked(0) };
+            unary_elementwise(ca, |str_val| head_binary(str_val, n)).with_name(ca.name())


This needs to be changed, string head/tail must be defined in terms of codepoints, not bytes! Otherwise you get illegal UTF-8 and general nonsense. Please change this and add a test-case that tests this, for example:

import polars as pl df = pl.DataFrame({"s": ["你好世界"]}) head = pl.DataFrame({"s": ["你好"]}) tail = pl.DataFrame({"s": ["世界"]}) assert_frame_equal(df.select(pl.col.s.str.head(2)), head) assert_frame_equal(df.select(pl.col.s.str.tail(2)), tail)

Thanks @orlp, may have to make a few changes.

FYI I am noticing that str.slice does not properly index codepoints with negative indexes. Using your example:

s = pl.Series(["你好世界"]) tail = "界" s.str.slice(-1) # should be equivalent to "tail" # shape: (1,) # Series: '' [str] # [ # "" # ]

I'll see if I can address this as a separate issue once I have finished with this one. Edit: opened #15136.

@orlp the new impl respects code points instead of bytes. I added some specific code point tests using your example. Let me know if anything looks off to you!

Looks good now.

codecov · 2024-03-18T20:27:48Z

Codecov Report

Attention: Patch coverage is 94.20290% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 81.15%. Comparing base (dcee934) to head (0b9f353).
Report is 4 commits behind head on main.

Files	Patch %	Lines
...rates/polars-plan/src/dsl/function_expr/strings.rs	80.76%	5 Missing ⚠️
py-polars/src/expr/general.rs	50.00%	2 Missing ⚠️
.../polars-ops/src/chunked_array/strings/substring.rs	98.52%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #14425      +/-   ##
==========================================
+ Coverage   81.14%   81.15%   +0.01%     
==========================================
  Files        1363     1363              
  Lines      175282   175408     +126     
  Branches     2527     2527              
==========================================
+ Hits       142236   142360     +124     
- Misses      32568    32571       +3     
+ Partials      478      477       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

orlp · 2024-03-18T21:30:10Z

crates/polars-ops/src/chunked_array/strings/substring.rs

+        (_, 1) => {
+            // SAFETY: `n` was verified to have at least 1 element.
+            let n = unsafe { n.get_unchecked(0) };
+            unary_elementwise(ca, |str_val| head_binary(str_val, n)).with_name(ca.name())


Looks good now.

ritchie46

Alright. Thanks @mcrumiller. Sorry for the delay on this one.

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Feb 11, 2024

mcrumiller force-pushed the str-head-tail branch from fff8a71 to 0e6703d Compare February 11, 2024 22:18

mcrumiller marked this pull request as ready for review February 11, 2024 22:44

mcrumiller requested review from ritchie46, stinodego, c-peters, alexander-beedie, MarcoGorelli and orlp as code owners February 11, 2024 22:44

mcrumiller force-pushed the str-head-tail branch from 0e6703d to 496f4de Compare February 12, 2024 00:58

stinodego reviewed Feb 12, 2024

View reviewed changes

crates/polars-ops/src/chunked_array/strings/namespace.rs Outdated Show resolved Hide resolved

stinodego reviewed Feb 12, 2024

View reviewed changes

py-polars/tests/unit/namespaces/string/test_string.py Outdated Show resolved Hide resolved

stinodego requested changes Feb 12, 2024

View reviewed changes

mcrumiller added 3 commits February 12, 2024 09:26

Add str_head and str_tail

1988f37

Improve test coverage

76ac085

Use strict cast

dfd6b66

mcrumiller force-pushed the str-head-tail branch from 0bf53a1 to dfd6b66 Compare February 12, 2024 14:26

stinodego requested changes Feb 12, 2024

View reviewed changes

py-polars/polars/series/string.py Outdated Show resolved Hide resolved

py-polars/polars/series/string.py Outdated Show resolved Hide resolved

mcrumiller force-pushed the str-head-tail branch from 36b313f to a2626a2 Compare February 12, 2024 18:50

mcrumiller added 2 commits February 12, 2024 14:00

Update docstrings

939d6c5

Update API

2df1b08

mcrumiller force-pushed the str-head-tail branch from a2626a2 to 2df1b08 Compare February 12, 2024 19:00

Reduce spacing in doctest Series output

54b0cf1

mcrumiller force-pushed the str-head-tail branch from 8af424a to cf53b6b Compare February 12, 2024 20:04

mcrumiller force-pushed the str-head-tail branch from cf53b6b to 0bc19f2 Compare February 12, 2024 20:05

Fix doctests

788484f

mcrumiller force-pushed the str-head-tail branch from 0bc19f2 to 788484f Compare February 12, 2024 20:13

stinodego requested changes Feb 14, 2024

View reviewed changes

py-polars/polars/series/string.py Outdated Show resolved Hide resolved

crates/polars-ops/src/chunked_array/strings/substring.rs Outdated Show resolved Hide resolved

mcrumiller added 2 commits February 14, 2024 08:56

Remove default values

e2e51db

Improve Safety section comments

66a8ecd

stinodego approved these changes Feb 14, 2024

View reviewed changes

orlp requested changes Mar 18, 2024

View reviewed changes

mcrumiller marked this pull request as draft March 18, 2024 14:34

mcrumiller added 3 commits March 18, 2024 10:35

Merge branch 'main' into str-head-tail

3acf020

Update head to use codepoints

0e0d146

Use codepoints for tail, extend unit tests

8d90538

mcrumiller marked this pull request as ready for review March 18, 2024 19:55

orlp mentioned this pull request Mar 18, 2024

fix: incorrect negative offset in multi-byte string slicing #15140

Merged

orlp approved these changes Mar 18, 2024

View reviewed changes

mcrumiller requested a review from reswqa as a code owner April 8, 2024 15:59

Merge branch 'main' into str-head-tail

0b9f353

ritchie46 reviewed Apr 13, 2024

View reviewed changes

ritchie46 merged commit 429d3dd into pola-rs:main Apr 13, 2024
24 checks passed

mcrumiller deleted the str-head-tail branch April 13, 2024 15:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add `str.head` and `str.tail` #14425

feat: Add `str.head` and `str.tail` #14425

mcrumiller commented Feb 11, 2024

Julian-J-S commented Feb 12, 2024

stinodego Feb 12, 2024

mcrumiller Feb 12, 2024

stinodego Feb 12, 2024

stinodego left a comment

stinodego left a comment

mcrumiller commented Feb 12, 2024 •

edited

Loading

stinodego commented Feb 12, 2024

stinodego left a comment

coolstudio1678 commented Mar 16, 2024

mcrumiller commented Mar 16, 2024

ritchie46 commented Mar 18, 2024

mcrumiller commented Mar 18, 2024

orlp Mar 18, 2024

mcrumiller Mar 18, 2024

mcrumiller Mar 18, 2024 •

edited

Loading

mcrumiller Mar 18, 2024

orlp Mar 18, 2024

codecov bot commented Mar 18, 2024 •

edited

Loading

orlp Mar 18, 2024

ritchie46 left a comment

feat: Add str.head and str.tail #14425

feat: Add str.head and str.tail #14425

Conversation

mcrumiller commented Feb 11, 2024

Julian-J-S commented Feb 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stinodego left a comment

Choose a reason for hiding this comment

stinodego left a comment

Choose a reason for hiding this comment

mcrumiller commented Feb 12, 2024 • edited Loading

stinodego commented Feb 12, 2024

stinodego left a comment

Choose a reason for hiding this comment

coolstudio1678 commented Mar 16, 2024

mcrumiller commented Mar 16, 2024

ritchie46 commented Mar 18, 2024

mcrumiller commented Mar 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcrumiller Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 18, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

ritchie46 left a comment

Choose a reason for hiding this comment

feat: Add `str.head` and `str.tail` #14425

feat: Add `str.head` and `str.tail` #14425

mcrumiller commented Feb 12, 2024 •

edited

Loading

mcrumiller Mar 18, 2024 •

edited

Loading

codecov bot commented Mar 18, 2024 •

edited

Loading