-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add str.head
and str.tail
#14425
Conversation
fff8a71
to
0e6703d
Compare
0e6703d
to
496f4de
Compare
nice addition! 😃 I would recommend a small change in the test the check for "foobar" with
is a little confusing because this would also work if the function just took the absolute value. "abcde" with
would be a little clearer to understand and is not ambiguous |
@@ -526,6 +526,26 @@ impl StringNameSpace { | |||
) | |||
} | |||
|
|||
/// Take the first `n` characters of the string values. | |||
pub fn head(self, n: Expr) -> Expr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we can't just define these in terms of a slice
operation - that would save a lot of code bloat. But that might not work with negative indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @stinodego -- this was my initial intent, when I said I would piggyback on @reswqa's implementation of str.slice
. I realized soon after that the negative indexing for head requires calculation of the string length to determine the end of the slice.
Here are the operations and their slice
equivalents:
s.str.head(3) # s.str.slice(3, None)
s.str.head(-3) # no equivalent: must know string length for start offset
s.str.tail(3) # s.str.slice(-3, None)
s.str.tail(-3) # s.str.slice(3, None)
So for tail
, we could do it because slice
can run to the end of the string by itself, but for head we have no recourse. I suppose this would save us a little bit of bloat but it does make the code a bit asymmetric, but on the other hand the tail
implementation is a bit more performant than slice
because we have one fewer parameters, and so we have more fast paths. So it's a tradeoff here. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you can use slice
in combination with len_chars
, but clearly it will be a bit more efficient to have a dedicated implementation like in this PR. I'll leave it to Ritchie to be the judge here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I left a few comments, I'll leave the proper review to Ritchie or Orson.
0bf53a1
to
dfd6b66
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstrings could use a bit more love (you can probably copy paste some stuff from slice
) and a doc entry is missing in the API reference.
If you could address this, I'll approve and leave it to someone to judge the Rust side of things.
36b313f
to
a2626a2
Compare
a2626a2
to
2df1b08
Compare
@stinodego I've updated the docstrings with more detail and more examples. I cannot for the life of me figure out why the CI doctest is failing with an "unexpected indentation" error. My doctests pass fine locally and I can't determine which part is causing the error. I do note that when I run code locally, Series show 8 spaces of indentation: >>> import polars as pl
>>> s = pl.Series(["pear", None, "papaya", "dragonfruit"])
>>> s.str.head(-3)
shape: (4,)
Series: '' [str]
[
"p"
null
"pap"
"dragonfr"
] And the examples in Edit: I suspect this is the case, as my local doctest does not complain. I've reduced to 4 and we'll see how that fares. |
It's probably an unclosed backtick. I can take a look. There are some issues with the docstring formatting anyway that I can see won't render. |
8af424a
to
cf53b6b
Compare
cf53b6b
to
0bc19f2
Compare
0bc19f2
to
788484f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right, Python side looks good to me. Leaving the Rust side review to @ritchie46
When can it be used in rust? |
It needs to be approved first. @ritchie46 would you mind taking a look? |
I hope to get to this today. I am a bit worried about sliced that are within char boundaries. |
Do we have reason not to trust |
(_, 1) => { | ||
// SAFETY: `n` was verified to have at least 1 element. | ||
let n = unsafe { n.get_unchecked(0) }; | ||
unary_elementwise(ca, |str_val| head_binary(str_val, n)).with_name(ca.name()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be changed, string head/tail must be defined in terms of codepoints, not bytes! Otherwise you get illegal UTF-8 and general nonsense. Please change this and add a test-case that tests this, for example:
import polars as pl
df = pl.DataFrame({"s": ["你好世界"]})
head = pl.DataFrame({"s": ["你好"]})
tail = pl.DataFrame({"s": ["世界"]})
assert_frame_equal(df.select(pl.col.s.str.head(2)), head)
assert_frame_equal(df.select(pl.col.s.str.tail(2)), tail)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @orlp, may have to make a few changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I am noticing that str.slice
does not properly index codepoints with negative indexes. Using your example:
s = pl.Series(["你好世界"])
tail = "界"
s.str.slice(-1) # should be equivalent to "tail"
# shape: (1,)
# Series: '' [str]
# [
# ""
# ]
I'll see if I can address this as a separate issue once I have finished with this one. Edit: opened #15136.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orlp the new impl respects code points instead of bytes. I added some specific code point tests using your example. Let me know if anything looks off to you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #14425 +/- ##
==========================================
+ Coverage 81.14% 81.15% +0.01%
==========================================
Files 1363 1363
Lines 175282 175408 +126
Branches 2527 2527
==========================================
+ Hits 142236 142360 +124
- Misses 32568 32571 +3
+ Partials 478 477 -1 ☔ View full report in Codecov by Sentry. |
(_, 1) => { | ||
// SAFETY: `n` was verified to have at least 1 element. | ||
let n = unsafe { n.get_unchecked(0) }; | ||
unary_elementwise(ca, |str_val| head_binary(str_val, n)).with_name(ca.name()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. Thanks @mcrumiller. Sorry for the delay on this one.
Resolves #10337.