-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CoreLib] String.splitMap #59686
base: main
Are you sure you want to change the base?
[CoreLib] String.splitMap #59686
Conversation
Introduced a new public method splitMap to convert a string into a generic iterable, inspired by splitMapJoin, this new method uses the same parameters and returns Iterable<T> Closes dart-lang#40072
Thank you for your contribution! This project uses Gerrit for code reviews. Your pull request has automatically been converted into a code review at: https://dart-review.googlesource.com/c/sdk/+/399620 Please wait for a developer to review your code review at the above link; you can speed up the review if you sign into Gerrit and manually add a reviewer that has recently worked on the relevant code. See CONTRIBUTING.md to learn how to upload changes to Gerrit directly. Additional commits pushed to this PR will update both the PR and the corresponding Gerrit CL. After the review is complete on the CL, your reviewer will merge the CL (automatically closing this PR). |
cb744e1
to
60eab33
Compare
https://dart-review.googlesource.com/c/sdk/+/399620 has been updated with the latest commits from this pull request. |
1 similar comment
https://dart-review.googlesource.com/c/sdk/+/399620 has been updated with the latest commits from this pull request. |
60eab33
to
f809105
Compare
https://dart-review.googlesource.com/c/sdk/+/399620 has been updated with the latest commits from this pull request. |
4 similar comments
https://dart-review.googlesource.com/c/sdk/+/399620 has been updated with the latest commits from this pull request. |
https://dart-review.googlesource.com/c/sdk/+/399620 has been updated with the latest commits from this pull request. |
https://dart-review.googlesource.com/c/sdk/+/399620 has been updated with the latest commits from this pull request. |
https://dart-review.googlesource.com/c/sdk/+/399620 has been updated with the latest commits from this pull request. |
thanks for this |
/// onNonMatch: (n) => TextSpan(text: n), | ||
/// ); | ||
/// ``` | ||
Iterable<T> splitMap<T>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes on API design here: there is no sensible way to make an arbitrary T
out of String
. I noticed some implementation do as
, some even do unsafeCast
(which is incorrect - that would cause all kinds of issues). However this will only work for T
being a super type of String
. I think that indicates that onMatch
and onNonMatch
can't be optional and have to be required.
I can see that this method is a parallel to an existing splitMapJoin
- but for that method at least there is a meaningful default for onMatch
and onNonMatch
. I also think on*
names are badly chosen - they should have been something like transformMatch
, transformSeparator
or something like that... But the ship has sailed on this once we added splitMapJoin
.
To be completely honest I would have probably designed this completely different producing the Iterable
of matches and letting the caller apply a map
(or do whatever):
Iterable<(Match match, {bool matched})> tokenize(Pattern pattern);
Then the usage becomes:
final result = 'Here is a [link](https://dart.dev) to the dart website'
.tokenize(
RegExp(
r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)'),
)
.map((tok) => switch (tok) {
(matched: true, m) => TextSpan(
text: m[0]!,
style: TextStyle(color: Colors.blue),
),
(matched: false, m) => TextSpan(m[0]!),
})
.toList();
Anyway, just some highlevel thoughts. It is @lrhn's call to make around the design. I can see that argument of just maintaining the consistency with an existing API (even if we think the existing API is maybe not well thought out).
PS. If we decide to go this route we should consider implementing splitMapJoin
on top of splitMap
to avoid duplicated code for no good reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't look too much to splitMapJoin
for design. If this matches, that's fine. If not, this should be its own method. (And the name was never great.)
It's not completely trivial to combine the "match" and "non-match" into a single iterable.
The bigger issue is that a non-match is not a Match
. We could wrap the substring in a Match
(and it might or might not be worth it memory-wise if we don't need to allocate the substring, but 99% of the time, the user will just do [0]!
on it anyway) but then we have to choose a Match.pattern
for it, which will be spurious since it's not actually the result of a pattern match. I guess we can use the string itself, so a non-match substring s
becomes _StringMatch(s, s, prevEnd, start)
. It's still a little iffy, and it makes it harder to distinguish a match from a non-match. Wrapping that into a pair with a bool is just not my kind of API.
It's more like a heterogenous iterable, an iterable of a (distinguishable) union.
Then it makes a kind of sense to have a map
operation that handles each part independently, so the onMatch
/onNonMatch
operations make sense. I'm fine with the names, and with onNonMatch
taking a String
instead of a Match
.
I agree that the parameters being optional is iffy.
The behavior if a parameter is omitted should be to emit nothing for that case, so
s.splitMap<int>(p, onNonMatch: (string) => string.length)
would emit nothing for matches.
That's a suspicious API, though, because s.splitMap(p, onMatch: f)
is just the same as p.allMatches(s).map(f)
, so there is no real need for the new API. And s.splitMap(p)
is really just Itearble.empty
.
So the onNonMatch
should probably really be required. If you don't need it, you should use another function.
Then it feels a little odd that onMatch
is optional. And s.splitMap(p, onNonMatch: f)
is just s.aplit(p).map(f)
, at least if split
had properly been lazy. (It really should be.)
We could have a .nonMatches(pattern)
that gets only the parts between matches if that's what you need,
and then have .splitMap(pattern, onMatch, onNonMatch)
with both callback parameters beging required (and positional).
So:
class String ... {
/// ...
Iterable<T> splitMap<T>(Pattern pattern, T Function(Match) onMatch, T Function(String) onNonMatch) {
// ...
}
}
Then one can question whether you'd want the start/end positions of the non-match, so we should make it a Match
object (with itself as pattern, or something), or introduce a StringSlice
that is a lazy substring of another string. (A Match
object has four references, the pattern, the string, and the start and end integers, a string slice would have three, both may be more effort and memory to allocate than just the substring, depending on its length.)
That all said, I think I would go for the "Match
object for non-matches" approach anyway.
It has the advantage of giving you the positions of the non-matches, which may be important
information. And then it can be on Iterable<Match>
, you can recognize the kind of matches by
whether their pattern
is the pattern you split on.
I'd probably put it on Pattern
instead, since it's a way to create Match
es, which is what Pattern
does, not what String
does. The String
operations just create more strings.
And I'd consider having different versions depending on whether I want a match for empty non-matches.
Something like:
extension PatternNonMatches on Pattern {
/// The [Match]es of matching this pattern against [string], and non-empty strings between them.
///
/// For each [Match] of [allMatches] called with [string] and [start] as arguments,
/// if the match starts later than where the previous match ended (or [start] for the first match),
/// first emit a "non-matched substring" [Match] object representing the substring between
/// the two matches.
/// This [Match] object has no captures, a [Match.start] at the end of the previous match,
/// and a [Match.end] at the start of the new match from [allMatches], the [string] as [Match.source]
/// and the matched substring also as the [Match.pattern].
/// Then the new match is emitted as the next element.
/// If the last match found by [allMatches] does not end at the end of [string], a [Match] object
/// for the substring from the end of the last match to the end of [string] is emitted at the end.
///
/// If [allMatches] has no match, and [start] is less than the length of [string], a single
/// [Match] is emitted for the substring of [string] from [start] to its end.
///
/// The non-matched substring `Match` objects can be recognized by their [Match.pattern]
/// not being this pattern. _(If this pattern is a string, there won't be a non-matched substring
/// with that string as pattern, because that substring would have been matched.)_
///
/// Notice: Matches and non-matches are not guaranteed to alternate, nor is the first and last
/// emitted [Match] object guaranteed to be a non-match. Only non-empty substrings before,
/// between or after matches are emitted, a non-match `Match` object will never be the empty string.
Iterable<Match> allMatchesAndNonMatches(String string, [int start = 0]) { ... }
/// The [Match]es of matching this pattern against [string], and the strings between them.
///
/// For each [Match] of [allMatches] called with [string] and [start] as arguments,
/// first emit a "non-matched substring" [Match] object representing the substring between
/// the end of the prior match (or [start] for the first match) and the start of the new match.
/// This [Match] object has no captures, a [Match.start] at the end of the previous match,
/// and a [Match.end] at the start of the new match from [allMatches], the [string] as [Match.source]
/// and the matched substring also as the [Match.pattern].
/// Then the new match is emitted as the next element.
/// A non-matched substring [Match] object is also emitted after the last match, for the substring
/// from the end of the last match to the end of [string].
///
/// If [allMatches] has no match, a single non-matched substring [Match] is emitted for
/// the substring of [string] from [start] to its end.
///
/// The non-matched substring `Match` objects can be recognized by their [Match.pattern]
/// not being this pattern. _(If this pattern is a string, there won't be a non-matched substring
/// with that string as pattern, because that substring would have been matched.)_
///
/// Matches and non-matches are guaranteed to alternate, with the first and last emitted [Match]
/// objects guaranteed to be a non-match substring. The emitted [Match] objects are always
/// consecutive, with the [Match.start] of each being the [Match.end] of the prior one.
/// A non-match `Match` object may be for the empty string.
Iterable<Match> allMatchesAndSeparators(String string, [int start = 0]) { ... }
/// The non-empty substrings before, between and after matches in [string].
///
/// Finds [allMatches] of [string] starting at [start], and emits a "non-mached substring" [Match]
/// object for any non-empty substring before (between [start] and the start of the first match),
/// between (the [Match.end] of a previous match and [Match.start] of the next match) or after
/// (between [Match.end] of the last match and the end of [string]) that wasn't matched by
/// this pattern.
///
/// The non-matched substring [Match] object has [string] as [Match.source],
/// [Match.start] and [Match.end] being the boundaries of the surrounding matches,
/// and [Match.pattern] set to the same substring as the matched string (`match[0]!`).
///
/// May not emit a substring between any two matches, only when there is a non-empty
/// substring between the matches. Will never emit an empty match.
///
/// If [string] has no matches of this patten, and has a length greater than [start],
/// a single non-matched substring `Match` object is emitted for that entire substring.
///
/// If you need both the matches and the non-matches, using [allMatchesAndNonMatches]
/// or [allMatchesAndSeparators] instead, rather than iterating both this and [allMatches]
/// in parallel.
Iterable<Match> nonMatches(String string, [int start = 0])
/// The substrings before, between and after matches in [string].
///
/// Finds [allMatches] of [string] starting at [start], and emits a "non-mached substring" [Match]
/// object for any substring before (between [start] and the start of the first match),
/// between (the [Match.end] of a previous match and [Match.start] of the next match) or after
/// (between [Match.end] of the last match and the end of [string]).
///
/// The non-matched substring [Match] object has [string] as [Match.source],
/// [Match.start] and [Match.end] being the boundaries of the surrounding matches,
/// and [Match.pattern] set to the same substring as the matched string (`match[0]!`).
///
/// If [string] has no matches of this patten,
/// a single non-matched substring `Match` object is emitted for that entire substring
/// starting at [start].
///
/// Will always emit one match more than [allMatches] finds, and may emit empty
/// non-match substrings.
///
/// If you need both the matches and the non-matches, using [allMatchesAndNonMatches]
/// or [allMatchesAndSeparators] instead, rather than iterating both this and [allMatches]
/// in parallel.
Iterable<Match> separators(String string, [int start = 0]) { ... }
}
This can all be backed by the same general implementation. (I'd combine the nonMatches
and separators
into one with a {bool allowEmpty = false}
if we could have both optional positional and named parameters on the same function. But we still can't.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But open to arguments for why splitMap
would be better in some cases, it's not a bad or inconsistent design if the parameters are required.
Have to decide whether to call onNonMatch
for empty substrings between adjacent matches or not, and for empty substrings before the first match or after the last match.
I can see uses for both. But probably more for the "separators" style where you know you get a match between elements, but no sure you want something before the first match or after the last match then.
Heck, could do:
Iterable<T> splitMap<T>(
Pattern pattern, T Function(Match) onMatch, T Function(String) onNonMatch, {
bool emptyLead = false, bool emptyTail = false, bool emptySeparator = false}) {...}
where you can decide whether an empty substring before, after or between matches should call onNonEmpty
. We can even throw in an emptyMatch
to not get called for empty matches too.
All possible design points. Might be worth looking at use-cases to see what's actually wanted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohk, thanks for taking me through your thinking process over many design considerations, I should have discussed this on the #40072 thread before doing any implementations, but then I thought to myself Talk is cheap, let me submit diffs first as a conversation starter
Agree that having Match
objects for onNonMatch
gives useful positional information for non-matches, and lets us combine the output into a single Iterable<Match>
, but then experienced users can do pattern.allMatches(String)
and get that anyway,
My goal and motivation for String.splitMap
.
I implemented this method for beginners who are getting started with Dart, as one of the first things they learn is the String
class and the various methods and what they let them do,
As it happened to me, I stumbled on String.splitMapJoin
by just exploring LSP options in the String
class, and I was surprised by how useful and creative it was, and it kind of introduced me to explore more Pattern
and other useful but underrated classes.
So I see this method as a way to give an easily accessible way (as I said, having this method on String
instead of Pattern
will be very impactful for new users b/c it's one of the first things they learn, this can be a great way to introduce them to more complex use cases to with Pattern
)
Apart from that, I'm open to and agree on anything technical side of the API like,
- the Iterator should be lazy and not eagerly load all elements at once,
- helpers should be private
onMatch(Match)
andonNonMatch(String)
should be required and named
I would like to continue this method if you agree on my goal & motivation of introducing as mentioned above
cc @lrhn for review. |
return stringSplitMapUnchecked(this, pattern, onMatch, onNonMatch); | ||
} | ||
|
||
Iterable<T> stringSplitMapUnchecked<T>(String receiver, Pattern pattern, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Just style comments, not sure this is the design we'll do.)
Should be private. All helper functions should.
Iterable<T> stringSplitMapUnchecked<T>(String receiver, Pattern pattern, | ||
T Function(Match)? onMatch, T Function(String)? onNonMatch) { | ||
onMatch ??= (match) => match[0]! as T; | ||
onNonMatch ??= (string) => string as T; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely not sound. If the parameters are optional, I'd emit nothing for a match/non-match if onMatch
/onNonMatch
is null
.
(And I'd probably make them named parameters today.)
if (pattern.isEmpty) { | ||
return stringSplitEmptyMapUnchecked(receiver, onMatch, onNonMatch); | ||
} | ||
List<T> result = <T>[]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be lazy, it shouldn't build a List
eagerly and return it as an Iterable.
result.add(onNonMatch("")); | ||
while (i < length) { | ||
result.add(onMatch(StringMatch(i, receiver, ""))); | ||
// Special case to avoid splitting a surrogate pair. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A string.split("")
would split surrogate pairs. So would string.split(RegExp(r'(?:)'))
without the unicode:true
flag.
(It'd be nice if it didn't, but then it'd be even niceer if it didn't break grapheme clusters either.)
(I can see that splitMapJoin
does the same thing. Probably my bad, it shouldn't have.)
startIndex = match.end; | ||
} | ||
result.add(onNonMatch(receiver.substring(startIndex))); | ||
return result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Identical to stringSplitJSRegExpMapUnchecked
, so no need to special case for JSRegExp
.
Match match = matches.current; | ||
if (match.start == _index) { | ||
_current = _onMatch(match); | ||
_index = match.end; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can cause infinite loop if match.start == match.end
. Empty matches can happen.
/// ```dart | ||
/// final result = 'Here is a [link](https://dart.dev) to the dart website' | ||
/// .splitMap<TextSpan>( | ||
/// RegExp( r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe A little too complicated a RegExp for an example. And too long for one line.
I don't expect someone to be able to read and understand this. (I don't expect anyone to be able to read a RegExp that isn't extremely trivial.)
Also, as a style hint, since the code here doesn't use captures, the (...)
s should be (?:...)
to not waste time doing a capture. There is no need to escape /
or +
inside a character range, and the parentheses around [-a-z....=]*
are unnecessary.
So the RegExp could be:
RegExp(r'https?://(?:www\.)?[-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b[-a-zA-Z0-9()@:%_+.~#?&/=]*'),
(Not sure why we want to accept ()
after the .
, this looks like it would match https://dart.dev)
including the )
, but not including the (
before it.)
/// onNonMatch: (n) => TextSpan(text: n), | ||
/// ); | ||
/// ``` | ||
Iterable<T> splitMap<T>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But open to arguments for why splitMap
would be better in some cases, it's not a bad or inconsistent design if the parameters are required.
Have to decide whether to call onNonMatch
for empty substrings between adjacent matches or not, and for empty substrings before the first match or after the last match.
I can see uses for both. But probably more for the "separators" style where you know you get a match between elements, but no sure you want something before the first match or after the last match then.
Heck, could do:
Iterable<T> splitMap<T>(
Pattern pattern, T Function(Match) onMatch, T Function(String) onNonMatch, {
bool emptyLead = false, bool emptyTail = false, bool emptySeparator = false}) {...}
where you can decide whether an empty substring before, after or between matches should call onNonEmpty
. We can even throw in an emptyMatch
to not get called for empty matches too.
All possible design points. Might be worth looking at use-cases to see what's actually wanted.
Introduces a new public method
String.splitMap
to convert a string into a generic iterable, inspired byString.splitMapJoin
, this new method uses the same parameters and returnsIterable<T>
Closes #40072
It can be convenient while dealing with string mappings like in this case I'm converting a text into List of
TextSpan
widgetsThis is a generic function signature of the newly added method