Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CoreLib] String.splitMap #59686

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

drxddy
Copy link
Contributor

@drxddy drxddy commented Dec 8, 2024

Introduces a new public method String.splitMap to convert a string into a generic iterable, inspired by String.splitMapJoin, this new method uses the same parameters and returns Iterable<T>

Closes #40072

It can be convenient while dealing with string mappings like in this case I'm converting a text into List of TextSpan widgets

final linkRegex =  RegExp( r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)');
final result = 'Here is a [link](https://dart.dev) to the dart website'
    .splitMap<TextSpan>(
       linkRegex,
       onMatch: (m) => TextSpan(
         text: m[0]!,
         style: TextStyle(color: Colors.blue),
         recognizer: TapGestureRecognizer()
                ..onTap = () {
                  launchUrl(m[0]!);
                },
       ),
       onNonMatch: (n) => TextSpan(text: n),
   );

This is a generic function signature of the newly added method

 Iterable<T> splitMap<T>(
    Pattern pattern, {
    T Function(Match match)? onMatch, 
    T Function(String nonMatch)? onNonMatch,
  });

Introduced a new public method splitMap to convert a string into a generic iterable,
inspired by splitMapJoin, this new method uses the same parameters and returns Iterable<T>

Closes dart-lang#40072
Copy link

Thank you for your contribution! This project uses Gerrit for code reviews. Your pull request has automatically been converted into a code review at:

https://dart-review.googlesource.com/c/sdk/+/399620

Please wait for a developer to review your code review at the above link; you can speed up the review if you sign into Gerrit and manually add a reviewer that has recently worked on the relevant code. See CONTRIBUTING.md to learn how to upload changes to Gerrit directly.

Additional commits pushed to this PR will update both the PR and the corresponding Gerrit CL. After the review is complete on the CL, your reviewer will merge the CL (automatically closing this PR).

@drxddy drxddy force-pushed the feat/generic-string-split-map branch from cb744e1 to 60eab33 Compare December 8, 2024 17:17
Copy link

https://dart-review.googlesource.com/c/sdk/+/399620 has been updated with the latest commits from this pull request.

1 similar comment
Copy link

https://dart-review.googlesource.com/c/sdk/+/399620 has been updated with the latest commits from this pull request.

@drxddy drxddy force-pushed the feat/generic-string-split-map branch from 60eab33 to f809105 Compare December 8, 2024 17:28
Copy link

https://dart-review.googlesource.com/c/sdk/+/399620 has been updated with the latest commits from this pull request.

4 similar comments
Copy link

https://dart-review.googlesource.com/c/sdk/+/399620 has been updated with the latest commits from this pull request.

Copy link

https://dart-review.googlesource.com/c/sdk/+/399620 has been updated with the latest commits from this pull request.

Copy link

https://dart-review.googlesource.com/c/sdk/+/399620 has been updated with the latest commits from this pull request.

Copy link

https://dart-review.googlesource.com/c/sdk/+/399620 has been updated with the latest commits from this pull request.

@ArchishmanSengupta
Copy link

thanks for this

/// onNonMatch: (n) => TextSpan(text: n),
/// );
/// ```
Iterable<T> splitMap<T>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes on API design here: there is no sensible way to make an arbitrary T out of String. I noticed some implementation do as, some even do unsafeCast (which is incorrect - that would cause all kinds of issues). However this will only work for T being a super type of String. I think that indicates that onMatch and onNonMatch can't be optional and have to be required.

I can see that this method is a parallel to an existing splitMapJoin - but for that method at least there is a meaningful default for onMatch and onNonMatch. I also think on* names are badly chosen - they should have been something like transformMatch, transformSeparator or something like that... But the ship has sailed on this once we added splitMapJoin.

To be completely honest I would have probably designed this completely different producing the Iterable of matches and letting the caller apply a map (or do whatever):

Iterable<(Match match, {bool matched})> tokenize(Pattern pattern);

Then the usage becomes:

final result = 'Here is a [link](https://dart.dev) to the dart website'
    .tokenize(
      RegExp(
          r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)'),
    )
    .map((tok) => switch (tok) {
          (matched: true, m) => TextSpan(
              text: m[0]!,
              style: TextStyle(color: Colors.blue),
            ),
          (matched: false, m) => TextSpan(m[0]!),
        })
    .toList();

Anyway, just some highlevel thoughts. It is @lrhn's call to make around the design. I can see that argument of just maintaining the consistency with an existing API (even if we think the existing API is maybe not well thought out).

PS. If we decide to go this route we should consider implementing splitMapJoin on top of splitMap to avoid duplicated code for no good reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the input @mraleph,

yes i should have made onMatch and onNonMatch required as there is no proper default case when they are null, as per the naming conventions, i just wanted to keep them consistent and similar to splitMapJoin

let me know if any api design changes are required here @lrhn

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't look too much to splitMapJoin for design. If this matches, that's fine. If not, this should be its own method. (And the name was never great.)

It's not completely trivial to combine the "match" and "non-match" into a single iterable.
The bigger issue is that a non-match is not a Match. We could wrap the substring in a Match (and it might or might not be worth it memory-wise if we don't need to allocate the substring, but 99% of the time, the user will just do [0]! on it anyway) but then we have to choose a Match.pattern for it, which will be spurious since it's not actually the result of a pattern match. I guess we can use the string itself, so a non-match substring s becomes _StringMatch(s, s, prevEnd, start). It's still a little iffy, and it makes it harder to distinguish a match from a non-match. Wrapping that into a pair with a bool is just not my kind of API.

It's more like a heterogenous iterable, an iterable of a (distinguishable) union.
Then it makes a kind of sense to have a map operation that handles each part independently, so the onMatch/onNonMatch operations make sense. I'm fine with the names, and with onNonMatch taking a String instead of a Match.

I agree that the parameters being optional is iffy.
The behavior if a parameter is omitted should be to emit nothing for that case, so

  s.splitMap<int>(p, onNonMatch: (string) => string.length)

would emit nothing for matches.

That's a suspicious API, though, because s.splitMap(p, onMatch: f) is just the same as p.allMatches(s).map(f), so there is no real need for the new API. And s.splitMap(p) is really just Itearble.empty.
So the onNonMatch should probably really be required. If you don't need it, you should use another function.
Then it feels a little odd that onMatch is optional. And s.splitMap(p, onNonMatch: f) is just s.aplit(p).map(f), at least if split had properly been lazy. (It really should be.)

We could have a .nonMatches(pattern) that gets only the parts between matches if that's what you need,
and then have .splitMap(pattern, onMatch, onNonMatch) with both callback parameters beging required (and positional).

So:

class String ... {
   /// ...
   Iterable<T> splitMap<T>(Pattern pattern, T Function(Match) onMatch, T Function(String) onNonMatch) {
     // ...
   }
}

Then one can question whether you'd want the start/end positions of the non-match, so we should make it a Match object (with itself as pattern, or something), or introduce a StringSlice that is a lazy substring of another string. (A Match object has four references, the pattern, the string, and the start and end integers, a string slice would have three, both may be more effort and memory to allocate than just the substring, depending on its length.)

That all said, I think I would go for the "Match object for non-matches" approach anyway.
It has the advantage of giving you the positions of the non-matches, which may be important
information. And then it can be on Iterable<Match>, you can recognize the kind of matches by
whether their pattern is the pattern you split on.

I'd probably put it on Pattern instead, since it's a way to create Matches, which is what Pattern does, not what String does. The String operations just create more strings.

And I'd consider having different versions depending on whether I want a match for empty non-matches.

Something like:

extension PatternNonMatches on Pattern {
  /// The [Match]es of matching this pattern against [string], and non-empty strings between them.
  ///
  /// For each [Match] of [allMatches] called with [string] and [start] as arguments,
  /// if the match starts later than where the previous match ended (or [start] for the first match), 
  /// first emit a "non-matched substring" [Match] object representing the substring between
  /// the two matches.
  /// This [Match] object has no captures, a [Match.start] at the end of the previous match,
  /// and a [Match.end] at the start of the new match from [allMatches], the [string] as [Match.source]
  /// and the matched substring also as the [Match.pattern].
  /// Then the new match is emitted as the next element.
  /// If the last match found by [allMatches] does not end at the end of [string], a [Match] object
  /// for the substring from the end of the last match to the end of [string] is emitted at the end.
  ///
  /// If [allMatches] has no match, and [start] is less than the length of [string], a single
  /// [Match] is emitted for the substring of [string] from [start] to its end.
  ///
  /// The non-matched substring `Match` objects can be recognized by their [Match.pattern]
  /// not being this pattern. _(If this pattern is a string, there won't be a non-matched substring
  /// with that string as pattern, because that substring would have been matched.)_
  ///
  /// Notice: Matches and non-matches are not guaranteed to alternate, nor is the first and last
  /// emitted [Match] object guaranteed to be a non-match. Only non-empty substrings before,
  /// between or after matches are emitted, a non-match `Match` object will never be the empty string.
  Iterable<Match> allMatchesAndNonMatches(String string, [int start = 0]) { ... }

  /// The [Match]es of matching this pattern against [string], and the strings between them.
  ///
  /// For each [Match] of [allMatches] called with [string] and [start] as arguments,
  /// first emit a "non-matched substring" [Match] object representing the substring between
  /// the end of the prior match (or [start] for the first match) and the start of the new match.
  /// This [Match] object has no captures, a [Match.start] at the end of the previous match,
  /// and a [Match.end] at the start of the new match from [allMatches], the [string] as [Match.source]
  /// and the matched substring also as the [Match.pattern].
  /// Then the new match is emitted as the next element.
  /// A non-matched substring [Match] object is also emitted after the last match, for the substring
  /// from the end of the last match to the end of [string].
  ///
  /// If [allMatches] has no match, a single non-matched substring [Match] is emitted for
  /// the substring of [string] from [start] to its end.
  ///
  /// The non-matched substring `Match` objects can be recognized by their [Match.pattern]
  /// not being this pattern. _(If this pattern is a string, there won't be a non-matched substring
  /// with that string as pattern, because that substring would have been matched.)_
  ///
  /// Matches and non-matches are guaranteed to alternate, with the first and last emitted [Match]
  /// objects guaranteed to be a non-match substring. The emitted [Match] objects are always
  /// consecutive, with the [Match.start] of each being the [Match.end] of the prior one.
  /// A non-match `Match` object may be for the empty string.
  Iterable<Match> allMatchesAndSeparators(String string, [int start = 0]) { ... }

  /// The non-empty substrings before, between and after matches in [string].
  ///
  /// Finds [allMatches] of [string] starting at [start], and emits a "non-mached substring" [Match]
  /// object for any non-empty substring before (between [start] and the start of the first match),
  /// between (the [Match.end] of a previous match and [Match.start] of the next match) or after
  /// (between [Match.end] of the last match and the end of [string]) that wasn't matched by
  /// this pattern.
  ///
  /// The non-matched substring [Match] object has [string] as [Match.source], 
  /// [Match.start] and [Match.end] being the boundaries of the surrounding matches,
  /// and [Match.pattern] set to the same substring as the matched string (`match[0]!`).
  ///
  /// May not emit a substring between any two matches, only when there is a non-empty
  /// substring between the matches. Will never emit an empty match.
  ///
  /// If [string] has no matches of this patten, and has a length greater than [start], 
  /// a single non-matched substring `Match` object is emitted for that entire substring.
  ///
  /// If you need both the matches and the non-matches, using [allMatchesAndNonMatches]
  /// or [allMatchesAndSeparators] instead, rather than iterating both this and [allMatches]
  /// in parallel.
  Iterable<Match> nonMatches(String string, [int start = 0])

  /// The substrings before, between and after matches in [string].
  ///
  /// Finds [allMatches] of [string] starting at [start], and emits a "non-mached substring" [Match]
  /// object for any substring before (between [start] and the start of the first match),
  /// between (the [Match.end] of a previous match and [Match.start] of the next match) or after
  /// (between [Match.end] of the last match and the end of [string]).
  ///
  /// The non-matched substring [Match] object has [string] as [Match.source], 
  /// [Match.start] and [Match.end] being the boundaries of the surrounding matches,
  /// and [Match.pattern] set to the same substring as the matched string (`match[0]!`).
  ///
  /// If [string] has no matches of this patten,
  /// a single non-matched substring `Match` object is emitted for that entire substring
  /// starting at [start].
  ///
  /// Will always emit one match more than [allMatches] finds, and may emit empty 
  /// non-match substrings.
  ///
  /// If you need both the matches and the non-matches, using [allMatchesAndNonMatches]
  /// or [allMatchesAndSeparators] instead, rather than iterating both this and [allMatches]
  /// in parallel.
  Iterable<Match> separators(String string, [int start = 0]) { ... }
}

This can all be backed by the same general implementation. (I'd combine the nonMatches and separators into one with a {bool allowEmpty = false} if we could have both optional positional and named parameters on the same function. But we still can't.)

Copy link
Member

@lrhn lrhn Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But open to arguments for why splitMap would be better in some cases, it's not a bad or inconsistent design if the parameters are required.

Have to decide whether to call onNonMatch for empty substrings between adjacent matches or not, and for empty substrings before the first match or after the last match.
I can see uses for both. But probably more for the "separators" style where you know you get a match between elements, but no sure you want something before the first match or after the last match then.

Heck, could do:

Iterable<T> splitMap<T>(
    Pattern pattern, T Function(Match) onMatch, T Function(String) onNonMatch, {
    bool emptyLead = false, bool emptyTail = false, bool emptySeparator = false}) {...}

where you can decide whether an empty substring before, after or between matches should call onNonEmpty. We can even throw in an emptyMatch to not get called for empty matches too.

All possible design points. Might be worth looking at use-cases to see what's actually wanted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohk, thanks for taking me through your thinking process over many design considerations, I should have discussed this on the #40072 thread before doing any implementations, but then I thought to myself Talk is cheap, let me submit diffs first as a conversation starter

Agree that having Match objects for onNonMatch gives useful positional information for non-matches, and lets us combine the output into a single Iterable<Match>, but then experienced users can do pattern.allMatches(String) and get that anyway,

My goal and motivation for String.splitMap.

I implemented this method for beginners who are getting started with Dart, as one of the first things they learn is the String class and the various methods and what they let them do,

As it happened to me, I stumbled on String.splitMapJoin by just exploring LSP options in the String class, and I was surprised by how useful and creative it was, and it kind of introduced me to explore more Pattern and other useful but underrated classes.

So I see this method as a way to give an easily accessible way (as I said, having this method on String instead of Pattern will be very impactful for new users b/c it's one of the first things they learn, this can be a great way to introduce them to more complex use cases to with Pattern)

Apart from that, I'm open to and agree on anything technical side of the API like,

  • the Iterator should be lazy and not eagerly load all elements at once,
  • helpers should be private
  • onMatch(Match) and onNonMatch(String) should be required and named

I would like to continue this method if you agree on my goal & motivation of introducing as mentioned above

@mraleph mraleph requested a review from lrhn December 9, 2024 08:20
@mraleph
Copy link
Member

mraleph commented Dec 9, 2024

cc @lrhn for review.

return stringSplitMapUnchecked(this, pattern, onMatch, onNonMatch);
}

Iterable<T> stringSplitMapUnchecked<T>(String receiver, Pattern pattern,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just style comments, not sure this is the design we'll do.)
Should be private. All helper functions should.

Iterable<T> stringSplitMapUnchecked<T>(String receiver, Pattern pattern,
T Function(Match)? onMatch, T Function(String)? onNonMatch) {
onMatch ??= (match) => match[0]! as T;
onNonMatch ??= (string) => string as T;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not sound. If the parameters are optional, I'd emit nothing for a match/non-match if onMatch/onNonMatch is null.
(And I'd probably make them named parameters today.)

if (pattern.isEmpty) {
return stringSplitEmptyMapUnchecked(receiver, onMatch, onNonMatch);
}
List<T> result = <T>[];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be lazy, it shouldn't build a List eagerly and return it as an Iterable.

result.add(onNonMatch(""));
while (i < length) {
result.add(onMatch(StringMatch(i, receiver, "")));
// Special case to avoid splitting a surrogate pair.
Copy link
Member

@lrhn lrhn Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A string.split("") would split surrogate pairs. So would string.split(RegExp(r'(?:)')) without the unicode:true flag.

(It'd be nice if it didn't, but then it'd be even niceer if it didn't break grapheme clusters either.)

(I can see that splitMapJoin does the same thing. Probably my bad, it shouldn't have.)

startIndex = match.end;
}
result.add(onNonMatch(receiver.substring(startIndex)));
return result;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical to stringSplitJSRegExpMapUnchecked, so no need to special case for JSRegExp.

Match match = matches.current;
if (match.start == _index) {
_current = _onMatch(match);
_index = match.end;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can cause infinite loop if match.start == match.end. Empty matches can happen.

/// ```dart
/// final result = 'Here is a [link](https://dart.dev) to the dart website'
/// .splitMap<TextSpan>(
/// RegExp( r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe A little too complicated a RegExp for an example. And too long for one line.
I don't expect someone to be able to read and understand this. (I don't expect anyone to be able to read a RegExp that isn't extremely trivial.)

Also, as a style hint, since the code here doesn't use captures, the (...)s should be (?:...) to not waste time doing a capture. There is no need to escape / or + inside a character range, and the parentheses around [-a-z....=]* are unnecessary.
So the RegExp could be:

     RegExp(r'https?://(?:www\.)?[-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b[-a-zA-Z0-9()@:%_+.~#?&/=]*'),

(Not sure why we want to accept () after the ., this looks like it would match https://dart.dev) including the ), but not including the ( before it.)

/// onNonMatch: (n) => TextSpan(text: n),
/// );
/// ```
Iterable<T> splitMap<T>(
Copy link
Member

@lrhn lrhn Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But open to arguments for why splitMap would be better in some cases, it's not a bad or inconsistent design if the parameters are required.

Have to decide whether to call onNonMatch for empty substrings between adjacent matches or not, and for empty substrings before the first match or after the last match.
I can see uses for both. But probably more for the "separators" style where you know you get a match between elements, but no sure you want something before the first match or after the last match then.

Heck, could do:

Iterable<T> splitMap<T>(
    Pattern pattern, T Function(Match) onMatch, T Function(String) onNonMatch, {
    bool emptyLead = false, bool emptyTail = false, bool emptySeparator = false}) {...}

where you can decide whether an empty substring before, after or between matches should call onNonEmpty. We can even throw in an emptyMatch to not get called for empty matches too.

All possible design points. Might be worth looking at use-cases to see what's actually wanted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

String.splitMap
5 participants