performance improvements:

chocolateboy · chocolateboy · commit 420a210d3273 · 2020-08-22T13:50:10.000+01:00
- delegate to String#split where possible
- use a regular class for Split rather than values.rb
- create Split objects directly rather than allocating intermediate
  hashes
diff --git a/.gitignore b/.gitignore
@@ -59,3 +59,4 @@ Gemfile.lock
 /dev/
 temp.*
 /*.rb
+/profile.*
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,12 @@
+## 0.7.1 - TBD
+
+#### Changes
+
+- performance improvements
+  - use `String#split` where possible
+  - use a regular class for Split rather than values.rb
+  - create Split objects directly rather than allocating intermediate hashes
+
 ## 0.7.0 - 2020-08-21
 
 #### Breaking Changes
diff --git a/Gemfile b/Gemfile
@@ -5,6 +5,9 @@ source 'https://rubygems.org'
 unless ENV['CI']
   group :development do
     gem 'irb', '~> 1.2' # XXX work around Arch Linux's broken ruby packaging
+
+    # "2.3-compatible analysis was dropped after version 0.81."
+    # gem 'rubocop', '0.81'
     gem 'rubocop', '~> 0.89'
   end
 end
diff --git a/README.md b/README.md
@@ -11,7 +11,7 @@
 - [DESCRIPTION](#description)
 - [WHY?](#why)
 - [CAVEATS](#caveats)
-  - [Differences from String#split](#differences-from-string%23split)
+  - [Differences from String#split](#differences-from-stringsplit)
 - [COMPATIBILITY](#compatibility)
 - [VERSION](#version)
 - [SEE ALSO](#see-also)
@@ -130,7 +130,7 @@ end
 Many languages have built-in `split` functions/methods for strings. They behave
 similarly (notwithstanding the occasional
 [surprise](https://chriszetter.com/blog/2017/10/29/splitting-strings/)), and
-handle a few common cases e.g.:
+handle a few common cases, e.g.:
 
 * limiting the number of splits
 * including the separator(s) in the results
@@ -140,7 +140,7 @@ But, because the API is squeezed into two overloaded parameters (the delimiter
 and the limit), achieving the desired results can be tricky. For instance,
 while `String#split` removes empty trailing fields (by default), it provides no
 way to remove *all* empty fields. Likewise, the cramped API means there's no
-way to e.g. combine a limit (positive integer) with the option to preserve
+way to, e.g., combine a limit (positive integer) with the option to preserve
 empty fields (negative integer), or use backreferences in a delimiter pattern
 without including its captured subexpressions in the result.
 
@@ -192,7 +192,7 @@ to a regex or a full-blown parser.
 As an example, the nominally unstructured output of many Unix commands is often
 formatted in a way that's tantalizingly close to being
 [machine-readable](https://en.wikipedia.org/wiki/Delimiter-separated_values),
-apart from a few pesky exceptions e.g.:
+apart from a few pesky exceptions, e.g.:
 
 ```bash
 $ ls -l
@@ -205,7 +205,7 @@ drwxr-xr-x 3 user users 4096 Jun 19 22:56 lib
 ```
 
 These lines can *almost* be parsed into an array of fields by splitting them on
-whitespace. The exception is the date (columns 6-8) i.e.:
+whitespace. The exception is the date (columns 6-8), i.e.:
 
 ```ruby
 line = "-rw-r--r-- 1 user users   87 Jun 18 18:16 CHANGELOG.md"
@@ -224,15 +224,15 @@ instead of:
 ["-rw-r--r--", "1", "user", "users", "87", "Jun 18 18:16", "CHANGELOG.md"]
 ```
 
-One way to work around this is to parse the whole line e.g.:
+One way to work around this is to parse the whole line, e.g.:
 
 ```ruby
 line.match(/^(\S+) \s+ (\d+) \s+ (\S+) \s+ (\S+) \s+ (\d+) \s+ (\S+ \s+ \d+ \s+ \S+) \s+ (.+)$/x)
 ```
 
 But that requires us to specify *everything*. What we really want is a version
 of `split` which allows us to veto splitting for the 6th and 7th delimiters
-(and to stop after the 8th delimiter) i.e. control over which splits are
+(and to stop after the 8th delimiter), i.e. control over which splits are
 accepted, rather than being restricted to the single, baked-in strategy
 provided by the `limit` parameter.
 
@@ -258,7 +258,7 @@ ss.split(line, at: [1..5, 8])
 ## Differences from String#split
 
 Unlike `String#split`, StringSplitter doesn't trim the string before splitting
-(with `String#strip`) if the delimiter is omitted or a single space, e.g.:
+if the delimiter is omitted or a single space, e.g.:
 
 ```ruby
 " foo bar baz ".split          # => ["foo", "bar", "baz"]
diff --git a/Rakefile b/Rakefile
@@ -18,6 +18,6 @@ task :console do
 end
 
 # FIXME this runs after the release!
-task :release => %i[rubocop test]
+task :release => %i[lint test]
 
 task :default => :test
diff --git a/lib/string_splitter.rb b/lib/string_splitter.rb
@@ -1,8 +1,8 @@
 # frozen_string_literal: true
 
 require 'set'
-require 'values'
 
+require_relative 'string_splitter/split'
 require_relative 'string_splitter/version'
 
 # This class extends the functionality of +String#split+ by:
@@ -17,9 +17,10 @@
 # These enhancements allow splits to handle many cases that otherwise require bigger
 # guns, e.g. regex matching or parsing.
 #
-# Implementation-wise, we split the string with a scanner which works in a similar
-# way to +String#split+ and parse the resulting tokens into an array of Split objects
-# with the following fields:
+# Implementation-wise, we split the string either with String#split, or with a custom
+# scanner if the delimiter may contain captures (since String#split doesn't handle
+# them correctly) and parse the resulting tokens into an array of Split objects with
+# the following attributes:
 #
 #   - captures:  separator substrings captured by parentheses in the delimiter pattern
 #   - count:     the number of splits
@@ -43,42 +44,6 @@ class StringSplitter
   DEFAULT_DELIMITER = /\s+/.freeze
   REMOVE = [].freeze
 
-  Split = Value.new(:captures, :count, :index, :lhs, :rhs, :separator) do
-    def position
-      index + 1
-    end
-
-    alias_method :pos, :position
-
-    # 0-based index relative to the end of the array, e.g. for 5 items:
-    #
-    #  index | rindex
-    #  ------|-------
-    #    0   |   4
-    #    1   |   3
-    #    2   |   2
-    #    3   |   1
-    #    4   |   0
-    def rindex
-      count - position
-    end
-
-    # 1-based position relative to the end of the array, e.g. for 5 items:
-    #
-    #   position | rposition
-    #  ----------|----------
-    #      1     |    5
-    #      2     |    4
-    #      3     |    3
-    #      4     |    2
-    #      5     |    1
-    def rposition
-      count + 1 - position
-    end
-
-    alias_method :rpos, :rposition
-  end
-
   # simulate an enum. the value is returned by the case statement
   # in the generated block if the positions match
   module Action
@@ -130,9 +95,10 @@ def split(
 
     return result unless splits
 
-    splits.each_with_index do |hash, index|
-      split = Split.with(hash.merge({ count: count, index: index }))
-      result << split.lhs if result.empty?
+    result << splits.first.lhs
+
+    splits.each_with_index do |split, index|
+      split.update!(count: count, index: index)
 
       if accept.call(split)
         result << split.captures << split.rhs
@@ -166,9 +132,10 @@ def rsplit(
 
     return result unless splits
 
-    splits.reverse_each.with_index do |hash, index|
-      split = Split.with(hash.merge({ count: count, index: index }))
-      result.unshift(split.rhs) if result.empty?
+    result.unshift(splits.last.rhs)
+
+    splits.reverse_each.with_index do |split, index|
+      split.update!(count: count, index: index)
 
       if accept.call(split)
         # [lhs + captures] + result
@@ -190,7 +157,7 @@ def rsplit(
   # the following fields:
   #
   #   - result: the array of separated strings to return from +split+ or +rsplit+.
-  #     if the splits arry is empty, the caller returns this array immediately
+  #     if the splits array is empty, the caller returns this array immediately
   #     without any further processing
   #
   #   - splits: an array of hashes containing the lhs, rhs, separator and captured
@@ -202,23 +169,76 @@ def rsplit(
   #     accepted (true) or rejected (false)
   #
   def init(string:, delimiter:, select:, reject:, block:)
-    if reject
-      positions = reject
-      action = Action::REJECT
-    elsif select
-      positions = select
-      action = Action::SELECT
+    return [[]] if string.empty?
+
+    unless block
+      if reject
+        positions = reject
+        action = Action::REJECT
+      elsif select
+        positions = select
+        action = Action::SELECT
+      else
+        block = ACCEPT_ALL
+      end
     end
 
-    splits = parse(string, delimiter)
+    # use String#split if we can
+    #
+    # NOTE +reject!+ is no faster than +reject+ on MRI and significantly slower
+    # on TruffleRuby
+
+    if delimiter.is_a?(String)
+      limit = -1
+
+      if delimiter == ' '
+        delimiter = / / # don't trim
+      elsif delimiter.empty?
+        limit = 0 # remove the trailing empty string
+      end
+
+      result = string.split(delimiter, limit)
+
+      return [result] if result.length == 1 # delimiter not found: no splits
+
+      if block == ACCEPT_ALL # return the (2 or more) fields
+        result = result.reject(&:empty?) if @remove_empty_fields
+        return [result]
+      end
+
+      splits = []
+
+      result.each_cons(2) do |lhs, rhs| # 2 or more fields
+        splits << Split.new(
+          captures: [],
+          lhs: lhs,
+          rhs: rhs,
+          separator: delimiter
+        )
+      end
+    elsif delimiter == DEFAULT_DELIMITER && block == ACCEPT_ALL
+      # non-empty separators so -1 is safe
+
+      if @remove_empty_fields
+        result = []
+        string.split(delimiter, -1) do |field|
+          result << field unless it.empty?
+        end
+      else
+        result = string.split(delimiter, -1)
+      end
 
-    if splits.empty?
-      result = string.empty? ? [] : [string]
       return [result]
+    else
+      splits = parse(string, delimiter)
     end
 
-    block ||= positions ? compile(positions, action, splits.length) : ACCEPT_ALL
-    [[], splits, splits.length, block]
+    count = splits.length
+
+    return [[string]] if count.zero?
+
+    block ||= compile(positions, action, count)
+    [[], splits, count, block]
   end
 
   def render(values)
@@ -227,6 +247,7 @@ def render(values)
         value.empty? && @remove_empty_fields ? REMOVE : [value]
       elsif @include_captures
         if @spread_captures
+          # TODO make sure compact can return a Capture
           @spread_captures == :compact ? value.compact : value
         elsif value.empty?
           # we expose non-captures (string delimiters or regexps with no
@@ -247,7 +268,7 @@ def render(values)
   # the delimiter, returning an array of objects (hashes) representing each split.
   # e.g. for:
   #
-  #   parse.split("foo:bar:baz:quux", ":")
+  #   parse("foo:bar:baz:quux", ":")
   #
   # we return:
   #
@@ -258,6 +279,7 @@ def render(values)
   #   ]
   #
   def parse(string, delimiter)
+    # has_names = delimiter.is_a?(Regexp) && !delimiter.names.empty?
     result = []
     start = 0
 
@@ -273,21 +295,23 @@ def parse(string, delimiter)
       next if separator.empty? && (index.zero? || after == string.length)
 
       lhs = string.slice(start, index - start)
-      result.last[:rhs] = lhs unless result.empty?
+      result.last.rhs = lhs unless result.empty?
 
       # this is correct for the last/only match, but gets updated to the next
       # match's lhs for other matches
       rhs = match.post_match
 
-      result << {
+      # captures = (has_names ? Captures.new(match) : match.captures)
+
+      result << Split.new(
         captures: match.captures,
         lhs: lhs,
         rhs: rhs,
-        separator: separator,
-      }
+        separator: separator
+      )
 
-      # move the start index (the start of the next lhs) to the index after the
-      # last character of the separator
+      # advance the start index (the start of the next lhs) to the position
+      # after the last character of the separator
       start = after
     end
 
@@ -297,8 +321,8 @@ def parse(string, delimiter)
   # returns a lambda which splits at (i.e. accepts or rejects splits at, depending
   # on the action) the supplied positions
   #
-  # positions are preprocessed to support additional features: negative
-  # ranges, infinite ranges, and descending ranges, e.g.:
+  # positions are preprocessed to support negative indices, infinite ranges, and
+  # descending ranges, e.g.:
   #
   #   ss.split("foo:bar:baz:quux", ":", at: -1)
   #
@@ -309,9 +333,8 @@ def parse(string, delimiter)
   # and
   #
   #   ss.split("1:2:3:4:5:6:7:8:9", ":", -3..)
-  #   ss.split("1:2:3:4:5:6:7:8:9", ":", -3..)
   #
-  # translate to:
+  # translates to:
   #
   #   ss.split("foo:bar:baz:quux", ":", at: 6..8)
   #
diff --git a/lib/string_splitter/split.rb b/lib/string_splitter/split.rb
diff --git a/resources/rubocop/rubocop.yml b/resources/rubocop/rubocop.yml
diff --git a/string_splitter.gemspec b/string_splitter.gemspec

-Original file line number
+Diff line change
 /dev/
 temp.*
 /*.rb
 +/profile.*