Skip to content

Commit 420a210

Browse files
committed
performance improvements:
- delegate to String#split where possible - use a regular class for Split rather than values.rb - create Split objects directly rather than allocating intermediate hashes
1 parent e066613 commit 420a210

File tree

9 files changed

+178
-85
lines changed

9 files changed

+178
-85
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,3 +59,4 @@ Gemfile.lock
5959
/dev/
6060
temp.*
6161
/*.rb
62+
/profile.*

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,12 @@
1+
## 0.7.1 - TBD
2+
3+
#### Changes
4+
5+
- performance improvements
6+
- use `String#split` where possible
7+
- use a regular class for Split rather than values.rb
8+
- create Split objects directly rather than allocating intermediate hashes
9+
110
## 0.7.0 - 2020-08-21
211

312
#### Breaking Changes

Gemfile

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@ source 'https://rubygems.org'
55
unless ENV['CI']
66
group :development do
77
gem 'irb', '~> 1.2' # XXX work around Arch Linux's broken ruby packaging
8+
9+
# "2.3-compatible analysis was dropped after version 0.81."
10+
# gem 'rubocop', '0.81'
811
gem 'rubocop', '~> 0.89'
912
end
1013
end

README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
- [DESCRIPTION](#description)
1212
- [WHY?](#why)
1313
- [CAVEATS](#caveats)
14-
- [Differences from String#split](#differences-from-string%23split)
14+
- [Differences from String#split](#differences-from-stringsplit)
1515
- [COMPATIBILITY](#compatibility)
1616
- [VERSION](#version)
1717
- [SEE ALSO](#see-also)
@@ -130,7 +130,7 @@ end
130130
Many languages have built-in `split` functions/methods for strings. They behave
131131
similarly (notwithstanding the occasional
132132
[surprise](https://chriszetter.com/blog/2017/10/29/splitting-strings/)), and
133-
handle a few common cases e.g.:
133+
handle a few common cases, e.g.:
134134

135135
* limiting the number of splits
136136
* including the separator(s) in the results
@@ -140,7 +140,7 @@ But, because the API is squeezed into two overloaded parameters (the delimiter
140140
and the limit), achieving the desired results can be tricky. For instance,
141141
while `String#split` removes empty trailing fields (by default), it provides no
142142
way to remove *all* empty fields. Likewise, the cramped API means there's no
143-
way to e.g. combine a limit (positive integer) with the option to preserve
143+
way to, e.g., combine a limit (positive integer) with the option to preserve
144144
empty fields (negative integer), or use backreferences in a delimiter pattern
145145
without including its captured subexpressions in the result.
146146

@@ -192,7 +192,7 @@ to a regex or a full-blown parser.
192192
As an example, the nominally unstructured output of many Unix commands is often
193193
formatted in a way that's tantalizingly close to being
194194
[machine-readable](https://en.wikipedia.org/wiki/Delimiter-separated_values),
195-
apart from a few pesky exceptions e.g.:
195+
apart from a few pesky exceptions, e.g.:
196196
197197
```bash
198198
$ ls -l
@@ -205,7 +205,7 @@ drwxr-xr-x 3 user users 4096 Jun 19 22:56 lib
205205
```
206206
207207
These lines can *almost* be parsed into an array of fields by splitting them on
208-
whitespace. The exception is the date (columns 6-8) i.e.:
208+
whitespace. The exception is the date (columns 6-8), i.e.:
209209
210210
```ruby
211211
line = "-rw-r--r-- 1 user users 87 Jun 18 18:16 CHANGELOG.md"
@@ -224,15 +224,15 @@ instead of:
224224
["-rw-r--r--", "1", "user", "users", "87", "Jun 18 18:16", "CHANGELOG.md"]
225225
```
226226
227-
One way to work around this is to parse the whole line e.g.:
227+
One way to work around this is to parse the whole line, e.g.:
228228
229229
```ruby
230230
line.match(/^(\S+) \s+ (\d+) \s+ (\S+) \s+ (\S+) \s+ (\d+) \s+ (\S+ \s+ \d+ \s+ \S+) \s+ (.+)$/x)
231231
```
232232
233233
But that requires us to specify *everything*. What we really want is a version
234234
of `split` which allows us to veto splitting for the 6th and 7th delimiters
235-
(and to stop after the 8th delimiter) i.e. control over which splits are
235+
(and to stop after the 8th delimiter), i.e. control over which splits are
236236
accepted, rather than being restricted to the single, baked-in strategy
237237
provided by the `limit` parameter.
238238
@@ -258,7 +258,7 @@ ss.split(line, at: [1..5, 8])
258258
## Differences from String#split
259259
260260
Unlike `String#split`, StringSplitter doesn't trim the string before splitting
261-
(with `String#strip`) if the delimiter is omitted or a single space, e.g.:
261+
if the delimiter is omitted or a single space, e.g.:
262262

263263
```ruby
264264
" foo bar baz ".split # => ["foo", "bar", "baz"]

Rakefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,6 @@ task :console do
1818
end
1919

2020
# FIXME this runs after the release!
21-
task :release => %i[rubocop test]
21+
task :release => %i[lint test]
2222

2323
task :default => :test

lib/string_splitter.rb

Lines changed: 92 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# frozen_string_literal: true
22

33
require 'set'
4-
require 'values'
54

5+
require_relative 'string_splitter/split'
66
require_relative 'string_splitter/version'
77

88
# This class extends the functionality of +String#split+ by:
@@ -17,9 +17,10 @@
1717
# These enhancements allow splits to handle many cases that otherwise require bigger
1818
# guns, e.g. regex matching or parsing.
1919
#
20-
# Implementation-wise, we split the string with a scanner which works in a similar
21-
# way to +String#split+ and parse the resulting tokens into an array of Split objects
22-
# with the following fields:
20+
# Implementation-wise, we split the string either with String#split, or with a custom
21+
# scanner if the delimiter may contain captures (since String#split doesn't handle
22+
# them correctly) and parse the resulting tokens into an array of Split objects with
23+
# the following attributes:
2324
#
2425
# - captures: separator substrings captured by parentheses in the delimiter pattern
2526
# - count: the number of splits
@@ -43,42 +44,6 @@ class StringSplitter
4344
DEFAULT_DELIMITER = /\s+/.freeze
4445
REMOVE = [].freeze
4546

46-
Split = Value.new(:captures, :count, :index, :lhs, :rhs, :separator) do
47-
def position
48-
index + 1
49-
end
50-
51-
alias_method :pos, :position
52-
53-
# 0-based index relative to the end of the array, e.g. for 5 items:
54-
#
55-
# index | rindex
56-
# ------|-------
57-
# 0 | 4
58-
# 1 | 3
59-
# 2 | 2
60-
# 3 | 1
61-
# 4 | 0
62-
def rindex
63-
count - position
64-
end
65-
66-
# 1-based position relative to the end of the array, e.g. for 5 items:
67-
#
68-
# position | rposition
69-
# ----------|----------
70-
# 1 | 5
71-
# 2 | 4
72-
# 3 | 3
73-
# 4 | 2
74-
# 5 | 1
75-
def rposition
76-
count + 1 - position
77-
end
78-
79-
alias_method :rpos, :rposition
80-
end
81-
8247
# simulate an enum. the value is returned by the case statement
8348
# in the generated block if the positions match
8449
module Action
@@ -130,9 +95,10 @@ def split(
13095

13196
return result unless splits
13297

133-
splits.each_with_index do |hash, index|
134-
split = Split.with(hash.merge({ count: count, index: index }))
135-
result << split.lhs if result.empty?
98+
result << splits.first.lhs
99+
100+
splits.each_with_index do |split, index|
101+
split.update!(count: count, index: index)
136102

137103
if accept.call(split)
138104
result << split.captures << split.rhs
@@ -166,9 +132,10 @@ def rsplit(
166132

167133
return result unless splits
168134

169-
splits.reverse_each.with_index do |hash, index|
170-
split = Split.with(hash.merge({ count: count, index: index }))
171-
result.unshift(split.rhs) if result.empty?
135+
result.unshift(splits.last.rhs)
136+
137+
splits.reverse_each.with_index do |split, index|
138+
split.update!(count: count, index: index)
172139

173140
if accept.call(split)
174141
# [lhs + captures] + result
@@ -190,7 +157,7 @@ def rsplit(
190157
# the following fields:
191158
#
192159
# - result: the array of separated strings to return from +split+ or +rsplit+.
193-
# if the splits arry is empty, the caller returns this array immediately
160+
# if the splits array is empty, the caller returns this array immediately
194161
# without any further processing
195162
#
196163
# - splits: an array of hashes containing the lhs, rhs, separator and captured
@@ -202,23 +169,76 @@ def rsplit(
202169
# accepted (true) or rejected (false)
203170
#
204171
def init(string:, delimiter:, select:, reject:, block:)
205-
if reject
206-
positions = reject
207-
action = Action::REJECT
208-
elsif select
209-
positions = select
210-
action = Action::SELECT
172+
return [[]] if string.empty?
173+
174+
unless block
175+
if reject
176+
positions = reject
177+
action = Action::REJECT
178+
elsif select
179+
positions = select
180+
action = Action::SELECT
181+
else
182+
block = ACCEPT_ALL
183+
end
211184
end
212185

213-
splits = parse(string, delimiter)
186+
# use String#split if we can
187+
#
188+
# NOTE +reject!+ is no faster than +reject+ on MRI and significantly slower
189+
# on TruffleRuby
190+
191+
if delimiter.is_a?(String)
192+
limit = -1
193+
194+
if delimiter == ' '
195+
delimiter = / / # don't trim
196+
elsif delimiter.empty?
197+
limit = 0 # remove the trailing empty string
198+
end
199+
200+
result = string.split(delimiter, limit)
201+
202+
return [result] if result.length == 1 # delimiter not found: no splits
203+
204+
if block == ACCEPT_ALL # return the (2 or more) fields
205+
result = result.reject(&:empty?) if @remove_empty_fields
206+
return [result]
207+
end
208+
209+
splits = []
210+
211+
result.each_cons(2) do |lhs, rhs| # 2 or more fields
212+
splits << Split.new(
213+
captures: [],
214+
lhs: lhs,
215+
rhs: rhs,
216+
separator: delimiter
217+
)
218+
end
219+
elsif delimiter == DEFAULT_DELIMITER && block == ACCEPT_ALL
220+
# non-empty separators so -1 is safe
221+
222+
if @remove_empty_fields
223+
result = []
224+
string.split(delimiter, -1) do |field|
225+
result << field unless it.empty?
226+
end
227+
else
228+
result = string.split(delimiter, -1)
229+
end
214230

215-
if splits.empty?
216-
result = string.empty? ? [] : [string]
217231
return [result]
232+
else
233+
splits = parse(string, delimiter)
218234
end
219235

220-
block ||= positions ? compile(positions, action, splits.length) : ACCEPT_ALL
221-
[[], splits, splits.length, block]
236+
count = splits.length
237+
238+
return [[string]] if count.zero?
239+
240+
block ||= compile(positions, action, count)
241+
[[], splits, count, block]
222242
end
223243

224244
def render(values)
@@ -227,6 +247,7 @@ def render(values)
227247
value.empty? && @remove_empty_fields ? REMOVE : [value]
228248
elsif @include_captures
229249
if @spread_captures
250+
# TODO make sure compact can return a Capture
230251
@spread_captures == :compact ? value.compact : value
231252
elsif value.empty?
232253
# we expose non-captures (string delimiters or regexps with no
@@ -247,7 +268,7 @@ def render(values)
247268
# the delimiter, returning an array of objects (hashes) representing each split.
248269
# e.g. for:
249270
#
250-
# parse.split("foo:bar:baz:quux", ":")
271+
# parse("foo:bar:baz:quux", ":")
251272
#
252273
# we return:
253274
#
@@ -258,6 +279,7 @@ def render(values)
258279
# ]
259280
#
260281
def parse(string, delimiter)
282+
# has_names = delimiter.is_a?(Regexp) && !delimiter.names.empty?
261283
result = []
262284
start = 0
263285

@@ -273,21 +295,23 @@ def parse(string, delimiter)
273295
next if separator.empty? && (index.zero? || after == string.length)
274296

275297
lhs = string.slice(start, index - start)
276-
result.last[:rhs] = lhs unless result.empty?
298+
result.last.rhs = lhs unless result.empty?
277299

278300
# this is correct for the last/only match, but gets updated to the next
279301
# match's lhs for other matches
280302
rhs = match.post_match
281303

282-
result << {
304+
# captures = (has_names ? Captures.new(match) : match.captures)
305+
306+
result << Split.new(
283307
captures: match.captures,
284308
lhs: lhs,
285309
rhs: rhs,
286-
separator: separator,
287-
}
310+
separator: separator
311+
)
288312

289-
# move the start index (the start of the next lhs) to the index after the
290-
# last character of the separator
313+
# advance the start index (the start of the next lhs) to the position
314+
# after the last character of the separator
291315
start = after
292316
end
293317

@@ -297,8 +321,8 @@ def parse(string, delimiter)
297321
# returns a lambda which splits at (i.e. accepts or rejects splits at, depending
298322
# on the action) the supplied positions
299323
#
300-
# positions are preprocessed to support additional features: negative
301-
# ranges, infinite ranges, and descending ranges, e.g.:
324+
# positions are preprocessed to support negative indices, infinite ranges, and
325+
# descending ranges, e.g.:
302326
#
303327
# ss.split("foo:bar:baz:quux", ":", at: -1)
304328
#
@@ -309,9 +333,8 @@ def parse(string, delimiter)
309333
# and
310334
#
311335
# ss.split("1:2:3:4:5:6:7:8:9", ":", -3..)
312-
# ss.split("1:2:3:4:5:6:7:8:9", ":", -3..)
313336
#
314-
# translate to:
337+
# translates to:
315338
#
316339
# ss.split("foo:bar:baz:quux", ":", at: 6..8)
317340
#

0 commit comments

Comments
 (0)