Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor char/string and byte search #54667

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

jakobnissen
Copy link
Contributor

@jakobnissen jakobnissen commented Jun 4, 2024

This is a refactoring of base/string/search.jl. It is purely internal, and comes with no changes in behaviour. It's based on #54593 and #54579, so those needs to get merged first, then this PR will be rebased onto master.

Included changes are:

  • The char/string search functions now use the last byte to memchr, not the first byte. Because the last bytes are more varied, this is much faster on small non-ASCII alphabets (like searching Greek or Cyrillic text) and somewhat faster on large non-ASCII ones (like Japanese). Speed on ASCII alphabets (like English) in unchanged.
  • Several unused or redundant methods have been removed
  • Moved boundschecks from the inner _search and _rsearch functions to the outer top-level functions that call them. This is because the former may be called in a loop where repeated boundschecking is needless. This should speed up search a bit.
  • Char/string search functions are now implemented in terms of an internal lazy iterator. This allows findall and findnext to share implementation, and will also make it trivially easy to implement a lazy findall in the future (see Implement lazy findall (Iterators.findall, perhaps?) #43737)

IMO there is still more work to be done on this file, but this requires a decision to be made on #43737, #54581 or #54584

Benchmarks

using BenchmarkTools
using Random

rng = Xoshiro(55)

greek = join(rand(rng, 'Α':'ψ', 100000)) * 'ω'
@btime findfirst('ω', greek)

@btime findfirst(==('\xce'), greek)

english = join(rand(rng, 'A':'y', 100000)) * 'z'
@btime findfirst('z', english)

@btime findall('A', english)
@btime findall('\xff', english)
nothing

1.11.0-beta2:

  100.049 μs (1 allocation: 16 bytes)
  474.084 μs (0 allocations: 0 bytes)
  689.110 ns (1 allocation: 16 bytes)
  93.536 μs (9 allocations: 21.84 KiB)
  72.316 μs (1 allocation: 32 bytes)

This PR:

  1.319 μs (1 allocation: 16 bytes)
  398.011 μs (0 allocations: 0 bytes)
  681.550 ns (1 allocation: 16 bytes)
  8.867 μs (8 allocations: 21.81 KiB)
  683.962 ns (1 allocation: 32 bytes)

@jakobnissen jakobnissen added strings "Strings!" search & find The find* family of functions performance Must go faster labels Jun 4, 2024
@jakobnissen jakobnissen marked this pull request as ready for review September 12, 2024 06:13
@jakobnissen jakobnissen added the status: waiting for PR reviewer PR is complete and seems ready to merge. Has tests and news/compat if needed. CI failures unrelated. label Sep 12, 2024
@jakobnissen
Copy link
Contributor Author

This is good to go now. Test failures are unrelated.

base/strings/search.jl Outdated Show resolved Hide resolved
@jakobnissen jakobnissen changed the title WIP: Refactor char/string and byte search Refactor char/string and byte search Sep 12, 2024
@jakobnissen
Copy link
Contributor Author

Bump. CI failures are unrelated.

In text, the first UTF8 bytes of characters are typically more repetitive than
the last byte. For example, most Greek characters start with 0xce or 0xcf.
By searching for the more unique last byte, more time is spent in the memchr
fast path. This gives a significant speedup.
It's more Julian to return nothing directly from the search function.
Many of these are identical to the generic fallback
This has two advantages: First, it consolidates the implementation of findnext
and findall. Second, it allows a hypothetical lazy findall iterator to be
trivially implemented later.
The search functions are a basic building block of the other functions, and
may e.g. be called in a loop. It's wasteful to check bounds in these, as they
are often called when we know for sure we are inbounds.
Move the boundscheck closer to the top-level calls.
This should slightly improve efficiency.
Take fast path not in every iteration, but just once, outside the loop.
@jakobnissen
Copy link
Contributor Author

Bump

@oscardssmith
Copy link
Member

@nanosoldier runbenchmarks(!scalar)

@oscardssmith
Copy link
Member

@nanosoldier runbenchmarks(!"scalar", vs=":master")

@oscardssmith
Copy link
Member

The perf results here look great! We really should try to get this reviewed and merged ASAP

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster search & find The find* family of functions status: waiting for PR reviewer PR is complete and seems ready to merge. Has tests and news/compat if needed. CI failures unrelated. strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants