Skip to content

Refactor char/string and byte search #54667

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Mar 6, 2025

Conversation

jakobnissen
Copy link
Member

@jakobnissen jakobnissen commented Jun 4, 2024

This is a refactoring of base/string/search.jl. It is purely internal, and comes with no changes in behaviour. It's based on #54593 and #54579, so those needs to get merged first, then this PR will be rebased onto master.

Included changes are:

  • The char/string search functions now use the last byte to memchr, not the first byte. Because the last bytes are more varied, this is much faster on small non-ASCII alphabets (like searching Greek or Cyrillic text) and somewhat faster on large non-ASCII ones (like Japanese). Speed on ASCII alphabets (like English) in unchanged.
  • Several unused or redundant methods have been removed
  • Moved boundschecks from the inner _search and _rsearch functions to the outer top-level functions that call them. This is because the former may be called in a loop where repeated boundschecking is needless. This should speed up search a bit.
  • Char/string search functions are now implemented in terms of an internal lazy iterator. This allows findall and findnext to share implementation, and will also make it trivially easy to implement a lazy findall in the future (see Implement lazy findall (Iterators.findall, perhaps?) #43737)

IMO there is still more work to be done on this file, but this requires a decision to be made on #43737, #54581 or #54584

Benchmarks

using BenchmarkTools
using Random

rng = Xoshiro(55)

greek = join(rand(rng, 'Α':'ψ', 100000)) * 'ω'
@btime findfirst('ω', greek)

@btime findfirst(==('\xce'), greek)

english = join(rand(rng, 'A':'y', 100000)) * 'z'
@btime findfirst('z', english)

@btime findall('A', english)
@btime findall('\xff', english)
nothing

1.11.0-beta2:

  100.049 μs (1 allocation: 16 bytes)
  474.084 μs (0 allocations: 0 bytes)
  689.110 ns (1 allocation: 16 bytes)
  93.536 μs (9 allocations: 21.84 KiB)
  72.316 μs (1 allocation: 32 bytes)

This PR:

  1.319 μs (1 allocation: 16 bytes)
  398.011 μs (0 allocations: 0 bytes)
  681.550 ns (1 allocation: 16 bytes)
  8.867 μs (8 allocations: 21.81 KiB)
  683.962 ns (1 allocation: 32 bytes)

@jakobnissen jakobnissen added strings "Strings!" search & find The find* family of functions performance Must go faster labels Jun 4, 2024
@jakobnissen jakobnissen marked this pull request as ready for review September 12, 2024 06:13
@jakobnissen
Copy link
Member Author

This is good to go now. Test failures are unrelated.

@jakobnissen jakobnissen changed the title WIP: Refactor char/string and byte search Refactor char/string and byte search Sep 12, 2024
@jakobnissen
Copy link
Member Author

Bump. CI failures are unrelated.

@jakobnissen
Copy link
Member Author

Bump

@oscardssmith
Copy link
Member

@nanosoldier runbenchmarks(!scalar)

@oscardssmith
Copy link
Member

@nanosoldier runbenchmarks(!"scalar", vs=":master")

@oscardssmith
Copy link
Member

The perf results here look great! We really should try to get this reviewed and merged ASAP

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

In text, the first UTF8 bytes of characters are typically more repetitive than
the last byte. For example, most Greek characters start with 0xce or 0xcf.
By searching for the more unique last byte, more time is spent in the memchr
fast path. This gives a significant speedup.
It's more Julian to return nothing directly from the search function.
Many of these are identical to the generic fallback
This has two advantages: First, it consolidates the implementation of findnext
and findall. Second, it allows a hypothetical lazy findall iterator to be
trivially implemented later.
The search functions are a basic building block of the other functions, and
may e.g. be called in a loop. It's wasteful to check bounds in these, as they
are often called when we know for sure we are inbounds.
Move the boundscheck closer to the top-level calls.
This should slightly improve efficiency.
Take fast path not in every iteration, but just once, outside the loop.
@jakobnissen
Copy link
Member Author

jakobnissen commented Jan 13, 2025

Nanosoldier found a regression for short ASCII searches, where extra book-keeping in this PR, which speeds up the special cases, but costs around 5 nanoseconds, becomes significant. I've addressed this by manually adding a fast path in findnext and findprev when the char is ASCII.
This fast path would have been hit eventually, but now a bunch of setup code is skipped, saving this handful of nanoseconds.

@oscardssmith
Copy link
Member

@nanosoldier runbenchmarks(!"scalar", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@oscardssmith
Copy link
Member

I think this is good to go, but I do want someone else to get another set of eyes on this.

@KristofferC
Copy link
Member

Just PkgEval it and if it looks good, merge?

@jakobnissen
Copy link
Member Author

@nanosoldier runtests()

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

Report summary

❗ Packages that crashed

2 packages crashed only on the current version.

  • A segmentation fault happened: 2 packages

13 packages crashed on the previous version too.

✖ Packages that failed

16 packages failed only on the current version.

  • Illegal method overwrites during precompilation: 1 packages
  • Package has test failures: 5 packages
  • Package tests unexpectedly errored: 1 packages
  • Tests became inactive: 2 packages
  • Test duration exceeded the time limit: 7 packages

1111 packages failed on the previous version too.

✔ Packages that passed tests

30 packages passed tests only on the current version.

  • Other: 30 packages

5325 packages passed tests on the previous version too.

~ Packages that at least loaded

20 packages successfully loaded only on the current version.

  • Other: 20 packages

2935 packages successfully loaded on the previous version too.

➖ Packages that were skipped altogether

908 packages were skipped on the previous version too.

@KristofferC KristofferC merged commit 59320c6 into JuliaLang:master Mar 6, 2025
8 checks passed
@jakobnissen jakobnissen deleted the find_refactor branch March 6, 2025 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster search & find The find* family of functions strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants