Refactor char/string and byte search #54667

jakobnissen · 2024-06-04T13:28:26Z

This is a refactoring of base/string/search.jl. It is purely internal, and comes with no changes in behaviour. It's based on #54593 and #54579, so those needs to get merged first, then this PR will be rebased onto master.

Included changes are:

The char/string search functions now use the last byte to memchr, not the first byte. Because the last bytes are more varied, this is much faster on small non-ASCII alphabets (like searching Greek or Cyrillic text) and somewhat faster on large non-ASCII ones (like Japanese). Speed on ASCII alphabets (like English) in unchanged.
Several unused or redundant methods have been removed
Moved boundschecks from the inner _search and _rsearch functions to the outer top-level functions that call them. This is because the former may be called in a loop where repeated boundschecking is needless. This should speed up search a bit.
Char/string search functions are now implemented in terms of an internal lazy iterator. This allows findall and findnext to share implementation, and will also make it trivially easy to implement a lazy findall in the future (see Implement lazy findall (Iterators.findall, perhaps?) #43737)

IMO there is still more work to be done on this file, but this requires a decision to be made on #43737, #54581 or #54584

Benchmarks

using BenchmarkTools
using Random

rng = Xoshiro(55)

greek = join(rand(rng, 'Α':'ψ', 100000)) * 'ω'
@btime findfirst('ω', greek)

@btime findfirst(==('\xce'), greek)

english = join(rand(rng, 'A':'y', 100000)) * 'z'
@btime findfirst('z', english)

@btime findall('A', english)
@btime findall('\xff', english)
nothing

1.11.0-beta2:

  100.049 μs (1 allocation: 16 bytes)
  474.084 μs (0 allocations: 0 bytes)
  689.110 ns (1 allocation: 16 bytes)
  93.536 μs (9 allocations: 21.84 KiB)
  72.316 μs (1 allocation: 32 bytes)

This PR:

  1.319 μs (1 allocation: 16 bytes)
  398.011 μs (0 allocations: 0 bytes)
  681.550 ns (1 allocation: 16 bytes)
  8.867 μs (8 allocations: 21.81 KiB)
  683.962 ns (1 allocation: 32 bytes)

jakobnissen · 2024-09-12T06:14:55Z

This is good to go now. Test failures are unrelated.

base/strings/search.jl

jakobnissen · 2024-11-18T10:37:31Z

Bump. CI failures are unrelated.

jakobnissen · 2025-01-03T09:27:02Z

Bump

oscardssmith · 2025-01-03T23:35:11Z

@nanosoldier runbenchmarks(!scalar)

oscardssmith · 2025-01-03T23:35:59Z

@nanosoldier runbenchmarks(!"scalar", vs=":master")

oscardssmith · 2025-01-03T23:40:37Z

The perf results here look great! We really should try to get this reviewed and merged ASAP

nanosoldier · 2025-01-04T11:10:13Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

In text, the first UTF8 bytes of characters are typically more repetitive than the last byte. For example, most Greek characters start with 0xce or 0xcf. By searching for the more unique last byte, more time is spent in the memchr fast path. This gives a significant speedup.

It's more Julian to return nothing directly from the search function.

Many of these are identical to the generic fallback

This has two advantages: First, it consolidates the implementation of findnext and findall. Second, it allows a hypothetical lazy findall iterator to be trivially implemented later.

The search functions are a basic building block of the other functions, and may e.g. be called in a loop. It's wasteful to check bounds in these, as they are often called when we know for sure we are inbounds. Move the boundscheck closer to the top-level calls. This should slightly improve efficiency.

Take fast path not in every iteration, but just once, outside the loop.

jakobnissen · 2025-01-13T15:00:18Z

Nanosoldier found a regression for short ASCII searches, where extra book-keeping in this PR, which speeds up the special cases, but costs around 5 nanoseconds, becomes significant. I've addressed this by manually adding a fast path in findnext and findprev when the char is ASCII.
This fast path would have been hit eventually, but now a bunch of setup code is skipped, saving this handful of nanoseconds.

oscardssmith · 2025-01-13T15:46:30Z

@nanosoldier runbenchmarks(!"scalar", vs=":master")

nanosoldier · 2025-01-14T02:00:18Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

oscardssmith · 2025-01-14T05:33:58Z

I think this is good to go, but I do want someone else to get another set of eyes on this.

KristofferC · 2025-02-07T14:27:50Z

Just PkgEval it and if it looks good, merge?

jakobnissen · 2025-03-03T11:22:00Z

@nanosoldier runtests()

nanosoldier · 2025-03-06T11:36:05Z

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

Report summary

❗ Packages that crashed

2 packages crashed only on the current version.

A segmentation fault happened: 2 packages

13 packages crashed on the previous version too.

✖ Packages that failed

16 packages failed only on the current version.

Illegal method overwrites during precompilation: 1 packages
Package has test failures: 5 packages
Package tests unexpectedly errored: 1 packages
Tests became inactive: 2 packages
Test duration exceeded the time limit: 7 packages

1111 packages failed on the previous version too.

✔ Packages that passed tests

30 packages passed tests only on the current version.

Other: 30 packages

5325 packages passed tests on the previous version too.

~ Packages that at least loaded

20 packages successfully loaded only on the current version.

Other: 20 packages

2935 packages successfully loaded on the previous version too.

➖ Packages that were skipped altogether

908 packages were skipped on the previous version too.

jakobnissen added strings "Strings!" search & find The find* family of functions performance Must go faster labels Jun 4, 2024

jakobnissen force-pushed the find_refactor branch from df9c1d8 to ba4e410 Compare September 11, 2024 06:46

jakobnissen marked this pull request as ready for review September 12, 2024 06:13

jakobnissen added the status: waiting for PR reviewer label Sep 12, 2024

KristofferC reviewed Sep 12, 2024

View reviewed changes

base/strings/search.jl Outdated Show resolved Hide resolved

jakobnissen changed the title ~~WIP: Refactor char/string and byte search~~ Refactor char/string and byte search Sep 12, 2024

jakobnissen force-pushed the find_refactor branch from 0b08a60 to c1f87e0 Compare November 18, 2024 07:26

jakobnissen force-pushed the find_refactor branch from ec0ef94 to 39a6449 Compare November 27, 2024 11:49

jakobnissen added 13 commits January 13, 2025 15:15

Various fixes to searching (squashed JuliaLang#54579)

1b6c200

Remove nothing_sentinel

fc37747

It's more Julian to return nothing directly from the search function.

Remove unused functions

8b4a858

Many of these are identical to the generic fallback

Impl byte/string search as lazy iterator

5f7233e

This has two advantages: First, it consolidates the implementation of findnext and findall. Second, it allows a hypothetical lazy findall iterator to be trivially implemented later.

Make findall slightly faster

5a9f226

Take fast path not in every iteration, but just once, outside the loop.

Fix typos

a5d727c

Fixup

6f22d80

Switch internal docstrings to comments

96bb59a

Remove duplicate tests from merge mistake

54c55f0

Add missing end

bb6492f

Manually inline fast path

a5fbc1f

jakobnissen force-pushed the find_refactor branch from 7511e93 to a5fbc1f Compare January 13, 2025 14:57

Merge branch 'master' into find_refactor

4fb1345

Merge branch 'master' into find_refactor

5a3c757

KristofferC merged commit 59320c6 into JuliaLang:master Mar 6, 2025
8 checks passed

jakobnissen deleted the find_refactor branch March 6, 2025 12:27

jakobnissen removed the status: waiting for PR reviewer label Mar 6, 2025

Uh oh!

Refactor char/string and byte search #54667

Refactor char/string and byte search #54667

Uh oh!

Conversation

jakobnissen commented Jun 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Uh oh!

jakobnissen commented Sep 12, 2024

Uh oh!

Uh oh!

jakobnissen commented Nov 18, 2024

Uh oh!

jakobnissen commented Jan 3, 2025

Uh oh!

oscardssmith commented Jan 3, 2025

Uh oh!

oscardssmith commented Jan 3, 2025

Uh oh!

oscardssmith commented Jan 3, 2025

Uh oh!

nanosoldier commented Jan 4, 2025

Uh oh!

jakobnissen commented Jan 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oscardssmith commented Jan 13, 2025

Uh oh!

nanosoldier commented Jan 14, 2025

Uh oh!

oscardssmith commented Jan 14, 2025

Uh oh!

KristofferC commented Feb 7, 2025

Uh oh!

jakobnissen commented Mar 3, 2025

Uh oh!

nanosoldier commented Mar 6, 2025

❗ Packages that crashed

✖ Packages that failed

✔ Packages that passed tests

~ Packages that at least loaded

➖ Packages that were skipped altogether

Uh oh!

Uh oh!

Uh oh!

jakobnissen commented Jun 4, 2024 •

edited

Loading

jakobnissen commented Jan 13, 2025 •

edited

Loading