Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Regex engine on PCRE2 #12840

Merged
merged 4 commits into from
Dec 16, 2022

Conversation

straight-shoota
Copy link
Member

This is an initial implementation of a Regex engine backend based on PCRE2 (see #12790).

All existing regex specs succeed with PCRE2 🎉

This is an MVP and does not include JIT compilation (I'm planning a follow-up).
It also does not integrate specific tests for both library versions into CI (another followup).

The engine version is automatically selected at compile time. libpcre2 takes preference and is used when available via pkg-config.
I think this makes sense to see PCRE2 used (when available) without requiring any additional configuration. This should help gather some feedback on usage problems. We can revisit if we want to prioritize the original libpcre for the next release.
Library selection can be forced with the the compiler flags -Dforce_pcre and -Dforce_pcre2.

If you have Crystal software that makes heavy use of regular expressions, please give this patch a try in order to identify potential issues in practical use cases. Performance comparisons are also welcome.

  • We're using the 8-byte version of PCRE2, so the library name is actually libpcre2-8 (this is also the package name on some systems).
  • pkg-config needs the .pc configuration which may require installing the devel package of libpcre2.

src/regex/engine.cr Outdated Show resolved Hide resolved
@jwoertink
Copy link
Contributor

I just gave it a shot locally on my app. We have a chat processor that works like a middleware stack. When you send a chat, each handler in the stack does some sort of regex to check what you're sending. The stack checks memory and timing with Benchmark.memory and Benchmark.realtime respectively.

All the regex we use works with pcre2! The interesting thing is when I run this branch against my app, it increases memory and runs slower. I tried to make a separate repo to reproduce, but of course, in that sample, the memory is actually reduces by half and timing is way faster 😂

# Crystal 1.6.2
❯ ./regex 
TIMING: 00:00:00.000095431
MEMORY: 5952

# This PR with -Dforce_pcre
❯ ./regex 
TIMING: 00:00:00.000096031
MEMORY: 5952

# This PR with -Dforce_pcre2
❯ ./regex 
TIMING: 00:00:00.000038081
MEMORY: 3616

Guess I need to figure out what I'm doing wrong in my app 😅

@jwoertink
Copy link
Contributor

oh, so not sure if this matters or not, but I wanted to benchmark the 3 different versions together. I built my sample code (with release) using 1.6.2, this branch with force_pcre, and again with pcre2.

# bench.cr
require "benchmark"

Benchmark.ips do |x|
  x.report("current") do
    `./regex`
  end
  x.report("pcre1") do
    `./regex1`
  end
  x.report("pcre2") do
    `./regex2`
  end
end
Development/crystal/sandbox is 📦 v0.1.0 via 🔮 v1.6.2 took 14s 
❯ ./bench 
current 295.54  (  3.38ms) (± 0.68%)  36.0kB/op   1.01× slower
  pcre1 299.58  (  3.34ms) (± 0.81%)  36.0kB/op        fastest
  pcre2 171.36  (  5.84ms) (± 1.22%)  36.0kB/op   1.75× slower

Development/crystal/sandbox is 📦 v0.1.0 via 🔮 v1.6.2 took 21s 
❯ ./bench 
current 295.95  (  3.38ms) (± 1.09%)  36.0kB/op   1.01× slower
  pcre1 300.19  (  3.33ms) (± 1.22%)  36.0kB/op        fastest
  pcre2 171.68  (  5.82ms) (± 0.99%)  36.0kB/op   1.75× slower

This actually seems closer to what I'm seeing in my app.

I ran it twice just to be sure, but it seems this is lining up more with what I'm seeing in my app.

Here's a link to the code https://gist.github.com/jwoertink/2af47a1450a071ad4934e691259ea5e5

crystal build --release regex.cr
~/Development/crystal/lang/bin/crystal build --release -Dforce_pcre regex.cr -o regex1
~/Development/crystal/lang/bin/crystal build --release -Dforce_pcre2 regex.cr -o regex2
crystal build --release bench.cr

Maybe someone else has a better way to benchmark this?

@asterite
Copy link
Member

I tried to make a separate repo to reproduce, but of course, in that sample, the memory is actually reduces by half and timing is way faster

You should deploy that separate repo to production then ;-)

Copy link
Member

@beta-ziliani beta-ziliani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving it in the hope that more testing and the follow up with JIT compilation will give us finer numbers on its performance. In the worst case scenario, we revert the bit about PCRE2 being the default for 1.7.


str = File.read(datapath("large_single_line_string.txt"))
str.matches?(/^(?:[A-Za-z0-9+\/]{4})*(?:[A-Za-z0-9+\/]{2}==|[A-Za-z0-9+\/]{3}=)?$/).should be_false
str.matches?(/^(?:[A-Za-z0-9+\/]{4})*(?:[A-Za-z0-9+\/]{2}==|[A-Za-z0-9+\/]{3}=)?$/)
# We don't care whether this actually matches or not, it's just to make
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little problem is that the name of the test is wrong. It matches a large single line string, but it was not expected to match. A better name could be doesn't crash with a...

@beta-ziliani beta-ziliani added this to the 1.7.0 milestone Dec 15, 2022
@straight-shoota straight-shoota merged commit a5373ad into crystal-lang:master Dec 16, 2022
@straight-shoota straight-shoota deleted the feature/pcre2 branch December 16, 2022 11:07
@dmgk
Copy link
Contributor

dmgk commented Dec 16, 2022

This appears to have broken a few std specs on FreeBSD 13.1-STABLE, pcre2-10.40 (please ignore UDP multicast failure, it's unrelated):

Failures:

  1) HTTP::Cookie::Parser parse_cookies parses multiple cookies
     Failure/Error: first.name.should eq("foo")

       Expected: "foo"
            got: "foobar"

     # spec/std/http/cookie_spec.cr:164

  2) StringScanner raises when there is no subgroup
     Failure/Error: expect_raises(KeyError, "Capture group 'something' does not exist") { s["something"] }

       Expected KeyError, got #<IndexError: Index out of bounds> with backtrace:
         # src/slice.cr:213:28 in '[]'
         # src/regex/pcre2.cr:160:12 in '[]'
         # src/string_scanner.cr:204:5 in '[]'
         # spec/std/string_scanner_spec.cr:159:75 in '->'
         # src/spec/example.cr:45:13 in 'internal_run'
         # src/spec/example.cr:33:16 in 'run'
         # src/spec/context.cr:18:23 in 'internal_run'
         # src/spec/context.cr:345:7 in 'run'
         # src/spec/context.cr:18:23 in 'internal_run'
         # src/spec/context.cr:158:7 in 'run'
         # src/spec/dsl.cr:220:7 in '->'
         # src/crystal/at_exit_handlers.cr:14:19 in 'run'
         # src/crystal/main.cr:50:14 in 'exit'
         # src/crystal/main.cr:45:5 in 'main'
         # src/crystal/main.cr:127:3 in 'main'

     # spec/std/string_scanner_spec.cr:159

  3) String #rindex by regex with offset assert
     Failure/Error: it { "bbbb".rindex(/b/, -4).should eq(0) }

       Expected: 0
            got: 1

     # spec/std/string_spec.cr:1116

  4) String #rindex by regex with offset assert
     Failure/Error: it { "bbbb".rindex(/b/, -2).should eq(2) }

       Expected: 2
            got: 3

     # spec/std/string_spec.cr:1114

  5) String #rindex by regex with offset assert
     Failure/Error: it { "日本語日本語".rindex(/日本/, 2).should eq(0) }

       Expected: 0
            got: 3

     # spec/std/string_spec.cr:1117

  6) String #rindex by regex with offset assert
     Failure/Error: it { "bbbb".rindex(/b/, 2).should eq(2) }

       Expected: 2
            got: 3

     # spec/std/string_spec.cr:1111

  7) String #rindex! by regex with offset assert
     Failure/Error: it { "bbbb".rindex!(/b/, 2).should eq(2) }

       Expected: 2
            got: 3

     # spec/std/string_spec.cr:1176

  8) String scan does with number and string
     Failure/Error: "1ab4".scan(/\d+/).map(&.[0]).should eq(["1", "4"])

       Expected: ["1", "4"]
            got: ["4", "4"]

     # spec/std/string_spec.cr:2271

  9) String scan does without block
     Failure/Error: a.scan(/\w+/).map(&.[0]).should eq(["cruel", "world"])

       Expected: ["cruel", "world"]
            got: ["world", "world"]

     # spec/std/string_spec.cr:2217

 10) String scan works when match is empty
     Failure/Error: "hello".scan(r).map(&.[0]).should eq(["hello", ""])

       Expected: ["hello", ""]
            got: ["", ""]

     # spec/std/string_spec.cr:2248

 11) Compress::Deflate::Reader should rewind

       deflate: invalid stored block lengths (Compress::Deflate::Error)
         from src/compress/deflate/reader.cr:112:12 in 'unbuffered_read'
         from src/io/buffered.cr:261:5 in 'fill_buffer'
         from src/io/buffered.cr:103:7 in 'peek'
         from src/io.cr:656:37 in 'gets'
         from src/io.cr:629:5 in 'gets'
         from src/io.cr:599:5 in 'gets'
         from src/io.cr:598:3 in 'gets'
         from spec/std/compress/deflate/deflate_spec.cr:32:7 in '->'
         from src/spec/example.cr:45:13 in 'internal_run'
         from src/spec/example.cr:33:16 in 'run'
         from src/spec/context.cr:18:23 in 'internal_run'
         from src/spec/context.cr:345:7 in 'run'
         from src/spec/context.cr:18:23 in 'internal_run'
         from src/spec/context.cr:158:7 in 'run'
         from src/spec/dsl.cr:220:7 in '->'
         from src/crystal/at_exit_handlers.cr:14:19 in 'run'
         from src/crystal/main.cr:50:14 in 'exit'
         from src/crystal/main.cr:45:5 in 'main'
         from src/crystal/main.cr:127:3 in 'main'
       

 12) Compress::Deflate::Reader should read byte by byte (#4192)

       deflate: invalid stored block lengths (Compress::Deflate::Error)
         from src/compress/deflate/reader.cr:112:12 in 'unbuffered_read'
         from src/io/buffered.cr:261:5 in 'fill_buffer'
         from src/io/buffered.cr:51:5 in 'read_byte'
         from spec/std/compress/deflate/deflate_spec.cr:20:19 in '->'
         from src/spec/example.cr:45:13 in 'internal_run'
         from src/spec/example.cr:33:16 in 'run'
         from src/spec/context.cr:18:23 in 'internal_run'
         from src/spec/context.cr:345:7 in 'run'
         from src/spec/context.cr:18:23 in 'internal_run'
         from src/spec/context.cr:158:7 in 'run'
         from src/spec/dsl.cr:220:7 in '->'
         from src/crystal/at_exit_handlers.cr:14:19 in 'run'
         from src/crystal/main.cr:50:14 in 'exit'
         from src/crystal/main.cr:45:5 in 'main'
         from src/crystal/main.cr:127:3 in 'main'
       

 13) StringScanner raises when there is no subgroup

       Index out of bounds (IndexError)
         from src/slice.cr:213:28 in '[]'
         from src/regex/pcre2.cr:160:12 in '[]?'
         from src/string_scanner.cr:230:23 in '[]?'
         from spec/std/string_scanner_spec.cr:191:5 in '->'
         from src/spec/example.cr:45:13 in 'internal_run'
         from src/spec/example.cr:33:16 in 'run'
         from src/spec/context.cr:18:23 in 'internal_run'
         from src/spec/context.cr:345:7 in 'run'
         from src/spec/context.cr:18:23 in 'internal_run'
         from src/spec/context.cr:158:7 in 'run'
         from src/spec/dsl.cr:220:7 in '->'
         from src/crystal/at_exit_handlers.cr:14:19 in 'run'
         from src/crystal/main.cr:50:14 in 'exit'
         from src/crystal/main.cr:45:5 in 'main'
         from src/crystal/main.cr:127:3 in 'main'
       

 14) UDPSocket using IPv6 joins and transmits to multicast groups

       Error sending datagram to [ff02::102]:20588: Network is unreachable (Socket::Error)
         from src/crystal/system/unix/socket.cr:90:5 in 'system_send_to'
         from src/socket.cr:214:5 in 'send'
         from spec/std/socket/udp_socket_spec.cr:75:5 in '->'
         from src/spec/example.cr:45:13 in 'internal_run'
         from src/spec/example.cr:33:16 in 'run'
         from src/spec/context.cr:18:23 in 'internal_run'
         from src/spec/context.cr:345:7 in 'run'
         from src/spec/context.cr:18:23 in 'internal_run'
         from src/spec/context.cr:345:7 in 'run'
         from src/spec/context.cr:18:23 in 'internal_run'
         from src/spec/context.cr:158:7 in 'run'
         from src/spec/dsl.cr:220:7 in '->'
         from src/crystal/at_exit_handlers.cr:14:19 in 'run'
         from src/crystal/main.cr:50:14 in 'exit'
         from src/crystal/main.cr:45:5 in 'main'
         from src/crystal/main.cr:127:3 in 'main'
       

 15) Compress::Zlib::Reader should not freeze when reading empty slice

       Invalid header (Compress::Zlib::Error)
         from src/compress/zlib/reader.cr:109:5 in 'invalid_header'
         from src/compress/zlib/reader.cr:35:7 in 'read_header'
         from src/compress/zlib/reader.cr:17:5 in 'initialize'
         from src/compress/zlib/reader.cr:16:3 in 'new'
         from spec/std/compress/zlib/reader_spec.cr:73:7 in '->'
         from src/spec/example.cr:45:13 in 'internal_run'
         from src/spec/example.cr:33:16 in 'run'
         from src/spec/context.cr:18:23 in 'internal_run'
         from src/spec/context.cr:345:7 in 'run'
         from src/spec/context.cr:18:23 in 'internal_run'
         from src/spec/context.cr:158:7 in 'run'
         from src/spec/dsl.cr:220:7 in '->'
         from src/crystal/at_exit_handlers.cr:14:19 in 'run'
         from src/crystal/main.cr:50:14 in 'exit'
         from src/crystal/main.cr:45:5 in 'main'
         from src/crystal/main.cr:127:3 in 'main'
       

 16) Compress::Zlib::Reader rewinds

       Invalid header (Compress::Zlib::Error)
         from src/compress/zlib/reader.cr:109:5 in 'invalid_header'
         from src/compress/zlib/reader.cr:35:7 in 'read_header'
         from src/compress/zlib/reader.cr:17:5 in 'initialize'
         from src/compress/zlib/reader.cr:16:3 in 'new'
         from spec/std/compress/zlib/reader_spec.cr:30:7 in '->'
         from src/spec/example.cr:45:13 in 'internal_run'
         from src/spec/example.cr:33:16 in 'run'
         from src/spec/context.cr:18:23 in 'internal_run'
         from src/spec/context.cr:345:7 in 'run'
         from src/spec/context.cr:18:23 in 'internal_run'
         from src/spec/context.cr:158:7 in 'run'
         from src/spec/dsl.cr:220:7 in '->'
         from src/crystal/at_exit_handlers.cr:14:19 in 'run'
         from src/crystal/main.cr:50:14 in 'exit'
         from src/crystal/main.cr:45:5 in 'main'
         from src/crystal/main.cr:127:3 in 'main'
       

 17) Compress::Zlib::Reader should be able to read

       Invalid header (Compress::Zlib::Error)
         from src/compress/zlib/reader.cr:109:5 in 'invalid_header'
         from src/compress/zlib/reader.cr:35:7 in 'read_header'
         from src/compress/zlib/reader.cr:17:5 in 'initialize'
         from src/compress/zlib/reader.cr:16:3 in 'new'
         from spec/std/compress/zlib/reader_spec.cr:17:7 in '->'
         from src/spec/example.cr:45:13 in 'internal_run'
         from src/spec/example.cr:33:16 in 'run'
         from src/spec/context.cr:18:23 in 'internal_run'
         from src/spec/context.cr:345:7 in 'run'
         from src/spec/context.cr:18:23 in 'internal_run'
         from src/spec/context.cr:158:7 in 'run'
         from src/spec/dsl.cr:220:7 in '->'
         from src/crystal/at_exit_handlers.cr:14:19 in 'run'
         from src/crystal/main.cr:50:14 in 'exit'
         from src/crystal/main.cr:45:5 in 'main'
         from src/crystal/main.cr:127:3 in 'main'
       

Finished in 1:03 minutes
15071 examples, 10 failures, 7 errors, 13 pending

Failed examples:

crystal spec spec/std/http/cookie_spec.cr:160 # HTTP::Cookie::Parser parse_cookies parses multiple cookies
crystal spec spec/std/string_scanner_spec.cr:152 # StringScanner raises when there is no subgroup
crystal spec spec/std/string_spec.cr:1116 # String #rindex by regex with offset assert
crystal spec spec/std/string_spec.cr:1114 # String #rindex by regex with offset assert
crystal spec spec/std/string_spec.cr:1117 # String #rindex by regex with offset assert
crystal spec spec/std/string_spec.cr:1111 # String #rindex by regex with offset assert
crystal spec spec/std/string_spec.cr:1176 # String #rindex! by regex with offset assert
crystal spec spec/std/string_spec.cr:2270 # String scan does with number and string
crystal spec spec/std/string_spec.cr:2215 # String scan does without block
crystal spec spec/std/string_spec.cr:2246 # String scan works when match is empty
crystal spec spec/std/compress/deflate/deflate_spec.cr:28 # Compress::Deflate::Reader should rewind
crystal spec spec/std/compress/deflate/deflate_spec.cr:14 # Compress::Deflate::Reader should read byte by byte (#4192)
crystal spec spec/std/string_scanner_spec.cr:185 # StringScanner raises when there is no subgroup
crystal spec spec/std/socket/udp_socket_spec.cr:75 # UDPSocket using IPv6 joins and transmits to multicast groups
crystal spec spec/std/compress/zlib/reader_spec.cr:71 # Compress::Zlib::Reader should not freeze when reading empty slice
crystal spec spec/std/compress/zlib/reader_spec.cr:27 # Compress::Zlib::Reader rewinds
crystal spec spec/std/compress/zlib/reader_spec.cr:14 # Compress::Zlib::Reader should be able to read

@asterite
Copy link
Member

In fact this is also failing in master. I think this was merged but CI said it was failing? Strange!

@straight-shoota
Copy link
Member Author

straight-shoota commented Dec 16, 2022

@asterite Yes, it's merged. But this branch was green before merging, and master is also (mostly) green after merging (https://github.com/crystal-lang/crystal/actions/runs/3712561573). The failing spec is an unrelated interpreter test which appears to be failing every now and then.

@straight-shoota
Copy link
Member Author

@dmgk Thanks for reporting. It's striking that no spec for regex itself is failing. The failures seem to be related to string scanner (which seems to be used way too much!).
Either the scanner implementation is wrong or we're missing specs for some regex edge case.

@HertzDevil
Copy link
Contributor

HertzDevil commented Dec 16, 2022

Is CI really using PCRE2? Is there any indication the pkg-config command in src/regex/engine.cr actually succeeded on CI?

Also I got even more failures on my M2, even for Regex itself:

  1) Regex #match returns matchdata
     Failure/Error: md[0].should eq "Cr"

       Expected: "Cr"
            got: ""

     # spec/std/regex_spec.cr:39

  2) Regex #match_at_byte_index positive index
     Failure/Error: md.begin.should eq 1

       Expected: 1
            got: 0

     # spec/std/regex_spec.cr:103

  3) Regex #match_at_byte_index multibyte index
     Failure/Error: md.begin.should eq 1

       Expected: 1
            got: 0

     # spec/std/regex_spec.cr:114

  4) Regex #=== assigns captures
     Failure/Error: $1.should eq("oo")

       Expected: "oo"
            got: ""

     # spec/std/regex_spec.cr:252

  5) Regex #=~ returns match index or nil
     Failure/Error: (/foo/ =~ "bar foo baz").should eq(4)

       Expected: 4
            got: 0

     # spec/std/regex_spec.cr:262

  6) Regex #=~ assigns captures
     Failure/Error: $1.should eq("oo")

       Expected: "oo"
            got: ""

     # spec/std/regex_spec.cr:269

  7) Regex .union constructs a Regex that matches things any of its arguments match
     Failure/Error: re.match("Skiing").not_nil![0].should eq "Skiing"

       Expected: "Skiing"
            got: ""

     # spec/std/regex_spec.cr:388

It appears that this match_data_free zeroes the ovector array, most likely because LibC.free on macOS 13+ does this to the deallocated memory immediately:

private def match_impl(str, byte_index, options)
  match_data = match_data(str, byte_index, options) || return

  ovector = LibPCRE2.get_ovector_pointer(match_data)
  ovector_count = LibPCRE2.get_ovector_count(match_data)
  LibPCRE2.match_data_free(match_data) # clears `ovector`

  ::Regex::MatchData.new(self, @re, str, byte_index, ovector, ovector_count.to_i32 - 1)
end

By the way I created #12847 for Windows support, where pkg-config doesn't exist, so all spec failures there must be due to PCRE2.

@asterite
Copy link
Member

asterite commented Dec 16, 2022

@straight-shoota Can you try checking out master and running this?

bin/crystal spec spec/std/http/cookie_spec.cr:160

It's failing for me.

Also all of the failing specs mentioned in #12840 (comment) are failing for me on master. Maybe CI isn't working well then, if it was able to merge that PR. Oh, it must be what HertzDevil is saying (Is CI really using PCRE2?)

@straight-shoota
Copy link
Member Author

Yeah right, CI isn't using PCRE2 yet.

I think I had tested the entire stdlib spec suit locally, though.

straight-shoota added a commit that referenced this pull request Dec 16, 2022
@straight-shoota straight-shoota removed this from the 1.7.0 milestone Dec 16, 2022
@straight-shoota
Copy link
Member Author

This was reverted in #12850 due to bugs and missing tests.

I'll re-issue a patched version shortly.

straight-shoota added a commit that referenced this pull request Dec 17, 2022
straight-shoota added a commit that referenced this pull request Dec 19, 2022
carlhoerberg pushed a commit to carlhoerberg/crystal that referenced this pull request Dec 21, 2022
@straight-shoota straight-shoota added the status:reverted PR was reverted or reverts another one, and is not part of a milestone label Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:feature status:reverted PR was reverted or reverts another one, and is not part of a milestone topic:stdlib:text
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants