Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emoji in regex breaks MS search #838

Open
tripleee opened this issue Jan 8, 2021 · 3 comments
Open

Emoji in regex breaks MS search #838

tripleee opened this issue Jan 8, 2021 · 3 comments
Labels
area: search Post search on metasmoke

Comments

@tripleee
Copy link
Member

tripleee commented Jan 8, 2021

What problem has occurred? What issues has it caused?

Charcoal-SE/SmokeDetector#5550 links to https://metasmoke.erwaysoftware.com/search?utf8=%E2%9C%93&body_is_regex=1&body=%28%3Fs%3A%5Cb%5B%5Cs.%3E%5D%2A%F0%9F%98%8D%F0%9F%98%8D%2B%5CW%2A%5Cb%29 which however produces a Ruby traceback for me.

Mysql2::Error: Got error 'nothing to repeat at offset 14' from regexp: SELECT COUNT(*) AS count_all, `posts`.`is_tp` AS posts_is_tp, `posts`.`is_fp` AS posts_is_fp, `posts`.`is_naa` AS posts_is_naa FROM `posts` WHERE (IFNULL(`posts`.`body`, '') REGEXP '(?s:\\b[\\s.>]*😍😍+\\W*\\b)') GROUP BY `posts`.`is_tp`, `posts`.`is_fp`, `posts`.`is_naa`

  respond_to do \|format\|
       format.html do
>>>      @counts_by_accuracy_group = @results.group(:is_tp, :is_fp, :is_naa).count
         @counts_by_feedback = %i[is_tp is_fp is_naa].each_with_index.map do \|symbol, i\|
           [symbol, @counts_by_accuracy_group.select { \|k, _v\| k[i] }.values.sum]
         end.to_h

What would you like to happen/not happen?

The regex is not really wrong; the search should run and show the hits, instead of crash.

Looks like the regex engine in MariaDB doesn't think an emoji is something you can repeat? Dunno if we can devise a workaround or should just defer this upstream.

@tripleee
Copy link
Member Author

tripleee commented Jan 8, 2021

Just https://metasmoke.erwaysoftware.com/search?utf8=%E2%9C%93&body_is_regex=1&body=%F0%9F%98%8D stunningly crashes with "nothing to repeat" so it's the emoji itself which produces the error.

@tripleee tripleee changed the title Broken MS search link in #5550 Emoji search breaks MS Jan 8, 2021
@tripleee tripleee changed the title Emoji search breaks MS Emoji search breaks MS regex Jan 8, 2021
@tripleee tripleee changed the title Emoji search breaks MS regex Emoji in regex breaks MS search Jan 8, 2021
@makyen makyen transferred this issue from Charcoal-SE/SmokeDetector Feb 12, 2021
@thesecretmaster thesecretmaster added the area: search Post search on metasmoke label Dec 23, 2021
@tripleee
Copy link
Member Author

@makyen
Copy link
Contributor

makyen commented Jul 21, 2022

This appears to be a limitation in the Regex implementation which is used in the database. It doesn't accept, or ignores, characters which are > 0xFFFF (either as characters or as Unicode escapes; e.g. \x{0b03}, which can have a max of 4 hex digits), so a lot of emoji just won't be recognized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: search Post search on metasmoke
Development

No branches or pull requests

3 participants