Case insensitive doesn't handle multi bytes UTF-8 character #490

RadhiFadlillah · 2024-09-16T15:30:31Z

Hi @skvadrik, I'm sorry for opening another issue this quickly. This issue might be related with #118, but since that one was from 9 years ago I thought it's better to create a new issue than necroing that one.

For example, I have the following string:

brÖther may I have söme föÖds

I want to search ö case insensitively, so I create the following templates:

func countRune(input string) int {
	var count int
	var cursor, marker int
	_ = marker

	input += string(rune(0)) // add terminating null
	limit := len(input) - 1  // limit points at the terminating null

	for { /*!re2c
		re2c:eof              = 0;
		re2c:yyfill:enable    = 0;
		re2c:posix-captures   = 0;
		re2c:case-insensitive = 1;

		re2c:define:YYCTYPE     = byte;
		re2c:define:YYPEEK      = "input[cursor]";
		re2c:define:YYSKIP      = "cursor++";
		re2c:define:YYBACKUP    = "marker = cursor";
		re2c:define:YYRESTORE   = "cursor = marker";
		re2c:define:YYLESSTHAN  = "limit <= cursor";

		ö { count++; continue }
		* { continue }
		$ { return count }
		*/
	}
}

When we run the generated code, that function above will only able to found 2 ö characters and ignore its capital Ö. As workaround, we can use square bracket to explicitly specify both of them:

	for { /*!re2c
		...
-		ö { count++; continue }
+		[öÖ] { count++; continue }
		...
		*/
	}

However it would be nice if re2c can handle it internally.

Thanks!

The text was updated successfully, but these errors were encountered:

RadhiFadlillah · 2024-09-16T15:42:57Z

By the way, I also noticed that re2go doesn't handle case-insensitive for characters inside square bracket. For example:

func countA(input string) int {
	for { /*!re2c
		a   { count++; continue }  // This will match a and A
		[a] { count++; continue }  // This will only match a
		* { continue }
		$ { return count }
		*/
	}
}

Is that the expected behavior?

skvadrik · 2024-09-16T15:55:54Z

Hi @skvadrik, I'm sorry for opening another issue this quickly.

You are very welcome!

This issue might be related with #118, but since that one was from 9 years ago I thought it's better to create a new issue than necroing that one.

Sure. It is related, and I'm afraid no progress has been made in this area. It definitely is a good issue that I'd like to fix, but I don't have resources to fix it in the upcoming release 4.0. Leave this bugreport open as a remainder and I'll prioritize it for the next release.

By the way, I also noticed that re2go doesn't handle case-insensitive for characters inside square bracket. Is that the expected behavior?

I'd say no, although I'm not sure what was the original intention, as these features predate my experience with re2c. I think with --case-insensitive ranges in square brackets should also be case insensitive as both single-quoted and double-quoted strings. Making it so would break backwards compatibility, but arguably make the behavior more intuitive. As it's not the most popular option, it may be acceptable. Alternatively we may add a new option, but it also has its downsides.

Anyway, it's a good point.

pmetzger · 2024-09-16T18:31:31Z

This should be an obvious remark, but: if default behavior does get changed, it should be very prominently marked in the release notes.

skvadrik · 2024-09-16T18:51:52Z

@pmetzger True. So far re2c has done a good job of not breaking backwards compatibility, and we should hold on to this. I thought at first that in this case we'll only increase the subset of matching strings, so it won't break any existing code. But then I thought of the negative ranges and range subtraction, and it's not so easy after all.

helly25 · 2024-09-16T21:36:49Z

Imo, the best way to address both issues, extended Unicode car insensitivity and the same for range is to add new flags and options. No reason to change the existing and breaking people who rely on their current behavior. Cheers Marcus

…

On Mon, Sep 16, 2024, 20:52 Ulya Trofimovich ***@***.***> wrote: @pmetzger <https://github.com/pmetzger> True. So far re2c has done a good job of *not* breaking backwards compatibility, and we should hold on to this. I thought at first that in this case we'll only increase the subset of matching strings, so it won't break any existing code. But then I thought of the negative ranges and range subtraction, and it's not so easy after all. — Reply to this email directly, view it on GitHub <#490 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABQ7NSO3HZAMXAOYKPWMCJTZW4SGBAVCNFSM6AAAAABOJR7JCSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJTGY3TCNZSGY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Case insensitive doesn't handle multi bytes UTF-8 character #490

Case insensitive doesn't handle multi bytes UTF-8 character #490

RadhiFadlillah commented Sep 16, 2024

RadhiFadlillah commented Sep 16, 2024

skvadrik commented Sep 16, 2024

pmetzger commented Sep 16, 2024

skvadrik commented Sep 16, 2024

helly25 commented Sep 16, 2024 via email

Case insensitive doesn't handle multi bytes UTF-8 character #490

Case insensitive doesn't handle multi bytes UTF-8 character #490

Comments

RadhiFadlillah commented Sep 16, 2024

RadhiFadlillah commented Sep 16, 2024

skvadrik commented Sep 16, 2024

pmetzger commented Sep 16, 2024

skvadrik commented Sep 16, 2024

helly25 commented Sep 16, 2024 via email