Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case insensitive doesn't handle multi bytes UTF-8 character #490

Open
RadhiFadlillah opened this issue Sep 16, 2024 · 5 comments
Open

Case insensitive doesn't handle multi bytes UTF-8 character #490

RadhiFadlillah opened this issue Sep 16, 2024 · 5 comments

Comments

@RadhiFadlillah
Copy link

Hi @skvadrik, I'm sorry for opening another issue this quickly. This issue might be related with #118, but since that one was from 9 years ago I thought it's better to create a new issue than necroing that one.

For example, I have the following string:

brÖther may I have söme föÖds

I want to search ö case insensitively, so I create the following templates:

func countRune(input string) int {
	var count int
	var cursor, marker int
	_ = marker

	input += string(rune(0)) // add terminating null
	limit := len(input) - 1  // limit points at the terminating null

	for { /*!re2c
		re2c:eof              = 0;
		re2c:yyfill:enable    = 0;
		re2c:posix-captures   = 0;
		re2c:case-insensitive = 1;

		re2c:define:YYCTYPE     = byte;
		re2c:define:YYPEEK      = "input[cursor]";
		re2c:define:YYSKIP      = "cursor++";
		re2c:define:YYBACKUP    = "marker = cursor";
		re2c:define:YYRESTORE   = "cursor = marker";
		re2c:define:YYLESSTHAN  = "limit <= cursor";

		ö { count++; continue }
		* { continue }
		$ { return count }
		*/
	}
}

When we run the generated code, that function above will only able to found 2 ö characters and ignore its capital Ö. As workaround, we can use square bracket to explicitly specify both of them:

	for { /*!re2c
		...
-		ö { count++; continue }
+		[öÖ] { count++; continue }
		...
		*/
	}

However it would be nice if re2c can handle it internally.

Thanks!

@RadhiFadlillah
Copy link
Author

By the way, I also noticed that re2go doesn't handle case-insensitive for characters inside square bracket. For example:

func countA(input string) int {
	for { /*!re2c
		a   { count++; continue }  // This will match a and A
		[a] { count++; continue }  // This will only match a
		* { continue }
		$ { return count }
		*/
	}
}

Is that the expected behavior?

@skvadrik
Copy link
Owner

Hi @skvadrik, I'm sorry for opening another issue this quickly.

You are very welcome!

This issue might be related with #118, but since that one was from 9 years ago I thought it's better to create a new issue than necroing that one.

Sure. It is related, and I'm afraid no progress has been made in this area. It definitely is a good issue that I'd like to fix, but I don't have resources to fix it in the upcoming release 4.0. Leave this bugreport open as a remainder and I'll prioritize it for the next release.

By the way, I also noticed that re2go doesn't handle case-insensitive for characters inside square bracket. Is that the expected behavior?

I'd say no, although I'm not sure what was the original intention, as these features predate my experience with re2c. I think with --case-insensitive ranges in square brackets should also be case insensitive as both single-quoted and double-quoted strings. Making it so would break backwards compatibility, but arguably make the behavior more intuitive. As it's not the most popular option, it may be acceptable. Alternatively we may add a new option, but it also has its downsides.

Anyway, it's a good point.

@pmetzger
Copy link
Contributor

This should be an obvious remark, but: if default behavior does get changed, it should be very prominently marked in the release notes.

@skvadrik
Copy link
Owner

@pmetzger True. So far re2c has done a good job of not breaking backwards compatibility, and we should hold on to this. I thought at first that in this case we'll only increase the subset of matching strings, so it won't break any existing code. But then I thought of the negative ranges and range subtraction, and it's not so easy after all.

@helly25
Copy link
Collaborator

helly25 commented Sep 16, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants