-
-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Multicharacter support #466
Comments
It's been a while since I touched the code, but from what I remember the matching was done on bytes rather than characters. It should be able to match unicode byte sequences just as well as ASCII ones. However, that relies on both languages.json and the source code using the same encoding. If they're both utf8 encoded with the same byte sequence then in theory it should be able to match them. On the other hand, if one is utf8 and the other is utf16 the underlying binary representation won't match so a byte sequence comparison will fail. |
Yep you are correct. That's the exact reason for it. Where I am debating is if it should be fixed in the state machine OR there be the ability to add new state machines for individual languages, which is something I wanted anyway in order to resolve some annoying issues, and as a potential performance improvement, for common languages such as Java as they could get their own specific code path. |
I remember seeing comments about the idea of language specific state machines to support the case of having one language nested within another. The current state machine is built around looking up the bytes against a fixed size 256 element array, which wouldn't adapt well to converting to unicode runes. The trees are probably not dense enough to justify a binary search within a node, so doing a linear search within each node might be the most obvious solution. Another solution would be to use the current |
The idea about the language specific state machine was yes to support that idea, as well as skip a heap of logic that does not apply where possible. For example java does not supported nested multiline comments, so skip the bookkeeping for that and hopefully gain a performance improvement. I would only be looking to do this for the most common languages though, as the generic version is much better for dealing with ad-hoc addition. It also would allow for dealing with any really oddball languages that pop up. My initial plan is to just see what affect some rune conversion has as a baseline and go from there. |
Describe the bug
The addition of wenyan language via #465 highlights an issue in how scc matches. If you run it against any wenyan file it will not count the complexity. This is down to how the scc state machine works where it is matching on characters not runes.
To Reproduce
go run . ./examples/language/wenyan.wy
from inside the scc directory on a recent checkout.
Expected behavior
It is expected that there will be some complexity counted here.
The text was updated successfully, but these errors were encountered: