Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify compile_class_not_nested (followup to ECLASS) #589

Open
1 task
NWilson opened this issue Dec 2, 2024 · 1 comment
Open
1 task

Simplify compile_class_not_nested (followup to ECLASS) #589

NWilson opened this issue Dec 2, 2024 · 1 comment
Labels
untidiness Not exactly a bug, but could do better
Milestone

Comments

@NWilson
Copy link
Member

NWilson commented Dec 2, 2024

The function compile_class_not_nested is very big and complicated. I reckon we can help it out:

  • Always have cranges. For the no-wide-char cases, we can just use a cranges object on the stack (to avoid malloc) and write into that. The wide-char cases would still need to malloc a cranges. But the advantage would be: all the code which follows can avoid having two separate paths.
  • We should put the bitmap into the struct cranges. Then, the struct cranges would be a complete description of the characters within the class (that is - all the non-property characters in the class), which would make it easier to reason about. "The characters matched by the class are identified by this bitmap, these cranges/clists, plus the Unicode properties."
  • We have code which is split out into separate places. For example, POSIX classes and escapes are handled twice - once when building the cranges, to add in the non-ASCII characters, and again after building the cranges, to add the ASCII characters to the bitmap.

I think we should do build the cranges from the METAs (ideally in one pass, although two passes would be acceptable); then one single pass over the cranges to build the output (without needing to back over the METAs - except possibly just to pick up Unicode property escapes which were completely ignored in the cranges).

Currently, the 16-bit library always, unconditionally, mallocs a cranges object even for super-simple character classes that just match ASCII characters! Using a small stack-alloced cranges as the starting point would be very nice, and we would only grow it to a heap-alloced one if the character class is too big to fit.

Items in the character class should handled in exactly one place: if the item is a character literal or POSIX class then it goes in the cranges, and shouldn't need to be handled elsewhere; if it's a Unicode escape then it's ignored in the cranges and is handled just once later on.

Additionally:

  • Consider removing XCL_END
@zherczeg
Copy link
Collaborator

zherczeg commented Dec 2, 2024

The function simplified a lot compared to its original form. It could cache the entire compiled class though. Then no processing during the lengthptr != case, just a memory copy.

@NWilson NWilson added the untidiness Not exactly a bug, but could do better label Dec 6, 2024
@NWilson NWilson added this to the 10.45-RC1 milestone Dec 6, 2024
@NWilson NWilson self-assigned this Dec 10, 2024
@NWilson NWilson modified the milestones: 10.45-RC1, 10.45 Dec 18, 2024
@NWilson NWilson modified the milestones: 10.45, 10.46 Jan 8, 2025
@NWilson NWilson removed their assignment Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
untidiness Not exactly a bug, but could do better
Projects
None yet
Development

No branches or pull requests

2 participants