Simplify compile_class_not_nested (followup to ECLASS) #589

NWilson · 2024-12-02T09:44:02Z

The function compile_class_not_nested is very big and complicated. I reckon we can help it out:

Always have cranges. For the no-wide-char cases, we can just use a cranges object on the stack (to avoid malloc) and write into that. The wide-char cases would still need to malloc a cranges. But the advantage would be: all the code which follows can avoid having two separate paths.
We should put the bitmap into the struct cranges. Then, the struct cranges would be a complete description of the characters within the class (that is - all the non-property characters in the class), which would make it easier to reason about. "The characters matched by the class are identified by this bitmap, these cranges/clists, plus the Unicode properties."
We have code which is split out into separate places. For example, POSIX classes and escapes are handled twice - once when building the cranges, to add in the non-ASCII characters, and again after building the cranges, to add the ASCII characters to the bitmap.

I think we should do build the cranges from the METAs (ideally in one pass, although two passes would be acceptable); then one single pass over the cranges to build the output (without needing to back over the METAs - except possibly just to pick up Unicode property escapes which were completely ignored in the cranges).

Currently, the 16-bit library always, unconditionally, mallocs a cranges object even for super-simple character classes that just match ASCII characters! Using a small stack-alloced cranges as the starting point would be very nice, and we would only grow it to a heap-alloced one if the character class is too big to fit.

Items in the character class should handled in exactly one place: if the item is a character literal or POSIX class then it goes in the cranges, and shouldn't need to be handled elsewhere; if it's a Unicode escape then it's ignored in the cranges and is handled just once later on.

Additionally:

Consider removing XCL_END

zherczeg · 2024-12-02T10:32:21Z

The function simplified a lot compared to its original form. It could cache the entire compiled class though. Then no processing during the lengthptr != case, just a memory copy.

NWilson added the untidiness Not exactly a bug, but could do better label Dec 6, 2024

NWilson added this to the 10.45-RC1 milestone Dec 6, 2024

NWilson self-assigned this Dec 10, 2024

NWilson modified the milestones: 10.45-RC1, 10.45 Dec 18, 2024

NWilson mentioned this issue Dec 26, 2024

Fix UCP-only handling of caseless character classes #644

Merged

NWilson modified the milestones: 10.45, 10.46 Jan 8, 2025

NWilson removed their assignment Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify compile_class_not_nested (followup to ECLASS) #589

Simplify compile_class_not_nested (followup to ECLASS) #589

NWilson commented Dec 2, 2024 •

edited

Loading

zherczeg commented Dec 2, 2024

Simplify compile_class_not_nested (followup to ECLASS) #589

Simplify compile_class_not_nested (followup to ECLASS) #589

Comments

NWilson commented Dec 2, 2024 • edited Loading

zherczeg commented Dec 2, 2024

NWilson commented Dec 2, 2024 •

edited

Loading