Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace hand-translated machines with nom #76

Closed
wants to merge 1 commit into from

Conversation

notgull
Copy link
Collaborator

@notgull notgull commented Aug 19, 2023

The _machine.rs files are unmaintainable, as they are a 1:1 handwritten mirror of the machine-generated state machine files for Harfbuzz. My goal is to make this crate more maintainable by writing these files in such a way that it corresponds with the .rl files. This way, changes to the .rl files can be easily reflected in rustybuzz.

My original attempt to replace the indic_machine.rs file, however, has not passed tests. I am not familiar with ragel itself and I am mystified by its semantics. As there isn't a ragel chat room I can ask for help, as far as I know, I figure that this is the best place to ask.

I've focused on one specific test, indic_old_spec_003. The input to the parser looks like this:

categories: [C, H, CM, H, X]

The parser for the consonant_syllable rule looks like this:

c = (C | Ra);			# is_consonant
n = ((ZWNJ?.RS)? (N.N?)?);	# is_consonant_modifier
z = ZWJ|ZWNJ;			# is_joiner
reph = (Ra H | Repha);		# possible reph

cn = c.ZWJ?.n?;
symbol = Symbol.N?;
matra_group = z*.(M | SM? MPst).N?.H?;
syllable_tail = (z?.SM.SM?.ZWNJ?)? (A | VD)*;
halant_group = (z?.H.(ZWJ.N?)?);
final_halant_group = halant_group | H.ZWNJ;
medial_group = CM?;
halant_or_matra_group = (final_halant_group | matra_group*);

complex_syllable_tail = (halant_group.cn)* medial_group halant_or_matra_group syllable_tail;
consonant_syllable =	(Repha|CS)? cn complex_syllable_tail;
broken_cluster =	reph? n? complex_syllable_tail;
other =			any;

Source

For this rule, the first (Repha|CS)? block evaluates to no input, as the first item is C which is neither Repha nor CS. The next item is cn, which matches the C tag and consumed it. The next tag is H, which doesn't match ZWJ or n, so the cn rule completes and we move on to complex_syllable_tail.

The complex_syllable_tail rule starts with (halant_group.cn)*, which would originally match the H, but there is no C at the end, so this rule evaluates to no input. the next medial_group rule is CM?. As the current input is H, CM doesn't match, so this rule evaluates to no input as well. halant_or_matra_group goes to final_halant_group which goes to halant_group which matches H and nothing else, consuming it. Finally, syllable_tail matches the lack of input at the end. Therefore the range from 0..2 is classified as a constant syllable.

Then, the next item on the chopping block is CM. Out of all the rules, this matches the complex_syllable_tail part of broken_cluster, along with the H. Therefore 2..4 is a broken cluster. Finally, the X at the end becomes a non-indic character.

However, this fails the test. After wiring some telemetry to the current rustybuzz master, I've found that it classifies the range from 0..4 as a consonant syllable and the 4..5 range as non-indic. I'm not sure how it does this; it feels like what would happen is that the H is somehow consumed by either the cn or the halant_group.cn before the CM is consumed by the medial_group. Unless ragel's semantics are wildly different from what I understand, it is unclear to me how this would happen.

Is there anyone who knows ragel well enough to help my understanding of it here?

cc #74, @bluebear94

@RazrFalcon
Copy link
Collaborator

RazrFalcon commented Aug 19, 2023

I'm not sure I would call the nom variant more maintainable. It's a complete gibberish for me. And performance is a big question as well.

If you have time, I would suggest compiling master ragel (which is an autotools hell and you need colm as well) and try its Rust output generator. I've tried it a couple of weeks ago and it was failing with a cryptic error.
But I think that modifying hb ragel files to output Rust code directly via ragel is far better from maintainability standpoint.

@notgull
Copy link
Collaborator Author

notgull commented Aug 19, 2023

Superseded by #77

@notgull notgull closed this Aug 19, 2023
@notgull notgull deleted the machine branch August 19, 2023 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants