Replace hand-translated machines with nom #76
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The
_machine.rs
files are unmaintainable, as they are a 1:1 handwritten mirror of the machine-generated state machine files for Harfbuzz. My goal is to make this crate more maintainable by writing these files in such a way that it corresponds with the.rl
files. This way, changes to the.rl
files can be easily reflected inrustybuzz
.My original attempt to replace the
indic_machine.rs
file, however, has not passed tests. I am not familiar withragel
itself and I am mystified by its semantics. As there isn't aragel
chat room I can ask for help, as far as I know, I figure that this is the best place to ask.I've focused on one specific test,
indic_old_spec_003
. The input to the parser looks like this:The parser for the
consonant_syllable
rule looks like this:Source
For this rule, the first
(Repha|CS)?
block evaluates to no input, as the first item isC
which is neitherRepha
norCS
. The next item iscn
, which matches theC
tag and consumed it. The next tag isH
, which doesn't matchZWJ
orn
, so thecn
rule completes and we move on tocomplex_syllable_tail
.The
complex_syllable_tail
rule starts with(halant_group.cn)*
, which would originally match theH
, but there is noC
at the end, so this rule evaluates to no input. the nextmedial_group
rule isCM?
. As the current input isH
,CM
doesn't match, so this rule evaluates to no input as well.halant_or_matra_group
goes tofinal_halant_group
which goes tohalant_group
which matchesH
and nothing else, consuming it. Finally,syllable_tail
matches the lack of input at the end. Therefore the range from0..2
is classified as a constant syllable.Then, the next item on the chopping block is
CM
. Out of all the rules, this matches thecomplex_syllable_tail
part ofbroken_cluster
, along with theH
. Therefore2..4
is a broken cluster. Finally, theX
at the end becomes a non-indic character.However, this fails the test. After wiring some telemetry to the current
rustybuzz
master, I've found that it classifies the range from0..4
as a consonant syllable and the4..5
range as non-indic. I'm not sure how it does this; it feels like what would happen is that theH
is somehow consumed by either thecn
or thehalant_group.cn
before theCM
is consumed by themedial_group
. Unlessragel
's semantics are wildly different from what I understand, it is unclear to me how this would happen.Is there anyone who knows
ragel
well enough to help my understanding of it here?cc #74, @bluebear94