You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the overall architecture of this is acceptable, but pretty hideous. The choices were made for various different (not necessarily correct) reasons at the time when this was written, but needs a fixing to some better choices now.
There is one lalrpop file which generates the "initial" parser, whose job is to take raw text input, make sure the syntax is correct, and generate list of label maps, data and code instructions ( as strings).
Next is interpreter which , again, takes the instructions as string, parses them and runs them.
Finally there is a print parser, dedicated to parse and run print instructions.
This is horrible for many reasons :
three parsers means it takes a lot of time to generate each from lalrpop file, pretty irritating for dev
print parser does not need to exist, its syntax is simple enough to manually lex and parse it
Because initial parser uses default lexer from lalrpop, which
does not report \n, so we have to go through input before everything for making newline mapping
because the way we define tokens using regexp , there are conflicts when we define two token which are overlapping, eg see issue blind student #5 (comment) , here the issue is the regexp for db text gobbles up everything upto the last quote (the last quote it can find, so it includes the next 'db' and the text as string). If we try to fix that regexp, it collides with string regexp. We cannot stop the db string at EOL, as lalrpop does not give access to \n
We really don't need to parse the instructions again from text for the interpreting. We can just use an enum to indicate instructions and store related params in it, and match on it , which will be a much better scheme overall.
Currently the two strategies are :
make a custom lexer which will be used with lalrpop as parse to do the initial parsing. The custom lexer will take care of newline mapping, as well as considering capital/small letters.
Make a custom lexer + (recursive decent?) parser, and remove the lalrpop dependency completely
Even though second option is desirable, it is equally tricky, so first shifting to custom lexer, than a custom parser separately might be a better way.
Either way, we should make the initial parser generate enum instead of text again and remove "interpreter parser", and remove print parser as well.
Tracking:
Remove Print parser with custom lexer+parser
Add lexer for "normal" asm , i.e. the main lexer (possibly integrate with print parser somehow?)
Integrate this lexer's token into lalrpop with custom token support which lalrpop provides, so at least the issue mentioned above can be mitigated in short term
Define enum for asm opcodes, so the original / lalrpop parser can (eventually) emit this instead of text
Port "initial" parser from emitting text instruction to the enum defined above, simultaneously port the "interpreter" from lalrpop to a giant match stmt on this enum values
Add a custom (recursive decent) parser for the "initial" parser, so that lalrpop dependency will completely removed. This is still up for discussion , need to see if that will actually provide any benefit , otherwise with custom tokens, the lalrpop parser file will be much simpler anyways.
Just noticed that the 8086 manual also include hex codes for instructions, if we can use them, we can actually store instructions in the memory and remove that barrier.
The text was updated successfully, but these errors were encountered:
I have a basic lexer written for another project, which I can share in a gist, and you can adapt it for this. A good idea might be to start with printer interpreter, and convert it from lalrpop to hand-written stuff. Its pretty small and pretty eacy. They we can think about converting the rest.
YJDoc2
changed the title
Overhall the lexer parse and interpreter architecture to a better scheme
Overhaul the lexer parse and interpreter architecture to a better scheme
Sep 21, 2022
I think we can implement strat 1 first and then we can think of removing lalrpop completely. Starting with the print parser sounds good to me, will help me understand this project piece by piece better.
Currently the overall architecture of this is acceptable, but pretty hideous. The choices were made for various different (not necessarily correct) reasons at the time when this was written, but needs a fixing to some better choices now.
This is horrible for many reasons :
Currently the two strategies are :
Even though second option is desirable, it is equally tricky, so first shifting to custom lexer, than a custom parser separately might be a better way.
Either way, we should make the initial parser generate enum instead of text again and remove "interpreter parser", and remove print parser as well.
Tracking:
Just noticed that the 8086 manual also include hex codes for instructions, if we can use them, we can actually store instructions in the memory and remove that barrier.
The text was updated successfully, but these errors were encountered: