-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lexer skeleton #5
Conversation
#[test] | ||
fn lexer_usage() { | ||
// build the lexer once for project environment with extensions. | ||
let lexer = Lexer::new(LexerOptions::default(), vec![]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree to with this reduced signature for the Lexer constructor. :-)
I hope I didn't make too many comment. Awaiting your replies. ;-) |
14daf30
to
82dda1b
Compare
Now I at looking at your Notable differences: |
Right now I implemented all lexer as iterator, but that does not prevent doing all lexing on first token request :). Right now lexer is using I am thinking about sharing |
And then I thought about it. If I am striving for concern separation I should keep lexer errors in lexer mod. Now lexer error can be one of two things: |
Cool, I will have a look (later). :-) |
impl<'i, 't> Iterator for Tokens<'i, 't> { | ||
type Item = LexingResult<ItemRef<'t>>; | ||
|
||
fn next(&mut self) -> Option<LexingResult<ItemRef<'t>>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR at the end ^^
Too bad rust does not have guaranteed tail call elimination. I did some investigations, what we could do instead.
First, I analyzed the nesting of sub-calls in my code and twigphp to look for any possibly infinite recursions.
The short answer is, there are definitely two possible recursion loops:
- state::Data::tokenize (aka lex_data) calls itself (luckily in a tail position)
- lex_expression calls itself (a) directly and possibly via (b) lex_expression -> lex_string -> lex_interpolation -> lex_expression
These recursions are separately closed in a sense they can be rewritten individually like
'lex_data: loop {
// do something
if(/**/) { continue 'lex_data /* mimic tail-call */ }
// ..
if(/**/) { break 'lex_data /* or return */ }
// ..
}
However there might be another recursion related to interpolation in double-quoted strings. This boils down to "can an expression/variable contain a string that contains a variable that contains another string that .."
Here are some details. I denote subcalls by "->", where tail-/non-tail-positions are marked with "*" and "_". No sub-calls: "()".
Initial -> *Data
Data -> *Data + *Final + _lex_comment + _lex_verbatim_data + _Block + _Expression
Final -> ()
lex_comment -> ()
lex_verbatim_data -> ()
Block -> *lex_expression
Expression/Variable -> *lex_expression
lex_expression -> *String + *PARENT (e.g. Block, Expression/Variable or Interpolation -> continue 'PARENT )
String -> *Interpolation
Interpolation -> *lex_expression (-> we can only reach this inside another lex_expression, so we could just `continue 'lex_expression` in our loop version - if it wasn't for the call from lex_expression to *PARENT -> this loop might be the most difficult one)
The good news is: sub-calls in non-tail-positions "_" don't create recursions (they only occur in Data, so this is easy to analyse).
- The lex_data-recursion is easy to rewrite (see above).
- And I am very confident the lex_expression recursion can be rewritten as a flat loop, too - however we probably want to share the code between Block and Expression/Variable.
We can't break a loop of the caller from inside a function - even a generic function is not flexible enough - but macros can. For readability I put the macros to the end - but for compilation they must be defined first.
'lex_block: loop {
// ..
lex_expression!("block successfully parsed", break 'lex_block, continue 'lex_block); // not in tail position - but no guarantee needed, because not recursive
continue 'lex_data // equivalent to a tail-call guarantee
// ..
}
// shared code via macro - a bit nasty but it works(tm)
macro_rules! lex_expression {
( $message:expr, $break_parent:stmt, $continue_parent:stmt ) => {
'lex_expression: loop {
// do something
println!($message);
$break_parent;
unreachable!(); // just to check the break_parent stmt really jumps somewhere else
// ...
// we could embedd the lex_string logic completely in this macro, or delegate to a sub-macro
lex_string!("some argument", break 'lex_expression, continue 'lex_expression)
//
}
};
}
// shared code via macro - a bit nasty but it works(tm)
macro_rules! lex_string {
( $message:expr, $break_parent:stmt, $continue_parent:stmt ) => {
'lex_string: loop {
// do something
println!($message);
$continue_parent;
unreachable!(); // just to check the break_statement really jumps somewhere else
// ...
}
};
}
I know this macro-thing probably feels a bit weird. But if we want to factor away most function calls in lexer, it seems possible. The only thing that needs a closer look is the possible expression-interpolation-recursion (this is where the stack could still blow up, if we can't reduce it to a flat loop). But at this point even I begin to question if it is really worth it...
TL;DR
So finally I arrive at the point to admit "let's just proceed with some enum-based match-branching in a loop - which would include the iterator pattern you are suggesting". ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but we can definitely do it in a way that does not block future improvements.
request to rename |
pub code: SyntaxErrorCode, | ||
pub starts_at: Position, | ||
pub ends_at: Option<Position>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to raise the bar up to Rust level when it comes to error location display :)
Yeah, that's just a detail. |
} | ||
} | ||
|
||
pub type LexingResult<T> = result::Result<T, LexingError>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with current error management this should become result::Result<T, Traced<LexingError>>;
} | ||
} | ||
|
||
pub fn expect(&mut self, expected: TokenRef<'t>) -> LexingResult<TokenRef<'t>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in my code I tried to be a bit more abstract ~ like expect<T: pattern>' where both Token and TokenType implement
pattern`. This might help to match any number, without specifying the value.
I moved the generic error stuff to Currently I imagine something like
Where the Display + Debug traits should only add a thin layer of information - and we can still discuss and optimize the order of appearance on the screen. A user calling functions on the We can also think about re-exporting all specific errors in |
PS: Here is what I mean (from #12
|
Ok, looks great, I will do that. |
Continued in #21. |
The idea here is to figure out how Lexer is going to look and work.
I placed all the modules into a single file, and when we finish the discussion I will move everything to appropriate files.
The
//
comments will be removed.There are few decisions I made, quite possibly biased, so I will let you know my ideas:
First, I think it is possible to separate the Lexer from extensions completely, and initialize it with collected list of options. The list is not big: just operators, and lexer does not even care if they are binary or unary.
Second, I want to keep the iterator with the idea of "lexing on demand". That is, lexing only happens when someone (parser) requests next token. However, this is internal detail, but good to keep in mind.
I want to try and keep only the slices to original string. What to do about new lines? Well, I suggest we think about that later :)
So, basically, this is a starting point. Won't merge until we are happy with result.