Lexer skeleton #5

Nercury · 2015-11-23T16:52:10Z

The idea here is to figure out how Lexer is going to look and work.

I placed all the modules into a single file, and when we finish the discussion I will move everything to appropriate files.

The // comments will be removed.

There are few decisions I made, quite possibly biased, so I will let you know my ideas:

First, I think it is possible to separate the Lexer from extensions completely, and initialize it with collected list of options. The list is not big: just operators, and lexer does not even care if they are binary or unary.

Second, I want to keep the iterator with the idea of "lexing on demand". That is, lexing only happens when someone (parser) requests next token. However, this is internal detail, but good to keep in mind.

I want to try and keep only the slices to original string. What to do about new lines? Well, I suggest we think about that later :)

So, basically, this is a starting point. Won't merge until we are happy with result.

colin-kiegel · 2015-11-23T18:56:05Z

src/lib.rs

+    #[test]
+    fn lexer_usage() {
+        // build the lexer once for project environment with extensions.
+        let lexer = Lexer::new(LexerOptions::default(), vec![]);


I agree to with this reduced signature for the Lexer constructor. :-)

colin-kiegel · 2015-11-23T20:03:06Z

I hope I didn't make too many comment. Awaiting your replies. ;-)

Nercury · 2015-12-06T14:17:48Z

Now I at looking at your colin/pr_twig_skeleton as an example, looks like we still using a bit different structure.

Notable differences:
I decided not to hide lexer inside parser for now (I know, changing my mind all the time :) ).
I am naming the container of lexer, parser the api. For now :/.
I would like engine with setup to be separate.
Any options the lexer needs should be passed by engine using Options object the lexer defines. Lexer will have no idea about engine.
I don't like macros in error handling :/. I will try to use them though.

Nercury · 2015-12-06T17:24:21Z

Right now I implemented all lexer as iterator, but that does not prevent doing all lexing on first token request :).

Right now lexer is using SyntaxError/SyntaxErrorCode where SyntaxErrorCode contains just the message, and SyntaxError additionally has position in template. Note that here I have not used neither my At nor ErrorCode trait.

I am thinking about sharing SyntaxError between lexer and parser, therefore I have left it in api/error (instead of api/lexer/error).

Nercury · 2015-12-06T17:35:22Z

And then I thought about it. If I am striving for concern separation I should keep lexer errors in lexer mod.

Now lexer error can be one of two things: io::Error or twig::api::SyntaxError.

colin-kiegel · 2015-12-06T18:30:42Z

Cool, I will have a look (later). :-)

colin-kiegel · 2015-12-12T22:36:58Z

src/api/lexer/lexing.rs

+impl<'i, 't> Iterator for Tokens<'i, 't> {
+    type Item = LexingResult<ItemRef<'t>>;
+
+    fn next(&mut self) -> Option<LexingResult<ItemRef<'t>>> {


TL;DR at the end ^^

Too bad rust does not have guaranteed tail call elimination. I did some investigations, what we could do instead.

First, I analyzed the nesting of sub-calls in my code and twigphp to look for any possibly infinite recursions.
The short answer is, there are definitely two possible recursion loops:

state::Data::tokenize (aka lex_data) calls itself (luckily in a tail position)

lex_expression calls itself (a) directly and possibly via (b) lex_expression -> lex_string -> lex_interpolation -> lex_expression

These recursions are separately closed in a sense they can be rewritten individually like

'lex_data: loop { // do something if(/**/) { continue 'lex_data /* mimic tail-call */ } // .. if(/**/) { break 'lex_data /* or return */ } // .. }

However there might be another recursion related to interpolation in double-quoted strings. This boils down to "can an expression/variable contain a string that contains a variable that contains another string that .."

Here are some details. I denote subcalls by "->", where tail-/non-tail-positions are marked with "*" and "_". No sub-calls: "()".

Initial -> *Data Data -> *Data + *Final + _lex_comment + _lex_verbatim_data + _Block + _Expression Final -> () lex_comment -> () lex_verbatim_data -> () Block -> *lex_expression Expression/Variable -> *lex_expression lex_expression -> *String + *PARENT (e.g. Block, Expression/Variable or Interpolation -> continue 'PARENT ) String -> *Interpolation Interpolation -> *lex_expression (-> we can only reach this inside another lex_expression, so we could just `continue 'lex_expression` in our loop version - if it wasn't for the call from lex_expression to *PARENT -> this loop might be the most difficult one)

The good news is: sub-calls in non-tail-positions "_" don't create recursions (they only occur in Data, so this is easy to analyse).

The lex_data-recursion is easy to rewrite (see above).

And I am very confident the lex_expression recursion can be rewritten as a flat loop, too - however we probably want to share the code between Block and Expression/Variable.

We can't break a loop of the caller from inside a function - even a generic function is not flexible enough - but macros can. For readability I put the macros to the end - but for compilation they must be defined first.

'lex_block: loop { // .. lex_expression!("block successfully parsed", break 'lex_block, continue 'lex_block); // not in tail position - but no guarantee needed, because not recursive continue 'lex_data // equivalent to a tail-call guarantee // .. } // shared code via macro - a bit nasty but it works(tm) macro_rules! lex_expression { ( $message:expr, $break_parent:stmt, $continue_parent:stmt ) => { 'lex_expression: loop { // do something println!($message); $break_parent; unreachable!(); // just to check the break_parent stmt really jumps somewhere else // ... // we could embedd the lex_string logic completely in this macro, or delegate to a sub-macro lex_string!("some argument", break 'lex_expression, continue 'lex_expression) // } }; } // shared code via macro - a bit nasty but it works(tm) macro_rules! lex_string { ( $message:expr, $break_parent:stmt, $continue_parent:stmt ) => { 'lex_string: loop { // do something println!($message); $continue_parent; unreachable!(); // just to check the break_statement really jumps somewhere else // ... } }; }

I know this macro-thing probably feels a bit weird. But if we want to factor away most function calls in lexer, it seems possible. The only thing that needs a closer look is the possible expression-interpolation-recursion (this is where the stack could still blow up, if we can't reduce it to a flat loop). But at this point even I begin to question if it is really worth it...

TL;DR

So finally I arrive at the point to admit "let's just proceed with some enum-based match-branching in a loop - which would include the iterator pattern you are suggesting". ;-)

OK, but we can definitely do it in a way that does not block future improvements.

colin-kiegel · 2015-12-12T22:43:08Z

request to rename mod tokens -> mod token?
(I would prefer singular nouns as module names).

colin-kiegel · 2015-12-12T22:47:24Z

src/api/error/syntax.rs

+    pub code: SyntaxErrorCode,
+    pub starts_at: Position,
+    pub ends_at: Option<Position>,
+}


I decided to raise the bar up to Rust level when it comes to error location display :)

Nercury · 2015-12-12T22:50:14Z

request to rename mod tokens -> mod token?

Yeah, that's just a detail.

colin-kiegel · 2015-12-12T22:59:04Z

src/api/lexer/error.rs

+    }
+}
+
+pub type LexingResult<T> = result::Result<T, LexingError>;


with current error management this should become result::Result<T, Traced<LexingError>>;

colin-kiegel · 2015-12-12T23:06:30Z

src/api/lexer/lexing.rs

+        }
+    }
+
+    pub fn expect(&mut self, expected: TokenRef<'t>) -> LexingResult<TokenRef<'t>> {


in my code I tried to be a bit more abstract ~ like expect<T: pattern>' where both Token and TokenType implementpattern`. This might help to match any number, without specifying the value.

colin-kiegel · 2015-12-12T23:28:32Z

I moved the generic error stuff to src/api/error/* (extension-dev perspective). I would like to try to separate abstract error wrappers/helpers a bit. Can we relocate the syntax error somewhere else? I think it belongs to lexer or lexing src/api/lexer/error.

Currently I imagine something like src/error.rs to define a Traced<TwigError>, where

pub enum TwigError {
    LexingError(LexingError),
    // ...
}

Where the Display + Debug traits should only add a thin layer of information - and we can still discuss and optimize the order of appearance on the screen. A user calling functions on the facade should always receive a Traced<TwigError>. If they need to destructure this, they should start with TwigError.

We can also think about re-exporting all specific errors in src/error.rs - this would be a clear separation from abstract helpers in src/api/error.rs

colin-kiegel · 2015-12-13T10:06:57Z

PS: Here is what I mean (from #12 src/engine/error.rs).

#[derive(Debug)]
pub enum TwigError {
    Lexer(LexerError),
    // ...
}

impl From<LexerError> for TwigError {
    fn from(err: LexerError) -> TwigError {
        TwigError::Lexer(err)
 }}

impl Error for TwigError {
    fn description(&self) -> &str {
        match *self {
            TwigError::Lexer(..) => "Twig lexer error.",
            // ...
}}}

impl Display for TwigError {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        try!(write!(f, "{}", self.description()));

        match *self {
            TwigError::Lexer(ref e) => e.fmt(f),
            // ...
}}}

Nercury · 2015-12-13T13:11:53Z

Ok, looks great, I will do that.

Nercury · 2016-10-06T21:34:43Z

Continued in #21.

Nercury assigned colin-kiegel Nov 23, 2015

colin-kiegel reviewed Nov 23, 2015
View reviewed changes

Lexer skeleton

82dda1b

Nercury force-pushed the lexer-skeleton branch from 14daf30 to 82dda1b Compare December 6, 2015 13:03

Nercury added 2 commits December 6, 2015 15:41

Move lexer into separate files as discussed in rust-web#10.

5e8b808

Add file headers.

f71dcd9

Implement lexer that can pass first test.

3d5cc2a

Nercury added 2 commits December 6, 2015 19:28

Remove unused ParsingResult for now.

febcfcf

Move Lexer error and result into lexer mod, duh.

22d78d2

File headers.

f11703e

colin-kiegel reviewed Dec 12, 2015
View reviewed changes

Nercury closed this Oct 6, 2016

Nercury deleted the lexer-skeleton branch October 6, 2016 21:34

Lexer skeleton #5

Lexer skeleton #5

Uh oh!

Conversation

Nercury commented Nov 23, 2015

Uh oh!

colin-kiegel Nov 23, 2015

Choose a reason for hiding this comment

Uh oh!

colin-kiegel commented Nov 23, 2015

Uh oh!

Nercury commented Dec 6, 2015

Uh oh!

Nercury commented Dec 6, 2015

Uh oh!

Nercury commented Dec 6, 2015

Uh oh!

colin-kiegel commented Dec 6, 2015

Uh oh!

colin-kiegel Dec 12, 2015

Choose a reason for hiding this comment

TL;DR

Uh oh!

Nercury Dec 12, 2015

Choose a reason for hiding this comment

Uh oh!

colin-kiegel commented Dec 12, 2015

Uh oh!

colin-kiegel Dec 12, 2015

Choose a reason for hiding this comment

Uh oh!

Nercury Dec 12, 2015

Choose a reason for hiding this comment

Uh oh!

Nercury commented Dec 12, 2015

Uh oh!

colin-kiegel Dec 12, 2015

Choose a reason for hiding this comment

Uh oh!

colin-kiegel Dec 12, 2015

Choose a reason for hiding this comment

Uh oh!

colin-kiegel commented Dec 12, 2015

Uh oh!

colin-kiegel commented Dec 13, 2015

Uh oh!

Nercury commented Dec 13, 2015

Uh oh!

Nercury commented Oct 6, 2016

Uh oh!

Uh oh!