Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexer skeleton #5

Closed
wants to merge 7 commits into from
Closed

Lexer skeleton #5

wants to merge 7 commits into from

Conversation

Nercury
Copy link
Member

@Nercury Nercury commented Nov 23, 2015

The idea here is to figure out how Lexer is going to look and work.

I placed all the modules into a single file, and when we finish the discussion I will move everything to appropriate files.

The // comments will be removed.

There are few decisions I made, quite possibly biased, so I will let you know my ideas:

First, I think it is possible to separate the Lexer from extensions completely, and initialize it with collected list of options. The list is not big: just operators, and lexer does not even care if they are binary or unary.

Second, I want to keep the iterator with the idea of "lexing on demand". That is, lexing only happens when someone (parser) requests next token. However, this is internal detail, but good to keep in mind.

I want to try and keep only the slices to original string. What to do about new lines? Well, I suggest we think about that later :)

So, basically, this is a starting point. Won't merge until we are happy with result.

#[test]
fn lexer_usage() {
// build the lexer once for project environment with extensions.
let lexer = Lexer::new(LexerOptions::default(), vec![]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree to with this reduced signature for the Lexer constructor. :-)

@colin-kiegel
Copy link
Member

I hope I didn't make too many comment. Awaiting your replies. ;-)

@Nercury
Copy link
Member Author

Nercury commented Dec 6, 2015

Now I at looking at your colin/pr_twig_skeleton as an example, looks like we still using a bit different structure.

Notable differences:
I decided not to hide lexer inside parser for now (I know, changing my mind all the time :) ).
I am naming the container of lexer, parser the api. For now :/.
I would like engine with setup to be separate.
Any options the lexer needs should be passed by engine using Options object the lexer defines. Lexer will have no idea about engine.
I don't like macros in error handling :/. I will try to use them though.

@Nercury
Copy link
Member Author

Nercury commented Dec 6, 2015

Right now I implemented all lexer as iterator, but that does not prevent doing all lexing on first token request :).

Right now lexer is using SyntaxError/SyntaxErrorCode where SyntaxErrorCode contains just the message, and SyntaxError additionally has position in template. Note that here I have not used neither my At nor ErrorCode trait.

I am thinking about sharing SyntaxError between lexer and parser, therefore I have left it in api/error (instead of api/lexer/error).

@Nercury
Copy link
Member Author

Nercury commented Dec 6, 2015

And then I thought about it. If I am striving for concern separation I should keep lexer errors in lexer mod.

Now lexer error can be one of two things: io::Error or twig::api::SyntaxError.

@colin-kiegel
Copy link
Member

Cool, I will have a look (later). :-)

impl<'i, 't> Iterator for Tokens<'i, 't> {
type Item = LexingResult<ItemRef<'t>>;

fn next(&mut self) -> Option<LexingResult<ItemRef<'t>>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR at the end ^^

Too bad rust does not have guaranteed tail call elimination. I did some investigations, what we could do instead.

First, I analyzed the nesting of sub-calls in my code and twigphp to look for any possibly infinite recursions.
The short answer is, there are definitely two possible recursion loops:

  • state::Data::tokenize (aka lex_data) calls itself (luckily in a tail position)
  • lex_expression calls itself (a) directly and possibly via (b) lex_expression -> lex_string -> lex_interpolation -> lex_expression

These recursions are separately closed in a sense they can be rewritten individually like

'lex_data: loop {
    // do something
    if(/**/) { continue 'lex_data /* mimic tail-call */ }
    // ..
    if(/**/) { break 'lex_data /* or return */ }
    // ..
}

However there might be another recursion related to interpolation in double-quoted strings. This boils down to "can an expression/variable contain a string that contains a variable that contains another string that .."

Here are some details. I denote subcalls by "->", where tail-/non-tail-positions are marked with "*" and "_". No sub-calls: "()".

Initial -> *Data
Data -> *Data + *Final + _lex_comment + _lex_verbatim_data + _Block + _Expression
Final -> ()
lex_comment -> ()
lex_verbatim_data -> ()
Block -> *lex_expression
Expression/Variable -> *lex_expression
lex_expression -> *String + *PARENT (e.g. Block, Expression/Variable or Interpolation -> continue 'PARENT )
String -> *Interpolation
Interpolation -> *lex_expression (-> we can only reach this inside another lex_expression, so we could just `continue 'lex_expression` in our loop version - if it wasn't for the call from lex_expression to *PARENT -> this loop might be the most difficult one)

The good news is: sub-calls in non-tail-positions "_" don't create recursions (they only occur in Data, so this is easy to analyse).

  • The lex_data-recursion is easy to rewrite (see above).
  • And I am very confident the lex_expression recursion can be rewritten as a flat loop, too - however we probably want to share the code between Block and Expression/Variable.

We can't break a loop of the caller from inside a function - even a generic function is not flexible enough - but macros can. For readability I put the macros to the end - but for compilation they must be defined first.

'lex_block: loop {
  // ..
  lex_expression!("block successfully parsed", break 'lex_block, continue 'lex_block); // not in tail position - but no guarantee needed, because not recursive
  continue 'lex_data // equivalent to a tail-call guarantee
  // ..
}

// shared code via macro - a bit nasty but it works(tm)
macro_rules! lex_expression {
    ( $message:expr, $break_parent:stmt, $continue_parent:stmt ) => {
        'lex_expression: loop {
        // do something
        println!($message);
        $break_parent;
        unreachable!(); // just to check the break_parent stmt really jumps somewhere else
        // ...
                // we could embedd the lex_string logic completely in this macro, or delegate to a sub-macro
        lex_string!("some argument", break 'lex_expression, continue 'lex_expression)
                //
        }
    };
}

// shared code via macro - a bit nasty but it works(tm)
macro_rules! lex_string {
    ( $message:expr, $break_parent:stmt, $continue_parent:stmt ) => {
        'lex_string: loop {
        // do something
        println!($message);
        $continue_parent;
        unreachable!(); // just to check the break_statement really jumps somewhere else
        // ...
        }
    };
}

I know this macro-thing probably feels a bit weird. But if we want to factor away most function calls in lexer, it seems possible. The only thing that needs a closer look is the possible expression-interpolation-recursion (this is where the stack could still blow up, if we can't reduce it to a flat loop). But at this point even I begin to question if it is really worth it...

TL;DR

So finally I arrive at the point to admit "let's just proceed with some enum-based match-branching in a loop - which would include the iterator pattern you are suggesting". ;-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but we can definitely do it in a way that does not block future improvements.

@colin-kiegel
Copy link
Member

request to rename mod tokens -> mod token?
(I would prefer singular nouns as module names).

pub code: SyntaxErrorCode,
pub starts_at: Position,
pub ends_at: Option<Position>,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to raise the bar up to Rust level when it comes to error location display :)

@Nercury
Copy link
Member Author

Nercury commented Dec 12, 2015

request to rename mod tokens -> mod token?

Yeah, that's just a detail.

}
}

pub type LexingResult<T> = result::Result<T, LexingError>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with current error management this should become result::Result<T, Traced<LexingError>>;

}
}

pub fn expect(&mut self, expected: TokenRef<'t>) -> LexingResult<TokenRef<'t>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my code I tried to be a bit more abstract ~ like expect<T: pattern>' where both Token and TokenType implementpattern`. This might help to match any number, without specifying the value.

@colin-kiegel
Copy link
Member

I moved the generic error stuff to src/api/error/* (extension-dev perspective). I would like to try to separate abstract error wrappers/helpers a bit. Can we relocate the syntax error somewhere else? I think it belongs to lexer or lexing src/api/lexer/error.

Currently I imagine something like src/error.rs to define a Traced<TwigError>, where

pub enum TwigError {
    LexingError(LexingError),
    // ...
}

Where the Display + Debug traits should only add a thin layer of information - and we can still discuss and optimize the order of appearance on the screen. A user calling functions on the facade should always receive a Traced<TwigError>. If they need to destructure this, they should start with TwigError.

We can also think about re-exporting all specific errors in src/error.rs - this would be a clear separation from abstract helpers in src/api/error.rs

@colin-kiegel
Copy link
Member

PS: Here is what I mean (from #12 src/engine/error.rs).

#[derive(Debug)]
pub enum TwigError {
    Lexer(LexerError),
    // ...
}

impl From<LexerError> for TwigError {
    fn from(err: LexerError) -> TwigError {
        TwigError::Lexer(err)
 }}

impl Error for TwigError {
    fn description(&self) -> &str {
        match *self {
            TwigError::Lexer(..) => "Twig lexer error.",
            // ...
}}}

impl Display for TwigError {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        try!(write!(f, "{}", self.description()));

        match *self {
            TwigError::Lexer(ref e) => e.fmt(f),
            // ...
}}}

@Nercury
Copy link
Member Author

Nercury commented Dec 13, 2015

Ok, looks great, I will do that.

@Nercury
Copy link
Member Author

Nercury commented Oct 6, 2016

Continued in #21.

@Nercury Nercury closed this Oct 6, 2016
@Nercury Nercury deleted the lexer-skeleton branch October 6, 2016 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants