-
Notifications
You must be signed in to change notification settings - Fork 0
Lexer
The first part of a compiler is the lexer. The lexer is responsible for tokenizing the text and returning the list of identified tokens for further use, usually with a parser. Our lexer is built using SLY and contains rules that catch the tokens of the PCL language. More specifically, the parts of the PCL lexer contain the keywords, operators, identifier names, constants, ignore characters and comments.
The keywords are defined in a dictionary called keywords as follows
keywords = { r'if' : 'IF', r'while' : 'WHILE', ... }
so that they can be separated from the rest of the identifiers (or names). Each rule is defined as a Python regular expression and special rules for handling certain tokens are invoked. If the token starts with ignore_
, then SLY is obliged to ignore the pattern. The ignore
token is a string which contains separate ignore characters. A rule for a certain pattern is defined as
def TOKEN_NAME(self, t):
# do something
return t
If a token is invalid then PCLLexerError
is raised. The sly.Lexer
class modus operandi processes the tokens by order of appearance and does not match the longest tokens greedily. So for instance, it would be difficult to break 3.14 correctly as it identifies 3 as an integer constant and returns it, contrary to what traditional lexers like Flex and MLLex do, so special patterns have to be added to address the issue, or (easier) functions are specified to address it.
Suppose that we have the following program
program collatz;
var x : integer;
begin
x := 6;
while x > 1 do
begin
writeInteger(x);
if x mod 2 = 0 then x := x div 2
else x := 3 * x + 1;
end;
end.
Then invoking
pclc.py collatz.pcl --pipeline lex pprint
yields
Token(type='PROGRAM', value='program', lineno=1, index=0)
Token(type='NAME', value='collatz', lineno=1, index=8)
Token(type='SEMICOLON', value=';', lineno=1, index=15)
Token(type='VAR', value='var', lineno=3, index=18)
Token(type='NAME', value='x', lineno=3, index=22)
Token(type='DCOLON', value=':', lineno=3, index=24)
Token(type='INTEGER', value='integer', lineno=3, index=26)
Token(type='SEMICOLON', value=';', lineno=3, index=33)
Token(type='BEGIN', value='begin', lineno=5, index=36)
Token(type='NAME', value='x', lineno=6, index=44)
Token(type='SET', value=':=', lineno=6, index=46)
Token(type='INT_CONS', value='6', lineno=6, index=49)
Token(type='SEMICOLON', value=';', lineno=6, index=50)
Token(type='WHILE', value='while', lineno=7, index=54)
Token(type='NAME', value='x', lineno=7, index=60)
Token(type='GT', value='>', lineno=7, index=62)
Token(type='INT_CONS', value='1', lineno=7, index=64)
Token(type='DO', value='do', lineno=7, index=66)
Token(type='BEGIN', value='begin', lineno=8, index=71)
Token(type='NAME', value='writeInteger', lineno=9, index=81)
Token(type='LPAREN', value='(', lineno=9, index=93)
Token(type='NAME', value='x', lineno=9, index=94)
Token(type='RPAREN', value=')', lineno=9, index=95)
Token(type='SEMICOLON', value=';', lineno=9, index=96)
Token(type='IF', value='if', lineno=10, index=102)
Token(type='NAME', value='x', lineno=10, index=105)
Token(type='MOD', value='mod', lineno=10, index=107)
Token(type='INT_CONS', value='2', lineno=10, index=111)
Token(type='EQUAL', value='=', lineno=10, index=113)
Token(type='INT_CONS', value='0', lineno=10, index=115)
Token(type='THEN', value='then', lineno=10, index=117)
Token(type='NAME', value='x', lineno=10, index=122)
Token(type='SET', value=':=', lineno=10, index=124)
Token(type='NAME', value='x', lineno=10, index=127)
Token(type='DIV', value='div', lineno=10, index=129)
Token(type='INT_CONS', value='2', lineno=10, index=133)
Token(type='ELSE', value='else', lineno=11, index=139)
Token(type='NAME', value='x', lineno=11, index=144)
Token(type='SET', value=':=', lineno=11, index=146)
Token(type='INT_CONS', value='3', lineno=11, index=149)
Token(type='TIMES', value='*', lineno=11, index=151)
Token(type='NAME', value='x', lineno=11, index=153)
Token(type='PLUS', value='+', lineno=11, index=155)
Token(type='INT_CONS', value='1', lineno=11, index=157)
Token(type='SEMICOLON', value=';', lineno=11, index=158)
Token(type='END', value='end', lineno=12, index=162)
Token(type='SEMICOLON', value=';', lineno=12, index=165)
Token(type='END', value='end', lineno=14, index=168)
Token(type='COLON', value='.', lineno=14, index=171)