Parser issue while parsing C code #1113

pth14 · 2022-09-17T15:07:43Z

pth14
Sep 17, 2022

Hi,

I try to parse C code for extracting some information.
I fail parsing a C struct with code and comments.
Here is my unit test:

var match = _parserLib.InterfaceIdList(@"static INTERFACE_t p_list [] =
{
    /*Comment1*/
    // Comment2
    {ID_E_P_P_M  , (void*)&e_p_p_m,   sizeof(e_p_p_m)   , ID_FLOAT32, PROD}, //TEXT=""E: IPM"" DU=""ppm""
    {ID_E_SC_E_S_NPM  , (void*)&s_mpm,  sizeof(s_mpm.speed_mpm)  , ID_FLOAT32, PROD}, //TEXT=""SE: Speed"" DU=""m/min""
	// Sp Det
	{ID_E_S_D_I_S , (void*)&s_i_s, sizeof(s_i_s)  , ID_UINT8,   PROD}, //TEXT=""SD: Input S""     DU=""pm""
};");
  Assert.That(match.Name, Is.EqualTo("p_list"));
  Assert.That(match.Items.Count(), Is.EqualTo(3));
  Assert.That(match.Items.First().Name, Is.EqualTo("ID_E_P_P_M"));
  Assert.That(match.Items.First().Comment.Text, Is.EqualTo("E: IPM"));
  Assert.That(match.Items.First().Comment.DisplayUnit, Is.EqualTo("ppm"));
  Assert.That(match.Items.Last().Name, Is.EqualTo("ID_E_S_D_I_S"));
  Assert.That(match.Items.Last().Comment.Text, Is.EqualTo("SD: Input S"));
  Assert.That(match.Items.Last().Comment.DisplayUnit, Is.EqualTo("pm"));

Here is the parser code:

  var interfaceEntry = str("static INTERFACE_t").label("interfaceId list entry");
  var openCloseBrackets = str("[]").label("open close brackets");
  var equal = ch('=').label("equal");
  var openBracket = ch('{').label("open bracket");
  var closeStruct = str("};").label("close struct");
  var noneOfCloseBracket = many(noneOf("}")).label("none of close bracket");
  var noneOfQuote = asString(many1(noneOf("\""))).label("none of quote");
  var comma = ch(',').label(" comma");
  var commentStart = str("//").label("comment start");
  var commentText = str("TEXT=\"").label("comment Text");
  var quote = ch('"').label("quote");
  var commentDU = str("DU=\"").label("comment DU");

  VariableParser = from first in firstVarChar
    from w in otherVarChar
    from s in spaces
    select first + w;

  InterfaceIdStructCommentParser = from s in commentStart
    from t in commentText
    from t2 in noneOfQuote
    // consume ending '"'
    from e in quote
    from sp in spaces
    from d in commentDU
    from d2 in noneOfQuote
    // consume ending '"'
    from e2 in quote
    select new InterfaceIdListItemComment { Text = t2, DisplayUnit = d2 };

  InterfaceIdStructCommentParser = from s in commentStart
    from t in commentText
    from t2 in noneOfQuote
    // consume ending '"'
    from e in quote
    from sp in spaces
    from d in commentDisplayUnit
    from d2 in noneOfQuote
    // consume ending '"'
    from e2 in quote
    select new InterfaceIdListItemComment { Text = t2, DisplayUnit = d2 };

  InterfaceIdParser = from s in spaces
    from o in openBracket
    from n in VariableParser
    // remaining
    from r in noneOfCloseBracket
    from e in closeBracket
    select n;

  InterfaceIdCommentParser = from ii in InterfaceIdParser
    from c in comma
    from s in spaces
    from c2 in InterfaceIdStructCommentParser
    select new InterfaceIdListItem { Name = ii, Comment = c2, IsIgnore = false };

  InterfaceIdListParser = from a in interfaceEntry
    from s in spaces
    from na in VariableParser
    from s2 in spaces
    from a2 in openCloseBrackets
    from s3 in spaces
    from a3 in equal
    from s4 in spaces
    from a4 in openBracket
    from s5 in spaces
    from c in many(either(InterfaceIdCommentParser,ignoreInterfaceIdStructCommentParser))
    from s6 in spaces
    from a5 in closeStruct
    select new InterfaceIdList { Name = na, Items = c.Where(ii => !ii.IsIgnore), IsValid = true };

The error is: *error at (line 7, column 5): unexpected "/", expecting space or open bracket"
Any help would be appreciated, it is my first usage of LanguageExt.

Answered by louthy

Sep 17, 2022

The main thing to note is that all parsers have one of four result states:

Consumed Success - the parser succeeded
Consumed Failure - the parser parsed some tokens, but failed midway through
Empty Success - the parser succeeded without parsing anything (this can happen with empty lists, for example)
Empty Failure - the parser failed without parsing anything

Imagine this:

var p = either(str("Hello"), str("Hi"));

If you try to use the parser p with the string "Hi" , it won't succeed, even though it's given as a valid option. It's because when trying the first parser str("Hello"), it would get to the 'e' and fail. This is a Consumed Failure. The parsec library won't automatically rewind to…

View full answer

louthy · 2022-09-17T20:07:21Z

louthy
Sep 17, 2022
Maintainer

The main thing to note is that all parsers have one of four result states:

Consumed Success - the parser succeeded
Consumed Failure - the parser parsed some tokens, but failed midway through
Empty Success - the parser succeeded without parsing anything (this can happen with empty lists, for example)
Empty Failure - the parser failed without parsing anything

Imagine this:

var p = either(str("Hello"), str("Hi"));

If you try to use the parser p with the string "Hi" , it won't succeed, even though it's given as a valid option. It's because when trying the first parser str("Hello"), it would get to the 'e' and fail. This is a Consumed Failure. The parsec library won't automatically rewind to the start for the subsequent cases.

And so you need attempt:

var p = either(attempt(str("Hello")), str("Hi"));

This will make sure that any consumption is undone when starting subsequent cases.

This is also true when using many1 and many, as it will possibly always fail when getting to the end of the successful parses. So also use attempt within many:

    many(attempt(p));
    many1(attempt(p));

I notice you don't have any attempt parsers in your example, and so it's likely to fail for that reason.

Another good technique when parsing languages is to build a token parser:

public static Parser<A> token<A>(Parser<A> p) =>
   from r in p
   from _ in spaces    // you need to build a spaces and comments parser to use her
   select r;

Then you can wrap any other parser with token(p) to make it automatically strip the spaces and comments in-between tokens.

For example, you could build a symbol parser:

public static Parser<string> symbol(string sym) =>
    token(str(sym));

Because building language parsers is quite hard to get right, especially when it comes to things like string-literals (with escape codes), floating-point numbers, nested multi-line comments, etc. There's built-in functionality for producing most of the parsers you will need:

using static LanguageExt.Parsec.Token;

var reservedNames = List("void", "char", "short", "int", "long", "float",
                         "double", "signed", "unsigned", "struct", "union", "const", "volatile",
                         "auto", "register", "extern", "typedef",
                         "goto", "continue", "break", "return",
                         "while", "do", "for", 
                         "if", "switch", "case", "default");
        
// Use the built-in language definition for Java
// Modify the defaults using .With
var def = Language.JavaStyle.With(ReservedNames: reservedNames);

// Build common parsers from the language definition
lexer = makeTokenParser(def) ?? throw new InvalidOperationException();

Language.JavaStyle is the closest to C as a language, so you can just use that as your base. When makeTokenParser is run it will populate lexer lots of useful parsers (that automatically strip spaces and comments). You should also see token is defined as lexer.Lexeme(p), this allows you to build new parsers and leverage everything that's been prebuilt for the 'Java style' language.

    static Parser<string> ident => lexer.Identifier;
    static Parser<string> dot => lexer.Dot;
    static Parser<string> comma => lexer.Comma;
    static Parser<string> colon => lexer.Colon;
    static Parser<string> op => lexer.Operator;
    static Parser<string> stringLiteral => lexer.StringLiteral;
    static Parser<char> charLiteral => lexer.CharLiteral;
    static Parser<int> natLiteral => lexer.Natural;
    static Parser<int> intLiteral => lexer.Decimal;
    static Parser<Unit> whiteSpace => lexer.WhiteSpace;  // This will parse comments too
    static Parser<Either<int, double>> natOrFloatLiteral => lexer.NaturalOrFloat;
    static Parser<double> floatLiteral => lexer.Float;
    static Parser<string> reserved(string ident) => lexer.Reserved(ident);
    static Parser<string> symbol(string sym) => lexer.Symbol(sym);
    static Parser<A> token<A>(Parser<A> p) => lexer.Lexeme(p);
    static Parser<A> parens<A>(Parser<A> p) => lexer.Parens(p);
    static Parser<A> braces<A>(Parser<A> p) => lexer.Braces(p);
    static Parser<A> brackets<A>(Parser<A> p) => lexer.Brackets(p);
    static Parser<A> angles<A>(Parser<A> p) => lexer.Angles(p);
    static Parser<Seq<A>> semiSep<A>(Parser<A> p) => lexer.SemiSep(p);
    static Parser<Seq<A>> semiSep1<A>(Parser<A> p) => lexer.SemiSep1(p);
    static Parser<Seq<A>> commaSep<A>(Parser<A> p) => lexer.CommaSep(p);
    static Parser<Seq<A>> commaSep1<A>(Parser<A> p) => lexer.CommaSep1(p);
    static Parser<Seq<A>> bracketsCommaSep<A>(Parser<A> p) => lexer.BracketsCommaSep(p);
    static Parser<Seq<A>> bracketsCommaSep1<A>(Parser<A> p) => lexer.BracketsCommaSep1(p);
    static Parser<Seq<A>> parensCommaSep<A>(Parser<A> p) => lexer.ParensCommaSep(p);
    static Parser<Seq<A>> parensCommaSep1<A>(Parser<A> p) => lexer.ParensCommaSep1(p);
    static Parser<Seq<A>> anglesCommaSep<A>(Parser<A> p) => lexer.AnglesCommaSep(p);
    static Parser<Seq<A>> anglesCommaSep1<A>(Parser<A> p) => lexer.AnglesCommaSep1(p);
    static Parser<Seq<A>> bracesCommaSep<A>(Parser<A> p) => lexer.BracesCommaSep(p);
    static Parser<Seq<A>> bracesCommaSep1<A>(Parser<A> p) => lexer.BracesCommaSep1(p);
    static Parser<Seq<A>> bracketsSemiSep<A>(Parser<A> p) => lexer.BracketsSemiSep(p);
    static Parser<Seq<A>> bracketsSemiSep1<A>(Parser<A> p) => lexer.BracketsSemiSep1(p);
    static Parser<Seq<A>> parensSemiSep<A>(Parser<A> p) => lexer.ParensSemiSep(p);
    static Parser<Seq<A>> parensSemiSep1<A>(Parser<A> p) => lexer.ParensSemiSep1(p);
    static Parser<Seq<A>> anglesSemiSep<A>(Parser<A> p) => lexer.AnglesSemiSep(p);
    static Parser<Seq<A>> anglesSemiSep1<A>(Parser<A> p) => lexer.AnglesSemiSep1(p);
    static Parser<Seq<A>> bracesSemiSep<A>(Parser<A> p) => lexer.BracesSemiSep(p);
    static Parser<Seq<A>> bracesSemiSep1<A>(Parser<A> p) => lexer.BracesSemiSep1(p);

One final tip is to know where the end of the stream is, and make sure it's there. So for your InterfaceIdList parser, you could do something like this:

    from _1 in whiteSpace
    from rs  in InterfaceIdList
    from _2 in eof
    select rs;

That will strip the spaces and comments at the start, parse the tokens (which automatically strip the comments and spaces after each token), and so we expect the end-of-stream after that. It should be there, or your parse has failed.

One other thing to look at is the BNF definition for C. That will give you a good insight into how to build your parsers (if you're looking to expand out to a more general C parser).

6 replies

pth14 Sep 19, 2022
Author

Hi,

I made some progress.
Using built-in functionality greatly simplified the code, see here below:

var interfaceIdListEntry = str("static INTERFACE_t").label("interfaceId list entry");
var equal = ch('=').label("equal");
var noneOfCloseBracket = many(noneOf("}")).label("none of close bracket");

var reservedNames = new Lst<string>(List.create<string>("static, struct", "enum"));
ParserResult<InterfaceIdList> parseResult = null;
var def = Language.JavaStyle.With(ReservedNames: reservedNames);
var lexer = makeTokenParser(def);
var structItemParser = from i in lexer.Identifier
                                from _s in spaces
                                from _r in noneOfCloseBracket
                                select i;
var structItemsParser = from si in lexer.Braces(structItemParser)
                                  from _c in lexer.Comma
                                  select new InterfaceIdListItem { Name = si, Comment = new InterfaceIdListItemComment(), IsIgnore = false };
var structParser = from _s1 in spaces
                         from _i1 in interfaceIdListEntry
                         from _s2 in spaces
                         from i in lexer.Identifier
                         from b in lexer.Brackets(spaces)
                         from _2 in spaces
                         from e in equal
                         from _s3 in spaces
                         from sip in lexer.Braces(many(structItemsParser))
                         from _s4 in spaces
                         from _sc in lexer.Semi
                         from _e in eof
                         select new InterfaceIdList { Name = i, Items = sip.Where(ii => !ii.IsIgnore), IsValid = true };
parseResult = parse(structParser, token);

However, I forgot to mentioned that I need to parse some comments and ignore other.
I should ignore line comments (starting with "//" or "/*") but I need to parse comment information starting with "TEXT=" or "DU=" as you can see in my unit test.
So, I still have some issues with this simplified version:

even if I used "spaces" instead of lexer.WhiteSpace, this simplified version remove all comments from parsing.
it is much slower, about 5 times slower.
event if I use "eof", I get the following error: "unexpected end of stream, expecting end of input"
I was not really able to benefit from reserved keywords parsing

pth14 Sep 19, 2022
Author

I got it about consume failure and the usage of attempt. I used it in other code I didn't posted. I don't think this is the cause of the issue. Anyway, I will try to apply the other advises. Thank you again.

You were right, I made some progress by adding attempt to the following line:

InterfaceIdCommentParser = from ii in attempt(InterfaceIdParser)

Now it seems I have an issue to signal the end of parsing as I got the following error:
"error at (line 9, column 3): unexpected many: combinator 'many' is applied to a parser that accepts an empty string.
"

louthy Sep 19, 2022
Maintainer

A few things should be changed:

var interfaceIdListEntry = str("static INTERFACE_t").label("interfaceId list entry");
var equal = ch('=').label("equal");

Should be:

var interfaceIdListEntry = lexer.Symbol("static INTERFACE_t").label("interfaceId list entry");
var equal = lexer.Symbol("=");

This guarantees they'll strip the spaces and comments after the token.

even if I used "spaces" instead of lexer.WhiteSpace, this simplified version remove all comments from parsing.

Obviously the makeTokenParser is a convenience for the most common use-case, i.e. removing all spaces and comments. If you look at the source-code for makeTokenParser then you can see how the GenTokenParser parsers are built. Simply cut n paste that into your own project, and provide your alternative comment parsers. You must take the whole thing, because every other parser in the makeTokenParser will use the lexeme parsing which strips comments and spaces in their entirety. So, don't call makeTokenParser at all, just copy the source into your own project and modify

The code from the start of the function down to the whiteSpace parser is the section you need to look at for stripping spaces and comments.

it is much slower, about 5 times slower.

Firstly, you're doing more work than is necessary, all the calls to spaces should be dropped - it is done automatically by the provided parsers.

This:

var noneOfCloseBracket = many(noneOf("}")).label("none of close bracket");

Can be changed to:

var noneOfCloseBracket = skipMany1(noneOf("}")).label("scoped body");

many will built up a Seq, whereas skipMany will just drop the parsed items.

Parsing tokens properly just isn't as fast as the more naive approach, for example identifiers need to check if you've actually got a reserved name; reserved names need to check that they're not immediately followed by valid identifier characters. String literals need to deal with escape-codes. This is just slower.

If you're looking for outright performance, then the best way is to tokenise the string first, then use Parser<I, O> with the token stream. There's other things you can do in terms of structuring your parser to not need to look-ahead too far before failing, but in this case I don't think you're doing that much.

Also, pre-building the parsers themselves is a good idea. Initialise them in a static constructor, so you don't need to keep building them each time you call parse.

event if I use "eof", I get the following error: "unexpected end of stream, expecting end of input"

Then you're not at the end of the stream. Although be mindful to check the result state, because there can still be error messages even on success. This is because errors 'bubble up', so make sure you have a result of Consumed Success, if so, it's worked. Or convert to Either<string, A> to get a more simplified result.

I have modified your example, and I am able to parse the source correctly. Note how I have made the changes I mention above, as well as moved the eof out of the struct-parser and into a stand-alone parser - this is so you can compose the structParser with other parsers. Also note how I strip spaces and comments before the call to structParser, this is needed because the token parsers only strip spaces/comments after the token; so the first thing for any source parser should be to strip the initial spaces/comments"

using LanguageExt;
using LanguageExt.Parsec;
using static LanguageExt.Prelude;
using static LanguageExt.Parsec.Char;
using static LanguageExt.Parsec.Prim;
using static LanguageExt.Parsec.Token;

namespace ExampleCParser;

public record InterfaceIdListItem(string Name, InterfaceIdListItemComment Comment, bool IsIgnore);
public record InterfaceIdList(string Name, Seq<InterfaceIdListItem> Items, bool IsValid);
public record InterfaceIdListItemComment;

public class CParser
{
    static readonly Parser<InterfaceIdList> structParser;
    static readonly GenTokenParser lexer;

    static CParser()
    {
        var reservedNames = List("static, struct", "enum");
        var def = Language.JavaStyle.With(ReservedNames: reservedNames);
        lexer = makeTokenParser(def);
        
        var equal = symbol("=");
        var staticKeyword = reserved("static").label("static");
        var noneOfCloseBracket = skipMany1(noneOf("}")).label("scope body");
        var type = ident.label("type");

        var structItemParser = 
            from i in ident
            from _s in spaces
            from _r in noneOfCloseBracket
            select i;
        
        var structItemsParser = 
            from si in braces(structItemParser)
            from _c in comma
            select new InterfaceIdListItem(si, new InterfaceIdListItemComment(), false);
        
        structParser = from _s1 in spaces
            from m in staticKeyword
            from t in type
            from i in ident
            from b in brackets(spaces)
            from e in equal
            from sip in braces(many(structItemsParser))
            from _sc in semi
            select new InterfaceIdList(i, sip.Where(ii => !ii.IsIgnore), true);
    }

    public static Either<string, InterfaceIdList> ParseStruct(string source) =>
        (from _ in whiteSpace
         from r in structParser
         from x in eof
         select r)
        .Parse(source)
        .ToEither();
    
    static Parser<string> ident => lexer.Identifier;
    static Parser<string> semi => lexer.Semi;
    static Parser<string> comma => lexer.Comma;
    static Parser<Unit> whiteSpace => lexer.WhiteSpace;
    static Parser<string> symbol(string sym) => lexer.Symbol(sym);
    static Parser<string> reserved(string sym) => lexer.Reserved(sym);
    static Parser<A> braces<A>(Parser<A> p) => lexer.Braces(p);
    static Parser<A> brackets<A>(Parser<A> p) => lexer.Brackets(p);
}

I can then call:

var test1 = @"static INTERFACE_t p_list [] =
{
    /*Comment1*/
    // Comment2
    {ID_E_P_P_M  , (void*)&e_p_p_m,   sizeof(e_p_p_m)   , ID_FLOAT32, PROD}, //TEXT=""E: IPM"" DU=""ppm""
    {ID_E_SC_E_S_NPM  , (void*)&s_mpm,  sizeof(s_mpm.speed_mpm)  , ID_FLOAT32, PROD}, //TEXT=""SE: Speed"" DU=""m/min""
	// Sp Det
	{ID_E_S_D_I_S , (void*)&s_i_s, sizeof(s_i_s)  , ID_UINT8,   PROD}, //TEXT=""SD: Input S""     DU=""pm""
};";

var result = CParser.ParseStruct(test1);

And it works.

pth14 Sep 19, 2022
Author

Thank you so much, I didn't expect a so detailed answer.

Simply cut n paste that into your own project, and provide your alternative comment parsers. You must take the whole thing, because every other parser in the makeTokenParser will use the lexeme parsing which strips comments and spaces in their entirety. So, don't call makeTokenParser at all, just copy the source into your own project and modify

I tried a simpler way.
First, I tried this:

var def = Language.JavaStyle.With(ReservedNames: reservedNames, CommentStart: null, CommentEnd: null, CommentLine: null);

but it didn't work because it seems you don't allow to set the comment parameters to null in "With" method.
Then I tried this:

var opChars = "+-*/";
var def = GenLanguageDef.Empty.With(
                NestedComments: true,
                OpStart: oneOf(opChars),
                OpLetter: oneOf(opChars),
                IdentStart: letter,
                IdentLetter: either(alphaNum, ch('_')),
                ReservedNames: reservedNames,
                ReservedOpNames: List("+", "-", "*", "/")
);

It seems to work and it seems simpler to me.

Now, I need to implement comments parser.
I will have a look how you did it in makeTokenParser.
Thank you again.

pth14 Sep 20, 2022
Author

I implemented comments parser successfully.
Thank you again for your quick support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser issue while parsing C code #1113

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Parser issue while parsing C code #1113

pth14 Sep 17, 2022

Replies: 1 comment · 6 replies

louthy Sep 17, 2022 Maintainer

pth14 Sep 19, 2022 Author

pth14 Sep 19, 2022 Author

louthy Sep 19, 2022 Maintainer

pth14 Sep 19, 2022 Author

pth14 Sep 20, 2022 Author

pth14
Sep 17, 2022

Replies: 1 comment 6 replies

louthy
Sep 17, 2022
Maintainer

pth14 Sep 19, 2022
Author

pth14 Sep 19, 2022
Author

louthy Sep 19, 2022
Maintainer

pth14 Sep 19, 2022
Author

pth14 Sep 20, 2022
Author