SyntaxError on use in expression of symbol with leading decimal digits #79

willwray · 2022-11-22T10:13:23Z

Here's a reduced reproducer:

#define Ox 0x
#if Ox
#endif

then pcpp test.h gives

test.h:3 error: Could not evaluate expression
 due to SyntaxError("around token 'x' type CPP_ID") (passed to evaluator: '0x')

It looks like leading decimal digits are eagerly stripped when parsed for the expression.

The text was updated successfully, but these errors were encountered:

willwray · 2022-11-22T11:30:53Z

debugpy/launcher 37201 -- -m pcmd test.h

PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
test.h:3 error: Could not evaluate expression due to SyntaxError("around token 'x' type CPP_ID") (passed to evaluator: '0x')
PyInt_FromLong not found.

ned14 · 2022-11-22T15:03:18Z

That's invalid input, and it did give a fairly good hint as to what's invalid about it.

willwray · 2022-11-22T15:09:45Z

Oops, I was overzealous in reducing the reproducer to less-than minimal...
Here's a reproducer that actually preprocesses

#define CAT_(A,B)A##B
#define CAT(A,B)CAT_(A,B)

#define Ox 0x
#if CAT(Ox,0)
#endif

willwray · 2022-11-22T15:12:08Z

It appears that (passed to evaluator: '0x0') is somehow lexed as CPP_INTEGER followed by CPP_ID
where it should remain a preprocessor token

willwray · 2022-11-22T15:16:58Z

FYI, the error was hit using pcpp to do codegen with this preprocessing library
https://github.com/willwray/IREPEAT
in processing 'vertical' repetitions - here's one of the many problematic lines
https://github.com/willwray/IREPEAT/blob/master/VREPEATx10.hpp#L11

(it works with gcc, clang, and the new conforming msvc preprocessor)

willwray · 2022-11-22T15:21:44Z

Also FYI, I'm looking at using pcpp to create an amalgamated header
(convenient for use on Compiler Explorer via a single #include<url>)

I'm also evaluating if it can create nicer codegen than the native cpp's.
It seems to create more empty lines than gcc and clang, but far fewer than msvc.

willwray · 2022-11-22T15:29:08Z

the PyInt_FromLong not found. spam seems to be coming from the debugger - a red herring

willwray · 2022-11-22T19:46:53Z

pcpp lacks a pp-number token (C++ link; same for C11 and C99)
so the tokenization is wrongly choosing CPP_INTEGER

> ppint = r'(((((0x)|(0X))[0-9a-fA-F]+)|(\d+))([uU][lL]|[lL][uU]|[uU]|[lL])?)'
> match = re.search(ppint,"0x")
> match.group()
: '0'

when it should choose pp-number as the max-munch

> ppnum = r".?[0-9]([A-Za-z_][\w_]*|[eEpP][-+]|'[a-zA-Z0-9_])*"
> match = re.search(ppnum,"0x")
> match.group()
: '0x'

In phase 3 input is decomposed into preprocessing tokens,
then phase 4 executes # directives and recurses back through 1,2,3...

Only in phase 7 are preprocessing tokens converted into tokens for translation.

pcpp only has one set of tokens (I'm trying to hack in a CPP_NUMBER token, no luck yet)

willwray · 2022-11-22T23:57:03Z

Help! Can't work out how to hack it.

Do the lextab.py and parsetab.py tables have to be regenerated? If so, how?

There's a comment on the in_production variable:

in_production = 1  # Set to 0 if editing pcpp implementation!

When set to zero and my edits are still ignored - PLY introspects the new CPP_NUMBER token
then it seems to get lost at some point (maybe because the table files are used).

willwray · 2022-11-23T10:46:43Z

Related issue #71, also notes the incorrect parse as glued CPP_INTEGER and CPP_ID.

willwray · 2022-11-23T12:45:29Z

This could be a straightforward fix (still can't work out how to test it).

The current gcc lex.cc only processes CPP_NUMBER.

This 2001 bugfix commit to the C preprocessor
c-lex.c (c_lex): Remove CPP_INT, CPP_FLOAT cases

Don't use CPP_INT, CPP_FLOAT; CPP_NUMBER is enough

shows pp-number is sufficient for preprocessor lexing.

Then, for evaluator.py processing of #if conditionals,
only "After all macro expansion and evaluation of ... ."
"Then the expression is evaluated as an integral constant expression"CPP_INTEGER

The current evaluator should correctly interpret any CPP_INTEGER.

In other words, CPP_INTEGER should be needed only for the evaluator
(and where the CPP_INTEGER##CPP_ID combo is a UDL user-defined literal)

Possible issues

pp-number is a broad superset that can parse invalid
see lex.cc cpp_avoid_paste "avoid an accidental token paste"

ned14 · 2022-11-23T16:24:18Z

You may find the ply parser docs at https://www.dabeaz.com/ply/ of use on how it works and generates the precalculated table files.

willwray · 2022-11-23T21:37:54Z

Related issue in Boost.Wave 👋 BOOST_PP_CAT(1e, -1) pp-token bug fixed early 2006

A simple proof of concept change that fixes ned14#79. With it, pcpp can do codegen using the IREPEAT library. I believe it's conceptually correct, but my Python may not be; please test this against your suite and review the method (hack) carefully. There's not much code! Mostly deletions. The change removes CPP_INTEGER, effectively replacing it with PP_NUMBER, and entirely removes CPP_FLOAT as superfluous for preprocessing purposes. pp-number is sufficient for preprocessing to stage 4 The pp-number regex in the issue is incorrect, lifted from unpublished WG21 https://isocpp.org/files/papers/D2180R0.html "pp-number makes cpp dumber" (best proposal title ever). Instead, I crafted a regex based on the lastest C++ draft https://eel.is/c++draft/lex.ppnumber#ntref:pp-number which accepts character ' as digit separator: regex string r'\.?\d(\.|[\w_]|\'[\w_]|[eEpP][-+])*' (also admits binary literals, with digit separator, of course, so they can now be added to the Value parsing code) Only the conditional evaluator is required to interpret the numbers as integer constant expressions. This is achieved by hacky means: def p_expression_number(p): 'expression : PP_NUMBER' try: p[0] = Value(p[1]) except: p[0] = p[1] The idea is that if the parsed string p[1] can be interpreted as an integer constant-expression Value(p[1]) then do so, otherwise simply pass through the string for possible further pasting and processing. A robust method might check p[1] against the CPP_INTEGER regex (removed in this commit) for a full match, consuming all input. On the other hand, relying on Value to validate the input while parsing and to raise an exception on failure may be Pythonic. It seems that pp-number itself is a hack in the standard; I see no way to incorporate pp-number alongside INTEGER and FLOAT tokens meaningful in C; but then there's no need to. Happy Thanksgiving!

willwray linked a pull request Nov 24, 2022 that will close this issue

Add PP_NUMBER, remove CPP_INTEGER, CPP_FLOAT #80

Open

ned14 added the bug label Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SyntaxError on use in expression of symbol with leading decimal digits #79

SyntaxError on use in expression of symbol with leading decimal digits #79

willwray commented Nov 22, 2022

willwray commented Nov 22, 2022

ned14 commented Nov 22, 2022

willwray commented Nov 22, 2022

willwray commented Nov 22, 2022

willwray commented Nov 22, 2022

willwray commented Nov 22, 2022

willwray commented Nov 22, 2022

willwray commented Nov 22, 2022

willwray commented Nov 22, 2022

willwray commented Nov 23, 2022

willwray commented Nov 23, 2022

ned14 commented Nov 23, 2022

willwray commented Nov 23, 2022

SyntaxError on use in expression of symbol with leading decimal digits #79

SyntaxError on use in expression of symbol with leading decimal digits #79

Comments

willwray commented Nov 22, 2022

willwray commented Nov 22, 2022

ned14 commented Nov 22, 2022

willwray commented Nov 22, 2022

willwray commented Nov 22, 2022

willwray commented Nov 22, 2022

willwray commented Nov 22, 2022

willwray commented Nov 22, 2022

willwray commented Nov 22, 2022

willwray commented Nov 22, 2022

willwray commented Nov 23, 2022

willwray commented Nov 23, 2022

ned14 commented Nov 23, 2022

willwray commented Nov 23, 2022