Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SyntaxError on use in expression of symbol with leading decimal digits #79

Open
willwray opened this issue Nov 22, 2022 · 13 comments · May be fixed by #80
Open

SyntaxError on use in expression of symbol with leading decimal digits #79

willwray opened this issue Nov 22, 2022 · 13 comments · May be fixed by #80
Labels

Comments

@willwray
Copy link

Here's a reduced reproducer:

#define Ox 0x
#if Ox
#endif

then pcpp test.h gives

test.h:3 error: Could not evaluate expression
 due to SyntaxError("around token 'x' type CPP_ID") (passed to evaluator: '0x')

It looks like leading decimal digits are eagerly stripped when parsed for the expression.

@willwray
Copy link
Author

debugpy/launcher 37201 -- -m pcmd test.h

PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
test.h:3 error: Could not evaluate expression due to SyntaxError("around token 'x' type CPP_ID") (passed to evaluator: '0x')
PyInt_FromLong not found.

image

@ned14
Copy link
Owner

ned14 commented Nov 22, 2022

That's invalid input, and it did give a fairly good hint as to what's invalid about it.

@willwray
Copy link
Author

Oops, I was overzealous in reducing the reproducer to less-than minimal...
Here's a reproducer that actually preprocesses

#define CAT_(A,B)A##B
#define CAT(A,B)CAT_(A,B)

#define Ox 0x
#if CAT(Ox,0)
#endif

@willwray
Copy link
Author

It appears that (passed to evaluator: '0x0') is somehow lexed as CPP_INTEGER followed by CPP_ID
where it should remain a preprocessor token

@willwray
Copy link
Author

FYI, the error was hit using pcpp to do codegen with this preprocessing library
https://github.com/willwray/IREPEAT
in processing 'vertical' repetitions - here's one of the many problematic lines
https://github.com/willwray/IREPEAT/blob/master/VREPEATx10.hpp#L11

(it works with gcc, clang, and the new conforming msvc preprocessor)

@willwray
Copy link
Author

Also FYI, I'm looking at using pcpp to create an amalgamated header
(convenient for use on Compiler Explorer via a single #include<url>)

I'm also evaluating if it can create nicer codegen than the native cpp's.
It seems to create more empty lines than gcc and clang, but far fewer than msvc.

@willwray
Copy link
Author

the PyInt_FromLong not found. spam seems to be coming from the debugger - a red herring

@willwray
Copy link
Author

pcpp lacks a pp-number token (C++ link; same for C11 and C99)
so the tokenization is wrongly choosing CPP_INTEGER

> ppint = r'(((((0x)|(0X))[0-9a-fA-F]+)|(\d+))([uU][lL]|[lL][uU]|[uU]|[lL])?)'
> match = re.search(ppint,"0x")
> match.group()
: '0'

when it should choose pp-number as the max-munch

> ppnum = r".?[0-9]([A-Za-z_][\w_]*|[eEpP][-+]|'[a-zA-Z0-9_])*"
> match = re.search(ppnum,"0x")
> match.group()
: '0x'

In phase 3 input is decomposed into preprocessing tokens,
then phase 4 executes # directives and recurses back through 1,2,3...

Only in phase 7 are preprocessing tokens converted into tokens for translation.

pcpp only has one set of tokens (I'm trying to hack in a CPP_NUMBER token, no luck yet)

@willwray
Copy link
Author

Help! Can't work out how to hack it.

Do the lextab.py and parsetab.py tables have to be regenerated? If so, how?

There's a comment on the in_production variable:

in_production = 1  # Set to 0 if editing pcpp implementation!

When set to zero and my edits are still ignored - PLY introspects the new CPP_NUMBER token
then it seems to get lost at some point (maybe because the table files are used).

@willwray
Copy link
Author

Related issue #71, also notes the incorrect parse as glued CPP_INTEGER and CPP_ID.

@willwray
Copy link
Author

This could be a straightforward fix (still can't work out how to test it).

The current gcc lex.cc only processes CPP_NUMBER.

This 2001 bugfix commit to the C preprocessor
c-lex.c (c_lex): Remove CPP_INT, CPP_FLOAT cases

Don't use CPP_INT, CPP_FLOAT; CPP_NUMBER is enough

shows pp-number is sufficient for preprocessor lexing.

Then, for evaluator.py processing of #if conditionals,
only "After all macro expansion and evaluation of ... ."
"Then the expression is evaluated as an integral constant expression"CPP_INTEGER

The current evaluator should correctly interpret any CPP_INTEGER.

In other words, CPP_INTEGER should be needed only for the evaluator
(and where the CPP_INTEGER##CPP_ID combo is a UDL user-defined literal)

Possible issues

  • pp-number is a broad superset that can parse invalid
  • see lex.cc cpp_avoid_paste "avoid an accidental token paste"

@ned14
Copy link
Owner

ned14 commented Nov 23, 2022

You may find the ply parser docs at https://www.dabeaz.com/ply/ of use on how it works and generates the precalculated table files.

@willwray
Copy link
Author

Related issue in Boost.Wave 👋 BOOST_PP_CAT(1e, -1) pp-token bug fixed early 2006

willwray added a commit to willwray/pcpp that referenced this issue Nov 24, 2022
A simple proof of concept change that fixes ned14#79.
With it, pcpp can do codegen using the IREPEAT library.

I believe it's conceptually correct, but my Python may not be;
please test this against your suite and review the method
(hack) carefully. There's not much code! Mostly deletions.

The change removes CPP_INTEGER, effectively replacing it with
PP_NUMBER, and entirely removes CPP_FLOAT as superfluous for
preprocessing purposes.

pp-number is sufficient for preprocessing to stage 4

The pp-number regex in the issue is incorrect, lifted from
unpublished WG21 https://isocpp.org/files/papers/D2180R0.html
"pp-number makes cpp dumber" (best proposal title ever).

Instead, I crafted a regex based on the lastest C++ draft
https://eel.is/c++draft/lex.ppnumber#ntref:pp-number
which accepts character ' as digit separator:

  regex string   r'\.?\d(\.|[\w_]|\'[\w_]|[eEpP][-+])*'

(also admits binary literals, with digit separator, of course,
 so they can now be added to the Value parsing code)

Only the conditional evaluator is required to interpret the
numbers as integer constant expressions.

This is achieved by hacky means:

    def p_expression_number(p):
        'expression : PP_NUMBER'
        try:
            p[0] = Value(p[1])
        except:
            p[0] = p[1]

The idea is that if the parsed string p[1] can be interpreted as
an integer constant-expression Value(p[1]) then do so, otherwise
simply pass through the string for possible further pasting and
processing.

A robust method might check p[1] against the CPP_INTEGER regex
(removed in this commit) for a full match, consuming all input.
On the other hand, relying on Value to validate the input while
parsing and to raise an exception on failure may be Pythonic.

It seems that pp-number itself is a hack in the standard; I see
no way to incorporate pp-number alongside INTEGER and FLOAT tokens
meaningful in C; but then there's no need to. Happy Thanksgiving!
@willwray willwray linked a pull request Nov 24, 2022 that will close this issue
@ned14 ned14 added the bug label Apr 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants