Add PP_NUMBER, remove CPP_INTEGER, CPP_FLOAT #80

willwray · 2022-11-24T12:25:41Z

A simple proof of concept change that fixes #79.
With it, pcpp can do codegen using the IREPEAT library.

I believe it's conceptually correct, but my Python may not be;
please test this against your suite and review the method
(hack) carefully. There's not much code! Mostly deletions.

The change removes CPP_INTEGER, effectively replacing it with
PP_NUMBER, and entirely removes CPP_FLOAT as superfluous for
preprocessing purposes.

pp-number is sufficient for preprocessing to stage 4

The pp-number regex in the issue is incorrect, lifted from
unpublished WG21 https://isocpp.org/files/papers/D2180R0.html
"pp-number makes cpp dumber" (best proposal title ever).

Instead, I crafted a regex based on the lastest C++ draft
https://eel.is/c++draft/lex.ppnumber#ntref:pp-number
which accepts character ' as digit separator:

regex string r'.?\d(.|[\w_]|'[\w_]|[eEpP][-+])*'

(also admits binary literals, with digit separator, of course,
so they can now be added to the Value parsing code)

Only the conditional evaluator is required to interpret the
numbers as integer constant expressions.

This is achieved by hacky means:

def p_expression_number(p):
    'expression : PP_NUMBER'
    try:
        p[0] = Value(p[1])
    except:
        p[0] = p[1]

The idea is that if the parsed string p[1] can be interpreted as
an integer constant-expression Value(p[1]) then do so, otherwise
simply pass through the string for possible further pasting and
processing.

A robust method might check p[1] against the CPP_INTEGER regex
(removed in this commit) for a full match, consuming all input.
On the other hand, relying on Value to validate the input while
parsing and to raise an exception on failure may be Pythonic.

It seems that pp-number itself is a hack in the standard; I see
no way to incorporate pp-number alongside INTEGER and FLOAT tokens
meaningful in C; but then there's no need to. Happy Thanksgiving!

A simple proof of concept change that fixes ned14#79. With it, pcpp can do codegen using the IREPEAT library. I believe it's conceptually correct, but my Python may not be; please test this against your suite and review the method (hack) carefully. There's not much code! Mostly deletions. The change removes CPP_INTEGER, effectively replacing it with PP_NUMBER, and entirely removes CPP_FLOAT as superfluous for preprocessing purposes. pp-number is sufficient for preprocessing to stage 4 The pp-number regex in the issue is incorrect, lifted from unpublished WG21 https://isocpp.org/files/papers/D2180R0.html "pp-number makes cpp dumber" (best proposal title ever). Instead, I crafted a regex based on the lastest C++ draft https://eel.is/c++draft/lex.ppnumber#ntref:pp-number which accepts character ' as digit separator: regex string r'\.?\d(\.|[\w_]|\'[\w_]|[eEpP][-+])*' (also admits binary literals, with digit separator, of course, so they can now be added to the Value parsing code) Only the conditional evaluator is required to interpret the numbers as integer constant expressions. This is achieved by hacky means: def p_expression_number(p): 'expression : PP_NUMBER' try: p[0] = Value(p[1]) except: p[0] = p[1] The idea is that if the parsed string p[1] can be interpreted as an integer constant-expression Value(p[1]) then do so, otherwise simply pass through the string for possible further pasting and processing. A robust method might check p[1] against the CPP_INTEGER regex (removed in this commit) for a full match, consuming all input. On the other hand, relying on Value to validate the input while parsing and to raise an exception on failure may be Pythonic. It seems that pp-number itself is a hack in the standard; I see no way to incorporate pp-number alongside INTEGER and FLOAT tokens meaningful in C; but then there's no need to. Happy Thanksgiving!

willwray · 2022-11-27T21:37:55Z

Added tests/issue0079.py
FAILs pre-patch as expected

FAIL: runTest (tests.issue0079.pp_number_pasting)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/will/repos/pcpp/tests/issue0079.py", line 21, in runTest
    self.assertEqual(p.return_code, 0)
AssertionError: 1 != 0

All python setup.py test tests pass post-patch.

Are there other tests to be run?

Should I commit the new generated lextab.py and parsetab.py?

ned14 · 2022-11-28T08:06:52Z

Should I commit the new generated lextab.py and parsetab.py?

If you wish.

It'll be next Spring at least before I can look at this properly.

Not sure if its right to commit generated tables in the PR so here they are in a separate commit to pick or drop.

willwray · 2022-11-28T13:05:57Z

ok, here are the generated tables, to pick or drop if and when merged.

Maybe @assarbad can review as it's closely related to his #71 (comment)
also suggests tweaking the CPP_INTEGER token that this PR eliminates.

I may be missing something obvious...
I see no need for CPP_INTEGER or CPP_FLOAT in a pure preprocessor.
Am I missing some use-case that requires phase-7 processing, perhaps.
The test-c stuff, and all the c- tests there...

If you authorise the PR action then it should confirm that the tests pass
or flag up any issue I may have missed.

willwray · 2022-11-28T13:13:24Z

Thanks for coding, publishing and maintaining this project.
I'd come across it before, forgotten, then found again.

For my use case, I'm actually more interested in a fix for issue #53 ...
How easy do you gauge that to be?
Maybe I learn some Python file system handling

assarbad · 2022-11-29T23:11:57Z

Thanks for coding, publishing and maintaining this project. I'd come across it before, forgotten, then found again.

For my use case, I'm actually more interested in a fix for issue #53 ... How easy do you gauge that to be? Maybe I learn some Python file system handling

I am also looking at issue #53 myself. And I'll review your patch, but not tonight.

assarbad · 2022-12-01T20:04:59Z

FYI: I looked at your willwray/pcpp:pp_number branch, @willwray, and tested it against the repro from #71. And lo and behold the issue is gone.

I'll now take a closer look at the actual changes over what's on master here.

assarbad

I think PP_NUMBER should be named CPP_NUMBER throughout, to stick with the established CPP_ prefix.

Otherwise this PR looks fine.

There is one minor issue that we may want to postpone for a separate ticket or not, mostly depends on @ned14. That regular expression doesn't quite capture what the pp-number is defined as. I'll comment on this separately, though. It's probably going to be bigger than what should go into such a final comment here 😉

pcpp/lextab.py

tests/issue0079.py

pcpp/evaluator.py

pcpp/parser.py

assarbad · 2022-12-01T21:38:22Z

So we have the following bits (correct me if I got it wrong):

pp-number:
- digit
- . digit [ == \.\d ]
- pp-number identifier-continue
- pp-number ' digit [ += '\d ]
- pp-number ' nondigit [ += [a-zA-Z0-9_] ]
- pp-number e sign
- pp-number E sign
- pp-number p sign
- pp-number P sign
- pp-number .

... where:

digit corresponds to \d
identifier-continue corresponds to [\w_], which would include non-ASCII letter code points (at least the description seems to include other Unicode letter code points, as per XID_Continue and this)
nondigit corresponds to pp-number + [a-zA-Z0-9_]
NB: interestingly as per the description this does not include other letters than a-z in lower and upper case

The rest is relatively self-explanatory.

The description you referenced already gives us a lot of leeway. And it may be appropriate to keep it that greedy and that broad. However, with the above remarks in mind I'd like to propose changing it to:

\.?\d(?:\.|[\w_]|'[\w_]|[eEpP][-+])*

This:

makes it non-capturing
removes the escaping of ' (was unnecessary)

We should probably also define what we expect to be matched here. Admittedly my proposed regex from #71 may be too strict.

My change does not address:

the difference between nondigit and identifier-continue
it doesn't make it "crispier", i.e. more to the point, such as really matching '(?:\d|[a-zA-Z0-9_]) which in the above is merely '[\w_]

What say you, @ned14 and @willwray?

- Renames PP_NUMBER from the PR to CPP_NUMBER - Test case for ned14#71 which will only work after this PR gets merged - Adds a stub to run pcpp without first installing it - Amends .gitignore to ignore items created by the tests

willwray · 2022-12-01T22:19:27Z

Thanks @assarbad for your thorough detailed review.
I'll aim to resolve all comments at the weekend.

ned14 · 2022-12-02T09:28:45Z

For me, if GCC and Wave's preprocessor use the same technique to solve a conformance problem, then it seems appropriate that pcpp can do the same.

Whether the technique is implemented correctly is another matter, but I'm sure we'll get there.

The previous regex escaped the digit separator ' as \' - necessary only because the string itself was '' single-quoted. This commit switches to use double-quote "" for the string and removes the now unneeded escape.

willwray · 2022-12-14T17:21:10Z

is the pcpp/evaluator.py change to p_expression_number kosher or a horrid hack?

The hack is highlighted in the top PR message and it's right up top of the listed Files changed tab view.
Please take a quick look. It's a few lines diff but doing it differently might take much more careful coding.

It effectively relies on Python exceptions for validating the Value(p[1]) evaluation.

p_expression_number originally used CPP_INTEGER as its input match then delegated direct to Value():

def p_expression_number(p):
    'expression : CPP_INTEGER'
    p[0] = Value(p[1])

p_expression_number now uses CPP_NUMBER as input match instead:

def p_expression_number(p):
    'expression : PP_NUMBER'
    try:
        p[0] = Value(p[1])
    except:
        p[0] = p[1]

Is it right and Pythonic to rely on exceptions to do the validation like this?

By the fact that the tests all pass, it appears that Value() is quite robust in parsing from CPP_NUMBER instead, and raising an exception if there's a failure to parse out the complete token character string. Perhaps there are more stress-test cases to run against this change.

I don't see a clean alternative way to do this that meshes properly with the grammar and parser., and still allows for pasting to form numeric literals from smaller pp-token parts.

willwray · 2022-12-14T17:45:21Z

the unicode question

The CPP_NUMBER regex definition invokes the specification of the pp-number regular expression that requires XID_Continue.

This raises further questions:

to what extent is pcpp already unicode-enabled?
what other token regex or code might call for unicodification?

Out of my depth, I asked a couple of questions on WG21 sg16 unicode study group mailing list.

I guess the regex should be extended to parse XID_Continue.
Does anyone here have a clue how to do that?

A regex engine that supports UTS#18 (Unicode Regular Expressions) rule RL2.7 would allow matching a character with the XID_Continue property with the syntax \p{XID_Continue}. Unicode Utilities example. Note that the example matches the identifier (but not the mathematical character nor the clown faces).
I don't think the Python regex engine supports that though, so you'll have to find another regex engine or another method to perform the match.
Clang is also going to allow the mathematical profile specified in L2/22-230 in identifiers. C++ may eventually allow them as well.

Any other pp-token that should be considered for unicodification?

Some other tokens incorporate an identifier user-defined-literal does.

The Python standard library re regular expression module notes:
https://docs.python.org/3/library/re.html

See also The third-party regex module, which has an API compatible with the standard library re module, but offers additional functionality and a more thorough Unicode support.

assarbad · 2022-12-14T23:01:06Z

def p_expression_number(p):
    'expression : PP_NUMBER'
    try:
        p[0] = Value(p[1])
    except:
        p[0] = p[1]

Is it right and Pythonic to rely on exceptions to do the validation like this?

The only un-Pythonic thing I see here is that the exception isn't caught by listing concrete exceptions or at least a base exception type. This is considered sort of an anti-pattern. As an example, this particular code should probably never raise a KeyError, but if it does it probably signals an actual logic error rather than something you may want to catch. I think in this case probably (but not 100% sure) you are looking for TypeError. If you have an example with this ambiguity a test could be used to validate your assumptions.

See https://docs.python.org/3/library/exceptions.html#base-classes

Otherwise I don't see much of an issue trying to treat something one way and, failing that, falling back to the alternative. AFAIK nothing un-Pythonic about that.

This raises further questions:

* to what extent is `pcpp` already unicode-enabled?
* what other token regex or code might call for unicodification?

As long as you stick to character classes such as \w/\W or \s/\S and so on, you're fairly safe with Python. It already considers Unicode to some extent. But probably not specifically to the extent of supporting Unicode character category XID_Continue. At least from what I found \p{categoryname} isn't supported by re. The case of t_CPP_ID for example seems to be approximated well enough.

See also The third-party regex module, which has an API compatible with the standard library re module, but offers additional functionality and a more thorough Unicode support.

I don't know. As an option this would probably be neat, but outright depending on it would not be too nice, given currently the set of external dependencies is quite limited. It's true that this may eventually become a concern, but as we have learned compilers already had to backpedal on allowed Unicode characters to some extent (changing direction and such), because otherwise editors would outright hide code. And as you noted yourself the debate is far from finished, so I think that bridge ought to be crossed when we get there, not necessarily right now.

Also keep in mind that other than for measuring header heft, pcpp probably is used for cases such as amalgamating libraries into single-header. This means latest if the compiler see the results of what pcpp spat out, you'll get the feedback whether something was wrong.

ned14 · 2022-12-15T11:49:11Z

I believe pcpp on Python 3 makes a reasonable attempt at Unicode. C++ has been evolving there though, as has C, and there are differences which are being reconciled. TBH I think a reasonable attempt is reasonable :)

willwray · 2022-12-15T11:57:14Z

I spent too long yesterday following up on Tom H's unicode links - my conclusion was that the security and stability concerns for languages, around unicode identifiers and exploits, are very unlikely to be concerns for pcpp as a pure preprocessor - it seems ok for pcpp to be lax in what it accepts as identifiers.

(p.s. It looks like that CI failure is on Python 2 due to the added StringIO context manager - I can revert that)

This reverts commit 8a5b9d0.

willwray · 2022-12-15T12:10:57Z

In particular, I like the idea that pcpp accept the mathematical profile identifiers that gcc currently accepts (rejects in -pedantic), clang 14 accepted and now rejects in 15 (but is adding an extension flag to support it) and that look likely to be incorporated in C++ once there's consensus.

ned14 · 2022-12-15T13:50:04Z

I go on annual vacation next week, I'm hoping some time will present itself to review all the open PRs everywhere across my open source. No promises, it's a hectic three weeks coming for me.

willwray · 2022-12-15T15:59:42Z

Thanks Niall; that'd be a nice surprise - don't stress it though.

I'll aim to understand more of the Value code and learn from @assarbad's review comment.

This PR I'm only really wanting in order to replace MSVC preprocessor with nicer output (fewer bogus empty lines).

It's the has_include feature I'm more interested in... 🎄 🎁 🎅

willwray marked this pull request as ready for review November 24, 2022 12:50

Add tests/issue0079.py

ad71ecd

Commit generated lextab and parsetab

4207a31

Not sure if its right to commit generated tables in the PR so here they are in a separate commit to pick or drop.

assarbad mentioned this pull request Dec 1, 2022

Incomplete macro reolving #71

Open

assarbad suggested changes Dec 1, 2022

View reviewed changes

pcpp/lextab.py Show resolved Hide resolved

tests/issue0079.py Show resolved Hide resolved

tests/issue0079.py Show resolved Hide resolved

tests/issue0079.py Show resolved Hide resolved

pcpp/evaluator.py Show resolved Hide resolved

pcpp/parser.py Outdated Show resolved Hide resolved

willwray added 2 commits December 14, 2022 12:01

Update PP_NUMBER regex definition

fcafc37

The previous regex escaped the digit separator ' as \' - necessary only because the string itself was '' single-quoted. This commit switches to use double-quote "" for the string and removes the now unneeded escape.

Update test to use context manager for StringIO

8a5b9d0

Revert "Update test to use context manager for StringIO"

9c702df

This reverts commit 8a5b9d0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PP_NUMBER, remove CPP_INTEGER, CPP_FLOAT #80

Add PP_NUMBER, remove CPP_INTEGER, CPP_FLOAT #80

willwray commented Nov 24, 2022

willwray commented Nov 27, 2022

ned14 commented Nov 28, 2022

willwray commented Nov 28, 2022

willwray commented Nov 28, 2022

assarbad commented Nov 29, 2022

assarbad commented Dec 1, 2022

assarbad left a comment

assarbad commented Dec 1, 2022

willwray commented Dec 1, 2022

ned14 commented Dec 2, 2022

willwray commented Dec 14, 2022

willwray commented Dec 14, 2022

assarbad commented Dec 14, 2022

ned14 commented Dec 15, 2022

willwray commented Dec 15, 2022

willwray commented Dec 15, 2022

ned14 commented Dec 15, 2022

willwray commented Dec 15, 2022

Add PP_NUMBER, remove CPP_INTEGER, CPP_FLOAT #80

Are you sure you want to change the base?

Add PP_NUMBER, remove CPP_INTEGER, CPP_FLOAT #80

Conversation

willwray commented Nov 24, 2022

willwray commented Nov 27, 2022

ned14 commented Nov 28, 2022

willwray commented Nov 28, 2022

willwray commented Nov 28, 2022

assarbad commented Nov 29, 2022

assarbad commented Dec 1, 2022

assarbad left a comment

Choose a reason for hiding this comment

assarbad commented Dec 1, 2022

willwray commented Dec 1, 2022

ned14 commented Dec 2, 2022

willwray commented Dec 14, 2022

willwray commented Dec 14, 2022

assarbad commented Dec 14, 2022

ned14 commented Dec 15, 2022

willwray commented Dec 15, 2022

willwray commented Dec 15, 2022

ned14 commented Dec 15, 2022

willwray commented Dec 15, 2022