Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PP_NUMBER, remove CPP_INTEGER, CPP_FLOAT #80

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

willwray
Copy link

A simple proof of concept change that fixes #79.
With it, pcpp can do codegen using the IREPEAT library.

I believe it's conceptually correct, but my Python may not be;
please test this against your suite and review the method
(hack) carefully. There's not much code! Mostly deletions.

The change removes CPP_INTEGER, effectively replacing it with
PP_NUMBER, and entirely removes CPP_FLOAT as superfluous for
preprocessing purposes.

pp-number is sufficient for preprocessing to stage 4

The pp-number regex in the issue is incorrect, lifted from
unpublished WG21 https://isocpp.org/files/papers/D2180R0.html
"pp-number makes cpp dumber" (best proposal title ever).

Instead, I crafted a regex based on the lastest C++ draft
https://eel.is/c++draft/lex.ppnumber#ntref:pp-number
which accepts character ' as digit separator:

regex string r'.?\d(.|[\w_]|'[\w_]|[eEpP][-+])*'

(also admits binary literals, with digit separator, of course,
so they can now be added to the Value parsing code)

Only the conditional evaluator is required to interpret the
numbers as integer constant expressions.

This is achieved by hacky means:

def p_expression_number(p):
    'expression : PP_NUMBER'
    try:
        p[0] = Value(p[1])
    except:
        p[0] = p[1]

The idea is that if the parsed string p[1] can be interpreted as
an integer constant-expression Value(p[1]) then do so, otherwise
simply pass through the string for possible further pasting and
processing.

A robust method might check p[1] against the CPP_INTEGER regex
(removed in this commit) for a full match, consuming all input.
On the other hand, relying on Value to validate the input while
parsing and to raise an exception on failure may be Pythonic.

It seems that pp-number itself is a hack in the standard; I see
no way to incorporate pp-number alongside INTEGER and FLOAT tokens
meaningful in C; but then there's no need to. Happy Thanksgiving!

A simple proof of concept change that fixes ned14#79.
With it, pcpp can do codegen using the IREPEAT library.

I believe it's conceptually correct, but my Python may not be;
please test this against your suite and review the method
(hack) carefully. There's not much code! Mostly deletions.

The change removes CPP_INTEGER, effectively replacing it with
PP_NUMBER, and entirely removes CPP_FLOAT as superfluous for
preprocessing purposes.

pp-number is sufficient for preprocessing to stage 4

The pp-number regex in the issue is incorrect, lifted from
unpublished WG21 https://isocpp.org/files/papers/D2180R0.html
"pp-number makes cpp dumber" (best proposal title ever).

Instead, I crafted a regex based on the lastest C++ draft
https://eel.is/c++draft/lex.ppnumber#ntref:pp-number
which accepts character ' as digit separator:

  regex string   r'\.?\d(\.|[\w_]|\'[\w_]|[eEpP][-+])*'

(also admits binary literals, with digit separator, of course,
 so they can now be added to the Value parsing code)

Only the conditional evaluator is required to interpret the
numbers as integer constant expressions.

This is achieved by hacky means:

    def p_expression_number(p):
        'expression : PP_NUMBER'
        try:
            p[0] = Value(p[1])
        except:
            p[0] = p[1]

The idea is that if the parsed string p[1] can be interpreted as
an integer constant-expression Value(p[1]) then do so, otherwise
simply pass through the string for possible further pasting and
processing.

A robust method might check p[1] against the CPP_INTEGER regex
(removed in this commit) for a full match, consuming all input.
On the other hand, relying on Value to validate the input while
parsing and to raise an exception on failure may be Pythonic.

It seems that pp-number itself is a hack in the standard; I see
no way to incorporate pp-number alongside INTEGER and FLOAT tokens
meaningful in C; but then there's no need to. Happy Thanksgiving!
@willwray willwray marked this pull request as ready for review November 24, 2022 12:50
@willwray
Copy link
Author

Added tests/issue0079.py
FAILs pre-patch as expected

FAIL: runTest (tests.issue0079.pp_number_pasting)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/will/repos/pcpp/tests/issue0079.py", line 21, in runTest
    self.assertEqual(p.return_code, 0)
AssertionError: 1 != 0

All python setup.py test tests pass post-patch.

Are there other tests to be run?

Should I commit the new generated lextab.py and parsetab.py?

@ned14
Copy link
Owner

ned14 commented Nov 28, 2022

Should I commit the new generated lextab.py and parsetab.py?

If you wish.

It'll be next Spring at least before I can look at this properly.

Not sure if its right to commit generated tables in the PR
so here they are in a separate commit to pick or drop.
@willwray
Copy link
Author

ok, here are the generated tables, to pick or drop if and when merged.

Maybe @assarbad can review as it's closely related to his #71 (comment)
also suggests tweaking the CPP_INTEGER token that this PR eliminates.

I may be missing something obvious...
I see no need for CPP_INTEGER or CPP_FLOAT in a pure preprocessor.
Am I missing some use-case that requires phase-7 processing, perhaps.
The test-c stuff, and all the c- tests there...

If you authorise the PR action then it should confirm that the tests pass
or flag up any issue I may have missed.

@willwray
Copy link
Author

Thanks for coding, publishing and maintaining this project.
I'd come across it before, forgotten, then found again.

For my use case, I'm actually more interested in a fix for issue #53 ...
How easy do you gauge that to be?
Maybe I learn some Python file system handling

@assarbad
Copy link
Contributor

Thanks for coding, publishing and maintaining this project. I'd come across it before, forgotten, then found again.

For my use case, I'm actually more interested in a fix for issue #53 ... How easy do you gauge that to be? Maybe I learn some Python file system handling

I am also looking at issue #53 myself. And I'll review your patch, but not tonight.

@assarbad
Copy link
Contributor

assarbad commented Dec 1, 2022

FYI: I looked at your willwray/pcpp:pp_number branch, @willwray, and tested it against the repro from #71. And lo and behold the issue is gone.

I'll now take a closer look at the actual changes over what's on master here.

Copy link
Contributor

@assarbad assarbad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think PP_NUMBER should be named CPP_NUMBER throughout, to stick with the established CPP_ prefix.

Otherwise this PR looks fine.

There is one minor issue that we may want to postpone for a separate ticket or not, mostly depends on @ned14. That regular expression doesn't quite capture what the pp-number is defined as. I'll comment on this separately, though. It's probably going to be bigger than what should go into such a final comment here 😉

pcpp/lextab.py Show resolved Hide resolved
tests/issue0079.py Show resolved Hide resolved
tests/issue0079.py Show resolved Hide resolved
tests/issue0079.py Show resolved Hide resolved
pcpp/evaluator.py Show resolved Hide resolved
pcpp/parser.py Outdated Show resolved Hide resolved
@assarbad
Copy link
Contributor

assarbad commented Dec 1, 2022

So we have the following bits (correct me if I got it wrong):

  • pp-number:
    • digit
    • . digit [ == \.\d ]
    • pp-number identifier-continue
    • pp-number ' digit [ += '\d ]
    • pp-number ' nondigit [ += [a-zA-Z0-9_] ]
    • pp-number e sign
    • pp-number E sign
    • pp-number p sign
    • pp-number P sign
    • pp-number .

... where:

  • digit corresponds to \d
  • identifier-continue corresponds to [\w_], which would include non-ASCII letter code points (at least the description seems to include other Unicode letter code points, as per XID_Continue and this)
  • nondigit corresponds to pp-number + [a-zA-Z0-9_]
    NB: interestingly as per the description this does not include other letters than a-z in lower and upper case

The rest is relatively self-explanatory.

The description you referenced already gives us a lot of leeway. And it may be appropriate to keep it that greedy and that broad. However, with the above remarks in mind I'd like to propose changing it to:

\.?\d(?:\.|[\w_]|'[\w_]|[eEpP][-+])*

This:

  • makes it non-capturing
  • removes the escaping of ' (was unnecessary)

We should probably also define what we expect to be matched here. Admittedly my proposed regex from #71 may be too strict.

My change does not address:

  • the difference between nondigit and identifier-continue
  • it doesn't make it "crispier", i.e. more to the point, such as really matching '(?:\d|[a-zA-Z0-9_]) which in the above is merely '[\w_]

What say you, @ned14 and @willwray?

assarbad added a commit to assarbad/pcpp that referenced this pull request Dec 1, 2022
- Renames PP_NUMBER from the PR to CPP_NUMBER
- Test case for ned14#71 which will only work after this PR gets
  merged
- Adds a stub to run pcpp without first installing it
- Amends .gitignore to ignore items created by the tests
@willwray
Copy link
Author

willwray commented Dec 1, 2022

Thanks @assarbad for your thorough detailed review.
I'll aim to resolve all comments at the weekend.

@ned14
Copy link
Owner

ned14 commented Dec 2, 2022

For me, if GCC and Wave's preprocessor use the same technique to solve a conformance problem, then it seems appropriate that pcpp can do the same.

Whether the technique is implemented correctly is another matter, but I'm sure we'll get there.

The previous regex escaped the digit separator ' as \' -
necessary only because the string itself was '' single-quoted.

This commit switches to use double-quote "" for the string
and removes the now unneeded escape.
@willwray
Copy link
Author

  • is the pcpp/evaluator.py change to p_expression_number kosher or a horrid hack?

The hack is highlighted in the top PR message and it's right up top of the listed Files changed tab view.
Please take a quick look. It's a few lines diff but doing it differently might take much more careful coding.

It effectively relies on Python exceptions for validating the Value(p[1]) evaluation.

p_expression_number originally used CPP_INTEGER as its input match then delegated direct to Value():

def p_expression_number(p):
    'expression : CPP_INTEGER'
    p[0] = Value(p[1])

p_expression_number now uses CPP_NUMBER as input match instead:

def p_expression_number(p):
    'expression : PP_NUMBER'
    try:
        p[0] = Value(p[1])
    except:
        p[0] = p[1]

Is it right and Pythonic to rely on exceptions to do the validation like this?

By the fact that the tests all pass, it appears that Value() is quite robust in parsing from CPP_NUMBER instead, and raising an exception if there's a failure to parse out the complete token character string. Perhaps there are more stress-test cases to run against this change.

I don't see a clean alternative way to do this that meshes properly with the grammar and parser., and still allows for pasting to form numeric literals from smaller pp-token parts.

@willwray
Copy link
Author

  • the unicode question

The CPP_NUMBER regex definition invokes the specification of the pp-number regular expression that requires XID_Continue.

This raises further questions:

  • to what extent is pcpp already unicode-enabled?
  • what other token regex or code might call for unicodification?

Out of my depth, I asked a couple of questions on WG21 sg16 unicode study group mailing list.

I guess the regex should be extended to parse XID_Continue.
Does anyone here have a clue how to do that?

A regex engine that supports UTS#18 (Unicode Regular Expressions) rule RL2.7 would allow matching a character with the XID_Continue property with the syntax \p{XID_Continue}. Unicode Utilities example. Note that the example matches the identifier (but not the mathematical character nor the clown faces).
I don't think the Python regex engine supports that though, so you'll have to find another regex engine or another method to perform the match.
Clang is also going to allow the mathematical profile specified in L2/22-230 in identifiers. C++ may eventually allow them as well.

Any other pp-token that should be considered for unicodification?

Some other tokens incorporate an identifier user-defined-literal does.

The Python standard library re regular expression module notes:
https://docs.python.org/3/library/re.html

See also The third-party regex module, which has an API compatible with the standard library re module, but offers additional functionality and a more thorough Unicode support.

@assarbad
Copy link
Contributor

def p_expression_number(p):
    'expression : PP_NUMBER'
    try:
        p[0] = Value(p[1])
    except:
        p[0] = p[1]

Is it right and Pythonic to rely on exceptions to do the validation like this?

The only un-Pythonic thing I see here is that the exception isn't caught by listing concrete exceptions or at least a base exception type. This is considered sort of an anti-pattern. As an example, this particular code should probably never raise a KeyError, but if it does it probably signals an actual logic error rather than something you may want to catch. I think in this case probably (but not 100% sure) you are looking for TypeError. If you have an example with this ambiguity a test could be used to validate your assumptions.

See https://docs.python.org/3/library/exceptions.html#base-classes

Otherwise I don't see much of an issue trying to treat something one way and, failing that, falling back to the alternative. AFAIK nothing un-Pythonic about that.

This raises further questions:

* to what extent is `pcpp` already unicode-enabled?
* what other token regex or code might call for unicodification?

As long as you stick to character classes such as \w/\W or \s/\S and so on, you're fairly safe with Python. It already considers Unicode to some extent. But probably not specifically to the extent of supporting Unicode character category XID_Continue. At least from what I found \p{categoryname} isn't supported by re. The case of t_CPP_ID for example seems to be approximated well enough.

See also The third-party regex module, which has an API compatible with the standard library re module, but offers additional functionality and a more thorough Unicode support.

I don't know. As an option this would probably be neat, but outright depending on it would not be too nice, given currently the set of external dependencies is quite limited. It's true that this may eventually become a concern, but as we have learned compilers already had to backpedal on allowed Unicode characters to some extent (changing direction and such), because otherwise editors would outright hide code. And as you noted yourself the debate is far from finished, so I think that bridge ought to be crossed when we get there, not necessarily right now.

Also keep in mind that other than for measuring header heft, pcpp probably is used for cases such as amalgamating libraries into single-header. This means latest if the compiler see the results of what pcpp spat out, you'll get the feedback whether something was wrong.

@ned14
Copy link
Owner

ned14 commented Dec 15, 2022

I believe pcpp on Python 3 makes a reasonable attempt at Unicode. C++ has been evolving there though, as has C, and there are differences which are being reconciled. TBH I think a reasonable attempt is reasonable :)

@willwray
Copy link
Author

I spent too long yesterday following up on Tom H's unicode links - my conclusion was that the security and stability concerns for languages, around unicode identifiers and exploits, are very unlikely to be concerns for pcpp as a pure preprocessor - it seems ok for pcpp to be lax in what it accepts as identifiers.

(p.s. It looks like that CI failure is on Python 2 due to the added StringIO context manager - I can revert that)

@willwray
Copy link
Author

In particular, I like the idea that pcpp accept the mathematical profile identifiers that gcc currently accepts (rejects in -pedantic), clang 14 accepted and now rejects in 15 (but is adding an extension flag to support it) and that look likely to be incorporated in C++ once there's consensus.

@ned14
Copy link
Owner

ned14 commented Dec 15, 2022

I go on annual vacation next week, I'm hoping some time will present itself to review all the open PRs everywhere across my open source. No promises, it's a hectic three weeks coming for me.

@willwray
Copy link
Author

Thanks Niall; that'd be a nice surprise - don't stress it though.

I'll aim to understand more of the Value code and learn from @assarbad's review comment.

This PR I'm only really wanting in order to replace MSVC preprocessor with nicer output (fewer bogus empty lines).

It's the has_include feature I'm more interested in... 🎄 🎁 🎅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SyntaxError on use in expression of symbol with leading decimal digits
3 participants