Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete macro reolving #71

Open
Rot127 opened this issue Jul 7, 2022 · 7 comments
Open

Incomplete macro reolving #71

Rot127 opened this issue Jul 7, 2022 · 7 comments
Labels

Comments

@Rot127
Copy link

Rot127 commented Jul 7, 2022

In the following example not all macros get resolved:

#define fCAST4_8s(A) ((int64_t)((int32_t)(A)))
#define fBIDIR_SHIFTR(SRC, SHAMT, REGSTYPE)    (((SHAMT) < 0) ? ((fCAST##REGSTYPE(SRC) << ((-(SHAMT)) - 1)) << 1)                   : (fCAST##REGSTYPE(SRC) >> (SHAMT)))
#define fBIDIR_ASHIFTR(SRC, SHAMT, REGSTYPE)    fBIDIR_SHIFTR(SRC, SHAMT, REGSTYPE##s)
#define fSXTN(N, M, VAL) (((N) != 0) ? sextract64((VAL), 0, (N)) : 0LL)
#define fHIDE(A) A
#define fECHO(A) (A)
#define DEF_SHORTCODE(TAG,SHORTCODE)    insn(TAG, SHORTCODE)

DEF_SHORTCODE(S2_asr_r_r_acc, { fHIDE(size4s_t) shamt=fSXTN(7,32,RtV); RxV = fECHO(RxV + fBIDIR_ASHIFTR(RsV,shamt,4_8)); })

Running pcpp test.h returns:

insn(S2_asr_r_r_acc, { size4s_t shamt=(((7) != 0) ? sextract64((RtV), 0, (7)) : 0LL); RxV = (RxV + (((shamt) < 0) ? ((fCAST4_8s(RsV) << ((-(shamt)) - 1)) << 1)                   : (fCAST4_8s(RsV) >> (shamt)))); })

Note that the fCAST4_8s macro was not resolved.

It seems to be a problem with the ## operator.

If fCAST##REGSTYPE is replaced with fCAST4_8s in macro fBIDIR_SHIFTR it works fine.

pcpp version: 1.30

Rot127 added a commit to rizinorg/rz-rzilcompiler that referenced this issue Jul 7, 2022
@ned14 ned14 added the bug label Jul 26, 2022
@ned14
Copy link
Owner

ned14 commented Jul 26, 2022

Confirmed with this repro:

class test29(unittest.TestCase, runner):
    input = r"""
#define fCAST4_8s(A) ((int64_t)((int32_t)(A)))
#define fBIDIR_SHIFTR(SRC, SHAMT, REGSTYPE)    (((SHAMT) < 0) ? ((fCAST##REGSTYPE(SRC) << ((-(SHAMT)) - 1)) << 1)                   : (fCAST##REGSTYPE(SRC) >> (SHAMT)))
#define fBIDIR_ASHIFTR(SRC, SHAMT, REGSTYPE)    fBIDIR_SHIFTR(SRC, SHAMT, REGSTYPE##s)
#define fSXTN(N, M, VAL) (((N) != 0) ? sextract64((VAL), 0, (N)) : 0LL)
#define fHIDE(A) A
#define fECHO(A) (A)
#define DEF_SHORTCODE(TAG,SHORTCODE)    insn(TAG, SHORTCODE)

DEF_SHORTCODE(S2_asr_r_r_acc, { fHIDE(size4s_t) shamt=fSXTN(7,32,RtV); RxV = fECHO(RxV + fBIDIR_ASHIFTR(RsV,shamt,4_8)); })
"""
# Getting instead:
# insn(S2_asr_r_r_acc, { size4s_t shamt=(((7) != 0) ? sextract64((RtV), 0, (7)) : 0LL); RxV = (RxV + (((shamt) < 0) ? ((fCAST4_8s(RsV) << ((-(shamt)) - 1)) << 1)                   : (fCAST4_8s(RsV) >> (shamt)))); })
# fCAST4_8s isn't being macro expanded
    output = r"""#line 10
insn(S2_asr_r_r_acc, { size4s_t shamt=(((7) != 0) ? sextract64((RtV), 0, (7)) : 0LL); RxV = (RxV + (((shamt) < 0) ? ((((int64_t)((int32_t)(RsV))) << ((-(shamt)) - 1)) << 1) : (((int64_t)((int32_t)(RsV))) >> (shamt)))); })
"""

Thanks for the bug report!

@assarbad
Copy link
Contributor

assarbad commented Nov 15, 2022

@ned14 I distilled the repro further and tried it in pdb. By now I am convinced that the issue is the two levels of pasting tokens together and the way the prescan works. However, I need to take a closer look and understand the code better to propose a fix.

Distilled version:

#define fCAST4_8s(A) A
#define SECOND_LVL_CONCAT(SRC, REGSTYPE) fCAST##REGSTYPE(SRC)
#define FIRST_LVL_CONCAT(SRC, REGSTYPE)  SECOND_LVL_CONCAT(SRC, REGSTYPE##s)

FIRST_LVL_CONCAT(src,4_8)

I renamed the function-style macros for brevity.

When I run the C/C++ preprocessor on my system against it I get (the grep invocations exist to filter out empty lines and those starting with # followed by a number):

$ cpp smaller.h |grep -vP '^\s*$'|grep -vP '^# \d+'
src

and pcpp gives:

$ pcpp smaller.h |grep -vP '^\s*$'|grep -vP '^#line'
fCAST4_8s(src)

So we arrive at the name fCAST4_8s only after two rounds of pasting tokens. With REGSTYPE == 4_8

  1. REGSTYPE##s -> 4_8s inside FIRST_LVL_CONCAT
  2. fCAST##REGSTYPE -> fCAST4_8s inside SECOND_LVL_CONCAT

... and only then do we see that fCAST4_8s needs expansion, too.

@assarbad
Copy link
Contributor

assarbad commented Nov 15, 2022

Alright, in pdb I can see that when we are in expand_macros() we have the following tokens:

[LexToken(CPP_WS,'\n',4,162), LexToken(CPP_ID,'FIRST_LVL_CONCAT',5,163), LexToken(CPP_LPAREN,'(',5,179), LexToken(CPP_ID,'src',5,180), LexToken(CPP_COMMA,',',5,183), LexToken(CPP_INTEGER,'4',5,184), LexToken(CPP_ID,'_8',5,185), LexToken(CPP_RPAREN,')',5,187), LexToken(CPP_WS,'\n',5,187)]

I think in particular the assessment of the lexer that 4_8 "dissolves" into an integer ('4') and an identifier ('_8') may be part of the problem: [..., LexToken(CPP_INTEGER,'4',5,184), LexToken(CPP_ID,'_8',5,185), ...] ...

Currently I think that this may cause the ##-logic to fall apart, because instead of a single token preceded or succeeded by ## now becomes two tokens in a row, preceded or succeeded by ##.

And a short test confirms that this suspicion may indeed be true. Replacing 4_8 by r_8 thereby making it a single (identifier type) token, we can even convince pcpp to behave nicely.


Temporary repro (to force the lexer to give us a single token):

#define fCASTr_8s(A) A
#define SECOND_LVL_CONCAT(SRC, REGSTYPE) fCAST##REGSTYPE(SRC)
#define FIRST_LVL_CONCAT(SRC, REGSTYPE)  SECOND_LVL_CONCAT(SRC, REGSTYPE##s)

FIRST_LVL_CONCAT(src,r_8)

Output for cpp remains correct (and the same), but for pcpp we now get:

$ pcpp --debug smaller.h |grep -vP '^\s*$'|grep -vP '^#line'
src

Unfortunately the pcpp_debug.log doesn't contain the output of how the last line is seen.

Just a thought: perhaps a notion like "debug levels" (--debug=n, defaulting to 1 when given) could help to turn this really verbose, which would be impractical for real-world preprocessing but invaluable for tracking down defects?!

Anyway, pdb to the rescue. This looks a lot better already ([..., LexToken(CPP_ID,'r_8',5,184), ...]):

(Pdb) p tokens
[LexToken(CPP_WS,'\n',4,162), LexToken(CPP_ID,'FIRST_LVL_CONCAT',5,163), LexToken(CPP_LPAREN,'(',5,179), LexToken(CPP_ID,'src',5,180), LexToken(CPP_COMMA,',',5,183), LexToken(CPP_ID,'r_8',5,184), LexToken(CPP_RPAREN,')',5,187), LexToken(CPP_WS,'\n',5,187)]

So the lexer parser.py is what needs fixing.


For reference this is the conditional breakpoint at preprocess.py:558 I set:

(Pdb) b
Num Type         Disp Enb   Where
1   breakpoint   keep yes   at /home/oliver/PCPP/pcpp/pcpp/preprocessor.py:558
        stop only if len(tokens) > 0
        breakpoint already hit 4 times

@assarbad
Copy link
Contributor

Side note: Looking at def CPP_INTEGER(t) in parser.py I think this regex:

(((((0x)|(0X))[0-9a-fA-F]+)|(\d+))([uU][lL]|[lL][uU]|[uU]|[lL])?)

can be vastly simplified and at the same time expanded to support 0b prefixed binary integer literals and grouping with ':

(?i)((?:0b[01]+|0x[0-9A-F]+|[\d']+)(?:UL{0,2}|L{0,2}U)?)

We only appear to use a single capture group, so make the two remaining groups non-capturing. Also set mode to case-insensitive (works both on Python 2.7 and 3.x).

What do you think, @ned14?

@ned14
Copy link
Owner

ned14 commented Nov 16, 2022

Seems plausible.

The hard part is coming up with a fix which doesn't break other things and is maintainable.

@assarbad
Copy link
Contributor

The hard part is coming up with a fix which doesn't break other things and is maintainable.

Indeed, which is the reason why I didn't rush to fixing it. The code I saw during debugging has to sink in first and I need to get more acquainted with the code in general. Also I think it helps to aim for highest possible test coverage before attempting to tackle this (which is why I did this tox thingamy).

@assarbad
Copy link
Contributor

assarbad commented Dec 1, 2022

FYI: The latest repro from this comment gets properly resolved with the changes from PR #80.

assarbad added a commit to assarbad/pcpp that referenced this issue Dec 1, 2022
- Renames PP_NUMBER from the PR to CPP_NUMBER
- Test case for ned14#71 which will only work after this PR gets
  merged
- Adds a stub to run pcpp without first installing it
- Amends .gitignore to ignore items created by the tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants