Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft for (complex) hash labeling scheme #151

Closed
kito-cheng opened this issue Aug 30, 2023 · 29 comments
Closed

Draft for (complex) hash labeling scheme #151

kito-cheng opened this issue Aug 30, 2023 · 29 comments

Comments

@kito-cheng
Copy link
Member

Dump some content from my offlist discussion with @ved-rivos :P

Introduce one more function in asm syntax: %lpad_label("<signature-string>").
e.g. lpad %lpad_label("v(i,i)"), lui t2, %lpad_label("v(i,i)") for void foo(int, int)

The signature string sytax:

SIGNATURE-STRING := RETURN-TYPE '(' ARGS ')'
                  | RETURN-TYPE '(' ')'
RETURN-TYPE := TYPE-NAME

ARGS := ARGS ',' ARG
      | ARG

ARG  := TYPE-NAME

TYPE-NAME is using the mangling name in itanium c++ abi[1] and RISC-V psABI[2].
[1] https://itanium-cxx-abi.github.io/cxx-abi/abi.html#mangling-builtin
[2] https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-elf.adoc#c-name-mangling

The value is the hash result of the signature string, we may start from md5.

void foo(int, int) = %lpad_label("v(i,i)") = low20(md5("v(i,i)")) = low20(3a7a68c073dcdbd6ee282a598c2dc44e) = 0xdc44e

Why md5: no strong reason for using md5, but one reason is it's available on binutils and llvm already.
Why using mangling name defined in itanium c++ abi: Prevent us to define complicate rule for mangling name.

More example:
void bar(): v()
void *memset(void *, int, size_t): Pv(Pv,i,m)

@yetingk
Copy link

yetingk commented Aug 30, 2023

I have two questions about using function signatures as landing label on function prologue.

  1. Landing label for indirect branches within a function. A simple proposal is also to use the function signature as the landing pad. The benefit is it allows indirect branch to its function entry. But it also increase the risk of indirect calls which are not landed to a function entry.
  2. We may need to have a landing label for the next pc of dual return functions like setjmp.

@kito-cheng
Copy link
Member Author

Landing label for indirect branches within a function. A simple proposal is also to use the function signature as the landing pad. The benefit is it allows indirect branch to its function entry. But it also increase the risk of indirect calls which are not landed to a function entry.

A branch within a function is not expose to other module, so IMO that could be implementation defined.

We may need to have a landing label for the next pc of dual return functions like setjmp.

Generally those function are using ret to return function, so they don't need landing pad, however I found x86 has some special function attribute called indirect_return, we might take a look for that to see if we need to define a same one as well.

@samitolvanen
Copy link

Why md5: no strong reason for using md5, but one reason is it's available on binutils and llvm already.

I would prefer not to add more uses of MD5, even if in this application there aren't any actual security concerns. Since xxHash is also readily available, why not use that instead?

Why using mangling name defined in itanium c++ abi: Prevent us to define complicate rule for mangling name.

This name mangling scheme matches LLVM's existing forward-edge CFI implementations and I agree that it's a reasonable choice.

A simple proposal is also to use the function signature as the landing pad.

If you mean using the same label for the function entry and all the indirect branch targets in the function, I don't think that's acceptable. We should ensure that indirect calls are allowed only to function entry points.

A branch within a function is not expose to other module, so IMO that could be implementation defined.

I would suggest at least providing a recommendation for the indirect branch labeling scheme as well, so we won't end up with implementations that accidentally make CFI weaker.

@kito-cheng
Copy link
Member Author

I would prefer not to add more uses of MD5, even if in this application there aren't any actual security concerns. Since xxHash is also readily available, why not use that instead?

Honestly I am not security guy, so that's my first time to heard xxHash, and I am open mind for other algorithm :)

A simple proposal is also to use the function signature as the landing pad.

If you mean using the same label for the function entry and all the indirect branch targets in the function, I don't think that's acceptable. We should ensure that indirect calls are allowed only to function entry points.

A branch within a function is not expose to other module, so IMO that could be implementation defined.

I would suggest at least providing a recommendation for the indirect branch labeling scheme as well, so we won't end up with implementations that accidentally make CFI weaker.

Yeah, sounds good idea.

@ved-rivos
Copy link
Collaborator

xxHash is not a cryptographic hash. xxHash is faster to compute. MD5 is also used by LLVM for CFI. For this purpose I would consider MD5 and xxHash to be equivalent.

For a branch within a function we could pick the convention of using the :<source_line_number>. That would restrict its scope.

@samitolvanen
Copy link

xxHash is not a cryptographic hash. xxHash is faster to compute. MD5 is also used by LLVM for CFI. For this purpose I would consider MD5 and xxHash to be equivalent.

The original CFI implementation in Clang uses MD5, but the newer ones (-fsanitize=kcfi and -fsanitize=function) have adopted xxHash:

https://github.com/llvm/llvm-project/blob/main/clang/lib/CodeGen/CodeGenFunction.cpp#L572
https://github.com/llvm/llvm-project/blob/main/clang/lib/CodeGen/CodeGenModule.cpp#L1995

For a branch within a function we could pick the convention of using the :<source_line_number>. That would restrict its scope.

Agreed, that's better. Ideally each branch target should have a unique label, so I would recommend including the file path in addition to the line number, for example.

@kito-cheng
Copy link
Member Author

I suspect encoding line number may not work for all situation, one example is GNU C extension support "Labels as Values", but I do agree that should have different label than function, so I think we may add some note about that "NOTE: Targets of local indirect jumps that are not used for jump tables should have a landing pad. Additionally, the labeling value for these targets is recommended to be different from the function label value."

    void *ptr;

    if(...)
        ptr = &&foo;
    else
        ptr = &&bar;

    /* ... */
    goto *ptr;

foo:
    /* ... */

bar:
    /* ... */

[1] https://stackoverflow.com/questions/56316820/what-is-an-indirect-goto-statement
[2] https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html

@samitolvanen
Copy link

I suspect encoding line number may not work for all situation, one example is GNU C extension support "Labels as Values"

Good point. Basically, all address-taken labels in a function must have the same landing pad label. We don't have to tie the label to the line number if that's confusing. It's enough that the label is unique within the program. E.g., HASH(path + counter).

NOTE: Targets of local indirect jumps that are not used for jump tables should have a landing pad.

Unless the compiler uses a software-guarded branch, the targets must have a landing pad. Do we want to include a recommendation for when it's safe to use a software-guarded branch?

Additionally, the labeling value for these targets is recommended to be different from the function label value.

This doesn't sound strong enough, because setting all indirect branch target labels to the same value would be enough in this case. How about recommending that each indirect branch target must have a landing pad label that's unique within the program, with the exception that all address-taken labels in the same function must have the same landing pad label?

@Yangff
Copy link

Yangff commented Oct 4, 2023

all address-taken labels in the same function must have the same landing pad label

what about the catch block which has a landing pad used by the runtime?

@samitolvanen
Copy link

what about the catch block which has a landing pad used by the runtime?

Please correct me if I'm wrong, but shouldn't C++ exceptions behave similarly to a function return, which means the exception landing pad won't need an lpad instruction?

@Yangff
Copy link

Yangff commented Oct 4, 2023

what about the catch block which has a landing pad used by the runtime?

Please correct me if I'm wrong, but shouldn't C++ exceptions behave similarly to a function return, which means the exception landing pad won't need an lpad instruction?

My impression is they use forward indirect jump to restore context instead of ret. This might be wrong though..

this vary by the implementation.. and for GCC and llvm it's just a ret.

@samitolvanen
Copy link

My impression is they use forward indirect jump to restore context instead of ret.

It depends on the architecture, but the RISC-V implementation in LLVM's libunwind seems to use ret at least:

https://github.com/llvm/llvm-project/blob/main/libunwind/src/UnwindRegistersRestore.S#L1159

Are you aware of other implementations that behave differently?

@Yangff
Copy link

Yangff commented Oct 5, 2023

My impression is they use forward indirect jump to restore context instead of ret.

It depends on the architecture, but the RISC-V implementation in LLVM's libunwind seems to use ret at least:

https://github.com/llvm/llvm-project/blob/main/libunwind/src/UnwindRegistersRestore.S#L1159

Are you aware of other implementations that behave differently?

You are right, it depends on the implementation. nongnu libunwind seems to use jr with non ra reg and GCC uses __builtin_eh_return which should be a ret.

With that said, it should be possible to assign a landingpad for it? Because that ret cannot be enforced by sspop since the address is not in stack anyway..

@samitolvanen
Copy link

Thanks for looking into this. It sounds like we should either require compatible implementations to use a return, or prepare a recommendation for exception handler landing pad labeling as well since there are implementations that use indirect jumps with a non-ra register.

Because that ret cannot be enforced by sspop since the address is not in stack anyway..

That's no different from setjmp etc. though. How much of a concern is this?

@Yangff
Copy link

Yangff commented Oct 6, 2023

setjmp

I think on x86 CET compiler emits an endbr64 after setjmp call so that the longjmp actually uses jmp *%rcx to go back to the context, instead of return.
Security wise it prevents the exploitable arbitrary jump gadget in the program.. but for compatibility this may cause problem..

@mylai-mtk
Copy link

mylai-mtk commented Jan 31, 2024

I have a question about the %lpad_label("<signature-string>") syntax: This works obviously for assembling, but what about disassembling? Do we display the labels as numbers when disassembling, or we have to craft a (standard?) method to map the numbers back to signature strings?

@ved-rivos
Copy link
Collaborator

The disassembly should only show the number. Since the label is generated by hashing the string and then truncating the hash to 20 bits it will not be possible to revere the process and regenerate the string from the truncated hash.

@kito-cheng
Copy link
Member Author

kito-cheng commented Feb 1, 2024

I have a question about the %lpad_label("") syntax: This works obviously for assembling, but what about disassembling? Do we display the labels as numbers when disassembling, or we have to craft a (standard?) method to map the numbers back to signature strings?

It's technical possible* to keep the signature string but I am not sure it's necessary or not?

* e.g. keep that via mapping symbol or customized section to record that.

@ved-rivos
Copy link
Collaborator

I am not sure its necessary. If we do add it we may want it to be strippable.

@mylai-mtk
Copy link

mylai-mtk commented Feb 2, 2024

I think if we're considering the necessity, this disassembling display rule can be considered "unnecessary", since it does not affect runtime behavior.

If we leave this part unspecified in the spec, I believe the mainstream toolchains would choose their favorable formats when the needs arise, and other late comers may just follow suit. This may lead to some interoperabilty issue in disassemblers, but it should not be a big problem, because they can/should always fallback to number when the string is unavailable.

@mylai-mtk
Copy link

mylai-mtk commented Feb 20, 2024

Question: It looks like that TYPE-NAME is a terminal in the grammar. What if an ARG contains a function prototype, e.g. a function pointer parameter? Do we encode the function prototype recursively, or we just throw it to the Itanium mangler?

@gcchri
Copy link
Collaborator

gcchri commented Feb 20, 2024

From my understanding it seems that recursive encoding will lead to finer label distribution.

@sorear
Copy link

sorear commented Feb 27, 2024

We should mangle the entire possibly qualified function type, not just the arguments and returns. U9vector_ccFvvE is a different function type from FvvE and confusing them should be caught. I'd rather not make up our own list syntax, as simple as it is, when Itanium provides F types we can use.

Providing the unhashed type in strippable debug information for disassembly sounds like a useful feature. I don't know enough about debug information to say whether that should be a symbol or something in DWARF.

ret does not check for a landing pad because Zicfilp is intended to be used in conjunction with Zicfiss. After sspopchk, ret is a software guarded branch and is treated as such. The CFI attacker model can corrupt the stack information which controls DWARF unwinding as well as the contents of jmp_buf structs, so some form of shadow stack or landing pad protection, but not both, is needed for longjmp and C++ exceptions.

The landing pad approach defines a label (singular, since there is no meaningful type information) to use for longjmp as well as a label (singular, since the unwinder can be corrupted we cannot trust any label it outputs) to use for __builtin_eh_return.

The shadow stack approach requires adding additional return addresses to the shadow stack, and checking in longjmp or __builtin_eh_return if the shadow stack pointer in the return context is inside the current shadow stack and points at the address returned to. For setjmp, there is an additional complication since setjmp calls are not necessarily well nested, see below. A similar approach with a different token could be used for __builtin_eh_return but this may be controversial since it violates the "zero cost on the non-exceptional path" property DWARF advertises (in practice, exceptions are far from zero cost due to lost optimization opportunities, and this approach provides far better protection than a single type of landing pad).

void foo() {
   ...
   r = setjmp(&jb);
   ...
   r = setjmp(&jb);
   ...
}

foo:
   sspush ra
   la t1, 1f
   sspush t1
   la t1, 2f
   sspush t1
   li t1, SETJMP_TOKEN(2)
   sspush t1
   ... create a stack frame ...
   ...
   call setjmp
   ...
   call setjmp
   ...
   ... stack frame teardown ...
   ssrdp t1
   addi t1, t1, 3*(XLEN/8)
   csrw ssp, t1 # Possible ABI issue, see #206
   sspopchk ra
   ret

longjmp pseudocode:
   if (!IS_SETJMP_TOKEN(*ssp)) crash();
   for (int i = 0; i < SETJMP_TOKEN_COUNT(*ssp), i++)
      if (*(ssp+i) == ra) goto good;
   crash();
  good:

@mylai-mtk
Copy link

Also, we may have to consider:

  • K&R-style functions, where the function signature is not fixed, and may not even be known at the call site. It can happen that a call to a non-K&R-style function is made through a K&R-style function pointer.
  • Variadic functions
  • Class methods where the pointer to class object itself is passed as the first argument --> Do we encode the type of this pointer as the first argument?
  • Virtual methods --> How do we encode the type of this pointer, both at the caller and callee site? Do we use the base class in which the virtual function is first declared?

@sorear
Copy link

sorear commented Mar 11, 2024

K&R-style functions

Signature is inferred at the call site on the basis of the promoted actual arguments. This is a general phenomenon; LLVM doesn't even have a concept of calling a function without a signature.

Variadic functions

Represented in the type system and handled fine with a mangled function type, although not with the original proposal.

Class methods where the pointer to class object itself is passed as the first argument

Virtual methods

These can only be indirectly called through a pointer to member function. Use the mangling rules for those.

@mylai-mtk
Copy link

mylai-mtk commented Mar 11, 2024

Signature is inferred at the call site on the basis of the promoted actual arguments.

I agree we can always obtain a function signature when calling a function from the the call site's view. However, from the view of the callee, it assumes the types of the passed arguments at compile time. By calling through K&R-style function pointers, we can pass arguments not matching the callee's assumption, so even if we have a "signature" at the call site, it would not match what's expected by the callee, thus the calculated landing pad labels would not match.

--> I later found out this is mostly like an undefined behavior in the C standard, so I guess we may be able to rule this case out, but an interesting exception in the (C11 draft) standard caught my attention:

If the function is defined with a type that does not include a prototype, and the types of
the arguments after promotion are not compatible with those of the parameters after
promotion, the behavior is undefined, except for the following cases:
— one promoted type is a signed integer type, the other promoted type is the
corresponding unsigned integer type, and the value is representable in both types;

Does this mean that if we distinguish signed and unsigned integer types in label calculation, some weird cases can be made?

kito-cheng added a commit to riscv-non-isa/riscv-asm-manual that referenced this issue Apr 18, 2024
Zicfilp has provided two labeling schemes: simple and complex (also known as
function signature-based). The simple scheme uses an lpad with a constant 0,
which does not require any hashing mechanism. In contrast, the complex
labeling scheme computes the MD5 hash from the signature string.

Filling up an MD5 hash value is straightforward for compilers, but it is
non-trivial work for humans to maintain. Therefore, we have added new assembler
modifiers to compute this value.

See also: riscv/riscv-cfi#151
@kito-cheng
Copy link
Member Author

@sorear

We should mangle the entire possibly qualified function type, not just the arguments and returns. U9vector_ccFvvE is a different function type from FvvE and confusing them should be caught. I'd rather not make up our own list syntax, as simple as it is, when Itanium provides F types we can use.

Spend time on reading C++ Itanium ABI again, and I tend to agree with you that not make our own list syntax if possible...we don't really want to deal with those edge/complex casees in C++.

I am gonna to moving this forward recently since the simple labeling scheme PoC is pass most of GCC testsuite :)

@kito-cheng
Copy link
Member Author

@ved-rivos
Copy link
Collaborator

Closing this issue as discussion and development is continuing in the psabi and asm manual PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants