Skip to content
joelpx edited this page Apr 17, 2016 · 14 revisions

Add a support for a new architecture

TODO : each architecture requires a C file for the analyzer.

Specific files for an architecture are in the folder plasma/lib/arch/<NEW_ARCH>. Four files are mandatory to add a new architecture :

  • utils.py : it defines some functions to detect jump/return/call/compare instructions and how instruction symbols must be printed (example add for x86 is "+=").
  • output.py : this is the implementation of the abstract class plasma.lib.output.
  • process_ast.py : you can define functions to modify the ast after a decompilation.
  • __init__.py : it contains the list of all functions defined in process_ast.py.

utils.py

Define two global variables :

OP_IMM = <ARCH>_OP_IMM
OP_MEM = <ARCH>_OP_MEM

Define a list of known function prologs. Due to a limitation in plasma.lib.analyzer.has_analyzer, one instruction cannot have more than 4 bytes.

PROLOGS = [
    [b"\x12\x34\x56"], # inst1, inst2, ...
    ...
]

Define a list containing all condition id with their opposite.

OPPOSITES = [
    [X86_INS_JE, X86_INS_JNE],
    ...
]
OPPOSITES = dict(OPPOSITES + [i[::-1] for i in OPPOSITES])

Define a dictionnary containing a string for each instruction you want to print differently.

INST_SYMB = {
    X86_INS_JE: "==",
    X86_INS_JNE: "!=",
    ...
    X86_INS_XOR: "^=",
    X86_INS_OR: "|=",
    ...
}

Then implement all these functions :

def is_cmp(i):
    return i.id == <COMPARE_ID_INSTRUCTION>

def is_jump(i):
    return i.group(CS_GRP_JUMP)

def is_cond_jump(i):
    return i.group(CS_GRP_JUMP) and i.id != <UNCONDITIONAL_JUMP>

def is_uncond_jump(i):
    return i.id == <UNCONDITIONAL_JUMP>

def is_ret(i):
    return i.group(CS_GRP_RET)

def is_call(i):
    return i.group(CS_GRP_CALL)

def cond_symbol(ty):
    return INST_SYMB.get(ty, "UNKNOWN")

def inst_symbol(i):
    return INST_SYMB.get(i.id, "UNKNOWN")

Generally the condition is the same as the instruction id. But for ARM a condition can be set on each instruction, in this case use i.cc.

def invert_cond(i):
    return OPPOSITES.get(i.id, -1)

def get_cond(i):
    return i.id

output.py

Two functions from plasma.lib.output may be useful : _imm and _add. The first is used to print an immediate value and the second to print a string. For RISC architectures you can get the operand size by doing self.gctx.dis.mode & CS_MODE_32.

COND_ADD_ZERO is a list of condition id. It means that after each instruction with this cond id, we have to add a 0. Example for mips : beqz $t1, label -> if == 0.

ASSIGNMENT_OPS is a list of instruction id indicating which instruction can be fused with a conditional instruction. An instruction must be an assignment, not a comparison (example add, and, ...). The fusion must be implemented in <NEW_ARCH>.process_ast, if not you can let this list empty.

from capstone import CS_MODE_32
from capstone.<ARCH> import ...
from plasma.lib.output import OutputAbs
from plasma.lib.arch.<NEW_ARCH>.utils import (inst_symbol, is_call, is_jump, is_ret, is_uncond_jump, cond_symbol)

COND_ADD_ZERO = [ ... ]
ASSIGNMENT_OPS = [ ... ]

class Output(OutputAbs):

In the function _sub_asm_inst you can define a specific display for each instruction. ret/call/jumps are printed later in the function _asm_inst so you can't rewrite them here.

def _sub_asm_inst(self, i, tab=0):
    modified = False

    if self.gctx.capstone_string == 0:
        if i.id == <INSTRUCTION_ID>:
            # do something ...
            modified = True
        ...

    if not modified:
        if len(i.operands) > 0:
            self._add("%s " % i.mnemonic)
            self._operand(i, 0)
            k = 1
            while k < len(i.operands):
                self._add(", ")
                self._operand(i, k)
                k += 1
        else:
            self._add(i.mnemonic)

For a first test you can let empty the function _operand and just do :

def _sub_asm_inst(self, i, tab=0):
    self._add(self.get_inst_str(i))

The function _operand is called on each operands of each instructions.

  • i : capstone instruction
  • num_op : the nth operand to print from i.operands
  • hexa : if the operand is an immediate and must be printed in hexa
  • show_deref : used with memory access, it indicates if it should print *(). For example, the lea instruction in x86 set show_deref to False.
  • force_dont_print_data : if False and if the operand is a pointer (immediate) to a string, it will print the string near. Set it to True is used for call and jumps : a string is never printed.

_

def _operand(self, i, num_op, hexa=False, show_deref=True, force_dont_print_data=False):
    def inv(n):
        return n == CS_OP_INVALID

    op = i.operands[num_op]

    if op.type == CS_OP_IMM:
        self._imm(op.value.imm, op_size, hexa, force_dont_print_data=force_dont_print_data)

    elif op.type == CS_OP_REG:
        self._add(i.reg_name(op.value.reg))

    elif op.type == MIPS_OP_MEM:
        mm = op.mem
        printed = False

        # Is the access contains a register with a known value ?
        # example : for x86 we can compute any access [eip + DISP]
        # We should call `self.deref_if_offset` for any known address.

        # This code is more or less generic, you just need to adapt it to the
        # architecture. (memory access can have a base, segment, index, disp,
        # shift (for arm), ...
        if show_deref:
            self._add("*(")
        if not inv(mm.base):
            self._add("%s" % i.reg_name(mm.base))
            printed = True
        if mm.disp != 0:
            section = self._binary.get_section(mm.disp)
            if self.is_label(mm.disp) or section is not None:
                if printed:
                    self._add(" + ")
                self._imm(mm.disp, 0, True, section=section, print_data=False,
                          force_dont_print_data=force_dont_print_data)
            else:
                if printed:
                    if mm.disp < 0:
                        self._add(" - %d" % (-mm.disp))
                    else:
                        self._add(" + %d" % mm.disp)
                else:
                    self._add("%d" % mm.disp)
        if show_deref:
            self._add(")")

# Is there any op.type in the architecture ?

The function _if_cond is used to print the statement if (...) in the decompilation mode. cond is the condition id (returned by <NEW_ARCH>.utils.get_cond). fused_inst is the instruction which is fused with the jump (example cmp with jne). It's equal to None if no fusion was done. If the fusion was not implemented, you can ignore this parameter.

This function must be reimplemented for the moment because in x86, there is a special case with the instruction test. Only test reg1, reg1 is used.

def _if_cond(self, cond, fused_inst):
    if fused_inst is None:
        self._add(cond_symbol(cond))
        if cond in COND_ADD_ZERO:
            self._add(" 0")
        return

    assignment = fused_inst.id in ASSIGNMENT_OPS
    if assignment:
        self._add("(")

    self._add("(")
    self._operand(fused_inst, 0)
    self._add(" ")

    if assignment:
        self._add(inst_symbol(fused_inst))
        self._add(" ")
        self._operand(fused_inst, 1)
        self._add(") ")
        self._add(cond_symbol(jump_cond))
    else:
        self._add(cond_symbol(cond))
        self._add(" ")
        self._operand(fused_inst, 1)

    if (fused_inst.id != <CMP_INSTRUCTION> and \
            (cond in COND_ADD_ZERO or assignment)):
        self._add(" 0")

    self._add(")")

process_ast.py

Define all functions to process the ast after a deocmpilation. You can fuse instructions here.

__init__.py

import plasma.lib.arch.<NEW_ARCH>.output
import plasma.lib.arch.<NEW_ARCH>.utils
import plasma.lib.arch.<NEW_ARCH>.process_ast

registered = [
    process_ast.function_1,
    ...
]

Integration

  • lib.disassembler : update the function load_arch_module.
  • lib.fileformat.[elf, raw] : update variables arch_lookup and arch_mode_lookup. Check also functions load_static_sym and load_dyn_sym in elf if they are correct.
  • lib.ui.console : update the function __exec_info.
  • lib.analyzer : update the function set. Search the word is_x86, you will see where it's arch-dependant.
  • lib.__init__.py : update the help (think about the --raw)
Clone this wiki locally