This project serves as the data collection process for training neural decompilers, such as CMUSTRUDEL/DIRE.
The code for compilation is adapted from bvasiles/decompilationRenaming. The code for decompilation is adapted from CMUSTRUDEL/DIRE.
You can use setup_system.sh
to setup an ubuntu machine to run this project
git clone https://github.com/mov0xdecafe/ghcc.git
cd ghcc
./scripts/system_setup.sh
This script assumes you are running in the ghcc folder, calling this from any other directory will result in an error.
You can then replace the contents of the database-config.json
file to the values below:
{
"host": "localhost",
"port": 27017,
"auth_db_name": "admin",
"db_name": "ghcc",
"username": "gchh_crawler",
"password": "securepassword"
}
These are based on the values used in the mongodb setup in scripts/system_setup.sh
- Install Docker and MongoDB.
- Install required Python packages by:
pip install -r requirements.txt
- Rename
database-config-example.json
todatabase-config.json
, and fill in appropriate values. This will be used to connect to your MongoDB server. - Build the Docker image used for compiling programs by:
docker build -t gcc-custom .
You will need a list of GitHub repository URLs to run the code. The current code expects one URL per line, for example:
https://github.com/huzecong/ghcc.git
https://www.github.com/torvalds/linux
FFmpeg/FFmpeg
https://api.github.com/repos/pytorch/pytorch
To run, simply execute:
python main.py --repo-list-file path/to/your/list [arguments...]
The following arguments are supported:
--repo-list-file [path]
: Path to the list of repository URLs.--clone-folder [path]
: The temporary directory to store cloned repository files. Defaults torepos/
.--binary-folder [path]
: The directory to store compiled binaries. Defaults tobinaries/
.--archive-folder [path]
: The directory to store archived repository files. Defaults toarchives/
.--n-procs [int]
: Number of worker processes to spawn. Defaults to 0 (single-process execution).--log-file [path]
: Path to the log file. Defaults tolog.txt
.--clone-timeout [int]
: Maximum cloning time (seconds) for one repository. Defaults to 600 (10 minutes).--force-reclone
: If specified, all repositories are cloned regardless of whether it has been processed before or whether an archived version exists.--compile-timeout [int]
: Maximum compilation time (seconds) for all Makefiles under a repository. Defaults to 900 (15 minutes).--force-recompile
: If specified, all repositories are compiled regardless of whether is has been processed before.--docker-batch-compile
: Batch compile all Makefiles in one repository using one Docker invocation. This is on by default, and you almost always want this. Use the--no-docker-batch-compile
flag to disable it.--compression-type [str]
: Format of the repository archive, available options aregzip
(faster) andxz
(smaller). Defaults togzip
.--max-archive-size [int]
: Maximum size (bytes) of repositories to archive. Repositories with greater sizes will not be archived. Defaults to 104,857,600 (100MB).--record-libraries [path]
: If specified, a list of libraries used during failed compilations will be written to the specified path. See Collecting and Installing Libraries for details.--logging-level [str]
: The logging level. Defaults toinfo
.--max-repos [int]
: If specified, only the firstmax_repos
repositories from the list will be processed.--recursive-clone
: If specified, submodules in the repository will also be cloned if exists. This is on by default. Use the--no-recursive-clone
flag to disable it.--write-db
: If specified, compilation results will be written to database. This is on by default. Use the--no-write-db
flag to disable it.--record-metainfo
: If specified, additional statistics will be recorded.--gcc-override-flags
: If specified, these are passed as compiler flags to GCC. By default-O0
is used.
- If compilation is interrupted, there may be leftovers that cannot be removed due to privilege issues. Purge them by:
This is because intermediate files are created under different permissions, and we need root privileges (sneakily obtained via Docker) to purge those files. This is also performed at the beginning of the
./purge_folder.py /path/to/clone/folder
main.py
script. - If something messed up seriously, drop the database by:
python -m ghcc.database clear
- If the code is modified, remember to rebuild the image since the
batch_make.py
script (executed inside Docker to compile Makefiles) depends on the library code. If you don't do so, well, GHCC will remind you and refuse to proceed.
Decompilation requires an active installation of IDA with the Hex-Rays plugin. To run, simply execute:
python run_decompiler.py --ida path/to/idat64 [arguments...]
The following arguments are supported:
--ida [path]
: Path to theidat64
executable found under the IDA installation folder.--binaries-dir [path]
: The directory where binaries are stored, i.e. the same value for--binary-folder
in the compilation arguments. Defaults tobinaries/
.--output-dir [path]
: The directory to store decompiled code. Defaults todecompile_output/
.--log-file [path]
: Path to the log file. Defaults todecompile-log.txt
.--timeout [int]
: Maximum decompilation time (seconds) for one binary. Defaults to 30.--n-procs [int]
: Number of worker processes to spawn. Defaults to 0 (single-process execution).
The following procedure happens when compiling a Makefile:
-
Check if directory is "make"-able: A directory is marked as "make"-able if it contains (case-insensitively) at least one set of files among the following:
- (Make)
Makefile
- (automake)
Makefile.am
If the directory is not "make"-able, skip the following steps.
- (Make)
-
Clean Git repository:
git reset --hard # reset modified files git clean -xffd # clean unversioned files # do the same for submodules git submodule foreach --recursive git reset --hard git submodule foreach --recursive git clean -xffd
If any command fails, ignore it and continue executing the rest.
-
Build:
-
If exists a file named
Makefile.am
, runautomake
:autoreconf && automake --add-missing
-
If exists a file named
configure
, run the configuration script:chmod +x ./configure && ./configure --disable-werror
The
--disable-werror
prevents warnings being treated as errors in cases where-Werror
is specified.If command fails within 2 seconds, try again without
--disable-werror
. -
Run
make
:make --always-make --keep-going -j1
The
--always-make
flag rebuilds all dependent targets even if they exist. The--keep-going
flag allows Make to continue for targets if errors occur in non-dependent targets.If command fails within 2 seconds and the output contains
"Missing separator"
, try again withbmake
(BSD Make).Note: We override certain program with our "wrapped" versions by modifying the
PATH
variable. The list of wrapped programs are:- GCC: (
gcc
,cc
,clang
) Swallows unnecessary and/or error-prone flags (-Werror
,-march
,-mlittle-endian
), records libraries used (-l
), overrides the optimization level (-O0
), adds override flags specified in the arguments, and calls the real GCC. If the real GCC fails, writes the libraries to a predefined path. sudo
: Does not prompt for the password, but instead just tries to execute the command without privileges.pkg-config
: Records libraries used, and calls the realpkg-config
. If it fails (meaning packages cannot be resolved), write the libraries to a predefined path.
- GCC: (
-
Most repositories require linking to external libraries. To collect libraries that are linked to in Makefiles, run the
script with the flag --record-libraries path/to/library_log.txt
. Only libraries in commands that failed to execute
(GCC return code is non-zero) are recorded in the log file.
After gathering the library log, run install_libraries.py path/to/library_log.txt
to resolve libraries to package
names (based on apt-cache
). This step requires actually installing packages, so it's recommended to run it in a Docker
environment:
docker run --rm \
-v /absolute/path/to/directory/:/usr/src/ \
gcc-custom \
"install_libraries.py /usr/src/library_log.txt"
This gives a list of packages to install. Add the list of packages to Dockerfile
(the command that begins with
RUN apt-get install -y --no-install-recommends
) and rebuild the image to apply changes.
Compiling random code from GitHub is basically equivalent to running curl | bash
, and doing so in Docker would be like
curl | sudo bash
as Docker (by default) doesn't protect you against kernel panics and fork bombs. The following notes
describe what is done to (partly) ensure safety of the host machine when compiling code.
-
Never run Docker as root. This means two things: 1) don't use
sudo docker run ...
, and 2) don't execute commands in Docker as the root user (default). The first goal can be achieved by create adocker
user group, and the second can be achieved using a special entry-point: create a non-privileged user and usegosu
to switch to that user and run commands.Caveats: When creating the non-privileged user, assign the same UID (user ID) or GID (group ID) as the host user, so files created inside the container can be accessed/modified by the host user.
-
Limit the number of processes. This is to prevent things like fork bombs or badly written recursive Makefiles from taking up the kernel memory. A simple solution is to use
ulimit -u <nprocs>
to set the maximum allowed number of processes, but such limits are on a per-user basis instead of a per-container or per-process-tree basis.What we can do is: for each container we spawn, create a user that has the same GID as the host user, but with a distinct UID, and call
ulimit
for that user. This serves as a workaround for per-container limits.Don't forget to
chmod g+w
for files that need to be accessed from host.
If you have specific requirements for the branch, commit_id, or tag for a repo, you should use a .json URL file
{
"repos": [
{
"url": "https:github.com/example/test.git",
"branch": "master",
"commit": "123456789abcdef",
"tag": "v0.1"
}
]
}
- Ability to checkout certain commits for compilation
- Update DB to store more repo details (branch, tag, commit_id)
- Update filesystem structure to capture tags/branchs/commit_ids
- Compiler matrix for different versions and configurations
- Enumerate over all tags
- Add more build systems
- CMake
- Add more compilers
- Clang / LLVM
- MSVC