Skip to content

Commit

Permalink
Add BOLT Makefile (#54107)
Browse files Browse the repository at this point in the history
This uses LLVM's BOLT to optimize libLLVM, libjulia-internal and
libjulia-codegen.

This improves the allinference benchmarks by about 10% largely due to
the optimization of libjulia-internal.
The example in issue #45395
which stresses LLVM significantly more also sees a ~10% improvement.
We see a 20% improvement on 
```julia
@time for i in 1:100000000
    string(i)
end
```

When building corecompiler.ji:
BOLT gives about a 16% improvement
PGO+LTO gives about a 21% improvement
PGO+LTO+BOLT gives about a 23% improvement

This only requires a single build of LLVM and theoretically none if we
change the binary builder script (i.e. we build with relocations and the
`-fno-reorder-blocks-and-partition` and then we can use BOLT to get
binaries with no relocations and reordered blocks and then ship both
binaries?) compared to the 2 in PGO. Also, this theoretically can
improve performance of a PGO+LTO build by a couple %.

The only reproducible test problem I see is that the BOLT, PGO+LTO and
PGO+LTO+BOLT builds all cause `readelf` to emit warnings as part of the
`osutils` tests.

```
readelf: Warning: Unrecognised form: 0x22
readelf: Warning: DIE has locviews without loclist
readelf: Warning: Unrecognised form: 0x23
readelf: Warning: DIE at offset 0x227399 refers to abbreviation number 14754 which does not exist
readelf: Warning: Bogus end-of-siblings marker detected at offset 212aa9 in .debug_info section
readelf: Warning: Bogus end-of-siblings marker detected at offset 212ab0 in .debug_info section
readelf: Warning: Further warnings about bogus end-of-sibling markers suppressed
```

The unrecognised form warnings seem to be a bug in binutils,
https://sourceware.org/bugzilla/show_bug.cgi?id=28981.
`DIE at offset` warning I believe was fixed in binutils 2.36,
https://sourceware.org/bugzilla/show_bug.cgi?id=26808, but `ld -v` says
I have 2.38.
I assume these are all benign. I also don't see them on CI here
https://buildkite.com/julialang/julia-buildkite/builds/1507#018f00e7-0737-4a42-bcd9-d4061dc8c93e
so could just be a local issue.
  • Loading branch information
Zentrik authored Jul 26, 2024
1 parent a07031a commit 1dee000
Show file tree
Hide file tree
Showing 14 changed files with 525 additions and 7 deletions.
13 changes: 10 additions & 3 deletions Make.inc
Original file line number Diff line number Diff line change
Expand Up @@ -516,6 +516,11 @@ SHIPFLAGS_COMMON := -O3
SHIPFLAGS_CLANG := $(SHIPFLAGS_COMMON) -g
SHIPFLAGS_GCC := $(SHIPFLAGS_COMMON) -ggdb2 -falign-functions

BOLT_LDFLAGS :=

BOLT_CFLAGS_GCC :=
BOLT_CFLAGS_CLANG :=

ifeq ($(OS), Darwin)
JCPPFLAGS_CLANG += -D_LARGEFILE_SOURCE -D_DARWIN_USE_64_BIT_INODE=1
endif
Expand All @@ -532,7 +537,8 @@ JCFLAGS := $(JCFLAGS_GCC)
JCPPFLAGS := $(JCPPFLAGS_GCC)
JCXXFLAGS := $(JCXXFLAGS_GCC)
DEBUGFLAGS := $(DEBUGFLAGS_GCC)
SHIPFLAGS := $(SHIPFLAGS_GCC)
SHIPFLAGS := $(SHIPFLAGS_GCC) $(BOLT_CFLAGS_GCC)
BOLT_CFLAGS := $(BOLT_CFLAGS_GCC)
endif

ifeq ($(USECLANG),1)
Expand All @@ -542,7 +548,8 @@ JCFLAGS := $(JCFLAGS_CLANG)
JCPPFLAGS := $(JCPPFLAGS_CLANG)
JCXXFLAGS := $(JCXXFLAGS_CLANG)
DEBUGFLAGS := $(DEBUGFLAGS_CLANG)
SHIPFLAGS := $(SHIPFLAGS_CLANG)
SHIPFLAGS := $(SHIPFLAGS_CLANG) $(BOLT_CFLAGS_CLANG)
BOLT_CFLAGS := $(BOLT_CFLAGS_CLANG)

ifeq ($(OS), Darwin)
CC += -mmacosx-version-min=$(MACOSX_VERSION_MIN)
Expand Down Expand Up @@ -1295,7 +1302,7 @@ CSL_NEXT_GLIBCXX_VERSION=GLIBCXX_3\.4\.33|GLIBCXX_3\.5\.|GLIBCXX_4\.
# Note: we explicitly _do not_ define `CSL` here, since it requires some more
# advanced techniques to decide whether it should be installed from a BB source
# or not. See `deps/csl.mk` for more detail.
BB_PROJECTS := BLASTRAMPOLINE OPENBLAS LLVM LIBSUITESPARSE OPENLIBM GMP MBEDTLS LIBSSH2 NGHTTP2 MPFR CURL LIBGIT2 PCRE LIBUV LIBUNWIND DSFMT OBJCONV ZLIB P7ZIP LLD LIBTRACYCLIENT
BB_PROJECTS := BLASTRAMPOLINE OPENBLAS LLVM LIBSUITESPARSE OPENLIBM GMP MBEDTLS LIBSSH2 NGHTTP2 MPFR CURL LIBGIT2 PCRE LIBUV LIBUNWIND DSFMT OBJCONV ZLIB P7ZIP LLD LIBTRACYCLIENT BOLT
define SET_BB_DEFAULT
# First, check to see if BB is disabled on a global setting
ifeq ($$(USE_BINARYBUILDER),0)
Expand Down
10 changes: 10 additions & 0 deletions contrib/bolt/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
profiles-bolt*
optimized.build
toolchain

bolt
bolt_instrument
merge_data
copy_originals
stage0
stage1
134 changes: 134 additions & 0 deletions contrib/bolt/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
.PHONY: clean clean_profiles restore_originals

# Settings taken from https://github.com/rust-lang/rust/blob/master/src/tools/opt-dist/src/bolt.rs
BOLT_ARGS :=
# Reorder basic blocks within functions
BOLT_ARGS += -reorder-blocks=ext-tsp
# Reorder functions within the binary
BOLT_ARGS += -reorder-functions=cdsort
# Split function code into hot and code regions
BOLT_ARGS += -split-functions
# Split as many basic blocks as possible
BOLT_ARGS += -split-all-cold
# Move jump tables to a separate section
BOLT_ARGS += -jump-tables=move
# Use regular size pages for code alignment
BOLT_ARGS += -no-huge-pages
# Fold functions with identical code
BOLT_ARGS += -icf=1
# Split using best available strategy (three-way splitting, Cache-Directed Sort)
# Disabled for libjulia-internal till https://github.com/llvm/llvm-project/issues/89508 is fixed
# BOLT_ARGS += -split-strategy=cdsplit
# Update DWARF debug info in the final binary
BOLT_ARGS += -update-debug-sections
# Print optimization statistics
BOLT_ARGS += -dyno-stats
# BOLT doesn't fully support computed gotos, https://github.com/llvm/llvm-project/issues/89117
# Use escaped regex as the name BOLT recognises is often a bit different, e.g. apply_cl/1(*2)
# This doesn't actually seem to do anything, the actual mitigation is not using --use-old-text
# which we do in the bolt target
BOLT_ARGS += -skip-funcs=.\*apply_cl.\*

# -fno-reorder-blocks-and-partition is needed on gcc >= 8.
BOLT_FLAGS := $\
"BOLT_CFLAGS_GCC+=-fno-reorder-blocks-and-partition" $\
"BOLT_LDFLAGS=-Wl,--emit-relocs"

STAGE0_BUILD:=$(CURDIR)/toolchain
STAGE1_BUILD:=$(CURDIR)/optimized.build

STAGE0_BINARIES:=$(STAGE0_BUILD)/usr/bin/

PROFILE_DIR:=$(CURDIR)/profiles-bolt
JULIA_ROOT:=$(CURDIR)/../..

LLVM_BOLT:=$(STAGE0_BINARIES)llvm-bolt
LLVM_MERGEFDATA:=$(STAGE0_BINARIES)merge-fdata

# If you add new files to optimize, you need to add BOLT_LDFLAGS and BOLT_CFLAGS to the build of your new file.
SYMLINKS_TO_OPTIMIZE := libLLVM.so libjulia-internal.so libjulia-codegen.so
FILES_TO_OPTIMIZE := $(shell for file in $(SYMLINKS_TO_OPTIMIZE); do readlink $(STAGE1_BUILD)/usr/lib/$$file; done)

AFTER_INSTRUMENT_MESSAGE:='Run `make finish_stage1` to finish off the build. $\
You can now optionally collect more profiling data by running Julia with an appropriate workload, $\
if you wish, run `make clean_profiles` before doing so to remove any profiling data generated by `make finish_stage1`. $\
You should end up with some data in $(PROFILE_DIR). Afterwards run `make merge_data && make bolt`. $\

$(STAGE0_BUILD) $(STAGE1_BUILD):
$(MAKE) -C $(JULIA_ROOT) O=$@ configure

stage0: | $(STAGE0_BUILD)
$(MAKE) -C $(STAGE0_BUILD)/deps install-BOLT && \
touch $@

# Build with our custom flags, binary builder doesn't use them so we need to build LLVM for now.
# We manually skip package image creation so that we can profile it
$(STAGE1_BUILD): stage0
stage1: export USE_BINARYBUILDER_LLVM=0
stage1: | $(STAGE1_BUILD)
$(MAKE) -C $(STAGE1_BUILD) $(BOLT_FLAGS) julia-src-release julia-symlink julia-libccalltest \
julia-libccalllazyfoo julia-libccalllazybar julia-libllvmcalltest && \
touch $@

copy_originals: stage1
for file in $(FILES_TO_OPTIMIZE); do \
abs_file=$(STAGE1_BUILD)/usr/lib/$$file; \
cp $$abs_file "$$abs_file.original"; \
done && \
touch $@

# I don't think there's any particular reason to have -no-huge-pages here, perhaps slightly more accurate profile data
# as the final build uses -no-huge-pages
bolt_instrument: copy_originals
for file in $(FILES_TO_OPTIMIZE); do \
abs_file=$(STAGE1_BUILD)/usr/lib/$$file; \
$(LLVM_BOLT) "$$abs_file.original" -o $$abs_file --instrument --instrumentation-file-append-pid --instrumentation-file="$(PROFILE_DIR)/$$file-prof" -no-huge-pages; \
mkdir -p $$(dirname "$(PROFILE_DIR)/$$file-prof"); \
printf "\n"; \
done && \
touch $@
@echo $(AFTER_INSTRUMENT_MESSAGE)

# We don't want to rebuild julia-src as then we lose the bolt instrumentation
# So we have to manually build the sysimage and package image
finish_stage1: stage1
$(MAKE) -C $(STAGE1_BUILD) julia-base-cache && \
$(MAKE) -C $(STAGE1_BUILD) -f sysimage.mk sysimg-release && \
$(MAKE) -C $(STAGE1_BUILD) -f pkgimage.mk release

merge_data: bolt_instrument
for file in $(FILES_TO_OPTIMIZE); do \
profiles=$(PROFILE_DIR)/$$file-prof.*.fdata; \
$(LLVM_MERGEFDATA) $$profiles > "$(PROFILE_DIR)/$$file-prof.merged.fdata"; \
done && \
touch $@

# The --use-old-text saves about 16 MiB of libLLVM.so size.
# However, the rust folk found it succeeds very non-deterministically for them.
# It tries to reuse old text segments to reduce binary size
# BOLT doesn't fully support computed gotos https://github.com/llvm/llvm-project/issues/89117, so we cannot use --use-old-text on libjulia-internal
# That flag saves less than 1 MiB for libjulia-internal so oh well.
bolt: merge_data
for file in $(FILES_TO_OPTIMIZE); do \
abs_file=$(STAGE1_BUILD)/usr/lib/$$file; \
$(LLVM_BOLT) "$$abs_file.original" -data "$(PROFILE_DIR)/$$file-prof.merged.fdata" -o $$abs_file $(BOLT_ARGS) $$(if [ "$$file" != $(shell readlink $(STAGE1_BUILD)/usr/lib/libjulia-internal.so) ]; then echo "--use-old-text -split-strategy=cdsplit"; fi); \
done && \
touch $@

clean_profiles:
rm -rf $(PROFILE_DIR)

clean:
rm -f stage0 stage1 bolt copy_originals merge_data bolt_instrument

restore_originals: copy_originals
for file in $(FILES_TO_OPTIMIZE); do \
abs_file=$(STAGE1_BUILD)/usr/lib/$$file; \
cp -P "$$abs_file.original" $$abs_file; \
done

delete_originals: copy_originals
for file in $(FILES_TO_OPTIMIZE); do \
abs_file=$(STAGE1_BUILD)/usr/lib/$$file; \
rm "$$abs_file.original"; \
done
17 changes: 17 additions & 0 deletions contrib/bolt/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
BOLT only works on x86_64 and arch64 on Linux.

DO NOT STRIP THE RESULTING .so FILES, https://github.com/llvm/llvm-project/issues/56738.
If you really need to, try adding `-use-gnu-stack` to `BOLT_ARGS`.

To build a BOLT-optimized version of Julia run the following commands (`cd` into this directory first)
```bash
make stage1
make copy_originals
make bolt_instrument
make finish_stage1
make merge_data
make bolt
```
After these commands finish, the optimized version of Julia will be built in the `optimized.build` directory.

This doesn't align the code to support huge pages as it doesn't seem that we do that currently, this decreases the size of the .so files by 2-4mb.
14 changes: 14 additions & 0 deletions contrib/pgo-lto-bolt/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
stage0*
stage1*
stage2*
bolt
bolt_instrument
merge_data
copy_originals

profiles
profiles-bolt

toolchain
pgo-instrumented.build
optimized.build
Loading

0 comments on commit 1dee000

Please sign in to comment.