Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared object support #15

Merged
merged 83 commits into from
Jul 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
66642d8
add perf outputs to gitignore
n-eiling Feb 15, 2023
403ec5f
fix various errors in the Makefiles that lead to building on a non-cl…
n-eiling Feb 15, 2023
0e13cbf
add test program for cuda code loaded using libdl
n-eiling Feb 15, 2023
cb391b3
when the client dlopens libraries containing cuda kernels, also open …
n-eiling Feb 15, 2023
905fefe
add decoding of fatbinary data
n-eiling Feb 16, 2023
0997d44
add decoding of embedded fatbinaries
n-eiling Feb 17, 2023
4ff4b5c
add temporary test code that launches a kernel on the server from an …
n-eiling Feb 17, 2023
e72c11c
add registry for tranferred cubins and kernel functions so Cricket is…
n-eiling Feb 18, 2023
15eb759
fix segfault on cleanup because CUDA accesses nonexisting fatcubinHandle
n-eiling Feb 21, 2023
ff493f6
code cleanup. fix wrong passing of dimensions
n-eiling Feb 21, 2023
98832e2
use an infinite timeout for kernel calls
n-eiling Feb 21, 2023
6eeef6c
remove timeout for cudaDeviceSynchronize
n-eiling Feb 21, 2023
74179ef
make cpu_utils_contains_kernel return the right value
n-eiling Mar 9, 2023
36ead03
add cudaRegisterVar client function
n-eiling Mar 9, 2023
babb70c
add gdb commands file for debugging client apps
n-eiling Mar 9, 2023
d1f6173
fix cpu_utils_contains_kernel and cpu_utils_parameter_info returning …
n-eiling Mar 10, 2023
9cb6aaf
make cpu_utils_launch_child also redirect stderr of child processes t…
n-eiling Mar 10, 2023
ac36e85
reduce debugging output verbosity and add some NULL checks
n-eiling Mar 10, 2023
07ed931
make dlopen return a handle to the main program if it is called with …
n-eiling Mar 10, 2023
e9b2c1c
fix ci error by making tests/cpu/cubin/main.cpp compile
n-eiling Mar 10, 2023
dec25d1
parse kernel parameter infos from in-memory elf using libbfd
n-eiling Mar 24, 2023
09b34f6
fix cpu-server not using the new name of elf_symbol_address
n-eiling Mar 24, 2023
701d4bd
add possibility to dump elfs
n-eiling Mar 24, 2023
89f78e6
make higher log levels configurable from makefile
n-eiling Mar 27, 2023
4d7dc55
add comments and additional error handling
n-eiling Mar 30, 2023
d9870e0
add elf_init function to avoid multiple initializations of libbfd
n-eiling Mar 30, 2023
45e7e18
use libelf instead of libbfd for elf manuipulation because of better …
n-eiling Apr 11, 2023
8de247f
add colors to log.c
n-eiling Apr 11, 2023
6acdf43
migrate to new elf handling. add decompression support for cuda fatbi…
n-eiling May 4, 2023
975cd31
port to CUDA 12.1
n-eiling May 10, 2023
eeb8e48
fix elf handling to work with a wider variety of CUDA kernels
n-eiling May 10, 2023
66eb961
fix memory leaks identified by gcc sanitizer
n-eiling May 11, 2023
b230687
clean up of uneeded code paths relating to old LD_PRELOADing of serve…
n-eiling May 12, 2023
6e46154
fix cudaMemcpy using correct shm index references
n-eiling May 12, 2023
4a4bd02
fix resource manager add_sorted function inserting and wrong location…
n-eiling May 15, 2023
bde6500
fix wrong decoding of compressed kernels
n-eiling May 16, 2023
33e0fe4
update dockerfiles so they install cuda profiler api and add new Dock…
n-eiling May 16, 2023
dcd9009
if a binary does not contain any kernel cricket should not show any e…
n-eiling May 16, 2023
d944cd9
cricket supports binaries with debug symbols so we should not throw a…
n-eiling May 16, 2023
a0473ac
implement cudaRegisterVar API so that we support cudaMemcpyToSymbol
n-eiling May 18, 2023
0df2fd3
add some driver apis, fix shadowing CUDA functions not working when t…
n-eiling Jun 1, 2023
0641ccc
add nvml support
n-eiling Jun 2, 2023
c9b9726
add nvml to Dockerfiles
n-eiling Jun 2, 2023
3b541b3
add license to pytorch_minimal.py
n-eiling Jun 2, 2023
433930b
add nvml library to dockerfiles
n-eiling Jun 2, 2023
f249f8f
exclude some nvml definitions when compiling with an old CUDA version…
n-eiling Jun 2, 2023
6860540
add cpu-server-nvml head
n-eiling Jun 6, 2023
c849bd7
change c standard to gnu11, improve logging
n-eiling Jun 6, 2023
4c78904
add documentation on how to use pytorch to docs/pytorch.md
n-eiling Jun 7, 2023
5c64748
fix elf decompression handling padding wrong in some circumstances
n-eiling Jun 7, 2023
1c7d39f
fix decompression not working for long uncompressed lz4 segments
n-eiling Jun 7, 2023
c9f09b9
fix potential segfault because of missing variadic parameter in logging
n-eiling Jun 12, 2023
c709acf
use uint64_t for decompressions to fix overflowing of range and lengt…
n-eiling Jun 12, 2023
56ce060
update docs to not deactivate compression as we now support compresse…
n-eiling Jun 12, 2023
da4682e
add v2 implementation of cudaGetDeviceProperties
n-eiling Jun 13, 2023
9f4e797
add libgl dependency to pytorch documentation
n-eiling Jun 13, 2023
8de9fb8
improve support for cuGetProcAddress
n-eiling Jun 13, 2023
d41d195
add cuDNN tests to tests/samples
n-eiling Jun 14, 2023
523d86e
use fixed size rpc array instead of opaque variable length array for …
n-eiling Jun 14, 2023
d786d9c
add cuDNN API stubs
n-eiling Jun 16, 2023
2de0e2e
add cuDNN implementation
n-eiling Jun 20, 2023
0a01b07
use resource managers for cudnn api
n-eiling Jun 20, 2023
9de7292
add more cuDNN APIs
n-eiling Jun 20, 2023
32796d7
add cudnn activation and pooling apis
n-eiling Jun 20, 2023
e180369
implement cudaMemset Async APIs
n-eiling Jun 21, 2023
14838e6
add cudnn dependency to Dockerfiles
n-eiling Jun 21, 2023
b392420
add cudnn LRN api
n-eiling Jun 21, 2023
26e19bd
add server side cudnn lrn implementations, fix some function names
n-eiling Jun 21, 2023
15fc3a2
add basic cuBLAS support
n-eiling Jun 21, 2023
762cada
implement cudnn tensor functions
n-eiling Jun 21, 2023
5d381a7
implement three more cudnn tensor APIs
n-eiling Jun 22, 2023
6da2f8d
add cublas and cudnn functions to support mnistCUDNN sample
n-eiling Jun 26, 2023
8a911ca
fix faulty if statement when intercepting dlopen calls
n-eiling Jun 26, 2023
122b721
improve logging for unloading of modules
n-eiling Jun 26, 2023
e8813ea
improve docs/pytorch.md
n-eiling Jun 26, 2023
e5dbebf
improve cublas implementation, add cudnnBackend implementation
n-eiling Jun 29, 2023
ce21d8a
improve debug output for cuModuleLoad
n-eiling Jul 13, 2023
481dec9
add support for cuModuleLoadData
n-eiling Jul 13, 2023
fbf7dad
cublas: remove usage of new APIs if we compile for CUDA 10
n-eiling Jul 13, 2023
bf3a15e
fix using logger function before initialization
n-eiling Jul 17, 2023
f30d9b0
fix no output on weird shells, e.g. ssh
n-eiling Jul 17, 2023
07db2ba
remove md5
n-eiling Jul 18, 2023
088b6fc
remove cuda 10 support, add cudnn CI test
n-eiling Jul 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ build/
.clangd
.project
.cproject
*.code-workspace
.settings/
.vscode/
.directory
Expand Down Expand Up @@ -39,3 +40,7 @@ core.*
compile_commands.json
tags
.gdb_history

# perf data
perf.data
main
47 changes: 28 additions & 19 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ stages:
##############################################################################

# Build docker image
prepare:centos8:docker-dev:
prepare:rocky9:docker-dev:
stage: prepare
script:
- docker build
Expand All @@ -31,13 +31,13 @@ prepare:centos8:docker-dev:
tags:
- docker

prepare:centos8:cuda10:
prepare:centos8:cuda11:
stage: prepare
script:
- docker build
--file utils/Dockerfile.cuda10
--tag ${DOCKER_IMAGE_DEV}_cuda10:${DOCKER_TAG}
--tag ${DOCKER_IMAGE_DEV}_cuda10:latest .
--file utils/Dockerfile.cuda11
--tag ${DOCKER_IMAGE_DEV}_cuda11:${DOCKER_TAG}
--tag ${DOCKER_IMAGE_DEV}_cuda11:latest .
tags:
- docker

Expand All @@ -57,7 +57,7 @@ prepare:centos8:cuda10:

build:
stage: build
needs: ["prepare:centos8:docker-dev"]
needs: ["prepare:rocky9:docker-dev"]
script:
- make -j 32 libtirpc
- make -j 32 cuda-gdb
Expand All @@ -68,6 +68,7 @@ build:
paths:
- bin
- tests/bin
- tests/samples/samples-bin
image: ${DOCKER_IMAGE_DEV}:${DOCKER_TAG}
cache:
paths:
Expand All @@ -82,7 +83,7 @@ build:

build:ib:
stage: build
needs: ["prepare:centos8:docker-dev"]
needs: ["prepare:rocky9:docker-dev"]
script:
- make -j 32 libtirpc
- make -j 32 cuda-gdb
Expand All @@ -108,19 +109,19 @@ build:ib:
tags:
- docker

build:cuda10:
build:cuda11:
stage: build
needs: ["prepare:centos8:cuda10"]
needs: ["prepare:centos8:cuda11"]
script:
- make -j 32 libtirpc
- make -j 32 cuda-gdb
- make -j 1 LOG=INFO
- make -j 1 LOG=INFO NOSAMPLES=yes
artifacts:
expire_in: 1 week
paths:
- bin
- tests/bin
image: ${DOCKER_IMAGE_DEV}_cuda10:${DOCKER_TAG}
image: ${DOCKER_IMAGE_DEV}_cuda11:${DOCKER_TAG}
cache:
paths:
- gpu/build
Expand All @@ -130,13 +131,13 @@ build:cuda10:
- submodules/libtirpc
- submodules/cuda-gdb
- submodules/cuda-gdb-src.rpm
key: build_cuda10
key: build_cuda11
tags:
- docker

build:debug:
stage: build
needs: ["prepare:centos8:docker-dev"]
needs: ["prepare:rocky9:docker-dev"]
script:
- make -j 32 libtirpc
- make -j 32 cuda-gdb
Expand Down Expand Up @@ -170,6 +171,7 @@ build:debug:
LDIR: '$CI_BUILDS_DIR/$CI_PROJECT_PATH/bin'
SAMPLES_PATH: '/usr/local/cuda/samples'
PARAMETER: ''
CHDIR: 'tests'
script:
- mkdir ~/.ssh &&
echo "-----BEGIN OPENSSH PRIVATE KEY-----" > ~/.ssh/id_rsa &&
Expand All @@ -179,9 +181,10 @@ build:debug:
echo $KNOWN_HOSTS > ~/.ssh/known_hosts && chmod 600 ~/.ssh/id_rsa
- ssh $GPU_TARGET mkdir -p $RDIR
- scp -r $LDIR/* $GPU_TARGET:$RDIR/
- ssh $GPU_TARGET "LD_PRELOAD=$RDIR/libtirpc.so.3:$RDIR/cricket-server.so $RDIR/$TEST_BINARY" &
- ssh $GPU_TARGET "LD_PRELOAD=$RDIR/libtirpc.so.3 $RDIR/cricket-rpc-server 255" &
- sleep 2
- REMOTE_GPU_ADDRESS="ghost.acs-lab.eonerc.rwth-aachen.de" PATH=$LDIR:$PATH LD_PRELOAD=$LDIR/libtirpc.so.3:$LDIR/cricket-client.so $LDIR/$TEST_BINARY $PARAMETER
- cd $LDIR/$CHDIR
- CRICKET_RPCID=255 REMOTE_GPU_ADDRESS="ghost.acs-lab.eonerc.rwth-aachen.de" PATH=$LDIR:$PATH LD_PRELOAD=$LDIR/libtirpc.so.3:$LDIR/cricket-client.so ./$TEST_BINARY $PARAMETER
after_script:
- ssh $GPU_TARGET rm -rf $RDIR
- ssh $GPU_TARGET pkill -fe -2 $RDIR/test_kernel
Expand Down Expand Up @@ -216,21 +219,27 @@ test:test_programs(2/2):
test:test_kernel:
extends: .remote-gpu
variables:
TEST_BINARY: 'tests/kernel.testapp'
TEST_BINARY: 'kernel.testapp'

test:samples:matrixMul:
extends: .remote-gpu
variables:
TEST_BINARY: 'tests/matrixMul'
TEST_BINARY: 'matrixMul.compressed.sample'

test:samples:bandwidthTest:
extends: .remote-gpu
variables:
TEST_BINARY: 'tests/bandwidthTest'
TEST_BINARY: 'bandwidthTest.sample'

test:samples:nbody:
extends: .remote-gpu
variables:
TEST_BINARY: 'tests/nbody'
TEST_BINARY: 'nbody.uncompressed.sample'
PARAMETER: '-benchmark'

test:samples:mnistCUDNN:
extends: .remote-gpu
variables:
CHDIR: '../tests/samples/samples-bin'
TEST_BINARY: 'mnistCUDNN.sample'

7 changes: 4 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ cuda-gdb:

libtirpc:
@echo -e "\033[36m----> Building libtirpc\033[0m"
$(MAKE) -C submodules libtirpc
$(MAKE) -C submodules libtirpc/install

gpu: cuda-gdb
@echo -e "\033[36m----> Building gpu\033[0m"
Expand All @@ -33,7 +33,7 @@ tests:
@echo -e "\033[36m----> Building test kernels\033[0m"
$(MAKE) -C tests

install-cpu: bin/cricket-client.so bin/cricket-server.so bin/libtirpc.so bin/libtirpc.so.3 bin/tests
install-cpu: bin/cricket-client.so bin/cricket-rpc-server bin/libtirpc.so bin/libtirpc.so.3 bin/tests
@echo -e "\033[36m----> Copying cpu binaries to build/bin\033[0m"

install: install-cpu bin/cricket
Expand All @@ -51,7 +51,8 @@ bin/cricket-client.so: bin

bin/cricket-server.so: bin
$(MAKE) -C cpu cricket-server.so
cp cpu/cricket-server.so bin
mv cpu/cricket-server.so bin/cricket-server.so


bin/cricket-rpc-server: bin
$(MAKE) -C cpu cricket-rpc-server
Expand Down
55 changes: 35 additions & 20 deletions cpu/Makefile
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
#RPC server library
SERVER = cricket-server.so
#Standalone RPC Server
SERVER_BIN = cricket-rpc-server
SERVER = cricket-rpc-server
SERVER_LIB = cricket-server.so
#RPC client library
CLIENT = cricket-client.so

CUDA_SRC = /usr/local/cuda
LIBTIRPC_PREFIX = ../submodules/libtirpc/install
SUBMODULE_LIBS = ../submodules/lib

CC = gcc
LD = gcc
Expand Down Expand Up @@ -39,7 +39,10 @@ SRC_SERVER = $(RPC_XDR) \
cr.c \
gsched_none.c \
oob.c \
mt-memcpy.c
mt-memcpy.c \
cpu-elf2.c \
cpu-server-nvml.c \
cpu-server-cudnn.c

SRC_SERVER_LIB = server-library.c
SRC_SERVER_EXE = server-exe.c
Expand All @@ -55,7 +58,11 @@ SRC_CLIENT = $(RPC_XDR) \
cpu-libwrap.c \
cpu-client-cusolver.c \
oob.c \
mt-memcpy.c
mt-memcpy.c \
cpu-elf2.c \
cpu-client-nvml.c \
cpu-client-cudnn.c \
cpu-client-cublas.c

# cpu-client-driver-hidden.c \

Expand All @@ -72,15 +79,17 @@ RPCGEN_FLAGS = -C -M -N
INC_FLAGS += -I$(LIBTIRPC_PREFIX)/include/tirpc
INC_FLAGS += -I$(CUDA_SRC)/include

LIB_FLAGS += -L$(LIBTIRPC_PREFIX)/lib -L$(CUDA_SRC)/lib64
CC_FLAGS += -std=gnu99 $(INC_FLAGS) -O2
LIB_FLAGS += -L$(LIBTIRPC_PREFIX)/lib
LIB_FLAGS += -L$(CUDA_SRC)/lib64
LIB_FLAGS += -L$(CUDA_SRC)/lib64/stubs
CC_FLAGS += -std=gnu11 $(INC_FLAGS) #-O2
# TODO: use extern in header files instead of direct definition e.g. in cpu-common.h to remove -fcommon flag
CC_FLAGS += -fcommon
LD_FLAGS = $(LIB_FLAGS) -ltirpc -ldl -lcrypto
LD_FLAGS = $(LIB_FLAGS) -ltirpc -ldl -lcrypto -lelf

ifdef WITH_DEBUG
# use ASAN_OPTIONS=protect_shadow_gap=0 LSAN_OPTIONS=fast_unwind_on_malloc=0 when running
CC_FLAGS += -g -ggdb #-fsanitize=address -fsanitize=pointer-compare -fsanitize=pointer-subtract -fsanitize-address-use-after-scope
CC_FLAGS += -g -ggdb #-static-libasan -fsanitize=address -fsanitize=pointer-compare -fsanitize=pointer-subtract -fsanitize-address-use-after-scope
endif

ifdef WITH_IB
Expand All @@ -90,48 +99,54 @@ endif
ifdef LOG
CC_FLAGS += -DLOG_LEVEL=LOG_$(LOG)
endif

ifdef LOGN
CC_FLAGS += -DLOG_LEVEL=$(LOGN)
endif

ifdef WITH_IB
CC_FLAGS += -DWITH_IB=$(WITH_IB)
endif

SERVER_LD_FLAGS = $(LD_FLAGS) -lcudart -lcusolver -lcuda -lcublas -lbfd -lrt -lpthread
SERVER_LD_FLAGS = $(LD_FLAGS) -lcudart -lcusolver -lcuda -lcublas -lrt -lpthread -lnvidia-ml -lcudnn
SERVER_BIN_LD_FLAGS = $(SERVER_LD_FLAGS) -Wl,--unresolved-symbols=ignore-in-object-files
CLIENT_LD_FLAGS = $(LD_FLAGS) -lbfd
CLIENT_LD_FLAGS = $(LD_FLAGS)

# Targets
.PHONY: all clean

all : $(SERVER) $(SERVER_BIN) $(CLIENT)
all : $(SERVER) $(CLIENT)

$(CLIENT) : $(OBJ_CLIENT)
$(LD) $(CC_FLAGS) -shared -o $@ $^ $(CLIENT_LD_FLAGS)

$(SERVER) : $(OBJ_SERVER) $(SRC_SERVER_LIB:%.c=%.o)
$(LD) $(CC_FLAGS) -shared -o $@ $^ $(SERVER_LD_FLAGS)
$(SERVER_LIB) : $(OBJ_SERVER) $(SRC_SERVER_EXE:%.c=%.o)
$(LD) $(CC_FLAGS) -shared -o $@ $^ $(SERVER_BIN_LD_FLAGS)

$(SERVER_BIN) : $(OBJ_SERVER) $(SRC_SERVER_EXE:%.c=%.o)
$(SERVER) : $(OBJ_SERVER) $(SRC_SERVER_EXE:%.c=%.o)
$(LD) $(CC_FLAGS) -o $@ $^ $(SERVER_BIN_LD_FLAGS)

$(RPC_H) : $(RPC_DEF)
$(RPCGEN) $(RPCGEN_FLAGS) -h -o $@ $<
rm -f $@ && $(RPCGEN) $(RPCGEN_FLAGS) -h -o $@ $<

$(RPC_CLIENT) : $(RPC_DEF)
$(RPCGEN) $(RPCGEN_FLAGS) -l -o $@ $<
rm -f $@ && $(RPCGEN) $(RPCGEN_FLAGS) -l -o $@ $<

$(RPC_SERVER) : $(RPC_DEF)
$(RPCGEN) $(RPCGEN_FLAGS) -m -o $@ $<
rm -f $@ && $(RPCGEN) $(RPCGEN_FLAGS) -m -o $@ $<

$(RPC_SERVER_MOD) : $(RPC_SERVER)
./generate_dispatch.sh

$(RPC_XDR) : $(RPC_DEF)
$(RPCGEN) $(RPCGEN_FLAGS) -c -o $@ $<
rm -f $@ && $(RPCGEN) $(RPCGEN_FLAGS) -c -o $@ $<

%.o : %.c $(RPC_H)
$(CC) $(CC_FLAGS) -c -fpic -o $@ $< $(LD_FLAGS)

clean:
rm -f $(RPC_H) $(RPC_CLIENT) $(RPC_SERVER) $(RPC_SERVER_BIN) $(RPC_SERVER_MOD) $(RPC_XDR) $(OBJ_CLIENT) $(OBJ_SERVER) $(SERVER) $(CLIENT)
rm -f $(RPC_H) $(RPC_CLIENT) $(RPC_SERVER) $(RPC_SERVER_MOD) $(RPC_XDR) $(OBJ_CLIENT) $(OBJ_SERVER) $(SERVER) $(SERVER_LIB) $(CLIENT) $(SRC_SERVER_EXE:%.c=%.o)




25 changes: 24 additions & 1 deletion cpu/api-recorder.c
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,13 @@

#include "api-recorder.h"
#include "log.h"
#include "list.h"


list api_records;

void api_records_free_args(void)

static void api_records_free_args(void)
{
api_record_t *record;
for (size_t i = 0; i < api_records.length; i++) {
Expand All @@ -22,6 +24,27 @@ void api_records_free_args(void)

}

static void api_records_free_data(void)
{
api_record_t *record;
for (size_t i = 0; i < api_records.length; i++) {
if (list_at(&api_records, i, (void**)&record) != 0) {
LOGE(LOG_ERROR, "list_at %zu returned an error.", i);
continue;
}
free(record->data);
record->data = NULL;
}
}


void api_records_free(void)
{
api_records_free_args();
api_records_free_data();
list_free(&api_records);
}

size_t api_records_malloc_get_size(void *ptr)
{
api_record_t *record;
Expand Down
5 changes: 4 additions & 1 deletion cpu/api-recorder.h
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@
*arguments = ARG
#define RECORD_ARG(NUM, ARG) \
arguments->arg##NUM = ARG
#define RECORD_NARG(ARG) \
arguments->ARG = ARG
#define RECORD_DATA(SIZE, PTR) \
record->data_size = SIZE; \
record->data = malloc(SIZE); \
Expand All @@ -58,14 +60,15 @@ typedef struct api_record {
void* ptr;
int integer;
ptr_result ptr_result_u;
sz_result sz_result_u;
} result;
void *data;
size_t data_size;
} api_record_t;
extern list api_records;


void api_records_free_args(void);
void api_records_free(void);
void api_records_print(void);
void api_records_print_records(api_record_t *record);

Expand Down
Loading