GNNDrive is a disk-based GNN training framework, specifically designed to optimize the training on large-scale graphs using a single machine with ordinary hardware components, such as CPU, GPU, and limited memory. GNNDrive minimizes the memory footprint for feature extraction to reduce the memory contention between sampling and extracting. It also introduces asynchronous feature extraction to mitigate I/O congestion for massive data movements.
Follow the instructions below to install the requirements and run an example using ogbn_papers100M dataset.
-
Clone our library
git clone
-
Run Docker
-
Build docker images
cd docker docker build -t GNN:gpu .
-
Install nvidia-container-runtime for docker
curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \ sudo apt-key add - distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \ sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list sudo apt-get update sudo apt-get install nvidia-container-runtime
Please refer the following links for more details.
-
Run container with limited memory
docker run --gpus all -it --ipc=host \ --name GNN-16g --memory 16G --memory-swap 32G \ -v /path-to-file:/working_dir/ GNN:gpu bash
Note:
--memory
limits the maximum amount of memory the container can use.Please refer the following link for more details.
-
-
Install necessary library.
-
liburing
# download wget https://github.com/axboe/liburing/archive/refs/tags/liburing-2.1.zip unzip liburing-2.1.zip # install cd liburing-2.1 ./configure --cc=gcc --cxx=g++; make -j$(nproc); make install;
Please refer the following link for more details.
-
Ninja
wget https://github.com/ninja-build/ninja/releases/download/v1.8.2/ninja-linux.zip unzip ninja-linux.zip -d /usr/local/bin/ update-alternatives --install /usr/bin/ninja ninja /usr/local/bin/ninja 1 --force
-
-
Prepare dataset
python3 prepare_dataset_ogbn.py
-
Preprocess for baseline, i.e., Ginex
python3 create_neigh_cache.py --neigh-cache-size 6000000000
-
Run baselines
# run PyG+ python3 run_baseline.py # run Ginex python3 run_ginex.py --neigh-cache-size 6000000000 \ --feature-cache-size 6000000000 --sb-size 1500
-
Run GNNDrive
# run without data parallelism in GPU python3 run_async.py --compute-type gpu # run without data parallelism in CPU python3 run_async.py --compute-type cpu # run with data parallelism using 2 subprocesses in GPU python3 run_async_multi.py --compute-type gpu \ --world-size 2 # run with data parallelism using 2 subprocesses in CPU python3 run_async_multi.py --compute-type cpu \ --world-size 2
Note:
--compute-type
indicates that the system uses GPU or CPU when training.--world-size
indicates the number of subprocesses used for training.
Qisheng Jiang ([email protected])
Lei Jia ([email protected])
We thank authors of Ginex for providing the source code of Ginex and PyG+. Our implementation uses some funtions of Ginex.