Skip to content

Commit

Permalink
[Feature] (La)yer-Neigh(bor) sampling implementation (dmlc#4668)
Browse files Browse the repository at this point in the history
* adding LABOR sampling

* add ladies and pladies samplers

* fix compile error after rebase

* add reference for ladies sampler

* Improve ladies implementation.

* weighted labor sampling initial implementation draft
fix indentation and small bug in ladies script

* importance_sampling currently doesn't work with weights

* fix weighted importance sampling

* move labor example into its own folder

* lint fixes

* Improve documentation

* remove examples from the main PR

* fix linting by not using c++17 features

* fix documentation of labor_sampler.py

* update documentation for labor.py

* reformat the labor.py file with black

* fix linting errors

* replace exception use with if

* fix typo in error comment

* fixing win64 build for ci

* fixing weighted implementation, works now.

* fix bug in the weighted case and importance_sampling==0

* address part of the reviews

* remove unused code paths from cuda

* remove unused code path from cpu side

* remove extra features of labor making use of random seed.

* fix exclude_edges bug

* remove pcg and seed logic from cpu implementation, seed logic should still work for cuda.

* minor style change

* refactor CPU implementation, take out the importance_sampling probability computation into a function.

* improve CUDAWorkspaceAllocator

* refactor importance_sampling part out to a function

* minor optimization

* fix linting issue

* Revert "remove pcg and seed logic from cpu implementation, seed logic should still work for cuda."

This reverts commit c250e07.

* Revert "remove extra features of labor making use of random seed."

This reverts commit 7f99034.

* fix the documentation

* disable NIDs

* improve the documentation in the code

* use the stream argument in pcg32 instead of skipping ahead t times, can discard the use of hashmap now since it is faster this way.

* fix linting issue

* address another round of reviews

* further optimize CPU LABOR sampling implementation

* fix linting error

* update the comment

* reformat

* rename and rephrase comment

* fix formatting according to new linting specs

* fix compile error due to renaming, fix linting.

* lint

* rename DGLHeteroGraph to DGLGraph to match master

* replace other occurrences of DGLHeteroGraph to DGLGraph

Co-authored-by: Muhammed Fatih BALIN <[email protected]>
Co-authored-by: Kaan Sancak <[email protected]>
Co-authored-by: Quan Gan <[email protected]>
  • Loading branch information
3 people authored Nov 22, 2022
1 parent 59f3d6e commit bf264d0
Show file tree
Hide file tree
Showing 22 changed files with 2,139 additions and 37 deletions.
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -34,3 +34,6 @@
[submodule "third_party/thrust"]
path = third_party/thrust
url = https://github.com/NVIDIA/thrust.git
[submodule "third_party/pcg"]
path = third_party/pcg
url = https://github.com/imneme/pcg-cpp.git
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,7 @@ target_include_directories(dgl PRIVATE "third_party/METIS/include/")
target_include_directories(dgl PRIVATE "tensoradapter/include")
target_include_directories(dgl PRIVATE "third_party/nanoflann/include")
target_include_directories(dgl PRIVATE "third_party/libxsmm/include")

target_include_directories(dgl PRIVATE "third_party/pcg/include")

# For serialization
if (USE_HDFS)
Expand Down
1 change: 1 addition & 0 deletions CONTRIBUTORS.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,3 +64,4 @@ Contributors
* [Jiahui Liu](https://github.com/paoxiaode) from Nvidia
* [Neil Dickson](https://github.com/ndickson-nvidia) from Nvidia
* [Chang Liu](https://github.com/chang-l) from Nvidia
* [Muhammed Fatih Balin](https://github.com/mfbalin) from Nvidia and Georgia Tech
11 changes: 11 additions & 0 deletions include/dgl/aten/array_ops.h
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,17 @@ IdArray NewIdArray(
int64_t length, DGLContext ctx = DGLContext{kDGLCPU, 0},
uint8_t nbits = 64);

/**
* @brief Create a new float array with given length
* @param length The array length
* @param ctx The array context
* @param nbits The number of integer bits
* @return float array
*/
FloatArray NewFloatArray(int64_t length,
DGLContext ctx = DGLContext{kDGLCPU, 0},
uint8_t nbits = 32);

/**
* @brief Create a new id array using the given vector data
* @param vec The vector data
Expand Down
62 changes: 61 additions & 1 deletion include/dgl/aten/coo.h
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@

/**
* Copyright (c) 2020 by Contributors
* Copyright (c) 2020-2022 by Contributors
* @file dgl/aten/coo.h
* @brief Common COO operations required by DGL.
*/
Expand Down Expand Up @@ -379,6 +379,66 @@ COOMatrix COORemove(COOMatrix coo, IdArray entries);
COOMatrix COOReorder(
COOMatrix coo, runtime::NDArray new_row_ids, runtime::NDArray new_col_ids);

/**
* @brief Randomly select a fixed number of non-zero entries along each given
* row using arXiv:2210.13339, Labor sampling.
*
* The picked indices are returned in the form of a COO matrix.
*
* The passed random_seed makes it so that for any seed vertex s and its
* neighbor t, the rolled random variate r_t is the same for any call to this
* function with the same random seed. When sampling as part of the same batch,
* one would want identical seeds so that LABOR can globally sample. One example
* is that for heterogenous graphs, there is a single random seed passed for
* each edge type. This will sample much fewer vertices compared to having
* unique random seeds for each edge type. If one called this function
* individually for each edge type for a heterogenous graph with different
* random seeds, then it would run LABOR locally for each edge type, resulting
* into a larger number of vertices being sampled.
*
* If this function is called without a random_seed, we get the random seed by
* getting a random number from DGL.
*
*
* Examples:
*
* // coo.num_rows = 4;
* // coo.num_cols = 4;
* // coo.rows = [0, 0, 1, 3, 3]
* // coo.cols = [0, 1, 1, 2, 3]
* // coo.data = [2, 3, 0, 1, 4]
* COOMatrix coo = ...;
* IdArray rows = ... ; // [1, 3]
* COOMatrix sampled = COOLaborSampling(coo, rows, 2, NullArray(), 0 \
* , NullArray(), NullArray());
* // possible sampled coo matrix:
* // sampled.num_rows = 4
* // sampled.num_cols = 4
* // sampled.rows = [1, 3, 3]
* // sampled.cols = [1, 2, 3]
* // sampled.data = [3, 0, 4]
*
* @param mat Input coo matrix.
* @param rows Rows to sample from.
* @param num_samples Number of samples using labor sampling
* @param prob Probability array for nonuniform sampling
* @param importance_sampling Whether to enable importance sampling
* @param random_seed The random seed for the sampler
* @param NIDs global nids if sampling from a subgraph
* @return A pair of COOMatrix storing the picked row and col indices and edge
* weights if importance_sampling != 0 or prob argument was passed.
* Its data field stores the the index of the picked elements in the
* value array.
*/
std::pair<COOMatrix, FloatArray> COOLaborSampling(
COOMatrix mat,
IdArray rows,
int64_t num_samples,
FloatArray prob = NullArray(),
int importance_sampling = 0,
IdArray random_seed = NullArray(),
IdArray NIDs = NullArray());

/**
* @brief Randomly select a fixed number of non-zero entries along each given
* row independently.
Expand Down
63 changes: 61 additions & 2 deletions include/dgl/aten/csr.h
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/**
* Copyright (c) 2020 by Contributors
* Copyright (c) 2020-2022 by Contributors
* @file dgl/aten/csr.h
* @brief Common CSR operations required by DGL.
*/
Expand Down Expand Up @@ -409,7 +409,66 @@ CSRMatrix CSRRemove(CSRMatrix csr, IdArray entries);

/**
* @brief Randomly select a fixed number of non-zero entries along each given
* row independently.
* row using arXiv:2210.13339, Labor sampling.
*
* The picked indices are returned in the form of a COO matrix.
*
* The passed random_seed makes it so that for any seed vertex s and its
* neighbor t, the rolled random variate r_t is the same for any call to this
* function with the same random seed. When sampling as part of the same batch,
* one would want identical seeds so that LABOR can globally sample. One example
* is that for heterogenous graphs, there is a single random seed passed for
* each edge type. This will sample much fewer vertices compared to having
* unique random seeds for each edge type. If one called this function
* individually for each edge type for a heterogenous graph with different
* random seeds, then it would run LABOR locally for each edge type, resulting
* into a larger number of vertices being sampled.
*
* If this function is called without a random_seed, we get the random seed by
* getting a random number from DGL.
*
*
* Examples:
*
* // csr.num_rows = 4;
* // csr.num_cols = 4;
* // csr.indptr = [0, 2, 3, 3, 5]
* // csr.indices = [0, 1, 1, 2, 3]
* // csr.data = [2, 3, 0, 1, 4]
* CSRMatrix csr = ...;
* IdArray rows = ... ; // [1, 3]
* COOMatrix sampled = CSRLaborSampling(csr, rows, 2, NullArray(), 0, \
* NullArray(), NullArray());
* // possible sampled coo matrix:
* // sampled.num_rows = 4
* // sampled.num_cols = 4
* // sampled.rows = [1, 3, 3]
* // sampled.cols = [1, 2, 3]
* // sampled.data = [3, 0, 4]
*
* @param mat Input CSR matrix.
* @param rows Rows to sample from.
* @param num_samples Number of samples using labor sampling
* @param prob Probability array for nonuniform sampling
* @param importance_sampling Whether to enable importance sampling
* @param random_seed The random seed for the sampler
* @param NIDs global nids if sampling from a subgraph
* @return A pair of COOMatrix storing the picked row and col indices and edge
* weights if importance_sampling != 0 or prob argument was passed. Its
* data field stores the the index of the picked elements in the value
* array.
*/
std::pair<COOMatrix, FloatArray> CSRLaborSampling(
CSRMatrix mat,
IdArray rows,
int64_t num_samples,
FloatArray prob = NullArray(),
int importance_sampling = 0,
IdArray random_seed = NullArray(),
IdArray NIDs = NullArray());

/*!
* @brief Randomly select a fixed number of non-zero entries along each given row independently.
*
* The function performs random choices along each row independently.
* The picked indices are returned in the form of a COO matrix.
Expand Down
4 changes: 2 additions & 2 deletions include/dgl/sampler.h
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ class SamplerOp {
* @brief Sample a graph from the seed vertices with neighbor sampling.
* The neighbors are sampled with a uniform distribution.
*
* @param graphs A graph for sampling.
* @param graph A graph for sampling.
* @param seeds the nodes where we should start to sample.
* @param edge_type the type of edges we should sample neighbors.
* @param num_hops the number of hops to sample neighbors.
Expand All @@ -43,7 +43,7 @@ class SamplerOp {
* @brief Sample a graph from the seed vertices with layer sampling.
* The layers are sampled with a uniform distribution.
*
* @param graphs A graph for sampling.
* @param graph A graph for sampling.
* @param seeds the nodes where we should start to sample.
* @param edge_type the type of edges we should sample neighbors.
* @param layer_sizes The size of layers.
Expand Down
1 change: 1 addition & 0 deletions python/dgl/dataloading/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from .base import *
from .cluster_gcn import *
from .graphsaint import *
from .labor_sampler import *
from .neighbor_sampler import *
from .shadow import *

Expand Down
Loading

0 comments on commit bf264d0

Please sign in to comment.