You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/modules/ROOT/pages/machine-learning/node-embeddings/hashgnn.adoc
+12-15Lines changed: 12 additions & 15 deletions
Original file line number
Diff line number
Diff line change
@@ -29,10 +29,6 @@ Moreover, the heterogeneous generalization also gives comparable results when co
29
29
30
30
The execution does not require GPUs as GNNs typically use, and parallelizes well across many CPU cores.
31
31
32
-
For more information on this algorithm, see:
33
-
34
-
* https://arxiv.org/pdf/2105.14280.pdf[W.Wu, B.Li, C.Luo and W.Nejdl "Hashing-Accelerated Graph Neural Networks for Link Prediction"^]
35
-
36
32
=== The algorithm
37
33
38
34
To clarify how HashGNN works, we will walk through a virtual example <<algorithms-embeddings-hashgnn-virtual-example, below>> of a three node graph for the reader who is curious about the details of the feature selection and prefers to learn from examples.
@@ -41,7 +37,7 @@ The HashGNN algorithm can only run on binary features.
41
37
Therefore, there is an optional first step to transform (possibly non-binary) input features into binary features as part of the algorithm.
42
38
43
39
For a number of iterations, a new binary embedding is computed for each node using the embeddings of the previous iteration.
44
-
In the first iteration, the previous embeddings are the binary feature vectors.
40
+
In the first iteration, the previous embeddings are the input feature vectors or the binarized input vectors.
45
41
46
42
During one iteration, each node embedding vector is constructed by taking `K` random samples.
47
43
The random sampling is carried out by successively selecting features with lowest min-hash values.
@@ -50,7 +46,7 @@ Features of each node itself and of its neighbours are both considered.
50
46
There are three types of hash functions involved: 1) a function applied to a node's own features, 2) a function applied to a subset of neighbors' features 3) a function applied to all neighbors' features to select the subset for hash function 2).
51
47
For each iteration and sampling round `k<K`, new hash functions are used, and the third function also varies depending on the relationship type connecting to the neighbor it is being applied on.
52
48
53
-
The sampling is consistent in the sense that if nodes `a` and `b` have identical or similar local graphs, the samples for `a` and `b` are also identical or similar.
49
+
The sampling is consistent in the sense that if nodes (`a`) and (`b`) have identical or similar local graphs, the samples for (`a`) and (`b`) are also identical or similar.
54
50
By local graph, we mean the subgraph with features and relationship types, containing all nodes at most `iterations` hops away.
55
51
56
52
The number `K` is called `embeddingDensity` in the configuration of the algorithm.
@@ -59,8 +55,8 @@ The algorithm ends with another optional step that maps the binary embeddings to
59
55
60
56
=== Features
61
57
62
-
The HashGNN algorithm assumes that nodes have binary features as input, and produces binary embedding vectors as output.
63
-
Since this is not always the case for real-world graphs, the algorithm also comes with an option to binarize node properties.
58
+
The original HashGNN algorithm assumes that nodes have binary features as input, and produces binary embedding vectors as output (unless output densification is opted for).
59
+
Since this is not always the case for real-world graphs, our algorithm also comes with an option to binarize node properties.
64
60
65
61
This is done using a type of hyperplane rounding and is configured via a map parameter `binarizeFeatures` containing `densityLevel` and `dimension`.
66
62
The hyperplane rounding uses hyperplanes defined by vectors that are potentially sparse.
@@ -84,13 +80,13 @@ Increasing the value leads to neighbors being selected more often.
84
80
The probability of selecting a feature from the neighbors as a function of `neighborInfluence` has a hockey-stick-like shape, somewhat similar to the shape of `y=log(x)` or `y=C - 1/x`.
85
81
This implies that the probability is more sensitive for low values of `neighborInfluence`.
86
82
87
-
=== Heterogeneous HashGNN
83
+
=== Heterogeneity support
88
84
89
85
The GDS implementation of HashGNN provides a new generalization to heterogeneous graphs in that it can distinguish between different relationship types.
90
86
To enable the heterogeneous support set `heterogeneous` to true.
91
-
The generalization works as the original HashGNN algorithm, but whenever a hash function is applied to a feature of a neighbor node, the algorithm uses a hash function that depends not only on the iteration and on a number `k<embeddingDensity`, but also on the relationship type connecting to the neighbor.
87
+
The generalization works as the original HashGNN algorithm, but whenever a hash function is applied to a feature of a neighbor node, the algorithm uses a hash function that depends not only on the iteration and on a number `k<embeddingDensity`, but also on the type of the relationship connecting to the neighbor.
92
88
Consider an example where HashGNN is run with one iteration, and we have `(a)-[:R]->(x), (b)-[:R]->(x)` and `(c)-[:S]->(x)`.
93
-
Assume that a feature `f` is selected for `(a)` and the hash value is very low.
89
+
Assume that a feature `f` of `(x)` is selected for `(a)` and the hash value is very small.
94
90
This will make it very likely that the feature is also selected for `(b)`.
95
91
There will however be no correlation to `f` being selected for `(c)` when considering the relationship `(c)-[:S]->(x)`, because a different hash function is used for `S`.
96
92
We can conclude that nodes with similar neighborhoods (including node properties and relationship types) get similar embeddings, while nodes that have less similar neighborhoods get less similar embeddings.
@@ -120,7 +116,7 @@ For example, in a citation network, a paper citing another paper is very differe
120
116
Since binary embeddings need to be of higher dimension than dense floating point embeddings to encode the same amount of information, binary embeddings require more memory and longer training time for downstream models.
121
117
The output embeddings can be optionally densified, by using random projection, similar to what is done to initialize FastRP with node properties.
122
118
This behavior is activated by specifying `outputDimension`.
123
-
Output densification can improve runtime and memory at the cost of introducing approximation error due to the random nature of the projection.
119
+
Output densification can improve runtime and memory of downstream tasks at the cost of introducing approximation error due to the random nature of the projection.
124
120
The larger the `outputDimension`, the lower the approximation error and performance savings.
125
121
126
122
=== Usage in machine learning pipelines
@@ -140,7 +136,7 @@ We will go through each of the configuration parameters and explain how they beh
140
136
=== Iterations
141
137
The maximum number of hops between a node and other nodes that affect its embedding is equal to the number of iterations of HashGNN which is configured with `iterations`.
142
138
This is analogous to the number of layers in a GNN or the number of iterations in FastRP.
143
-
Often a value of `2` to `4` is sufficient.
139
+
Often a value of `2` to `4` is sufficient, but sometimes more iterations are useful.
144
140
145
141
=== Embedding density
146
142
@@ -164,11 +160,11 @@ The sparsity of the raw features and the input dimension can also affect the bes
164
160
165
161
As explained above, the default value is a reasonable starting point.
166
162
If using a hyperparameter tuning library, this parameter may favorably be transformed by a function with increasing derivative such as the exponential function, or a function of the type `a/(b - x)`.
167
-
The probability of selecting (and keeping throughout the iterations) a feature from different node depends on `neighborInfluence` the number of hops to the node.
163
+
The probability of selecting (and keeping throughout the iterations) a feature from different nodes depends on `neighborInfluence` and the number of hops to the node.
168
164
Therefore `neighborInfluence` should be re-tuned when `iterations` is changed.
169
165
170
166
=== Heterogeneous
171
-
In general, there is a large amount of information to store about typed paths in a heterogeneous graph, so with many iterations and relationship types the information will become blurred unless the embedding dimension is very large.
167
+
In general, there is a large amount of information to store about paths containing multiple relationship types in a heterogeneous graph, so with many iterations and relationship types, a very high embedding dimension may be necessary.
172
168
This is especially true for unsupervised embedding algorithms such as HashGNN.
173
169
Therefore, caution should be taken when using many iterations in the heterogeneous mode.
174
170
@@ -557,6 +553,7 @@ Perhaps the below example is best enjoyed with a pen and paper.
557
553
Let say we have a node `a` with feature `f1`, a node `b` with feature `f2` and a node `c` with features `f1` and `f3`.
558
554
The graph structure is `a--b--c`.
559
555
We imagine running HashGNN for one iteration with `embeddingDensity=2`.
556
+
For simplicity, we will assume that the hash functions return some made up numbers as we go.
560
557
561
558
During the first iteration and `k=0`, we compute an embedding for `(a)`.
| featureProperties | List of String | [] | yes | The names of the node properties that should be used as input features. All property names must exist in the projected graph and be of type Float or List of Float.
2
-
| iterations | Integer | n/a | no | The number of iterations to run HashGNN.
3
-
| embeddingDensity | Integer | n/a | no | The number of features to sample per node in each iteration. Called `K` in the original paper.
2
+
| iterations | Integer | n/a | no | The number of iterations to run HashGNN. Must be at least 1.
3
+
| embeddingDensity | Integer | n/a | no | The number of features to sample per node in each iteration. Called `K` in the original paper. Must be at least 1.
4
4
| heterogeneous | Boolean | false | yes | Whether different relationship types should be treated differently.
5
-
| neighborInfluence | Float | 1.0 | yes | Controls how often neighbors' features are sampled in each iteration relative to sampling the node's own features
6
-
| binarizeFeatures | Map | n/a | yes | A map with keys `dimension` and `densityLevel`. If given, features are transformed into `dimension` binary features via hyperplane rounding.
7
-
| outputDimension | Integer | n/a | yes | If given, the embeddings are projected randomly into `outputDimension` dense features.
5
+
| neighborInfluence | Float | 1.0 | yes | Controls how often neighbors' features are sampled in each iteration relative to sampling the node's own features. Must be non-negative.
6
+
| binarizeFeatures | Map | n/a | yes | A map with keys `dimension` and `densityLevel`. If given, features are transformed into `dimension` binary features via hyperplane rounding. Both must be at least 1 and `densityLevel` at most `dimension / 2`.
7
+
| outputDimension | Integer | n/a | yes | If given, the embeddings are projected randomly into `outputDimension` dense features. Must be at least 1.
8
8
| randomSeed | Integer | n/a | yes | A random seed which is used for all randomness in computing the embeddings.
0 commit comments