Graph Attention Networks

Key ideas

Apply different weights to different nodes in the neighborhood without prior knowledge of the graph's structure
Spectral vs non-spectral approaches to graph convolutions
- Spectral filters learn on Laplacian eigenbasis, hence they are dependent on graph structure
- Non-spectral filters use convolution difrectly on the graph operating as close neighbors
  - Sampling a fixed-size neighborhood (weighted for attention, self link for self-attention) and aggregatin3 Attention g

input: node features h={h1,h2,h3... (vectors)}
output: ditto
At least one linear transformation is required to have higher level features. A weight matrix 'W' can be used for this.
Where eij is the importance of node j's features to node i
- we only compute it for nodes j in i's neighborhood.
Also || is concatenation
To stabilize the process of self-attention we have found extending one mechanism to employ multi-head attention

Highly efficient as computing attention can be parallelized across multiple edges in O(|V|FF' + |E|F') where F is # of node features.