The task of searching over large keyword graphs aims to identify a subgraph where the nodes collectively cover the input query keywords. Although finding an exact solution to this problem is NP-hard, we address it by proposing a novel graph neural network representation learning technique specifically tailored for graphs with missing information. We propose a novel keyword graph representation learning method that incorporates complementary aspects of graphs: global, local, adjusted, and feature semantics. Considering these multiple aspects, our approach remains robust and resilient to missing information. We adopt and fine-tune a transformer-based model to aggregate the various features of a graph to generate rich representations, recognizing the pivotal role of keywords in this task. We show through experiments on real-world data that our method outperforms the state-of-the-art approaches and is particularly robust in the face of missing values, underscoring its ability to effectively handle incomplete graphs.
These are the most important libraries used to implement the methods. You need to install them using the pip or conda command before running the codes:
Numpy
Math
Pytorch 1.9.x +
Transformers
Torch Geometric
Tqdm
Please take note that some miscellaneous libraries may be needed to run the codes.
We use CiteSeer, Amazon Video, Amazon Toy and DBLP datasets for evaluation. Datasets, along its preprocessed version, are available in dataset directory.
After preparing the input data, the proposed method and baselines can be run together. You can then simply run the following command in the terminal:
`python <train.py>`
* Take note that parameters are initiated in the args.py file. You can customize your configuration there.
After running each model, the average performance for 20 runs will be saved and the results will be shown at the end when all methods are evaluated.
Here are the sample results that you will get by running the code for Hit@100 without missing edges:
Datasets | 30% Missing Keywords | 50% Missing Keywords | 70% Missing Keywords | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 5 | 7 | 9 | 3 | 5 | 7 | 9 | 3 | 5 | 7 | 9 | ||
CiteSeer | GCN | 15.72 | 16.05 | 16.33 | 13.59 | 12.56 | 14.01 | 20.18 | 19.86 | 12.79 | 16.57 | 15.46 | 15.12 |
GAT | 19.02 | 17.06 | 22.7 | 22.46 | 15.51 | 22.53 | 21.53 | 18.62 | 16.82 | 14.9 | 19.66 | 23.04 | |
ChebConv | 17.74 | 20.24 | 23.04 | 13.98 | 15.59 | 23.2 | 20.63 | 20.38 | 16.92 | 22.7 | 18.99 | 23.24 | |
GraphSAGE | 0.74 | 0.59 | 1.23 | 1.16 | 3.39 | 2.36 | 2.53 | 1.87 | 10.45 | 7.14 | 8.57 | 7.74 | |
BLINK+SAT | 8.81 | 7.23 | 5.65 | 4.42 | 9.14 | 5.97 | 5.47 | 4.88 | 9.82 | 8.34 | 8.75 | 5.49 | |
PCA | 11.2 | 8.42 | 7.05 | 6.05 | 9.87 | 7.33 | 6.8 | 5.66 | 8.4 | 6.83 | 6.07 | 5.46 | |
Conv-PCA | 11.56 | 11.11 | 9.08 | 10.32 | 8.14 | 10.05 | 10.76 | 9.25 | 12.69 | 11.06 | 10.83 | 7.3 | |
KS-PCA | 25.7 | 30.06 | 35.01 | 36.38 | 26.08 | 27.56 | 30.84 | 33.82 | 26.08 | 28.52 | 32.08 | 36 | |
KS-GNN | 29.52 | 36.37 | 37.92 | 39 | 30.07 | 36.41 | 38.85 | 39 | 27.08 | 37.48 | 37.66 | 36.02 | |
Proposed Method | 34.84 | 43.3 | 45.05 | 46.49 | 34.67 | 42.68 | 44.95 | 46.94 | 34.79 | 43.85 | 45.73 | 47.03 | |
Video | GCN | 13.66 | 16.06 | 19.56 | 25.28 | 16.14 | 19.46 | 19.99 | 23.85 | 6.49 | 6.79 | 11.31 | 12.06 |
GAT | 10.47 | 13.4 | 9.15 | 11.35 | 11.29 | 12.16 | 14.5 | 19.12 | 8.84 | 12.26 | 13.46 | 13.84 | |
ChebConv | 4.45 | 4.3 | 4.57 | 8.49 | 8.19 | 12.91 | 8.4 | 7.05 | 5.77 | 5.96 | 9.44 | 13.92 | |
GraphSAGE | 0.49 | 0.30 | 0.06 | 0.03 | 0.44 | 0.14 | 0.00 | 0.05 | 0.34 | 0.35 | 0.21 | 0.16 | |
BLINK+SAT | 10.21 | 9.86 | 10.99 | 14.87 | 8.55 | 6.92 | 8.63 | 5.82 | 1.18 | 1.15 | 4.38 | 3.35 | |
PCA | 1.54 | 0.91 | 0.55 | 0.61 | 1.71 | 0.72 | 0.71 | 0.57 | 1.66 | 0.95 | 0.66 | 0.55 | |
Conv-PCA | 1.81 | 2.46 | 1.58 | 2.38 | 2.43 | 1.49 | 1.66 | 2.54 | 2.42 | 2.37 | 2.80 | 3.37 | |
KS-PCA | 10.19 | 12.23 | 16.15 | 21.37 | 11.26 | 15.62 | 19.51 | 23.87 | 10.66 | 15.57 | 19.17 | 25.36 | |
KS-GNN | 21.43 | 23.36 | 22.92 | 26.79 | 22.54 | 22.57 | 30.41 | 33.41 | 21.01 | 16.48 | 22.01 | 28.47 | |
Proposed Method | 23.33 | 30.13 | 34.89 | 38.71 | 22.76 | 29.56 | 34.27 | 38.11 | 21.6 | 27.71 | 32.73 | 36.18 | |
Toy | GCN | 18.97 | 21.27 | 19.13 | 21.44 | 18.47 | 20.26 | 19.07 | 22.34 | 18.03 | 19.40 | 18.91 | 20.73 |
GAT | 20.84 | 22.41 | 22.6 | 19.65 | 21.39 | 27.43 | 29.82 | 31.09 | 19.16 | 20.87 | 21.36 | 21.80 | |
ChebConv | 15.5 | 18.08 | 22.63 | 12.0 | 14.94 | 21.01 | 24.27 | 19.85 | 16.3 | 17.48 | 19.97 | 12.35 | |
GraphSAGE | 0.74 | 0.88 | 5.11 | 4.21 | 4.24 | 12.49 | 6.50 | 6.10 | 1.45 | 4.39 | 6.71 | 1.09 | |
BLINK+SAT | 6.47 | 8.17 | 8.12 | 10.54 | 6.79 | 9.59 | 10.99 | 11.93 | 6.44 | 8.56 | 11.55 | 8.15 | |
PCA | 1.15 | 0.68 | 0.63 | 0.51 | 1.01 | 0.69 | 0.51 | 0.44 | 0.67 | 0.47 | 0.44 | 0.30 | |
Conv-PCA | 21.34 | 21.99 | 23.76 | 25.40 | 19.17 | 19.72 | 20.21 | 25.22 | 16.99 | 21.61 | 23.95 | 24.90 | |
KS-PCA | 27.23 | 27.73 | 31.58 | 33.79 | 25.78 | 28.94 | 31.04 | 32.82 | 18.35 | 22.03 | 25.50 | 26.25 | |
KS-GNN | 28.56 | 29.85 | 29.55 | 34.28 | 24.65 | 29.16 | 31.27 | 33.25 | 21.78 | 27.41 | 25.55 | 30.17 | |
Proposed Method | 31.82 | 40.5 | 43.87 | 45.31 | 32.19 | 41.41 | 43.49 | 45.18 | 24.09 | 36.31 | 40.75 | 41.92 | |
DBLP | GCN | 2.75 | 5.27 | 3.05 | 8.04 | 7.27 | 2.44 | 7.17 | 7.6 | 5.86 | 3.85 | 9.12 | 11.31 |
GAT | 4.41 | 4.77 | 10.09 | 2.35 | 5.93 | 5.22 | 10.74 | 3.68 | 5.26 | 12.97 | 8.72 | 9.63 | |
ChebConv | 4.56 | 9.41 | 5.68 | 8.96 | 6.4 | 4.17 | 8.5 | 8.68 | 7.54 | 11.03 | 8.63 | 17.36 | |
GraphSAGE | 0.42 | 0.49 | 0.36 | 0.82 | 0.29 | 0.43 | 0.53 | 0.05 | 0.07 | 0.01 | 0.01 | 0.00 | |
BLINK+SAT | 3.26 | 6.09 | 4.01 | 6.65 | 3.49 | 1.66 | 5.42 | 3.95 | 4.25 | 2.42 | 3.12 | 4.24 | |
PCA | 3.78 | 2.57 | 2.25 | 2.38 | 3.06 | 2.55 | 2.21 | 2.15 | 2.97 | 2.51 | 2.02 | 1.88 | |
Conv-PCA | 9.00 | 13.24 | 13.29 | 16.87 | 5.93 | 6.56 | 7.64 | 10.62 | 7.00 | 10.52 | 10.03 | 15.68 | |
KS-PCA | 15.28 | 21.41 | 25.61 | 31.64 | 14.98 | 20.73 | 23.21 | 31.72 | 12.49 | 19.49 | 21.23 | 28.63 | |
KS-GNN | 16.21 | 24.94 | 29.55 | 33.51 | 16.52 | 22.73 | 26.85 | 30.69 | 15.57 | 24.15 | 27.12 | 29.06 | |
Proposed Method | 23.11 | 32.29 | 39.38 | 40.54 | 24.01 | 33.46 | 38.01 | 42.13 | 23.33 | 32.46 | 39.16 | 40.86 |
Here are the results that you will get by running the code for Hit@100 with 30% missing edges:
Datasets | 30% Missing Keywords | 50% Missing Keywords | 70% Missing Keywords | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 5 | 7 | 9 | 3 | 5 | 7 | 9 | 3 | 5 | 7 | 9 | ||
CiteSeer | GCN | 19.00 | 17.86 | 21.19 | 22.44 | 13.7 | 15.18 | 15.82 | 11.46 | 11.27 | 19.34 | 22.43 | 21.33 |
GAT | 10.58 | 8.88 | 9.9 | 12.44 | 12.2 | 18.52 | 14.36 | 21.24 | 12.96 | 15.98 | 14.83 | 12.41 | |
ChebConv | 18.33 | 28.27 | 21.06 | 17.54 | 17.89 | 16.54 | 28.33 | 26.67 | 18.99 | 22.42 | 29.54 | 28.77 | |
GraphSAGE | 1.25 | 1.03 | 0.75 | 1.03 | 3.26 | 2.45 | 3.34 | 1.62 | 9.44 | 7.86 | 7.59 | 5.45 | |
BLINK+SAT | 9.19 | 8.14 | 5.68 | 3.71 | 8.26 | 6.79 | 6.04 | 8.57 | 7.88 | 8.33 | 7.69 | 8.3 | |
PCA | 11.2 | 8.42 | 7.05 | 6.05 | 9.87 | 7.33 | 6.8 | 5.66 | 8.4 | 6.83 | 6.07 | 5.46 | |
Conv-PCA | 11.6 | 10.97 | 9.05 | 10.56 | 9.77 | 10.61 | 12.73 | 9.31 | 12.92 | 11.83 | 12.25 | 8.45 | |
KS-PCA | 26.08 | 30.06 | 32.26 | 34.65 | 23.8 | 26.45 | 29.11 | 31.81 | 26.06 | 26.57 | 31.42 | 34.53 | |
KS-GNN | 28.96 | 35.75 | 39.44 | 39.4 | 26.48 | 34.92 | 36.31 | 37.3 | 24.2 | 27.84 | 28.19 | 33.41 | |
Proposed Method | 32.79 | 41.72 | 42.45 | 43.63 | 33.28 | 40.74 | 41.59 | 42.48 | 33.17 | 41.72 | 41.72 | 43.16 | |
Video | GCN | 2.83 | 2.95 | 4.5 | 4.69 | 1.62 | 2.8 | 4.14 | 2.94 | 2.19 | 2.19 | 3.01 | 3.04 |
GAT | 3.01 | 2.13 | 3.01 | 2.52 | 2.51 | 3.99 | 3.76 | 3.84 | 4.27 | 3.0 | 4.31 | 3.64 | |
ChebConv | 3.35 | 4.08 | 4.85 | 4.76 | 2.65 | 3.03 | 4.46 | 4.54 | 1.96 | 2.55 | 5.67 | 2.9 | |
GraphSAGE | 0.09 | 0.11 | 0.02 | 0.00 | 0.26 | 0.05 | 0.00 | 0.04 | 1.65 | 1.65 | 2.22 | 1.29 | |
BLINK+SAT | 1.67 | 1.85 | 2.48 | 1.44 | 0.08 | 0.99 | 4.96 | 2.97 | 2.19 | 1.77 | 0.78 | 1.21 | |
PCA | 1.54 | 0.91 | 0.55 | 0.61 | 1.71 | 0.72 | 0.71 | 0.57 | 1.66 | 0.95 | 0.66 | 0.55 | |
Conv-PCA | 1.05 | 1.59 | 0.83 | 2.01 | 1.43 | 0.81 | 0.87 | 1.23 | 1.25 | 1.25 | 1.31 | 1.38 | |
KS-PCA | 3.82 | 4.13 | 4.88 | 7.13 | 3.96 | 4.52 | 5.33 | 6.11 | 3.64 | 4.58 | 5.17 | 6.55 | |
KS-GNN | 8.08 | 8.34 | 12.88 | 11.82 | 6.84 | 7.68 | 4.18 | 11.12 | 6.37 | 10.31 | 13.92 | 10.07 | |
Proposed Method | 15.31 | 18.74 | 21.8 | 24.95 | 14.74 | 18.98 | 22.09 | 25.03 | 13.87 | 17.17 | 20.02 | 22.39 | |
Toy | GCN | 2.64 | 2.37 | 1.78 | 2.99 | 2.45 | 2.55 | 2.04 | 2.93 | 2.18 | 2.21 | 2.11 | 2.30 |
GAT | 2.22 | 1.45 | 2.77 | 0.92 | 1.01 | 0.87 | 1.11 | 0.64 | 1.16 | 0.93 | 0.74 | 0.88 | |
ChebConv | 6.93 | 6.15 | 11.69 | 7.82 | 7.25 | 8.25 | 9.03 | 8.91 | 3.09 | 3.83 | 4.46 | 5.17 | |
GraphSAGE | 0.04 | 0.02 | 0.01 | 0.00 | 0.28 | 0.00 | 0.01 | 0.02 | 0.02 | 0.18 | 0.29 | 0.23 | |
BLINK+SAT | 3.69 | 3.03 | 4.85 | 5.66 | 2.56 | 1.34 | 1.86 | 5.46 | 1.76 | 1.39 | 4.73 | 4.45 | |
PCA | 1.15 | 0.68 | 0.63 | 0.51 | 1.01 | 0.69 | 0.51 | 0.44 | 0.67 | 0.47 | 0.44 | 0.30 | |
Conv-PCA | 11.64 | 10.46 | 11.23 | 12.29 | 9.31 | 9.00 | 9.08 | 11.61 | 6.74 | 8.87 | 8.99 | 9.56 | |
KS-PCA | 13.80 | 13.03 | 13.55 | 15.67 | 11.35 | 11.03 | 11.09 | 13.02 | 6.84 | 8.37 | 9.50 | 10.43 | |
KS-GNN | 13.82 | 13.28 | 14.38 | 15.22 | 12.51 | 12.54 | 12.68 | 12.59 | 9.41 | 8.99 | 12.83 | 11.15 | |
Proposed Method | 22.72 | 28.62 | 30.26 | 31.13 | 22.09 | 28.57 | 30.19 | 31.85 | 17.32 | 22.96 | 23.49 | 24.03 | |
DBLP | GCN | 4.85 | 6.71 | 5.89 | 3.62 | 2.91 | 10.47 | 9.53 | 8.5 | 2.21 | 2.79 | 1.41 | 4.56 |
GAT | 3.27 | 3.74 | 7.37 | 7.11 | 3.71 | 4.49 | 5.55 | 8.04 | 5.63 | 7.73 | 8.34 | 7.63 | |
ChebConv | 5.51 | 8.51 | 12.55 | 4.24 | 7.66 | 12.24 | 11.94 | 13.09 | 6.3 | 6.44 | 10.21 | 14.2 | |
GraphSAGE | 0.51 | 0.20 | 0.42 | 0.34 | 0.67 | 0.25 | 0.40 | 0.90 | 0.40 | 0.05 | 0.04 | 0.02 | |
BLINK+SAT | 4.78 | 3.45 | 6.74 | 4.58 | 3.94 | 3.66 | 5.05 | 3.97 | 2.96 | 2.75 | 2.83 | 7.29 | |
PCA | 3.78 | 2.57 | 2.25 | 2.38 | 3.06 | 2.55 | 2.21 | 2.15 | 2.97 | 2.51 | 2.02 | 1.88 | |
Conv-PCA | 8.35 | 11.8 | 12.51 | 14.51 | 5.25 | 6.77 | 7.03 | 9.91 | 6.46 | 9.59 | 9.47 | 14.33 | |
KS-PCA | 15.05 | 19.99 | 22.61 | 28.57 | 13.68 | 19.19 | 21.59 | 28.96 | 11.67 | 17.371 | 18.38 | 24.51 | |
KS-GNN | 15.91 | 20.49 | 25.09 | 29.04 | 15.79 | 19.86 | 22.15 | 29.71 | 12.07 | 17.18 | 19.23 | 24.19 | |
Proposed Method | 22.29 | 29.88 | 35.22 | 37.36 | 19.22 | 25.86 | 29.46 | 30.90 | 19.79 | 26.36 | 29.65 | 31.81 |
To study the impact of each component of our approach on its overall performance. The results of the study are as follows:
Datasets | 50% Missing Edges | ||||
---|---|---|---|---|---|
50% Missing Keywords | |||||
3 | 5 | 7 | 9 | ||
CiteSeer | 25.48 | 30.74 | 33.24 | 34.17 | |
25.27 | 30.41 | 32.95 | 33.83 | ||
25.35 | 30.05 | 33.79 | 34.01 | ||
25.95 | 31.65 | 34.11 | 34.75 | ||
26.21 | 32.71 | 34.17 | 35.32 | ||
26.19 | 32.80 | 34.94 | 35.36 | ||
Proposed Method (All) | 27.3 | 34.57 | 35.48 | 36.04 | |
Video | 11.21 | 14.37 | 16.97 | 18.77 | |
11.43 | 14.89 | 17.08 | 19.58 | ||
11.84 | 14.90 | 17.10 | 19.71 | ||
12.31 | 15.35 | 18.56 | 20.19 | ||
12.49 | 15.58 | 18.63 | 20.34 | ||
12.09 | 15.12 | 18.07 | 20.11 | ||
Proposed Method (All) | 13.71 | 16.63 | 19.09 | 21.45 | |
Toy | 17.39 | 23.42 | 24.74 | 25.19 | |
17.60 | 23.90 | 24.83 | 25.67 | ||
17.95 | 23.98 | 24.89 | 25.79 | ||
18.07 | 24.07 | 25.41 | 26.14 | ||
18.31 | 24.44 | 25.39 | 26.85 | ||
18.60 | 24.82 | 25.58 | 26.92 | ||
Proposed Method (All) | 19.9 | 25.54 | 27.03 | 28.34 | |
DBLP | 16.06 | 23.20 | 27.59 | 28.33 | |
16.47 | 23.72 | 27.62 | 28.40 | ||
16.19 | 23.93 | 27.48 | 28.67 | ||
18.38 | 24.37 | 28.69 | 29.14 | ||
17.84 | 24.02 | 28.13 | 29.01 | ||
18.29 | 24.49 | 28.72 | 29.21 | ||
Proposed Method (All) | 18.86 | 25.02 | 29.33 | 30.65 |