This reposotory provides scripts for reproducing the benchmarking results of several implementations of node2vec as presented in PecanPy.
Note: all test scripts provided use the SLURM workload manager, and will NOT run on a personal computer; this package uses Anaconda to manage environments through conda environments, so that different software packages can be run despite they might have different dependency requirements.
To get started and perform all tests, run the following commands which will first setup the working environment and then submit all job scripts for testing.
WARNING: the whole dataset takes up 11GB of sapce, make sure you have enough space before setting up.
sh setup.sh # setup working environment
sh SLURM/submit_all.sh # submitting test job scripts
Note: it might take up to 30 minutes to fully setup, mainly due to download of large data files
After all jobs are done (one might encounter memory errors due to limitations of some implementation), two
files will be created under the result/
directory:
stat_summary.tsv
- summary table consisting of runtime statistics of different implementations for different networks. More details about the measurements could be found in section 3evaluation_summary.tsv
- summary table for the classification evaluation of the resulting embeddings from different implementations. More details about the eavluation setup could be found in section 4
To generate the runtime stats and evaluation figures using the test data collected, open and run the two jupyter
notebooks (runtime_stats.ipynb
and classification_evaluation.ipynb
) under the jup/
directory. The figures
should be produced at the end of each notebook.
This section describes in more detail the steps that are done when a user runs setup.sh
, which includes
- setting up directory structure for saving results and history files
- building environments for different implementataions
- downloading and processing data
The script/init_setup/setup_dir.sh
script will setup the SLURM_history/
directory for holding SLURM job history
files and the result/
with the following
result
|-- emb
| |-- nodevectors
| |-- orig-cpp
| |-- orig-py
| |-- pecanpy-DenseOTF
| |-- pecanpy-PreComp
| `-- pecanpy-SparseOTF
`-- stat
|-- nodevectors
|-- orig-cpp
|-- orig-py
|-- pecanpy-DenseOTF
|-- pecanpy-PreComp
`-- pecanpy-SparseOTF
where emb/
holds the embedding files generated by different implementations, and stat/
holds the runtime stat
data (execution time and physical memory usage) profiled by GNU time
command. Each of the directory above is
further subdivided with one folder per implementations
We tested 6 different implementations of the node2vec algorithm with 3 implementations coming from PecanPy and 3 other from alternative software packages.
- Original node2vec (Python) -
orig-py
- Original node2vec (C++) -
orig-cpp
- PecanPy
- PreComp -
pecanpy-PreComp
- SparseOTF -
pecanpy-SparseOTF
- DenseOTF -
pecanpy-DenseOTF
- PreComp -
- NodeVectors -
nodevectors
Since different libraries require different dependencies that are not compatible with each other, we set up conda
conda environments for each of the libraries (except for orig-cpp
, which requires building cpp source code
instead). Information about the conda environments is provided in the env/
directory as .yml
files. The
script/init_setup/setup_envs.sh
script will use these files to setup the three conda environments to be used
later by different libraries
pecanpy-bench_node2vec
pecanpy-bench_nodevectors
pecanpy-bench_pecanpy
Various networks with a wide range of sizes and densities are used for benchmarking different implementations of
the node2vec algorithm. The relatively small networks (BlogCatalog, PPI, Wikipedia) are provided in this repository
along with node labels. They are originally downloaded from the node2vec
webpage, and converted to .txt
files from .mat
files so that it is easier to load with Python. The rest of
the netowrks will be downloaded from other repositories, which will be automatically done by the
script/init_setup/setup_data.sh
script. The following table summaries some information about the networks tested.
Network | Weighted | # nodes | # edges | Density (unweighted) | File size |
---|---|---|---|---|---|
BioGRID | No | 20,558 | 238,474 | 1.13E-03 | 2.5M |
STRING | Yes | 17,352 | 3,640,737 | 2.42E-02 | 60M |
GIANT-TN-c01 | Yes | 25,689 | 38,904,929 | 1.18E-01 | 1.1G |
GIANT-TN | Yes | 25,825 | 333,452,400 | 1.00E+00 | 7.2G |
SSN200 | Yes | 814,731 | 72,618,574 | 2.19E-04 | 2.0G |
BlogCatalog | No | 10,312 | 333,983 | 6.28E-03 | 3.2M |
PPI | No | 3,852 | 38,273 | 5.16E-03 | 707K |
Wikipedia | Yes | 4,777 | 92,406 | 8.10E-03 | 2.0M |
Each implementation will be tested using two different resource configurations (multi and single). The multi setup aims to test the capability of the implementation to make use of large computational resources, while the single setup tests the ability of the implementation run on a smaller amount of computational resources (e.g. what might be found on a laptop). The configuration details are the following
Configuration | Core count | Memory (GB) | Time limit (hr) |
---|---|---|---|
Multi | 28 | 200 | 24 |
Single | 1 | 32 | 8 |
As mentioned earlier, SLURM/submit_all.sh
will submit all test jobs at once. Each implementation and configuration
has its own test script, e.g. SLURM/test_pecanpy-PreComp_single.sb
is the batch scrip for testing PecanPy-PreComp
with single-core resource configuration.
A note on testing nodevectors
implementation: for all other implementation besides nodevectors
, the p
and
q
parameters are set to 1
by default, but for nodevectors
they are set to 1.001
. If p
and q
are set to 1
,
nodevectors
will automatically perform 1st order random walk as a shortcut
(link to source)
other than 2nd order walk as required by the node2vec. Performing 1st order walk is significantly faster than performing
2nd order random walk as all other implementations do, and hence providing biased testing results. Setting p
and q
to 1.001
disable the automatic fall back to 1st order random walk and at the same time keeps the results close to when
p
and q
are set to 1
.
After all test jobs are finished, the following Python script ~/script get_test_info.py
will be executed
to extract test information from logging files and summarize into a single table to ~/result/stat_summary/summary.txt
.
The summary file can be renamed to specify specific benchmark conditions if needed.
The table consists of the following columns:
- Network - name of the network embedded
- Method - name of the implementation of node2vec
- Setup - computational resource setting of the test
- Loading time - (stage1) time used to load network into memory in desired format
- Preprocessing time - (stage2) time used to pre-compute transition probability table
- Walking time - (stage3) time used to generate random walks
- Training time - (stage4) time used to train word2vec model using the random walks
- Total time - total run time of the program (from starting Python, includes time to load packages, etc.)
- Total time in second - same as total time, but converted to seconds
- Maximum resident size - maximum physical memory used
To ensure the quality of the embeddings, we perform node classification tasks as presented in node2vec for BlogCatalog, PPI, and Wikipedia. Slight modification to the labels for PPI was made to remove labelsets with less than 10 positives to ensure meaningful evaluation metrics. There are 38 node classes in BlogCatalog, 50 node classes in PPI, and 21 node classes in Wikipedia. For each node class in a network, a one vs rest l2 regularized logistic regression model is trained and evaluated through 5-fold cross validation. Each test fold is scored using auROC separately, and the mean auROC score across 5 folds is reported. This evaluation is repeated 10 times and the mean reported scores are taken as the final evaluation score the the class.
After the evaluation, for each network, there should be a list of scores for each implementation, where the entries correspond to evaluation score for a particular node class. A wilcoxon paired test is then applied for each implementation against the original Python implementation to see if there is any significant performance drop using the new implementation.
We welcome any researchers to use this repository to benchmark their own network of interest or even embedding programs. This section provides guidelines for creating new tests and requirements for contributing to the repository by compiling more implementations.
- Add new implementation name to
data/implementation_list.txt
(see section 5.3 for more requirements of implementation) - Create test job script using the templates provided (
SLURM/_test_template.sb
andSLURM/_test_template_single.sb
), and follow the modifaction tag### MODIFY
to modify the lines as needed
### MODIFY1
- sbatch job name (ends with_s
for single-core configuration setup)### MODIFY2
- name of implementation, need to match that added todata/implementation_list.txt
### MODIFY3
- change directory to source code for new implementation if needed (not required if setup through env)### MODIFY4
- activte vritual environment for the new implementation if needed### MODIFY5
- modify command for calling the program to embed network
- Add new network as edgelist file (
.edg
) and add the network name (without file extension) todata/networks.txt
. - Add
true
(orfalse
) todata/weighted.txt
depending on whether the new network is weighted (or not). The order of network names indata/networks.txt
and the corresponding weighted information indata/weighted.txt
must match.
- Prepare the
.yml
environment file with the naming convention ofpecanpy-bench_new-method
wherenew-method
should be replaced by the method's name, and modifyscript/init_setup/setup_envs.sh
to setup environment for the new implementation. - Provide interfacing script to for bash to communicate with python if needed, the following inputs are required
--input
- input graph path--output
- output embedding path--dimension
- embedding dimension--walk-length
- length of each walk--num-walks
- number of walks per node--workers
- number of workers--p
- return parameter--q
- inout parameter
- For proper runtime stat retrieval, add the following printing statements to report execution time for each corresponding stage
print("Took %02d:%02d:%05.2f to load graph"%(hrs, mins, secs)) print("Took %02d:%02d:%05.2f to pre-compute transition probabilities"%(hrs, mins, secs)) print("Took %02d:%02d:%05.2f to generate walks"%(hrs, mins, secs)) print("Took %02d:%02d:%05.2f to train embeddings"%(hrs, mins, secs))
- Add data retrieval scripts and any necessary preprocessing scripts for new networks (as edgelist files, with
.edg
extension)