This repo contains scripts that we used to recreate WDBench benchmark, as well as the results that we gathered. The benchmark was performed on AWS instances of following specification:
Here you may find the instructions how we imported the data to both tested databases. The Wikidata data that we used may be found here
This part assumes you are connected to the AWS instance with at least 24GB of RAM and 350GB of free disk space. Some of the commands may require root access if you are using Amazon Linux as we did.
- Download Neo4J community edition. We used the version 4.3.5, same that was used in the original run.
- Extract the downloaded file
tar -xf neo4j-community-4.*.*-unix.tar.gz
- Enter to the folder:
cd neo4j-community-4.*.*/
- Set the variable
$NEO4J_HOME
pointing to the Neo4J folder.
Edit the text file conf/neo4j.conf
- Set
dbms.default_database=wikidata
- Uncomment the line
dbms.security.auth_enabled=false
- Add the line
dbms.transaction.timeout=10m
Use the script nt_to_neo4j.py to generate the .csv files entities.csv
, literals.csv
and edges.csv
Execute the data import
bin/neo4j-admin import --database wikidata \
--nodes=Entity=wikidata_csv/entities.csv \
--nodes wikidata_csv/literals.csv \
--relationships wikidata_csv/edges.csv \
--delimiter "," --array-delimiter ";" --skip-bad-relationships true
Now we have to create the index for entities:
- Start the server:
bin/neo4j console
- Open the cypher console (in another terminal):
bin/cypher-shell
, and inside the console run the command:CREATE INDEX ON :Entity(id);
- Even though the above command returns immediately, you have to wait until is finished before interrupting the server. You can see the status of the index with the command
SHOW INDEXES;
.
Blazegraph can't load big files in a reasonable time, so we need to split the .nt into smaller files (1M each)
mkdir splitted_nt
cd splitted_nt
split -l 1000000 -a 4 -d --additional-suffix=.nt [path_to_nt]
cd ..
git clone --recurse-submodules https://gerrit.wikimedia.org/r/wikidata/query/rdf wikidata-query-rdf
cd wikidata-query-rdf
mvn package
cd dist/target
tar xvzf service-*-dist.tar.gz
cd service-*/
mkdir logs
- Edit the script file
runBlazegraph.sh
.- configure main memory here:
HEAP_SIZE=${HEAP_SIZE:-"64g"}
(You may use other value depending on how much RAM your machine has) - set the log folder
LOG_DIR=${LOG_DIR:-"/path/to/logs"}
, replace/path/to/logs
with the absolute path of thelogs
dir you created in the previous step. - add
-Dorg.wikidata.query.rdf.tool.rdf.RdfRepository.timeout=60
to theexec java
command to specify the timeout (value is in seconds). - also change
-Dcom.bigdata.rdf.sparql.ast.QueryHints.analyticMaxMemoryPerQuery=0
which removes per-query memory limits.
- configure main memory here:
- Start the server:
./runBlazegraph.sh
- This process won't end until you interrupt it (Ctrl+C). Let this execute until the import ends. Run the next command in another terminal.
- Start the import:
./loadRestAPI.sh -n wdq -d [path_of_splitted_nt_folder]
This step may take a while, on AWS instance that we used, it took around 4 days to finish.