Graphs and grammars for experimental analysis of context-free path querying algorithms.
- GCC
- Python 3
Just install requirements and run init.py
:
pip3 install -r requirements.txt
python3 init.py
This script downloads data (real-world RDF files) and GTgraph, a suite of graph generators. After that, some synthetic graphs are generated with it.
In order to download/update one specific part of the dataset run:
python3 init.py --update [GroupName]
Options for [GroupName]
are rdf, scalefree, full, worstcase, sparse, memoryaliases
We provide a set of scripts to simplify data loading into some popular graph databases.
The dataset can be loaded to RedisGraph with tools/redis-rdf
, for example:
cd ./tools/redis-rdf
python3 main.py --port [PORT] dir ../../data/Matrices/ScaleFree/
python3 main.py --port [PORT] file ../../data/Matrices/RDF/foaf.rdf foaf
To load RDF data into Neo4j database one can use Neosemantix plugin for Neo4j.
Set contains both real-world data and synthetic graphs for several specific cases. All graphs are represented in RDF format to unify loading process.
Queries are represented as context-free grammars in the following format:
-
Line 0: a set of variable symbols delimited by spaces, the first one is the starting symbol
-
Line 1: a set of terminal symbols delimited by spaces
-
The rest of the lines are productions in the form:
head -> body | body | ... | body
where each body can contain basic regular expression, allowed operators:
- The concatenation, the default operator, which can by represented either by a space or a dot (
.
) - The union, represented by
|
- The
?
quantifier - The kleene star, represented by
*
Epsilon symbol should be represented by
eps
- The concatenation, the default operator, which can by represented either by a space or a dot (
Example query in such format:
S
a b
S -> a S b S
S -> eps
Grammar can be converted to CNF with tools/gramar_to_cnf
, which uses pyformlang library to perform context-free grammar modifications.
Graphs and grammars can be found in data
— all graphs are divided into groups, which are placed in different directories. Each data/[GroupName]/Matrices
contains graph descriptions and data/[GroupName]/Grammars
— descriptions of queries.
RDF
— fixed versions of real-world RDF files (links are provided for updating purposes only!):
-
Small graphs is a set of popular semantic web ontologies. This set is introduced by Xiaowang Zhang in "Context-Free Path Queries on RDF Graphs" :
-
Bigger graphs:
- geospecies – graph related to taxonomic hierarchy and geographical information of animal species, download here: https://old.datahub.io/dataset/geospecies. Introduced in "An Experimental Study ofContext-Free Path Query Evaluation Methods"
- a set of graphs from the Uniprot protein sequences database, download here: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf
- go
- go-hierarchy
- pathways
- taxonomy
- taxonomy-hierarchy
- core
- enzime
- eclass_514en
Grammars list contains the following variants of same-generation query over different relation tipes.
- g1 — same-generation query over type and subclass-of relations. Introduced in "Context-Free Path Queries on RDF Graphs"
- g2 — same-generation query over type and subclass-of relations. Introduced in "Context-Free Path Queries on RDF Graphs"
- geo — same-generation query over broader-transitive relation.
MemoryAliases
— real-world data for points-to analysis of C code.
- First part is a dataset form Graspan tool. The original data is placed here. This part is placed in
Graspan
folder. - Second part is a part of dataset form "Demand-driven alias analysis for C". This part is placed in
small
folder.
Both grammars g1 and g2 specify the same language which is described in related papers. These two grammars were written in a different way in order to investigate dependencies on query specification format.
WorstCase
— graphs with two cylces; the query Brackets is a grammar for the language of correct bracket sequences.
SparseGraph
— graphs generated with GTgraph to emulate sparse data. The grammar provided is a variant of the same-generation query.
ScaleFree
— graphs generated with GTgraph by using the Barab'asi-Albert model of scale-free networks. Use with grammar an_bm_cm_dn, which is a query for AnBmCmDn language.
FullGraph
— cycle graphs, all edges are labeled with the same token. Use with A_star queries, which produce full graph on that dataset.
Reference values for algorithms correctnes checking, which can be found in reference_values.csv, are in the following format: graph, grammar, control_sum
, where control_sum
is the number of paths generated by non-terminals of the grammars.
Example of the FullGraph
dataset reference values in such format:
fullgraph_10,A_star0,"{'s': 100}"
fullgraph_100,A_star0,"{'s': 10000}"
...
List of CFPQ-related works. The list is not full, work in progress.
- M. Yannakakis. "Graph-theoretic methods in database theory".
- P. Sevon and L. Eronen. "Subgraph queries by context-free grammars".
- J. Kuijpers, G. Fletcher, N. Yakovets, and T. Lindaaker. "An experimental study of context-free path query evaluation methods".
- H. Miao and A. Deshpande. "Understanding Data Science Lifecycle Provenance via Graph Segmentation and Summarization".
- TBD
- T. Reps. "Program analysis via graph reachability".
- X. Zheng and R. Rugina. "Demand-driven alias analysis for C".
- TBD