Skip to content

Gradoop DataSources

Christopher Rost edited this page Oct 4, 2019 · 15 revisions

This section provides an overview of gradoop specific data sources.

Gradoop Data Sources
GraphDataSource
EdgeListDataSource
VertexLabeledEdgeListDataSource
HBaseDataSource
AccumuloDataSource
CSVDataSource
IndexedCSVDataSource
JSONDataSource (deprecated since v0.5.0)
TLFDataSource
FlinkAsciiGraphLoader

GraphDataSource

A DataSource to transform an external graph into a LogicalGraph with or without lineage information stored at the resulting EPGM graph. The external graph needs to be represented by a data set of the generalized classes ImportVertex and ImportEdge, while K defines the external identifier type.

GraphDataSource example

Map<String, Object> propertiesBob = Maps.newHashMap();
propertiesBob.put("name", "Bob");
propertiesBob.put("age", 24);

String bobId = "myOwnDomainSpecificIdForBob";
String aliceId = "myOwnDomainSpecificIdForAlice";

DataSet<ImportVertex<String>> importVertices = env.fromElements(
        new ImportVertex<>(bobId, "Person", Properties.createFromMap(propertiesBob)),
        new ImportVertex<>(aliceId, "Person", Properties.createFromMap(propertiesAlice))
    );

DataSet<ImportEdge<String>> importEdges = env.fromElements(
        new ImportEdge<>("myOwnEdgeId1", bobId, aliceId, "friendsWith", Properties.createFromMap(propertiesEdge1)),
        new ImportEdge<>("myOwnEdgeId2", aliceId, bobId, "friendsWith", Properties.createFromMap(propertiesEdge2))
    );

GraphDataSource<String> graphDataSource = new GraphDataSource(importVertices, importEdges, config);

To store the lineage identifier in a property, use the respective constructor:

... new GraphDataSource(importVertices, importEdges, "myID", config);

EdgeListDataSource

A DataSource to create a LogicalGraph from an edge list file. Paths can be local (file://) or HDFS (hdfs://).

EdgeListDataSource example

DataSource dataSource = new EdgeListDataSource("/path/to/edgelist", ",", config);

Example edgelist file
This example denotes two edges between three vertices: 2-->0-->1

0,1
2,0

VertexLabeledEdgeListDataSource

A special case of a EdgeListDataSource. A DataSource to create a LogicalGraph from an edge list file while vertices are annotated with a string label/value. Paths can be local (file://) or HDFS (hdfs://).

VertexLabeledEdgeListDataSource example

DataSource dataSource = new VertexLabeledEdgeListDataSource("/path/to/edgelist", " ", "lang", config);

In this example, the token separator is a single space and the property key for storing the vertex value is named with lang.

Example vertex labeled edgelist file
This example denotes two edges between three vertices: FR-->EN-->ZH

0 EN 1 ZH
2 FR 0 EN

HBaseDataSource

A DataSource to create an EPGM instance from a HBase source. By default, graphs are stored in three tables: graph_heads, vertices and edges. See HBaseDataSink for an example of the table structures.

HBaseDataSource example

// create hbase and gradoop-hbase configuration
Configuration hBaseConfiguration = HBaseConfiguration.create();
GradoopHBaseConfig gradoopConfig = GradoopHBaseConfig.getDefaultConfig(env);
// get the EPGM store
HBaseEPGMStore store = HBaseEPGMStoreFactory.createOrOpenEPGMStore(hBaseConfiguration, GradoopHBaseConfig.getDefaultConfig(env));
// get GraphCollection from DataSource
HBaseDataSource hBaseDataSource = new HBaseDataSource(store, cfg);
GraphCollection collection = hBaseDataSource.getGraphCollection();

AccumuloDataSource

A DataSource to create an EPGM instance from a Apache Accumulo® store. By default, graphs are stored in three tables: graph, vertex and edge. See AccumuloDataSink for an example of the table structures.

AccumuloDataSource example

// flink execution env
ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
// create gradoop accumulo configuration
GradoopAccumuloConfig config = GradoopAccumuloConfig.create(env)  
  .set(GradoopAccumuloConfig.ACCUMULO_USER, {user})
  .set(GradoopAccumuloConfig.ACCUMULO_INSTANCE, {instance})
  .set(GradoopAccumuloConfig.ZOOKEEPER_HOSTS, {comma separated zookeeper host list})
  .set(GradoopAccumuloConfig.ACCUMULO_PASSWD, {password})
  .set(GradoopAccumuloConfig.ACCUMULO_TABLE_PREFIX, {table prefix});
// create store
AccumuloStore graphStore = new AccumuloStore(config);
// create dataSource
AccumuloDataSource dataSource = new AccumuloDataSource(graphStore);

GraphCollection collection = dataSource.getGraphCollection();

CSVDataSource

A DataSource to create an EPGM instance from CSV files. It expects files separated by graphs, vertices, edges and metadata in the following directory structure:

 csvRoot
   |- graphs.csv   # all graph head data
   |- vertices.csv # all vertex data
   |- edges.csv    # all edge data
   |- metadata.csv # metadata for all data contained in the graph

CSVDataSource example

DataSource dataSource = new CSVDataSource("/path/to/csv/input", config);

Each line of the graphs.csv should contain a graph identifier (12 Bytes hexadecimal string), a label and a list of properties separated by '|'. The types of properties are defined in the metadata.csv.

Example graphs.csv

000000000000000000000000;g1;graph1|2.75
000000000000000000000001;g2;graph2|4

Each line of the vertices.csv should contain a vertex identifier (12 Bytes hexadecimal string), a list that stores the graphs this vertex belongs to, a label and a list of properties separated by '|'. The types of properties are defined in the metadata.csv.

Example vertices.csv

000000000000000000000000;[000000000000000000000001];A;foo|42|13.37|NULL
000000000000000000000001;[000000000000000000000001];A;bar|23|19.84|
000000000000000000000002;[000000000000000000000000,000000000000000000000001];B;1234|true|0.123
000000000000000000000003;[000000000000000000000000,000000000000000000000001];B;5678|false|4.123
000000000000000000000004;[000000000000000000000001];B;2342||19.84

Each line of the edges.csv should contain an edge identifier (12 Bytes hexadecimal string), a list that stores the graphs this edge belongs to, the source vertex identifier, the target vertex identifier, a label and a list of properties separated by '|'. The types of properties are defined in the metadata.csv.

Example edges.csv

000000000000000000000000;[000000000000000000000001];000000000000000000000000;000000000000000000000001;a;1234|13.37
000000000000000000000001;[000000000000000000000001];000000000000000000000001;000000000000000000000000;a;5678|23.42
000000000000000000000002;[000000000000000000000001];000000000000000000000001;000000000000000000000002;b;3141
000000000000000000000003;[000000000000000000000000,000000000000000000000001];000000000000000000000002;000000000000000000000003;b;2718
000000000000000000000004;[000000000000000000000001];000000000000000000000004;000000000000000000000000;a;|19.84
000000000000000000000005;[000000000000000000000001];000000000000000000000004;000000000000000000000000;b;

Each line of the metadata.csv should contain the entity type, label and a comma separated list of (property-label,type) tuples to define the data type of the properties of the edges and vertices.

Example metadata.csv

g;g1;a:string,b:double
g;g2;a:string,b:int
v;A;a:string,b:int,c:float
v;B;a:long,b:boolean,c:double
e;a;a:int,b:float
e;b;a:long

IndexedCSVDataSource

A DataSource to create an EPGM instance from CSV files indexed by label. It expects files separated by label, e.g. in the following directory structure:

 csvRoot
    |- metadata.csv         # Meta data for all data contained in the graph
    |- graphs/
        |- g1
            |- data.csv     # contains all graph heads with label g1
        |- g2
            |- data.csv     # contains all graph heads with label g2
    |- vertices/
        |- Person
            |- data.csv     # contains all vertices with label 'Person'
        |- University
            |- data.csv     # contains all vertices with label 'University'
    |- edges/
        |- knows
            |- data.csv     # contains all edges with label 'knows'
        |- studyat
            |- data.csv     # contains all edges with label 'studyAt' 

The structure of the files and the call of the constructor are equal to CSVDataSource.


JSONDataSource (deprecated since v0.5.0)

A DataSource to create an EPGM instance from JSON files: one for vertex declaration, one for edge declaration and one for the graph declaration. Paths can be local (file://) or HDFS (hdfs://). The exact format is documented in the classes JSONToGraphHead, JSONToVertex, JSONToEdge.

JSONDataSource example

String graphFile = "/path/to/graphs.json";
String vertexFile = "/path/to/nodes.json";
String edgeFile = "/path/to/edges.json";

DataSource dataSource = new JSONDataSource(graphFile, vertexFile, edgeFile, config);

Example nodes.json

{"id":"000000000000000000000000","data":{"name":"Dave","gender":"m","city":"Dresden","age":40},"meta":{"label":"Person","graphs":["000000000000000002000000","000000000000000002000001","000000000000000002000002"]}}
{"id":"000000000000000000000001","data":{"name":"Hadoop"},"meta":{"label":"Tag","graphs":[]}}

This example denotes two vertices Person and Tag with different properties. The first one is part of three graphs while the second one has no affiliation to a graph.


TLFDataSource

A DataSource to create an EPGM instance from a TLF file. See The vertex-transitive TLF-planar graphs for further details about TLF. Paths can be local (file://) or HDFS (hdfs://). The exact format is documented in TLFFileFormat.

TLFDataSource example

DataSource dataSource = new TLFDataSource("/path/to/file.tlf", config);

Example file.tlf (with comments)

t # 0	// graph head with graph id 0
v 0 A	// vertex with id 0 and label A
v 1 B
e 0 1 a //edge with source id 0, target id 1 and edge label a
e 0 1 b
t # 1
v 0 A
v 1 B
e 0 1 a
e 1 0 b

FlinkAsciiGraphLoader

Used the AsciiGraphLoader to generate instances of LogicalGraph and GraphCollection from Graph Definition Language GDL (see GDL on GitHub). The GDL data can be load from string or file.

FlinkAsciiGraphLoader example

// create a sample graph
String graph = "g1:graph[" +
        "(p1:Person {name: \"Bob\", age: 24})-[:friendsWith]->" +
        "(p2:Person{name: \"Alice\", age: 30})-[:friendsWith]->(p1)" +
        "(p2)-[:friendsWith]->(p3:Person {name: \"Jacob\", age: 27})-[:friendsWith]->(p2) " +
        "(p3)-[:friendsWith]->(p4:Person{name: \"Marc\", age: 40})-[:friendsWith]->(p3) " +
        "(p4)-[:friendsWith]->(p5:Person{name: \"Sara\", age: 33})-[:friendsWith]->(p4) " +
        "(c1:Company {name: \"Acme Corp\"}) " +
        "(c2:Company {name: \"Globex Inc.\"}) " +
        "(p2)-[:worksAt]->(c1) " +
        "(p4)-[:worksAt]->(c1) " +
        "(p5)-[:worksAt]->(c1) " +
        "(p1)-[:worksAt]->(c2) " +
        "(p3)-[:worksAt]->(c2) " + "] " +
        "g2:graph[" +
        "(p4)-[:friendsWith]->(p6:Person {name: \"Paul\", age: 37})-[:friendsWith]->(p4) " +
        "(p6)-[:friendsWith]->(p7:Person {name: \"Mike\", age: 23})-[:friendsWith]->(p6) " +
        "(p8:Person {name: \"Jil\", age: 32})-[:friendsWith]->(p7)-[:friendsWith]->(p8) " +
        "(p6)-[:worksAt]->(c2) " +
        "(p7)-[:worksAt]->(c2) " +
        "(p8)-[:worksAt]->(c1) " + "]";

FlinkAsciiGraphLoader loader = new FlinkAsciiGraphLoader(config);
loader.initDatabaseFromString(graph);
GraphCollection c1 = loader.getGraphCollectionByVariables("g1", "g2");