s2graph 는 가장자리(edges) 와 정점(vertices) 을 사용하여 빅데이터를 저장하는 GraphDB 이다. 또한 가장자리(edges)와 정점(vertices)의 정보를 질의하기 위해 REST API 를 제공한다. s2graph 는 무려 비동기(asynchronous), 넌 블로킹(non-blocking) API to manupulate and traverse(breadth first search) large graph 를 제공한다. 이 문서는 s2graph 에서 사용되는 용어와 개념을 정의하고, REST API 를 설명한다.
Table of Contents generated with DocToc
- Getting Started
- The Data Model
- REST API Glossary
- 0. Create a Service -
POST /graphs/createService
- 1. Create a Label -
POST /graphs/createLabel
- 2. Create ServiceColumn(Optional) -
POST /graphs/createServiceColumn
- 3. Insert and Manipulate Edges
- 4. (Optionally) Insert and Manipulate Vertices
- 5. Query
- 6. Bulk Loading
- 7. Benchmark
- 8. Resources
S2Graph 는 다양한 프로젝트들로 구성되어 있다.
- S2Core: edge/vertex 로 데이터를 검색하고 저장하는 common 클래스를 위한 core library.
- root project: Rest APIs 를 제공하는 Play rest server.
- spark: spark 와 연관된 common 클래스.
- loader: S2Core library를 사용하여 Kafka에서 HBase 로 이벤트를 소비하는 spark jobs. 또한 HDFS에서 s2graph 로 마이그레이션하는 kit이 포함되어 있다.
- asynchbase: 이 곳(https://github.com/OpenTSDB/asynchbase)으로부터 fork 한 것이다. 우리는 GetRequest 에 몇 가지 기능을 추가하였다. 이것들 전부 아직 original 프로젝트에 병합(merge)되지 않은 pull request 에 굉장히 의존하고 있다. 6. rpcTimeout 7. setFilter 8. column pagination 9. retryAttempCount 10. timestamp filtering
Setup 하고 아래 요구사항에 따라 실행한다.
- Apache HBase setup.
2. Mac 유저라면brew install hadoop
와brew install hbase
를 실행한다.
3. Mac 유저가 아니라면, hbase 를 어떻게 설치하는지 이 링크를 확인하자. reference
4. 현재 apache hbase 1.0.1 with apache hadoop version 2.7.0 의 최신 stable 버전을 제공하며, 만약 cdh 를 사용한다면, feature/cdh5.3.0 를 checkout 받자. 우리는 곧 hbase/hadoop 버전에서 profile을 제공하기 위해 노력하고 있다. - s2graph 는 현재 mysql 에서 메타데이터를 저장한다. 이 스크립트 's2core/migrate/mysql/schema.sql' 를 실행하여 mysql 에 관련 테이블을 생성하자.
그리고 나머지 프로젝트를 컴파일 한다.
sbt compile
이제 s2graph 를 실행한다.
sbt run
우리는 모든 설정 property 를 체크하는 간단한 스크립트를 제공한다.
sh script/test.sh
마지막으로 메일링 리스트 에 조인한다.
s2graph 에서 사용하는 data model 을 정의하는 4가지 중요한 추상적인 개념은 services, columns, labels 그리고 properties 가 있다.
Services, 상위 수준의 추상화는 모든 데이터가 포함되어있는 전통적인 RDBMS의 데이터베이스와 같은 것이다. 서비스는 일반적으로 기업의 실제 서비스 중 하나를 나타내며, 예를 들어 "KakaoTalk"
, "KakaoStory"
처럼 명명되어 있다.
Columns vertices type을 정의하고 서비스는 다양한 column을 가질 수 있다. 한 예로, "KakaoMusic"
서비스는 "user_id"
와 "track_id"
컬럼을 가질 수 있다. column은 전통적인 RDBMS의 table과 비교 할 수 있는 반면, lable은 오직 자신의 이름을 사용하여 schema와 column이 일반적으로 참조되는 것을 표현하는 기본 추상화이다. (말이 어렵네요;)
Labels, edges의 type을 나타내는 두 columns 사이의 관계를 나타낸다. 예를 들어 SNS에서 우정을 표현하는 label은 두 개 column이 같을 수 있다. 그 column 들은 서비스에서 "user_id"
로 둘 다 할 것이다. 두 개의 다른 서비스에서 두 column이 연결된 lable 인 경우도 있다. 예를 들어, 하나는 KakaoStory 포스트가 KakaoTalk 으로 공유되는 모든 이벤트를 포함하는 label을 create 할 수 있다.
Properties, 나중에 질의를 할 수 있는 vertices 혹은 edges에 연결되어 있는 metadata 다. KakaoTalk 사용자를 나타내는 vertices,estimated_birth_year
는 사용 가능한 property 이며, 유사한 KakaoMusic 노래를 나타내는 edges는 그것의 cosine_similarity
property가 될 수 있다.
이러한 추상화를 사용하여 unique한 vertex는 (service, column, vertex id)
에서 확인할 수 있으며, unique한 edge는 (service, label, source vertex id, target vertex id)
에서 확인할 수 있다. edges 와 vertices에 대한 추가 정보는 자신의 properties에 저장된다.
다음은 일반적으로 사용되는 s2graph API와 그에 대한(하나도 빠트림없이) 예제 목록이다. 이 곳에서 모든 최신 REST API 를 찾을 수있다. the routes file
다음은 API 를 통해 설정하는 방법을 알아보도록 하자.
Service 를 create 하는 것은 다음과 같이 필요한 field 를 정의하고 요청한다.
파일이름 | 정의 | 데이터 타입 | 예제 | 비고 |
---|---|---|---|---|
serviceName | 사용자 정의 namespace 이름 | string | "talk_friendship" | required. |
cluster | 클러스터의 zookeeper 쿼럼 주소 | string | "abc.com:2181,abd.com:2181" | optional. application.conf 의 기본 값은 "hbase.zookeeper.quorum" 이다. "hbase.zookeeper.quorum"에 대한 값이 없는 경우 application.conf에기본 값 "localhost"로 정의한다. |
hTableName | 물리적인 HBase 테이블 이름 | string | "test" | optional. 기본은 serviceName-#{phase} 이다. phase 중 하나는 dev/real/alpha/sandbox 이다. |
hTableTTL | 살아있는 데이터 유지 global 시간 | integer | 86000 | optional. 기본은 시간 제한이 없다. |
preSplitSize | HBase 테이블의 pre-split 숫자 비율. numOfRegionServer x 이 숫자는 pre-split 사이즈를 가져오도록 결정할 수있다. | integer | 1 | optional. 기본은 0(no pre-split)이다. 만약 1로 설정하면, s2graph 는 당신의 테이블에 1 x numOfRegionServers 만큼 pre-split 될 것 이다. |
Service 는 상위 수준의 추상화로 RDBMS 의 데이터베이스와 같은 것으로 여겨질 수 있다. 이 API 를 사용하여 service 를 생성할 수 있다.
curl -XPOST localhost:9000/graphs/createService -H 'Content-Type: Application/json' -d '
{"serviceName": "s2graph", "cluster": "address for zookeeper", "hTableName": "hbase table name", "hTableTTL": 86000, "preSplitSize": # of pre split}
'
Service 에 대한 옵션 값은 상위의 users 만 된다는 점을 유의하자. 만약 무엇을 해야 할지 모른다면, 기본 방침을 그대로 지킨다.
또한 service 에 해당하는 모든 label 을 검색할 수 있다.
curl -XGET localhost:9000/graphs/getLabels/:serviceName
A label represents a relation between two columns, and plays a role like a table in RDBMS since labels contain the schema information, i.e. what type of data will be collected and what among them needs to be indexed for efficient retrieval. In most scenario, defining a schema on vertices is pretty straightforward but defining a schema on edges requires a little effort. Think about queries you will need first, and then model user's actions/relations as edges to design a label.
2 개의 column 사이의 관계를 나타내는 label은 schema 정보, 즉 data type이 수집되고 어떤 것들 가운데 효율적인 검색을 위한 index를 만들 필요가 있지만 포함 되어 있기 때문에 RDBMS의 테이블과 같은 역할을 하고 있다. 대부분의 시나리오에서는 vertices 에 schema를 정의하는 것은 매우 간단하지만 edges 에 schema를 정의하면 약간의 노력이 필요하다. 당신이 먼저 필요하고, 그 edges 와 같은 model 사용자의 action/relations 가 label을 설계하는 쿼리를 생각해보자.
Label 을 생성하는 것은, 아래와 같이 request 요청을 정의하는 fields 가 필요하다.
field 이름 | 정의 | 데이터 타입 | 예제 | 비고 |
---|---|---|---|---|
label | 명확히 정해져있는 이 관계의 name. | string | "talk_friendship" | required. |
srcServiceName | source column의 service | string | "kakaotalk" | required. |
srcColumnName | source column의 name | string | "user_id" | required. |
srcColumnType | source column의 data type | long/integer/string | "string" | required. |
tgtServiceName | target column의 service | string | "kakaotalk"/"kakaoagit" | 정의되어 있지 않을 때는, srcServiceName 과 같다. |
tgtColumnName | target column의 name | string | "item_id" | required. |
tgtColumnType | target column의 data type | long/integer/string | "long" | required. |
isDirected | 이 label 이 방향이 있는지 혹은 없는지(directed or undirected) | true/false | true/false | 기본은 true |
serviceName | service label이 속해있는 것. | string | s2graph | 기본은 tgtServiceName |
hTableName | if this label need special usecase(such as batch upload), own hbase table name can be used. | string | s2graph-batch | default use service`s hTableName. note that this is optional. |
hTableTTL | time to data keep alive. | integer | 86000 | default use service`s hTableTTL. note that this is optional. |
consistencyLevel | strong 이면, from-to 간에 하나의 edge 만 사용할 수 있다. 그것과 달리 weak 는 from-to 간의 edges 를 여러개 가진다. (대부분은 weak 를 사용한다) | string | strong/weak | default weak |
indexProps | see below | |||
props | see below |
Label에서 대부분 중요한 elements 들은 indexProps 과 props 이다.
정점과 에지를 포함하여 그래프 요소의 모든 자신의 속성이 있습니다. 하나의 속성은 다음과 같이 정의된다. 속성은 간단한 키, 가장자리와 정점에 값 맵입니다. Vertex 와 Edge 를 포함하여 graph 의 모든 element 는 그들의 properties 이다. 하나의 property 는 다음과 같이 정의된다. property 는 Edge 와 Vertex의 심플한 key, value 의 map 형태이다.
{
"name": "name of property",
"dataType": "data type of property value",
"defaultValue": "default value in string"
}
이 것은 props 를 알 수 있는 간단한 예제이다.
[
{"name": "play_count", "defaultValue": 0, "dataType": "integer"},
{"name": "is_hidden","defaultValue": false,"dataType": "boolean"},
{"name": "category","defaultValue": "jazz","dataType": "string"},
{"name": "score","defaultValue": 0,"dataType": "float"}
]
property 값 type 은 numeric(byte, short, integer, float, double) 과 boolean 혹은 string 이어야 한다는 점 유의하자.
indexProps 은 이 label 의 primary index 를 정의한다.(RDBMS 의 PRIMARY INDEX idx_xxx
(p1, p2
) 와 동일).
s2graph 는 자동으로 이 indexProps 에 따라 정렬도괴 edges 를 유지할 것이다. edges 에 멀티 정렬 을 필요로 할 때 기타 indexProps 은 나중에 정의할 수 있다.
props 는 edges 정렬에 영향을 미치지 않는 메타 데이터를 정의한다.
One last thing to note here is that s2graph reserved following property names. user can`t create following property name but they can use as it is provided by default. 마지막으로 주의해야 할 점은 property 이름이 아래와 같이 예약되어 있다. 사용자는 아래와 같은 propety 이름을 생성할 수 없고, 기본적으로 제공된 것들을 사용할 수 있다.
- _timestamp 은 system 에 대한 timestamp 이다. 이것은 last_modified_at 로 설명할 수 있다.
- _from 은 label의 시작 vertex 이다.
- _to 또한 edge 의 vertex 이다.
여기 s2graph
서비스의 user_id
column과 s2graph_news
서비스의 article_id
column 간에 user_article_liked
label 로 불리우는 label 을 생성하는 예다. indexedProps
field 이후 _timestamp
는 빈 값의 기본 속성으로 생성된다.
curl -XPOST localhost:9000/graphs/createLabel -H 'Content-Type: Application/json' -d '
{
"label": "user_article_liked",
"srcServiceName": "s2graph",
"srcColumnName": "user_id",
"srcColumnType": "long",
"tgtServiceName": "s2graph_news",
"tgtColumnName": "article_id",
"tgtColumnType": "string",
"indexProps": {}, // _timestamp will be used as default
"props": {}
}
'
this label will keep edges ordered according to edge`s indexProps values in this case, latest like first. default ordering is latest first which many application naturally want.
Here is another example that creates a label named friends
, which represents the friend relation between users in s2graph
service. In this case with higher affinity_score comes first and if affinity_score is ties, then latest friend comes first. friends
label will belongs to s2graph
service.
curl -XPOST localhost:9000/graphs/createLabel -H 'Content-Type: Application/json' -d '
{
"label": "friends",
"srcServiceName": "s2graph",
"srcColumnName": "user_id",
"srcColumnType": "long",
"tgtServiceName": "s2graph",
"tgtColumnName": "user_id",
"tgtColumnType": "long",
"indexProps": [
{"name": "affinity_score", "dataType": "float", "defaultValue": 0.0}
{"name": "_timestamp", "dataType": "long", "defaultValue": 0}
],
"props": [
{"name": "is_hidden", "dataType": "boolean", "defaultValue": false},
{"name": "is_blocked", "dataType": "boolean", "defaultValue": true},
{"name": "error_code", "dataType": "integer", "defaultValue": 500}
],
"serviceName": "s2graph",
"consistencyLevel": "strong"
}
'
s2graph support multiple index on label which means we can add other ordering option for edges with this label.
curl -XPOST localhost:9000/graphs/addIndex -H 'Content-Type: Application/json' -d '
{
"label": "friends",
"indexProps": [
{"name": "is_blocked","dataType": "boolean", "defaultValue": "false"},
{"name": "_timestamp","dataType": "long","defaultValue": 0}
]
}
'
To get information on label, just use following.
curl -XGET localhost:9000/graphs/getLabel/friends
You can also delete a label using the following API.
curl -XPUT localhost:9000/graphs/deleteLabel/friends
Update is not supported, so just delete label and re-create it.
To add a new property, use the following API:
curl -XPOST localhost:9000/graphs/addProp/graph_test -H 'Content-Type: Application/json' -d '
{"name": "is_blocked", "defaultValue": false, "dataType": "boolean"}
'
One last important constraint on label is consistency level.
This define how to store edges on storage level. note that query is completely independent with this.
To explain consistency, s2graph defined edge uniquely with their (from, label, to) triple. s2graph call this triple as unique edge key.
following example is used to explain differences between strong/weak consistency level.
1418950524721 insert e 1 101 graph_test {"weight": 10} = (1, graph_test, 101) 1418950524723 insert e 1 101 graph_test {"weight": 20} = (1, graph_test, 101)
currently there are two consistency level
1. strong
make sure there is only one edge stored in storage between same edge key((1, graph_test, 101) above). with strong consistency level, last command overwrite previous command.
2. weak
no consistency check on unique edge key. above example yield two different edge stored in storage with different timestamp and weight value.
for example, with each configuration, following edges will be stored.
assumes that only timestamp is used as indexProps and user inserts following.
u1 -> (t1, v1)
u1 -> (t2, v2)
u1 -> (t3, v2)
u1 -> (t4, v1)
with strong consistencyLevel following is what to be stored.
u1 -> (t4, v1), (t3, v2)
note that u1 -> (t1, v1), (t2, v2) are not exist.
with weak consistencyLevel.
u1 -> (t4, v1), (t3, v2), (t2, v2), (t1, v1)
Reason weak consistency is default.
most case edges related to user`s activity should use weak consistencyLevel since there will be no concurrent update on same edges. strong consistencyLevel is only for edges expecting many concurrent updates.
Consistency level also determine how edges will be stored in storage when command is delivered reversely by their timestamp.
with strong consistencyLevel following is guaranteed.
natural event on (1, graph_test, 101) unique edge key is following.
1418950524721 insert e 1 101 graph_test {"is_blocked": false}
1418950524722 delete e 1 101 graph_test
1418950524723 insert e 1 101 graph_test {"is_hidden": false, "weight": 10}
1418950524724 update e 1 101 graph_test {"time": 1, "weight": -10}
1418950524726 update e 1 101 graph_test {"is_blocked": true}
even if above commands arrive in not in order, strong consistency make sure same eventual state on (1, graph_test, 101).
1418950524726 update e 1 101 graph_test {"is_blocked": true}
1418950524723 insert e 1 101 graph_test {"is_hidden": false, "weight": 10}
1418950524722 delete e 1 101 graph_test
1418950524721 insert e 1 101 graph_test {"is_blocked": false}
1418950524724 update e 1 101 graph_test {"time": 1, "weight": -10}
There are many cases that commands arrive in not in order.
- client servers are distributed and each client issue command asynchronously.
- client servers are distributed and grouped commands.
- by using kafka queue, global ordering or message is not guaranteed.
Following is what s2graph do to make strong consistency level.
complexity = O(one read) + O(one delete) + O(2 put)
fetchedEdge = fetch edge with (1, graph_test, 101) from lookup table.
if fetchedEdge is not exist:
create new edge same as current insert operation
update lookup table as current insert operation
else:
valid = compare fetchedEdge vs current insert operation.
if valid:
delete fetchedEdge
create new edge after comparing fetchedEdge and current insert.
update lookup table
Limitation Since we write our data to HBase asynchronously, there is no consistency guarantee on same edge within our flushInterval(1 seconds).
A label can have multiple indexed properties, or (for brevity) indexes. When queried, returned edges' order is determined according to indexes, indexes essentially defines what will be included in the topK query.
Edge retrieval queries in s2graph by default returns topK edges. Clients must issue another query to fetch the next K edges, i.e., topK ~ 2topK.
Internally, s2graph stores edges sorted according to the indexes in order to limit the number of edges to fetch in one query. If no ordering is given, s2graph will use the timestamp as an index, thus resulting in the most recent data.
It is impossible to fetch millions of edges and sort them on-line to get topK in less than a second. s2graph uses vertex-centric indexes to avoid this.
using vertex-centric index, having millions of edges is fine as long as the topK value is reasonable (~ 1K) Note that indexes must be created before putting any data on this label (just like RDBMS).
New indexes can be dynamically added, but it will not be applied to existing data(planned in future versions). the number of indexes on a label is currently limited to 8.
The following is an example of adding indexes play_count
and pay_amount
to a label named graph_test
.
curl -XPOST localhost:9000/graphs/addIndex -H 'Content-Type: Application/json' -d '
{
"label": "graph_test",
"indexProps": [
{ "name": "play_count", "defaultValue": 0, "dataType": "integer" }
]
}
'
A ServiceColumn represents object and plays a role like a single table in RDBMS.
Note: if you only need vertex id, then you don`t need to create vertex explicitly. when you create label, s2graph create vertex with empty properties according to label schema.
field name | definition | data type | note | example |
---|---|---|---|---|
serviceName | which service this vertex belongs to | string | required. think this as database in RDBMS | kakaotalk |
columnName | what this vertex`s id is | string | required. think this as primary key in RDBMS table | talk_user_id |
props | optional properties on vertex | json array of json dictionary | optional. think this as columns in RDBMS table | see examples |
This is simple example to show how to define vertex schema and insert/select vertices.
curl -XPOST localhost:9000/graphs/createServiceColumn -H 'Content-Type: Application/json' -d '
{
"serviceName": "s2graph",
"columnName": "user_id",
"columnType": "long",
"props": [
{"name": "is_active", "dataType": "boolean", "defaultValue": true},
{"name": "phone_number", "dataType": "string", "defaultValue": "-"},
{"name": "nickname", "dataType": "string", "defaultValue": ".."},
{"name": "activity_score", "dataType": "float", "defaultValue": 0.0},
{"name": "age", "dataType": "integer", "defaultValue": 0}
]
}
'
user can get information on vertex schema as following.
curl -XGET localhost:9000/graphs/getServiceColumn/s2graph/user_id
This will give all properties on serviceName s2graph
and columnName user_id
serviceColumn.
user can also add more properties on vertex as following.
curl -XPOST localhost:9000/graphs/addServiceColumnProps/s2graph/user_id -H 'Content-Type: Application/json' -d '
[
{"name": "home_address", "defaultValue": "korea", "dataType": "string"}
]
'
you can insert vertex data as following.
curl -XPOST localhost:9000/graphs/vertices/insert/s2graph/user_id -H 'Content-Type: Application/json' -d '
[
{"id":1,"props":{"is_active":true}, "timestamp":1417616431},
{"id":2,"props":{},"timestamp":1417616431}
]
'
finally you can query your vertex as following.
curl -XPOST localhost:9000/graphs/getVertices -H 'Content-Type: Application/json' -d '
[
{"serviceName": "s2graph", "columnName": "user_id", "ids": [1, 2, 3]}
]
'
##3. Insert and Manipulate Edges ##
An edge represents a relation between two vertices, with properties according to the schema defined in its label. The following fields need to be specified when inserting an edge, and are returned when queried on edges.
field name | definition | data type | note | example |
---|---|---|---|---|
timestamp | when this request is issued. | long | required. in millis since the epoch. It is important to use millis, since TTL support is in millis. | 1430116731156 |
operation | insert/delete/update/increment | string | required only for bulk operation; aliases are insert: i, delete:d, update: u, increment: in, default is insert. | "i", "insert" |
from | Id of start vertex. | long/string | required. prefer long if possible. maximum string bytes length < 249 | 1 |
to | Id of end vertex. | long/string | required. prefer long if possible. maximum string bytes length < 249 | 101 |
label | name the corresponding label | string | required. | "graph_test" |
direction | direction of this relation, one of out/in/undirected | string | required. alias are out: o, in: i, undirected: u | "out" |
props | extra properties of this edge. | json dictionary | required. all indexed properties should be present, otherwise the default values will be added. Non-indexed properties can also be present | {"timestamp": 1417616431, "affinity_score":10, "is_hidden": false, "is_valid": true} |
s2graph provide 4 different operations on edge.
- insert: create new edge.
- delete: delete existing edge.
- update: update existing edge`s state.
- increment: increment existing edge`s state.
- deleteAll: delete all adjacent edges from certain starting vertex.
edge operations behave differently with regard to their label`s consistencyLevel.
for explanation, consider following test cases.
create 2 different label each for strong/weak consistencyLevel.
- s2graph_label_test(strong)
- s2graph_label_test_weak(weak)
then insert 2 test cases.
strong consistency
curl -XPOST localhost:9000/graphs/edges/insert -H 'Content-Type: Application/json' -d '
[
{"timestamp": 1, "from": 101, "to": 10, "label": "s2graph_label_test", "props": {"time": 0}},
{"timestamp": 2, "from": 101, "to": 10, "label": "s2graph_label_test", "props": {"time": -10}},
{"timestamp": 3, "from": 101, "to": 10, "label": "s2graph_label_test", "props": {"time": -30}}
]
'
note that only one edge exist between (101, 10, s2graph_label_test, out) since label s2graph_label_test is strong consistency.
{
"size": 1,
"degrees": [
{
"from": 101,
"label": "s2graph_label_test",
"direction": "out",
"_degree": 1
}
],
"results": [
{
"cacheRemain": -20,
"from": 101,
"to": 10,
"label": "s2graph_label_test",
"direction": "out",
"_timestamp": 3,
"timestamp": 3,
"score": 1,
"props": {
"_timestamp": 3,
"time": -30,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
}
],
"impressionId": -1650835965
}
weak consistency
curl -XPOST localhost:9000/graphs/edges/insert -H 'Content-Type: Application/json' -d '
[
{"timestamp": 1, "from": 101, "to": 10, "label": "s2graph_label_test_weak", "props": {"time": 0}},
{"timestamp": 2, "from": 101, "to": 10, "label": "s2graph_label_test_weak", "props": {"time": -10}},
{"timestamp": 3, "from": 101, "to": 10, "label": "s2graph_label_test_weak", "props": {"time": -30}}
]
'
note that there are 3 edges between (101, 10, s2graph_label_test_weak, out).
{
"size": 3,
"degrees": [
{
"from": 101,
"label": "s2graph_label_test_weak",
"direction": "out",
"_degree": 3
}
],
"results": [
{
"cacheRemain": -148,
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak",
"direction": "out",
"_timestamp": 3,
"timestamp": 3,
"score": 1,
"props": {
"_timestamp": 3,
"time": -30,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
},
{
"cacheRemain": -148,
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak",
"direction": "out",
"_timestamp": 2,
"timestamp": 2,
"score": 1,
"props": {
"_timestamp": 2,
"time": -10,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
},
{
"cacheRemain": -148,
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak",
"direction": "out",
"_timestamp": 1,
"timestamp": 1,
"score": 1,
"props": {
"_timestamp": 1,
"time": 0,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
}
],
"impressionId": 1972178414
}
unique edge is defined by (from, to, label, direction). for insert operation, s2graph first check if there exist edge with same (from, to, label, direction) then if there is an existing edge, then insert works as update. see above example.
s2graph looks for unique edge with (from, to, label, direction) then delete them if deleting edge is valid. note that delete request has strictly larger timestamp than existing edge. since s2graph check validation on delete request, if existing edge has large timestamp than current delete request, then current delete request will be ignored. also note that props on delete request is not necessary and will be ignored when label is strong consistency since it is safe to assume there is only one edge with edge`s unique id(from, to, label, direction).
curl -XPOST localhost:9000/graphs/edges/delete -H 'Content-Type: Application/json' -d '
[
{"timestamp": 10, "from": 101, "to": 10, "label": "s2graph_label_test"}
]
'
works like insert with strong consistency level.
curl -XPOST localhost:9000/graphs/edges/update -H 'Content-Type: Application/json' -d '
[
{"timestamp": 10, "from": 101, "to": 10, "label": "s2graph_label_test", "props": {"time": 100, "weight": -10}}
]
'
works like update. only difference is increment call don`t return old value but incremented value.
curl -XPOST localhost:9000/graphs/edges/increment -H 'Content-Type: Application/json' -d '
[
{"timestamp": 10, "from": 101, "to": 10, "label": "s2graph_label_test", "props": {"time": 100, "weight": -10}}
]
'
delete all adjacency edges that start from starting vertex. following operation will first fetch edges start from 101 then delete all edges. note that not only all out going edges from 101 and but also incoming edges to 101 will be deleted.
curl -XPOST localhost:9000/graphs/edges/deleteAll -H 'Content-Type: Application/json' -d '
[
{"ids" : [101], "label":"s2graph_label_test", "direction": "out", "timestamp":1417616441}
]
'
s2graph do not look for unique edge defined by (from, to, label, direction). it simply make new edge for user`s request. no read and no consistency check. note that this difference make multiple edge with same (from, to, label, direction) exist. see above example.
to delete edges with weak consistency, first thing to do is fetching existing edges. with getEdges query result json, extract "results" part in it, then fire /graphs/edges/delete with "results" part json. note that request body json is copied from "results" part in /graphs/getEdges.
curl -XPOST localhost:9000/graphs/edges/delete -H 'Content-Type: Application/json' -d '
[
{
"cacheRemain": -148,
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak",
"direction": "out",
"_timestamp": 3,
"timestamp": 3,
"score": 1,
"props": {
"_timestamp": 3,
"time": -30,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
},
{
"cacheRemain": -148,
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak",
"direction": "out",
"_timestamp": 2,
"timestamp": 2,
"score": 1,
"props": {
"_timestamp": 2,
"time": -10,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
},
{
"cacheRemain": -148,
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak",
"direction": "out",
"_timestamp": 1,
"timestamp": 1,
"score": 1,
"props": {
"_timestamp": 1,
"time": 0,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
}
]
'
like insert, s2graph do not look check uniqueness. all responsibility is on users. like delete, update require fetch existing edges first. after fetching edges, update props then issue update request.
like update, s2graph do not look check uniqueness. all responsibility is on users. like delete, update require fetch existing edges first. after fetching edges, update props then issue increment request.
not supported with weak consistency. like update, fetch all edges then issue delete individually on fetched edges.
Vertices are the two ends that an edge is connecting, and correspond to a column defined for a service. In case you need to store some metadata corresponding to vertices and make queries regarding them, you can insert and manipulate vertices rather than edges.
Unlike edges and their labels, properties on vertices are not indexed and do not require a predefined schema nor default values. The following fields are used when operating on vertices.
field name | definition | data type | note | example |
---|---|---|---|---|
timestamp | long | required. in seconds since the epoch | 1417616431 | |
operation | the operation to perform; one of insert, delete, update, increment | string | required only for bulk operations; alias are insert: i, delete:d, update: u, increment: in, default is insert. | "i", "insert" |
serviceName | corresponding service's name | "string" | required. | "kakaotalk"/"kakaogroup" |
columnName | corresponding column's name | string | required. | "xxx_service_ user_id" |
id | a unique identifier of this vertex | long/string | required. prefer long if possible. | 101 |
props | extra properties of this vertex. | json dictionary | required. | {"is_active_user": true, "age":10, "gender": "F", "country_iso": "kr"} |
curl -XPOST localhost:9000/graphs/vertices/insert/s2graph/account_id -H 'Content-Type: Application/json' -d '
[
{"id":1,"props":{"is_active":true, "talk_user_id":10},"timestamp":1417616431},
{"id":2,"props":{"is_active":true, "talk_user_id":12},"timestamp":1417616431},
{"id":3,"props":{"is_active":false, "talk_user_id":13},"timestamp":1417616431},
{"id":4,"props":{"is_active":true, "talk_user_id":14},"timestamp":1417616431},
{"id":5,"props":{"is_active":true, "talk_user_id":15},"timestamp":1417616431}
]
'
This operation will delete only the vertex data of a specified column and will not delete all edges connected to those vertices.
Important notes
This means that edges returned by a query can contain deleted vertices. Clients need to check if those vertices are valid.
This operation will delete all vertex data of a specified column and also delete all edges that are connected to those vertices. Example:
curl -XPOST localhost:9000/graphs/vertices/deleteAll/s2graph/account_id -H 'Content-Type: Application/json' -d '
[{"id": 1, "timestamp": 193829198}]
'
This is an extremely expensive operation; The following is a pseudocode showing how this operation works:
vertices = vertex list to delete
for vertex in vertices
labals = fetch all labels that this vertex is included.
for label in labels
for index in label.indices
edges = G.read with limit 50K
for edge in edges
edge.delete
The total complexity is O(L * L.I) reads + O(L * L.I * 50K) writes in the worst case. If a vertex to delete has more than 50K edges, the delete operation will not be consistent.
The update operation on vertices uses the same parameters as in the insert operation.
Not yet implemented; stay tuned.
Once you have your graph data uploaded to s2graph, you can traverse your graph using our REST APIs. Queries contain the vertex to start traversing, and list of labels paired with filters and scoring weights used during the traversal. Query requests are structures as follows:
field name | definition | data type | note | example |
---|---|---|---|---|
srcVertices | vertices to start traversing. | json array of json dictionary specifying each vertex, with "serviceName", "columnName", "id" fields. | required. | [{"serviceName": "kakao", "columnName": "account_id", "id":1}] |
steps | list of steps for traversing. | json array of steps | explained below | [[{"label": "graph_test", "direction": "out", "limit": 100, "scoring":{"time": 0, "weight": 1}}]] |
removeCycle | when traverse to next step, don`t traverse already visited vertices | true/false. default is true | already visited is defined by following(label, vertex). so if steps are friend -> friend, then remove second depth friends if they exist in first depth friends |
|
select | which field on edge to include in result json | json array of string | ["label", "to", "from"] | |
groupBy | how to group by results | json array of string | ["to"] | |
filterOut | filtering out query which will be run concurrently then filter out | another query json |
step: Each step define what to traverse in a single hop on the graph. The first step has to be a direct neighbor of the starting vertices, the second step is a direct neighbor of vertices from the first step and so on. A step is specified with a list of query params, hence the steps
field of a query request becoming an array of arrays of dictionaries.
step param:
field name | definition | data type | note | example |
---|---|---|---|---|
weights | weight constant to multiply for labels in this step | json dictionay | optional | {"graph_test": 0.3, "graph_test2": 0.2} |
nextStepThreshold | score threshold for current step edges to pass to next step | double | ||
nextStepLimit | number of edges in current step to pass to next step | double | if this parameter is given, then sort current step by score and take topK, then start traverse next step |
query param:
field name | definition | data type | note | example |
---|---|---|---|---|
label | name of label to traverse. | string | required. must be an existing label. | "graph_test" |
direction | in/out direction to traverse | string | optional, default out | "out" |
limit | how many edges to fetch | int | optional, default 10 | 10 |
offset | start position on this index | int | optional, default 0 | 50 |
interval | the range to filter on indexed properties | json dict | optional | {"from": {"time": 0, "weight": 1}, "to": {"time": 1, "weight": 15}} |
duration | time range | json dict | optional | {"from": 1407616431, "to": 1417616431} |
scoring | a mapping from indexed properties' names to their weights the weighted sum of property values will be the final score. |
json dict | optional | {"time": 1, "weight": 2} |
where | filter condition(like sql`s where clause). logical operation(and/or) is supported and each condition can have exact equal(=), sets(in), and range(between x and y). do not use any quotes for string type |
string | optional | ex) "((_from = 123 and _to = abcd) or gender = M) and is_hidden = false and weight between 1 and 10 or time in (1, 2, 3)". note that it only support long/string/boolean type |
outputField | replace edge`s to field with this field in props | string | optional | "outputField": "service_user_id". this change to field into props['service_user_id'] |
exclude | decide if vertices that appear on this label and different labels in this step should be filtered out | boolean | optional, default false | true, exclude vertices that appear on this label and other labels in this step will be filtered out. |
include | decide if vertices that appear on this label and different labels in this step should be remain in result. | boolean | optional, default false | |
duplicate | policy on how to deal with duplicate edges. duplicate edges means edges with same (from, to, label, direction). |
string one of "first", "sum", "countSum", "raw" |
optional, default "first" | "first" means only first occurrence of edge survive. "sum" means sums up all scores of same edges but only one edge survive. "countSum" means counts up occurrence of same edges but only one edge survive. "raw" means same edges will be survived as they are. |
rpcTimeout | timeout for this request | integer | optional, default 100ms | note: maximum value should be less than 1000ms |
maxAttempt | how many times client will try to fetch result from HBase | integer | optional, default 1 | note: maximum value should be less than 5 |
_to | to vertex id | string | optional | note: use this to get a edge for certain vertex |
threshold | score threshold for filtering out result edges | double | optional, default 0.0 | |
transform | rules define how to transform _to field value on edge | json array of json array | optional, default [ ["_to"]] |
s2graph provide query DSL like sql. s2graph use getEdges to fetch data and traverse multiple steps, just like users use select query in mysql.
users have been struggled since query DSL can be very complex, so if you can`t find any idea even after reading through example query section, then give me issue specifying what is your use case.
select all edges with given query.
return edge for given vertex pair only if edge exist.
this is very basic query to fetch all adjacent edges from starting vertex.
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '
{
"srcVertices": [
{
"serviceName": "s2graph",
"columnName": "user_id_test",
"id": 101
}
],
"steps": [
{
"step": [
{
"label": "s2graph_label_test_weak",
"direction": "out",
"offset": 0,
"limit": 10,
"duplicate": "raw"
}
]
}
]
}
'
note "duplicate" field. when consistency level is weak and multiple edges exist with same (from, to, label, direction), then query expect duplicate policy. s2graph provide 4 duplicate policies on edge.
- raw: return all edges.
- first: return only first edge if multiple edges exist. this is default
- countSum: return only one edge but return how many times same edge exist.
- scoreSum: return only one edge but return sum of their score.
you can see with "raw" duplicate policy, there are actually 3 edges with same (from, to, label, direction).
{
"size": 3,
"degrees": [
{
"from": 101,
"label": "s2graph_label_test_weak",
"direction": "out",
"_degree": 3
}
],
"results": [
{
"cacheRemain": -29,
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak",
"direction": "out",
"_timestamp": 6,
"timestamp": 6,
"score": 1,
"props": {
"_timestamp": 6,
"time": -30,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
},
{
"cacheRemain": -29,
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak",
"direction": "out",
"_timestamp": 5,
"timestamp": 5,
"score": 1,
"props": {
"_timestamp": 5,
"time": -10,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
},
{
"cacheRemain": -29,
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak",
"direction": "out",
"_timestamp": 4,
"timestamp": 4,
"score": 1,
"props": {
"_timestamp": 4,
"time": 0,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
}
],
"impressionId": 1972178414
}
now "countSum" policy gives only one edge but with score 3.
{
"size": 1,
"degrees": [
{
"from": 101,
"label": "s2graph_label_test_weak",
"direction": "out",
"_degree": 3
}
],
"results": [
{
"cacheRemain": -135,
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak",
"direction": "out",
"_timestamp": 4,
"timestamp": 4,
"score": 3,
"props": {
"_timestamp": 4,
"time": 0,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
}
],
"impressionId": 1972178414
}
continue with example 1, sometimes user only want few fields on edge. this can be done with select.
{
"select": ["from", "to", "label"],
"srcVertices": [
{
"serviceName": "s2graph",
"columnName": "user_id_test",
"id": 101
}
],
"steps": [
{
"step": [
{
"label": "s2graph_label_test_weak",
"direction": "out",
"offset": 0,
"limit": 10,
"duplicate": "raw"
}
]
}
]
}
now user tell s2graph that only ["from", "to", "label"] fields are necessary so s2graph return only those fields on result json.
{
"size": 3,
"degrees": [
{
"from": 101,
"label": "s2graph_label_test_weak",
"direction": "out",
"_degree": 3
}
],
"results": [
{
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak"
},
{
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak"
},
{
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak"
}
],
"impressionId": 1972178414
}
default behavior of select option is empty array and if select option is empty, then all edge fields will be returned.
some times use want to group by result edges by their field.
{
"select": ["from", "to", "label", "direction", "timestamp", "score", "time", "weight", "is_hidden", "is_blocked"],
"groupBy": ["from", "to", "label"],
"srcVertices": [
{
"serviceName": "s2graph",
"columnName": "user_id_test",
"id": 101
}
],
"steps": [
{
"step": [
{
"label": "s2graph_label_test_weak",
"direction": "out",
"offset": 0,
"limit": 10,
"duplicate": "raw"
}
]
}
]
}
now result json grouped all edges by their ["from", "to", "label"] fields.
{
"size": 1,
"results": [
{
"groupBy": {
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak"
},
"agg": [
{
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak",
"direction": "out",
"timestamp": 6,
"score": 1,
"props": {
"time": -30,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
},
{
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak",
"direction": "out",
"timestamp": 5,
"score": 1,
"props": {
"time": -10,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
},
{
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak",
"direction": "out",
"timestamp": 4,
"score": 1,
"props": {
"time": 0,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
}
]
}
],
"impressionId": 1972178414
}
sometimes it is necessary run 2 query concurrently, then filter out result with other query result.
{
"filterOut": {
"srcVertices": [
{
"serviceName": "s2graph",
"columnName": "user_id_test",
"id": 100
}
],
"steps": [
{
"step": [
{
"label": "s2graph_label_test_weak",
"direction": "out",
"offset": 0,
"limit": 10,
"duplicate": "raw"
}
]
}
]
},
"srcVertices": [
{
"serviceName": "s2graph",
"columnName": "user_id_test",
"id": 101
}
],
"steps": [
{
"step": [
{
"label": "s2graph_label_test_weak",
"direction": "out",
"offset": 0,
"limit": 10,
"duplicate": "raw"
}
]
}
]
}
s2graph run 2 concurrent query, one for main step, and other is filter out query. above example only show syntax, so here is more concrete examples.
{
"filterOut": {
"srcVertices": [
{
"columnName": "uuid",
"id": "alec",
"serviceName": "nachu"
}
],
"steps": [
{
"step": [
{
"direction": "out",
"label": "nachu_user_view_news",
"limit": 100,
"offset": 0
}
]
}
]
},
"srcVertices": [
{
"columnName": "uuid",
"id": "alec",
"serviceName": "nachu"
}
],
"steps": [
{
"nextStepLimit": 10,
"step": [
{
"direction": "out",
"duplicate": "scoreSum",
"label": "nachu_user_view_news",
"limit": 100,
"offset": 0,
"timeDecay": {
"decayRate": 0.1,
"initial": 1,
"timeUnit": 86000000
}
}
]
},
{
"nextStepLimit": 10,
"step": [
{
"label": "nachu_news_belongto_category",
"limit": 1
}
]
},
{
"step": [
{
"direction": "in",
"label": "nachu_news_belongto_category",
"limit": 10
}
]
}
]
}
above main query will traverse news graph as following.
- find out news list that user alec read.
- find out categories for step 1`s news.
- find out other news that is published in same category.
also concurrently, alec don`t want to get news recommendations that he already read. by using filterOut, client can add following step on result.
news that alec read
{
"size": 5,
"degrees": [
{
"from": "alec",
"label": "nachu_user_view_news",
"direction": "out",
"_degree": 6
}
],
"results": [
{
"cacheRemain": -19,
"from": "alec",
"to": 20150803143507760,
"label": "nachu_user_view_news",
"direction": "out",
"_timestamp": 1438591888454,
"timestamp": 1438591888454,
"score": 0.9342237306639056,
"props": {
"_timestamp": 1438591888454
}
},
{
"cacheRemain": -19,
"from": "alec",
"to": 20150803150406010,
"label": "nachu_user_view_news",
"direction": "out",
"_timestamp": 1438591143640,
"timestamp": 1438591143640,
"score": 0.9333716513280771,
"props": {
"_timestamp": 1438591143640
}
},
{
"cacheRemain": -19,
"from": "alec",
"to": 20150803144908340,
"label": "nachu_user_view_news",
"direction": "out",
"_timestamp": 1438581933262,
"timestamp": 1438581933262,
"score": 0.922898833570944,
"props": {
"_timestamp": 1438581933262
}
},
{
"cacheRemain": -19,
"from": "alec",
"to": 20150803124627492,
"label": "nachu_user_view_news",
"direction": "out",
"_timestamp": 1438581485765,
"timestamp": 1438581485765,
"score": 0.9223930035297659,
"props": {
"_timestamp": 1438581485765
}
},
{
"cacheRemain": -19,
"from": "alec",
"to": 20150803113311090,
"label": "nachu_user_view_news",
"direction": "out",
"_timestamp": 1438580536376,
"timestamp": 1438580536376,
"score": 0.9213207756669546,
"props": {
"_timestamp": 1438580536376
}
}
],
"impressionId": 354266627
}
without filterOut
{
"size": 2,
"degrees": [
{
"from": 1028,
"label": "nachu_news_belongto_category",
"direction": "in",
"_degree": 2
}
],
"results": [
{
"cacheRemain": -33,
"from": 1028,
"to": 20150803105805092,
"label": "nachu_news_belongto_category",
"direction": "in",
"_timestamp": 1438590169146,
"timestamp": 1438590169146,
"score": 0.9342777143725886,
"props": {
"updateTime": 20150803172249144,
"_timestamp": 1438590169146
}
},
{
"cacheRemain": -33,
"from": 1028,
"to": 20150803143507760,
"label": "nachu_news_belongto_category",
"direction": "in",
"_timestamp": 1438581548486,
"timestamp": 1438581548486,
"score": 0.9342777143725886,
"props": {
"updateTime": 20150803145908490,
"_timestamp": 1438581548486
}
}
],
"impressionId": -14034523
}
with filterOut
{
"size": 1,
"degrees": [],
"results": [
{
"cacheRemain": 85957406,
"from": 1028,
"to": 20150803105805092,
"label": "nachu_news_belongto_category",
"direction": "in",
"_timestamp": 1438590169146,
"timestamp": 1438590169146,
"score": 0.9343106784173475,
"props": {
"updateTime": 20150803172249144,
"_timestamp": 1438590169146
}
}
],
"impressionId": -14034523
}
note that 20150803143507760 has been filtered out.
s2graph provide step level aggregation and users can decide topK on aggregated results. back to s2graph_label_test_weak label, user may want to
for traversing, s2graph use current step output edge`s to id(vertex id) for start vertexId on next step. sometimes users want to keep traversing with current step output edge property value.
below is result from example 1.
{
"size": 3,
"degrees": [
{
"from": 101,
"label": "s2graph_label_test_weak",
"direction": "out",
"_degree": 3
}
],
"results": [
{
"cacheRemain": -147,
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak",
"direction": "out",
"_timestamp": 6,
"timestamp": 6,
"score": 1,
"props": {
"_timestamp": 6,
"time": -30,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
},
{
"cacheRemain": -147,
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak",
"direction": "out",
"_timestamp": 5,
"timestamp": 5,
"score": 1,
"props": {
"_timestamp": 5,
"time": -10,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
},
{
"cacheRemain": -147,
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak",
"direction": "out",
"_timestamp": 4,
"timestamp": 4,
"score": 1,
"props": {
"_timestamp": 4,
"time": 0,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
}
],
"impressionId": 1972178414
}
Here is how to use transform.
{
"select": [],
"srcVertices": [
{
"serviceName": "s2graph",
"columnName": "user_id_test",
"id": 101
}
],
"steps": [
{
"step": [
{
"label": "s2graph_label_test_weak",
"direction": "out",
"offset": 0,
"limit": 10,
"duplicate": "raw",
"transform": [
["_to"],
["time.$", "time"]
]
}
]
}
]
}
note that 3 edges become 6 edges since there are two transform rules. first one simply generate original edge, and second rule will replace to value with string interpolation.
{
"size": 6,
"degrees": [
{
"from": 101,
"label": "s2graph_label_test_weak",
"direction": "out",
"_degree": 3
},
{
"from": 101,
"label": "s2graph_label_test_weak",
"direction": "out",
"_degree": 3
}
],
"results": [
{
"cacheRemain": -8,
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak",
"direction": "out",
"_timestamp": 6,
"timestamp": 6,
"score": 1,
"props": {
"_timestamp": 6,
"time": -30,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
},
{
"cacheRemain": -8,
"from": 101,
"to": "time.-30",
"label": "s2graph_label_test_weak",
"direction": "out",
"_timestamp": 6,
"timestamp": 6,
"score": 1,
"props": {
"_timestamp": 6,
"time": -30,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
},
{
"cacheRemain": -8,
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak",
"direction": "out",
"_timestamp": 5,
"timestamp": 5,
"score": 1,
"props": {
"_timestamp": 5,
"time": -10,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
},
{
"cacheRemain": -8,
"from": 101,
"to": "time.-10",
"label": "s2graph_label_test_weak",
"direction": "out",
"_timestamp": 5,
"timestamp": 5,
"score": 1,
"props": {
"_timestamp": 5,
"time": -10,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
},
{
"cacheRemain": -8,
"from": 101,
"to": "10",
"label": "s2graph_label_test_weak",
"direction": "out",
"_timestamp": 4,
"timestamp": 4,
"score": 1,
"props": {
"_timestamp": 4,
"time": 0,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
},
{
"cacheRemain": -8,
"from": 101,
"to": "time.0",
"label": "s2graph_label_test_weak",
"direction": "out",
"_timestamp": 4,
"timestamp": 4,
"score": 1,
"props": {
"_timestamp": 4,
"time": 0,
"weight": 0,
"is_hidden": false,
"is_blocked": false
}
}
],
"impressionId": 1972178414
}
chaining multiple step will yield traverse query. following query will find friends of friends.
{
"srcVertices": [{"serviceName": "s2graph", "columnName": "account_id", "id":1}],
"steps": [
{
"step": [
{"label": "friends", "direction": "out", "limit": 100}
]
},
{
"step": [
{"label": "friends", "direction": "out", "limit": 10}
]
}
]
}
'
just like 2 step, add more steps. watch out limit on each step since search space is equal to multiplication of max limits on each step.
Example 1. Selecting the first 100 edges of label graph_test
, which start from the vertex with account_id=1
, sorted using the default index of graph_test
.
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '
{
"srcVertices": [{"serviceName": "s2graph", "columnName": "account_id", "id":1}],
"steps": [
[{"label": "graph_test", "direction": "out", "offset": 0, "limit": 100
}]
]
}
'
Example 2. Selecting the 50th ~ 100th edges from the same vertex.
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '
{
"srcVertices": [{"serviceName": "s2graph", "columnName": "account_id", "id":1}],
"steps": [
[{"label": "graph_test", "direction": "in", "offset": 50, "limit": 50}]
]
}
'
Example 3. Selecting the 50th ~ 100th edges from the same vertex, now with a time range filter.
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '
{
"srcVertices": [{"serviceName": "s2graph", "columnName": "account_id", "id":1}],
"steps": [
[{"label": "graph_test", "direction": "in", "offset": 50, "limit": 50, "duration": {"from": 1416214118, "to": 1416214218}]
]
}
'
Example 4. Selecting 50th ~ 100th edges from the same vertex, sorted using the indexed properties time
and weight
, with the same time range filter, and applying weighted sum using time: 1.5, weight: 10
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '
{
"srcVertices": [{"serviceName": "s2graph", "columnName": "account_id", "id":1}],
"steps": [
[{"label": "graph_test", "direction": "in", "offset": 50, "limit": 50, "duration": {"from": 1416214118, "to": 1416214218}, "scoring": {"time": 1.5, "weight": 10}]
]
}
'
Example 5. Selecting 100 edges representing friends
, from the vertex with account_id=1
, and again selecting their 10 friends, therefore selecting at most 1,000 "friends of friends".
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '
{
"srcVertices": [{"serviceName": "s2graph", "columnName": "account_id", "id":1}],
"steps": [
[{"label": "friends", "direction": "out", "limit": 100}],
[{"label": "friends", "direction": "out", "limit": 10}]
]
}
'
Example 6. Selecting 100 edges representing friends
and their 10 listened_music
edges, to get "music that my friends have listened to".
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '
{
"srcVertices": [{"serviceName": "s2graph", "columnName": "account_id", "id":1}],
"steps": [
[{"label": "talk_friend", "direction": "out", "limit": 100}],
[{"label": "play_music", "direction": "out", "limit": 10}]
]
}
'
Example 7. my friends who played music id 200
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '
{
"srcVertices": [{"serviceName": "s2graph", "columnName": "account_id", "id":1}],
"steps": [
[{"label": "talk_friend", "direction": "out", "limit": 100}],
[{"label": "play_music", "direction": "out", "_to": 200}]
]
}
'
Example 8. more general way to check list of edges exist.
curl -XPOST localhost:9000/graphs/checkEdges -H 'Content-Type: Application/json' -d '
[
{"label": "talk_friend", "direction": "out", "from": 1, "to": 100},
{"label": "talk_friend", "direction": "out", "from": 1, "to": 101}
]
'
Selecting all vertices from column account_id
of a service s2graph
.
curl -XPOST localhost:9000/graphs/getVertices -H 'Content-Type: Application/json' -d '
[
{"serviceName": "s2graph", "columnName": "account_id", "ids": [1, 2, 3]},
{"serviceName": "agit", "columnName": "user_id", "ids": [1, 2, 3]}
]
'
In many cases, the first step to start using s2graph is to migrate a large dataset into s2graph. s2graph provides a bulk loading script for importing the initial dataset.
To use bulk load, you need running Spark cluster and TSV file with bulk load format.
Note that if you don't need extra properties on vertices(i.e., you only need vertex id), you only need to publish the edges and not the vertices. Publishing edges will effectively create vertices with empty properties.
timestamp | operation | logType | from | to | label | props |
---|---|---|---|---|---|---|
1416236400 | insert | edge | 56493 | 26071316 | talk_friend_long_term_agg_by_account_id | {"timestamp":1416236400,"score":0} |
timestamp | operation | logType | id | serviceName | columnName | props |
---|---|---|---|---|---|---|
1416236400 | insert | vertex | 56493 | kakaotalk | account_id | {"is_active":true, "country_iso": "kr"} |
to build bulk loader, you need to build loader project. just run following commend.
`sbt "project loader" "clean" "assembly"
you will see s2graph-loader-assembly-0.0.4-SNAPSHOT.jar under loader/target/scala-2.xx/
For bulk loading, source data can be either in HDFS or Kafka queue.
- run subscriber.GraphSubscriber to bulk upload HDFS TSV file into s2graph.
- make sure how many edges are parsed/stored by looking at Spark UI.
assumes that data is bulk loading format and constantly comming into Kafka MQ.
- run subscriber.GraphSubscriberStreaming to extract and load into s2graph from kafka topic.
- make sure how many edges are parsed/stored by looking at Spark UI.
following is the way we do online migration from RDBMS to s2graph. assumes that client send same events that goes to primary storage(RDBMS) and s2graph.
- mark label as isAsync true. this will queue all events into kafka queue.
- dump RDBMS and build bulk load file in TSV.
- update TSV file with subscriber.GraphSubscriber.
- mark label as isAsync false. this will stop queuing events into kafka queue and apply changes into s2graph directly.
- since s2graph is Idempotent, it is safe to replay queued message while bulk load process. so just use subscriber.GraphSubscriberStreaming to queued events.
- kakao talk full graph(8.8 billion edges)
- sample 10 million user id that have more than 100 friends.
- number of region server for HBase = 20
find 50 talk friends then find 20 talk friends
{
"srcVertices": [{"serviceName": "kakaotalk", "columnName": "talk_user_id", "id":$id}],
"steps": [
[{"label": "talk_friend", "direction": "out", "limit": 50}],
[{"label": "talk_friend", "direction": "out", "limit": 20}]
]
}
total vuser = 980
| number of rest server | tps | mean test time | |:------- | --- |:----: | --- | | 10 | 5,981.5 | 151.36 ms | | 20 | 10,589 | 86.45 ms | | 30 | 16,295.4 | 56.43 ms |
find 100 talk friends
{
"srcVertices": [{"serviceName": "kakaotalk", "columnName": "talk_user_id", "id":$id}],
"steps": [
[{"label": "talk_friend", "direction": "out", "limit": 100}]
]
}
total vuser = 2,072
| number of rest server | tps | mean test time |
|:------- | --- |:----: | --- |
| 20 | 53,713.4 | 37.31 ms |
{
"srcVertices": [
{
"serviceName": "kakaotalk",
"columnName": "talk_user_id",
"id": %s
}
],
"steps": [
[
{
"label": "talk_friend_long_term_agg",
"direction": "out",
"offset": 0,
"limit": %d
}
]
]
}
| number of rest server | vuser | offset | first step limit | tps | latency | |:------- | --- |:----: | --- | --- | --- | --- | | 1 | 30 | 0 | 10 | 9790TPS | 3ms | | 1 | 30 | 80 | 10 | 9,958.2TPS | 2.91ms | | 1 | 30 | 0 | 20 | 7,418.1TPS | 3.92ms | | 1 | 30 | 0 | 40 | 5,118.5TPS | 5.72ms | | 1 | 30 | 0 | 60 | 3,966.9TPS | 7.38ms | | 1 | 30 | 0 | 80 | 3,408.4TPS | 8.58ms | | 1 | 30 | 0 | 100 | 3,048.1TPS | 9.76ms | | 2 | 60 | 0 | 100 | 5,869.4TPS | 10.04ms | | 4 | 120 | 0 | 100 | 11,473.1TPS | 10.27ms |
{
"srcVertices": [
{
"serviceName": "kakaotalk",
"columnName": "talk_user_id",
"id": %s
}
],
"steps": [
[
{
"label": "talk_friend_long_term_agg",
"direction": "out",
"offset": 0,
"limit": %d
}
],
[
{
"label": "talk_friend_long_term_agg",
"direction": "out",
"offset": 0,
"limit": %d
}
]
]
}
| number of rest server | vuser | first step limit | second step limit | tps | latency |
|:------- | --- |:----: | --- | --- | --- | --- |
| 1 | 30 | 10 | 10 | 2,008.2TPS | 14.7ms |
| 1 | 30 | 10 | 20 | 1,221.3TPS | 24.13ms |
| 1 | 30 | 10 | 40 | 678TPS | 43.92ms |
| 1 | 30 | 10 | 60 | 488.2TPS | 60.72ms |
| 1 | 30 | 10 | 80 | 360.2TPS | 82.55ms |
| 1 | 30 | 10 | 100 | 312.1TPS | 94.7ms |
| 1 | 20 | 10 | 100 | 297TPS | 66.73ms |
| 1 | 10 | 10 | 100 | 302TPS | 32.86ms |
| 1 | 30 | 20 | 10 | 1163.3TPS | 25.5ms |
| 1 | 30 | 20 | 20 | 645.9TPS | 45.79ms |
| 1 | 30 | 40 | 10 | 618.4TPS | 47.96ms |
| 1 | 30 | 60 | 10 | 448.9TPS | 66.16ms |
| 1 | 30 | 80 | 10 | 339.3TPS | 87.82ms |
| 1 | 30 | 100 | 10 | 272.5TPS | 108.65ms |
| 1 | 20 | 100 | 10 | 288.5TPS | 68.34ms |
| 1 | 10 | 100 | 10 | 261.4TPS | 37.49ms |
| 2 | 60 | 100 | 10 | 412.9TPS | 143.83ms |
| 4 | 120 | 100 | 10 | 791.7TPS | 150.06ms |
{
"srcVertices": [
{
"serviceName": "kakaotalk",
"columnName": "talk_user_id",
"id": %s
}
],
"steps": [
[
{
"label": "talk_friend_long_term_agg",
"direction": "out",
"offset": 0,
"limit": %d
}
],
[
{
"label": "talk_friend_long_term_agg",
"direction": "out",
"offset": 0,
"limit": %d
}
],
[
{
"label": "talk_friend_long_term_agg",
"direction": "out",
"offset": 0,
"limit": %d
}
]
]
}
| number of rest server | vuser | first step limit | second step limit | third step limit | tps | latency |
|:------- | --- |:----: | --- | --- | --- | --- | --- |
| 1 | 30 | 10 | 10 | 10 | 250.2TPS | 118.86ms |
| 1 | 30 | 10 | 10 | 20 | 90.4TPS | 329.46ms |
| 1 | 20 | 10 | 10 | 20 | 83.2TPS | 238.42ms |
| 1 | 10 | 10 | 10 | 20 | 82.6TPS | 120.16ms |
- hbaseconf: presentation is not published yet, but you can find our keynote
- mailing list: use google group or fire issues on this repo.
- contact: [email protected]