Skip to content

Commit

Permalink
Added new gensim node embedder and refactored similarity to support d…
Browse files Browse the repository at this point in the history
…ifferent backends (#91)

* Added gensim embedder and new sim backends
* Updated notebooks and docs
* Bugfixes
* Updated readme
  • Loading branch information
eugeniashurko authored Sep 28, 2021
1 parent bcbf453 commit fece005
Show file tree
Hide file tree
Showing 43 changed files with 142,527 additions and 2,758 deletions.
5 changes: 4 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ Using the built-in :code:`PGFrame` data structure (currently, `pandas <https://p
- `graph-tool <https://graph-tool.skewed.de/>`_ (for the analytics API)
- `Neo4j <https://neo4j.com/>`_ (for the analytics and representation learning API);
- `StellarGraph <https://stellargraph.readthedocs.io/en/stable/>`_ (for the representation learning API).
- `gensim <https://radimrehurek.com/gensim/>`_ (for the representation learning API).

This repository originated from the Blue Brain effort on building a COVID-19-related knowledge graph from the `CORD-19 <https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge>`_ dataset and analysing the generated graph to perform literature review of the role of glucose metabolism deregulations in the progression of COVID-19. For more details on how the knowledge graph is built, explored and analysed, see `COVID-19 co-occurrence graph generation and analysis <https://github.com/BlueBrain/BlueGraph/tree/master/cord19kg#readme>`__.

Expand Down Expand Up @@ -156,7 +157,9 @@ To get familiar with the ideas behind the co-occurrence analysis and the graph a
- `Literature exploration (PGFrames + in-memory analytics tutorial) <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/Literature%20exploration%20(PGFrames%20%2B%20in-memory%20analytics%20tutorial).ipynb>`_ illustrates how to use BlueGraphs's analytics API for in-memory graph backends based on the :code:`NetworkX` and the :code:`graph-tool` libraries.
- `NASA keywords (PGFrames + Neo4j analytics tutorial) <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/NASA%20keywords%20(PGFrames%20%2B%20Neo4j%20analytics%20tutorial).ipynb>`_ illustrates how to use the Neo4j-based analytics API for persistent property graphs.

`Embedding and downstream tasks tutorial <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/Embedding%20and%20downstream%20tasks%20tutorial.ipynb>`_ starts from the co-occurrence graph generation example and guides the user through the graph representation learning and all it's downstream tasks including node similarity queries, node classification, edge prediction and embedding pipeline building.
`Embedding and downstream tasks tutorial <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/Embedding%20and%20downstream%20tasks%20tutorial.ipynb>`_ starts from the co-occurrence graph generation example and guides the user through the graph representation learning and all it's downstream tasks including node similarity queries, node classification and edge prediction.

`Create and run embedding pipelines <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/Create%20and%20run%20embedding%20pipelines.ipynb>`_ illustrates how embedding pipelines can be built and executed using BlueGraph.

Finally, `Create and push embedding pipeline into Nexus.ipynb <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/Create%20and%20push%20embedding%20pipeline%20into%20Nexus.ipynb>`_ illustrates how embedding pipelines can be created and pushed to `Nexus <https://bluebrainnexus.io/>`_ and
`Embedding service API <https://github.com/BlueBrain/BlueGraph/blob/master/services/embedder/examples/notebooks/Embedding%20service%20API.ipynb>`_ shows how embedding service that retrieves the embedding pipelines from Nexus can be used.
Expand Down
16 changes: 16 additions & 0 deletions bluegraph/backends/gensim/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# BlueGraph: unifying Python framework for graph analytics and co-occurrence analysis.

# Copyright 2020-2021 Blue Brain Project / EPFL

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .embed.embedders import GensimNodeEmbedder
15 changes: 15 additions & 0 deletions bluegraph/backends/gensim/embed/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# BlueGraph: unifying Python framework for graph analytics and co-occurrence analysis.

# Copyright 2020-2021 Blue Brain Project / EPFL

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
113 changes: 113 additions & 0 deletions bluegraph/backends/gensim/embed/embedders.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# BlueGraph: unifying Python framework for graph analytics and co-occurrence analysis.

# Copyright 2020-2021 Blue Brain Project / EPFL

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from collections import namedtuple
import warnings
import pandas as pd

from gensim.models.poincare import PoincareModel

from bluegraph.core.embed.embedders import GraphElementEmbedder
from bluegraph.backends.params import (GENSIM_PARAMS,
DEFAULT_GENSIM_PARAMS)


GensimGraph = namedtuple('GensimGraph', 'graph graph_configs')


class GensimNodeEmbedder(GraphElementEmbedder):

_transductive_models = [
"poincare",
"word2vec"
]

def __init__(self, model_name, directed=True, include_type=False,
feature_props=None, feature_vector_prop=None,
edge_weight=None, **model_params):
if directed is False and model_name == "poincare":
raise GraphElementEmbedder.FittingException(
"Poincare embedding can be performed only on directed graphs: "
"undirected graph was provided")
super().__init__(
model_name=model_name, directed=directed,
include_type=include_type,
feature_props=feature_props,
feature_vector_prop=feature_vector_prop,
edge_weight=edge_weight, **model_params)

@staticmethod
def _generate_graph(pgframe, graph_configs):
"""Generate backend-specific graph object."""
return GensimGraph(pgframe, graph_configs)

def _dispatch_model_params(self, **kwargs):
"""Dispatch training parameters."""
params = {}
for k, v in kwargs.items():
if k not in GENSIM_PARAMS[self.model_name]:
warnings.warn(
f"GensimNodeEmbedder's model '{self.model_name}' "
f"does not support the training parameter '{k}', "
"the parameter will be ignored",
GraphElementEmbedder.FittingWarning)
else:
params[k] = v

for k, v in DEFAULT_GENSIM_PARAMS.items():
if k not in params:
params[k] = v
return params

def _fit_transductive_embedder(self, train_graph):
"""Fit transductive embedder (no model, just embeddings)."""

model_params = {**self.params}
del model_params["epochs"]

if self.model_name == "poincare":
model = PoincareModel(
train_graph.graph.edges(), **model_params)

model.train(epochs=self.params["epochs"])

embedding = pd.DataFrame(
[
(n, model.kv.get_vector(n))
for n in train_graph.graph.nodes()
],
columns=["@id", "embedding"]
).set_index("@id")
return embedding

def _fit_inductive_embedder(self, train_graph):
"""Fit inductive embedder (predictive model and embeddings)."""
raise NotImplementedError(
"Inductive models are not implemented for gensim-based "
"node embedders")

def _predict_embeddings(self, graph, nodes=None):
"""Fit inductive embedder (predictive model and embeddings)."""
raise NotImplementedError(
"Inductive models are not implemented for gensim-based "
"node embedders")

@staticmethod
def _save_predictive_model(model, path):
pass

@staticmethod
def _load_predictive_model(path):
pass
2 changes: 1 addition & 1 deletion bluegraph/backends/neo4j/analyse/paths.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ def _compute_yen_shortest_paths(graph, source, target, n,
graph._generate_st_match_query(source, target) +
Neo4jPathFinder._generate_path_search_call(
graph, source, target,
"gds.beta.shortestPath.yens.stream",
"gds.shortestPath.yens.stream",
distance, exclude_edge,
extra_params={"k": n}) +
"YIELD nodeIds\n"
Expand Down
13 changes: 10 additions & 3 deletions bluegraph/backends/neo4j/embed/embedders.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,16 @@ class Neo4jNodeEmbedder(GraphElementEmbedder):
@staticmethod
def _generate_graph(pgframe=None, uri=None, username=None,
password=None, driver=None,
node_label=None, edge_label=None):
node_label=None, edge_label=None,
graph_configs=None):
"""Generate backend-specific graph object."""
if graph_configs is None:
graph_configs = {"directed": True}

return pgframe_to_neo4j(
pgframe=pgframe, uri=uri, username=username, password=password,
driver=driver, node_label=node_label, edge_label=edge_label)
driver=driver, node_label=node_label, edge_label=edge_label,
directed=graph_configs["directed"])

def _dispatch_model_params(self, **kwargs):
"""Dispatch training parameters."""
Expand Down Expand Up @@ -223,7 +228,9 @@ def fit_model(self, pgframe=None, uri=None, username=None, password=None,
train_graph = self._generate_graph(
pgframe=pgframe, uri=uri, username=username,
password=password, driver=driver,
node_label=node_label, edge_label=edge_label)
node_label=node_label, edge_label=edge_label,
graph_configs=self.graph_configs)
# self.graph_configs
else:
train_graph = graph_view

Expand Down
13 changes: 7 additions & 6 deletions bluegraph/backends/neo4j/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,12 +162,12 @@ def pgframe_to_neo4j(pgframe=None, uri=None, username=None, password=None,
node_label_repr = f":{node_label}" if node_label else ""

query = (
f"""
WITH [{", ".join(node_repr)}] AS batch
UNWIND batch as individual
CREATE (n{node_label_repr})
SET n += individual
""")
f"""
WITH [{", ".join(node_repr)}] AS batch
UNWIND batch as individual
CREATE (n{node_label_repr})
SET n += individual
""")
execute(driver, query)

# Add node types to the Neo4j node labels
Expand All @@ -189,6 +189,7 @@ def pgframe_to_neo4j(pgframe=None, uri=None, username=None, password=None,
edge_labels = [edge_label]

for edge_label in edge_labels:

# Select edges of a given type, if applicable
edges = pgframe.edges(
raw_frame=True,
Expand Down
24 changes: 24 additions & 0 deletions bluegraph/backends/params.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,3 +84,27 @@
"clusters_q": 1,
"num_powers": 10
}


GENSIM_PARAMS = {
"poincare": [
"epochs",
"size",
"alpha",
"negative",
"workers",
"epsilon",
"regularization_coeff",
"burn_in",
"burn_in_alpha",
"init_range",
"dtype",
"seed"
]
}


DEFAULT_GENSIM_PARAMS = {
"size": 64,
"epochs": 50
}
12 changes: 8 additions & 4 deletions bluegraph/core/embed/embedders.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ def _inductive_models(self):

@staticmethod
@abstractmethod
def _generate_graph(self, pgframe):
def _generate_graph(pgframe, graph_configs):
"""Generate backend-specific graph object."""
pass

Expand Down Expand Up @@ -167,7 +167,7 @@ def fit_model(self, pgframe):
if not isinstance(embeddings, pd.DataFrame):
embeddings = pd.DataFrame(
{"embedding": embeddings.tolist()},
index=train_graph.nodes())
index=pgframe.nodes())
elif self.model_name in self._inductive_models:
self._embedding_model = self._fit_inductive_embedder(train_graph)
embeddings = self._predict_embeddings(train_graph)
Expand Down Expand Up @@ -234,8 +234,12 @@ def load(path):

with open(os.path.join(path, "emb.pkl"), "rb") as f:
embedder = pickle.load(f)
embedder._embedding_model = embedder._load_predictive_model(
os.path.join(path, "model"))

embedder._embedding_model = None
if os.path.isfile(os.path.join(path, "model")):
embedder._embedding_model = embedder._load_predictive_model(
os.path.join(path, "model"))

if decompressed:
shutil.rmtree(path)

Expand Down
5 changes: 4 additions & 1 deletion bluegraph/core/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -954,6 +954,8 @@ def edge_types(self, flatten=False):
"""Return a list of edges types."""
if flatten:
types = _aggregate_values(self._edges["@type"])
if isinstance(types, str):
types = [types]
else:
types = []
for el in self._edges["@type"]:
Expand Down Expand Up @@ -1112,9 +1114,10 @@ def get_edge_typing(self):
def aggregate_properties(frame, func, into="aggregation_result"):
if "@type" in frame.columns:
df = frame.drop("@type", axis=1)
aggregated = df.aggregate(func, axis=1).values.tolist()
frame = pd.DataFrame(
{
into: df.aggregate(func, axis=1),
into: aggregated,
"@type": frame["@type"]
},
index=frame.index)
Expand Down
5 changes: 3 additions & 2 deletions bluegraph/downstream/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .data_structures import (ElementClassifier,
EmbeddingPipeline)
from .data_structures import ElementClassifier
from .pipelines import EmbeddingPipeline

from .utils import *
Loading

0 comments on commit fece005

Please sign in to comment.