Added new gensim node embedder and refactored similarity to support d…

…ifferent backends (#91) * Added gensim embedder and new sim backends * Updated notebooks and docs * Bugfixes * Updated readme
BlueBrain · Sep 28, 2021 · fece005 · fece005
1 parent bcbf453
commit fece005
Show file tree

Hide file tree

Showing 43 changed files with 142,527 additions and 2,758 deletions.
diff --git a/README.rst b/README.rst
@@ -28,6 +28,7 @@ Using the built-in :code:`PGFrame` data structure (currently, `pandas <https://p
 - `graph-tool <https://graph-tool.skewed.de/>`_ (for the analytics API)
 - `Neo4j <https://neo4j.com/>`_ (for the analytics and representation learning API);
 - `StellarGraph <https://stellargraph.readthedocs.io/en/stable/>`_ (for the representation learning API).
+- `gensim <https://radimrehurek.com/gensim/>`_ (for the representation learning API).
 
 This repository originated from the Blue Brain effort on building a COVID-19-related knowledge graph from the `CORD-19 <https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge>`_ dataset and analysing the generated graph to perform literature review of the role of glucose metabolism deregulations in the progression of COVID-19. For more details on how the knowledge graph is built, explored and analysed, see `COVID-19 co-occurrence graph generation and analysis <https://github.com/BlueBrain/BlueGraph/tree/master/cord19kg#readme>`__.
 
@@ -156,7 +157,9 @@ To get familiar with the ideas behind the co-occurrence analysis and the graph a
 - `Literature exploration (PGFrames + in-memory analytics tutorial) <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/Literature%20exploration%20(PGFrames%20%2B%20in-memory%20analytics%20tutorial).ipynb>`_  illustrates how to use BlueGraphs's analytics API for in-memory graph backends based on the :code:`NetworkX` and the :code:`graph-tool` libraries.
 - `NASA keywords (PGFrames + Neo4j analytics tutorial) <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/NASA%20keywords%20(PGFrames%20%2B%20Neo4j%20analytics%20tutorial).ipynb>`_ illustrates how to use the Neo4j-based analytics API for persistent property graphs.
 
-`Embedding and downstream tasks tutorial <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/Embedding%20and%20downstream%20tasks%20tutorial.ipynb>`_ starts from the co-occurrence graph generation example and guides the user through the graph representation learning and all it's downstream tasks including node similarity queries, node classification, edge prediction and embedding pipeline building.
+`Embedding and downstream tasks tutorial <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/Embedding%20and%20downstream%20tasks%20tutorial.ipynb>`_ starts from the co-occurrence graph generation example and guides the user through the graph representation learning and all it's downstream tasks including node similarity queries, node classification and edge prediction.
+
+`Create and run embedding pipelines <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/Create%20and%20run%20embedding%20pipelines.ipynb>`_ illustrates how embedding pipelines can be built and executed using BlueGraph.
 
 Finally, `Create and push embedding pipeline into Nexus.ipynb <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/Create%20and%20push%20embedding%20pipeline%20into%20Nexus.ipynb>`_ illustrates how embedding pipelines can be created and pushed to `Nexus <https://bluebrainnexus.io/>`_ and
 `Embedding service API <https://github.com/BlueBrain/BlueGraph/blob/master/services/embedder/examples/notebooks/Embedding%20service%20API.ipynb>`_ shows how embedding service that retrieves the embedding pipelines from Nexus can be used.

diff --git a/bluegraph/backends/gensim/__init__.py b/bluegraph/backends/gensim/__init__.py
@@ -0,0 +1,16 @@
+# BlueGraph: unifying Python framework for graph analytics and co-occurrence analysis. 
+
+# Copyright 2020-2021 Blue Brain Project / EPFL
+
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+
+#        http://www.apache.org/licenses/LICENSE-2.0
+
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+from .embed.embedders import GensimNodeEmbedder
diff --git a/bluegraph/backends/gensim/embed/__init__.py b/bluegraph/backends/gensim/embed/__init__.py
@@ -0,0 +1,15 @@
+# BlueGraph: unifying Python framework for graph analytics and co-occurrence analysis. 
+
+# Copyright 2020-2021 Blue Brain Project / EPFL
+
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+
+#        http://www.apache.org/licenses/LICENSE-2.0
+
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
diff --git a/bluegraph/backends/gensim/embed/embedders.py b/bluegraph/backends/gensim/embed/embedders.py
@@ -0,0 +1,113 @@
+# BlueGraph: unifying Python framework for graph analytics and co-occurrence analysis. 
+
+# Copyright 2020-2021 Blue Brain Project / EPFL
+
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+
+#        http://www.apache.org/licenses/LICENSE-2.0
+
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+from collections import namedtuple
+import warnings
+import pandas as pd
+
+from gensim.models.poincare import PoincareModel
+
+from bluegraph.core.embed.embedders import GraphElementEmbedder
+from bluegraph.backends.params import (GENSIM_PARAMS,
+                                       DEFAULT_GENSIM_PARAMS)
+
+
+GensimGraph = namedtuple('GensimGraph', 'graph graph_configs')
+
+
+class GensimNodeEmbedder(GraphElementEmbedder):
+
+    _transductive_models = [
+        "poincare",
+        "word2vec"
+    ]
+
+    def __init__(self, model_name, directed=True, include_type=False,
+                 feature_props=None, feature_vector_prop=None,
+                 edge_weight=None, **model_params):
+        if directed is False and model_name == "poincare":
+            raise GraphElementEmbedder.FittingException(
+                "Poincare embedding can be performed only on directed graphs: "
+                "undirected graph was provided")
+        super().__init__(
+            model_name=model_name, directed=directed,
+            include_type=include_type,
+            feature_props=feature_props,
+            feature_vector_prop=feature_vector_prop,
+            edge_weight=edge_weight, **model_params)
+
+    @staticmethod
+    def _generate_graph(pgframe, graph_configs):
+        """Generate backend-specific graph object."""
+        return GensimGraph(pgframe, graph_configs)
+
+    def _dispatch_model_params(self, **kwargs):
+        """Dispatch training parameters."""
+        params = {}
+        for k, v in kwargs.items():
+            if k not in GENSIM_PARAMS[self.model_name]:
+                warnings.warn(
+                    f"GensimNodeEmbedder's model '{self.model_name}' "
+                    f"does not support the training parameter '{k}', "
+                    "the parameter will be ignored",
+                    GraphElementEmbedder.FittingWarning)
+            else:
+                params[k] = v
+
+        for k, v in DEFAULT_GENSIM_PARAMS.items():
+            if k not in params:
+                params[k] = v
+        return params
+
+    def _fit_transductive_embedder(self, train_graph):
+        """Fit transductive embedder (no model, just embeddings)."""
+
+        model_params = {**self.params}
+        del model_params["epochs"]
+
+        if self.model_name == "poincare":
+            model = PoincareModel(
+                train_graph.graph.edges(), **model_params)
+
+        model.train(epochs=self.params["epochs"])
+
+        embedding = pd.DataFrame(
+            [
+                (n, model.kv.get_vector(n))
+                for n in train_graph.graph.nodes()
+            ],
+            columns=["@id", "embedding"]
+        ).set_index("@id")
+        return embedding
+
+    def _fit_inductive_embedder(self, train_graph):
+        """Fit inductive embedder (predictive model and embeddings)."""
+        raise NotImplementedError(
+            "Inductive models are not implemented for gensim-based "
+            "node embedders")
+
+    def _predict_embeddings(self, graph, nodes=None):
+        """Fit inductive embedder (predictive model and embeddings)."""
+        raise NotImplementedError(
+            "Inductive models are not implemented for gensim-based "
+            "node embedders")
+
+    @staticmethod
+    def _save_predictive_model(model, path):
+        pass
+
+    @staticmethod
+    def _load_predictive_model(path):
+        pass
diff --git a/bluegraph/backends/neo4j/analyse/paths.py b/bluegraph/backends/neo4j/analyse/paths.py
@@ -104,7 +104,7 @@ def _compute_yen_shortest_paths(graph, source, target, n,
             graph._generate_st_match_query(source, target) +
             Neo4jPathFinder._generate_path_search_call(
                 graph, source, target,
-                "gds.beta.shortestPath.yens.stream",
+                "gds.shortestPath.yens.stream",
                 distance, exclude_edge,
                 extra_params={"k": n}) +
             "YIELD nodeIds\n"

diff --git a/bluegraph/backends/neo4j/embed/embedders.py b/bluegraph/backends/neo4j/embed/embedders.py
@@ -48,11 +48,16 @@ class Neo4jNodeEmbedder(GraphElementEmbedder):
     @staticmethod
     def _generate_graph(pgframe=None, uri=None, username=None,
                         password=None, driver=None,
-                        node_label=None, edge_label=None):
+                        node_label=None, edge_label=None,
+                        graph_configs=None):
         """Generate backend-specific graph object."""
+        if graph_configs is None:
+            graph_configs = {"directed": True}
+
         return pgframe_to_neo4j(
             pgframe=pgframe, uri=uri, username=username, password=password,
-            driver=driver, node_label=node_label, edge_label=edge_label)
+            driver=driver, node_label=node_label, edge_label=edge_label,
+            directed=graph_configs["directed"])
 
     def _dispatch_model_params(self, **kwargs):
         """Dispatch training parameters."""
@@ -223,7 +228,9 @@ def fit_model(self, pgframe=None, uri=None, username=None, password=None,
             train_graph = self._generate_graph(
                 pgframe=pgframe, uri=uri, username=username,
                 password=password, driver=driver,
-                node_label=node_label, edge_label=edge_label)
+                node_label=node_label, edge_label=edge_label,
+                graph_configs=self.graph_configs)
+            # self.graph_configs
         else:
             train_graph = graph_view
 

diff --git a/bluegraph/backends/neo4j/io.py b/bluegraph/backends/neo4j/io.py
@@ -162,12 +162,12 @@ def pgframe_to_neo4j(pgframe=None, uri=None, username=None, password=None,
         node_label_repr = f":{node_label}" if node_label else ""
 
         query = (
-        f"""
-        WITH [{", ".join(node_repr)}] AS batch
-        UNWIND batch as individual
-        CREATE (n{node_label_repr})
-        SET n += individual
-        """)
+            f"""
+            WITH [{", ".join(node_repr)}] AS batch
+            UNWIND batch as individual
+            CREATE (n{node_label_repr})
+            SET n += individual
+            """)
         execute(driver, query)
 
     # Add node types to the Neo4j node labels
@@ -189,6 +189,7 @@ def pgframe_to_neo4j(pgframe=None, uri=None, username=None, password=None,
         edge_labels = [edge_label]
 
     for edge_label in edge_labels:
+
         # Select edges of a given type, if applicable
         edges = pgframe.edges(
             raw_frame=True,

diff --git a/bluegraph/backends/params.py b/bluegraph/backends/params.py
@@ -84,3 +84,27 @@
     "clusters_q": 1,
     "num_powers": 10
 }
+
+
+GENSIM_PARAMS = {
+    "poincare": [
+        "epochs",
+        "size",
+        "alpha",
+        "negative",
+        "workers",
+        "epsilon",
+        "regularization_coeff",
+        "burn_in",
+        "burn_in_alpha",
+        "init_range",
+        "dtype",
+        "seed"
+    ]
+}
+
+
+DEFAULT_GENSIM_PARAMS = {
+    "size": 64,
+    "epochs": 50
+}
diff --git a/bluegraph/core/embed/embedders.py b/bluegraph/core/embed/embedders.py
@@ -59,7 +59,7 @@ def _inductive_models(self):
 
     @staticmethod
     @abstractmethod
-    def _generate_graph(self, pgframe):
+    def _generate_graph(pgframe, graph_configs):
         """Generate backend-specific graph object."""
         pass
 
@@ -167,7 +167,7 @@ def fit_model(self, pgframe):
             if not isinstance(embeddings, pd.DataFrame):
                 embeddings = pd.DataFrame(
                     {"embedding": embeddings.tolist()},
-                    index=train_graph.nodes())
+                    index=pgframe.nodes())
         elif self.model_name in self._inductive_models:
             self._embedding_model = self._fit_inductive_embedder(train_graph)
             embeddings = self._predict_embeddings(train_graph)
@@ -234,8 +234,12 @@ def load(path):
 
         with open(os.path.join(path, "emb.pkl"), "rb") as f:
             embedder = pickle.load(f)
-        embedder._embedding_model = embedder._load_predictive_model(
-            os.path.join(path, "model"))
+
+        embedder._embedding_model = None
+        if os.path.isfile(os.path.join(path, "model")):
+            embedder._embedding_model = embedder._load_predictive_model(
+                os.path.join(path, "model"))
+
         if decompressed:
             shutil.rmtree(path)
 

diff --git a/bluegraph/core/io.py b/bluegraph/core/io.py
@@ -954,6 +954,8 @@ def edge_types(self, flatten=False):
         """Return a list of edges types."""
         if flatten:
             types = _aggregate_values(self._edges["@type"])
+            if isinstance(types, str):
+                types = [types]
         else:
             types = []
             for el in self._edges["@type"]:
@@ -1112,9 +1114,10 @@ def get_edge_typing(self):
     def aggregate_properties(frame, func, into="aggregation_result"):
         if "@type" in frame.columns:
             df = frame.drop("@type", axis=1)
+            aggregated = df.aggregate(func, axis=1).values.tolist()
             frame = pd.DataFrame(
                 {
-                    into: df.aggregate(func, axis=1),
+                    into: aggregated,
                     "@type": frame["@type"]
                 },
                 index=frame.index)

diff --git a/bluegraph/downstream/__init__.py b/bluegraph/downstream/__init__.py
@@ -13,6 +13,7 @@
 #    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 #    See the License for the specific language governing permissions and
 #    limitations under the License.
-from .data_structures import (ElementClassifier,
-                              EmbeddingPipeline)
+from .data_structures import ElementClassifier
+from .pipelines import EmbeddingPipeline
+
 from .utils import *