Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with scipy and scikit-learn #261

Open
Mec-iS opened this issue Aug 30, 2022 · 4 comments
Open

Integration with scipy and scikit-learn #261

Mec-iS opened this issue Aug 30, 2022 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@Mec-iS
Copy link
Contributor

Mec-iS commented Aug 30, 2022

One of the integration we are going to work on is the one with scikit-learn.

This conversation is to collect requirements and features to implement calling scikit-learn using kglab abstraction layer.

My point of view after taking a look to the API provided by popular data science libraries, these are the interesting scikit-learn and scipy functionalities that we could start with:

  1. Allow converting kglab's KnowledgeGraph data structures to observations matrix (to be defined), adjacency matrix and condensed distance matrix as defined by scipy. This will allow building up further flows (or "pipelines", chains of function calls) that the users can assemble to go from a KnowledgeGraph representation to a graph algebra representations. This is critical as we need to pick first principles or to provide different alternatives according to the type of graph or the different tasks the users may want to accomplish.
  2. After 1, let's start with an example flow in kglab for SciPy's Hierachical Clustering. It would be nice to have a flow that allow simple clustering. This implies providing switches to:
    1. Linkage procedures
    2. Tree building like sklearn.cluster.ward_tree

Other possible examples:

These are now in unordered fashion, will take some time to figure out which principles to import from scikit-learn and scipy so to build up proper user flows from knowledge graph as represented in RDF/kglab and graph algebra representations.

Please provide feedback and suggestions. I will create a Github project around this effort.

cc: @tomaarsen @SultanOrazbayev

@Mec-iS Mec-iS self-assigned this Aug 30, 2022
@Mec-iS Mec-iS added the enhancement New feature or request label Aug 30, 2022
@ceteri ceteri added this to the Machine Learning integration milestone Aug 31, 2022
@ceteri
Copy link
Collaborator

ceteri commented Sep 1, 2022

Wonderful! This is super helpful.
The nearest neighbor parts would have some immediate use cases.

BTW, there's already the SubgraphMatrix class in subg.py which handles the transform/inverse_transform from an RDF graph to:

  • pandas.DataFrame
  • iGraph (adjacency matrix, slightly odd/tangled format)
  • NetworkX (adjacency matrix, as an edge list)
  • cuGraph (adjacency matrix, as an edge list for cuDF)

@Mec-iS
Copy link
Contributor Author

Mec-iS commented Sep 1, 2022

we probably want some methods that returns numpy.array, I will reuse what it is already there for sure.

@Mec-iS
Copy link
Contributor Author

Mec-iS commented Sep 6, 2022

@SultanOrazbayev mentioned the importance of having a descriptive summary of general metrics about a graph, something like pandas.describe(). These are the metrics that could be useful in an hypothetical SubgraphMatrix.describe():

@tomaarsen
Copy link
Collaborator

tomaarsen commented Sep 8, 2022

Agreed, sometimes it's hard to actually understand what kind of graph you're using..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants