Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft how-to for getting nodes and edges tables from network #23

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

caro401
Copy link
Collaborator

@caro401 caro401 commented Dec 11, 2023

This is aimed at technical users (jupyter users of the python API and/or me developing UI code).

Looking for review from @makkus that the content is technically correct, and from an end user @CBurge95 that it's clear enough (have I provided enough context for example), and addresses her original questions from #20

@caro401 caro401 requested review from makkus and CBurge95 December 11, 2023 16:13
@caro401 caro401 linked an issue Dec 11, 2023 that may be closed by this pull request
Copy link
Collaborator

@makkus makkus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added my comments, will update here once I decided on the interface and added the 'network_graph.pick.table` module.

@@ -0,0 +1,66 @@
# How to view the data in a NetworkData type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment in advance: personally, I'd probably have a section in the docs that deals with tabular data, and only explain here how to get to the tables, and then link to the more generic documentaiton re: querying and other things to do with it.


Quite often, you'll want to inspect the raw contents of the nodes and/or edges tables which contain the data behind a `NetworkData` value. This might be to get an overview of what's in your network, or to look at the values of centrality measures you've just calculated and applied to the network.

The nodes and edges tables can be accessed from a `NetworkData` value by calling the `get_table` method on the `NetworkData`, passing the appropriate table name `"nodes"` or `"edges"` as argument. This resulting value is a `KiaraTable`, which in turn is backed by a `pyarrow.Table` from [Apache arrow](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html). The Arrow table contains the raw data, and can be accessed via the `arrow_table` property on a `KiaraTable`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the refactoring we talked about, the data type is now called 'NetworkGraph, but I tried to keep the interface the same as much as possible. get_tablewould still work, but a user could also just call theedgesandnodesattributes and get the sameKiaraTableinstance they would get withget_table`.


In order to view the data contained in the Arrow table, you'll need to turn it into a different data format. The `pyarrow.Table` data type provides a few options for converting the data, for example `to_pandas()` to get a NumPy array or pandas DataFrame, and `to_pydict()` and `to_pylist()` to get plain Python data types, which you can then manipulate as you choose.

Be aware that doing any of these data transformations means your whole nodes or edges table will be loaded into memory on your computer. If your tables are really big, this could cause your code to run slowly and use a lot of memory (RAM).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess here it would make sense to point out that using the arrow data directly is considered best practice overall, unless you are writing custom code that is not going to get re-used, or where you know for sure you won't have to deal with unexpectedly large amounts of data.

For frontend developers that would mean using the arrow JS library, and ideally either send/receive 'unserialized' arrow format, or even better try to get a pointer to the data in memory for zero-copy style access (not always possible). For Jupyter users it would mean using polars or duckdb (or any of the modules that use it internally, like the query.table one you point out below.

# if you're in a jupyter context, printing edges_kiara_table will give you a preview of the data

# get all the data via the underlying Arrow table
edges_data = edges_table.arrow_table.to_pylist()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I have never needed to use the to_pylist or to_pydict methods. I think a much more common use-case (at least for Jupyter users) would be the pandas export, since there is a high likely-hood they are using Pandas anyway. If there is indeed a valid use-case for frontend devs to use this over 'pure' arrow access, I'd say we can probably assume frontend devs have more programming background, and can figure things out themselves with a few links we could provide. Long story short, I would tend to document the pandas code, and not to_pylist.

# let's call it my_network_data

# get the nodes table for the network, as a `KiaraTable`
nodes_kiara_table = my_network_data.get_table("nodes")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not a good idea, because if you do it like this you break the lineage of the result value. It depends of course if that matters in your particular cicrumstances or not, but I guess it's better to not confuse people by documenting a practice that would only make sense for some sort of frontend-preview scenario, but would be ill-advised within a Jupyter/Python research workflow.

Up until now for all the network analysis examples when there was a usecase like this, the querying always happened on the source tables (before they became network_data/network_graph. We can easily support this scenario too, all it takes is adding a module network_graph.pick.table (or something like that), that takes a network graph and either a 'edges' or 'nodes' string as input, and returns a table as result. I can easily add that, will have it ready in the 'tropy' plugin in the next few days.

Anyway, the result (of type 'table') can subsequently be used in the code below, and lineage will be intact in the result of that.

@makkus
Copy link
Collaborator

makkus commented Dec 29, 2023

Ok, I've written a small jupyter notebook that contains a version where the nodes table is picked via a kiara operation as to not break lineage (attached to this comment, had to zip it otherwise github wouldn't let me add it). This should work with an updated environment that has the tropy plugin installed in the most recent version (pip install -e kiara_plugin.tropy).

notebook_example.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accessing/querying network_data & module outputs
2 participants