Fixing the issue with conflicting property dtypes #186

joni-herttuainen · 2023-03-01T16:07:26Z

Initial commit to have arbitrary test cases for mismatching dtypes on same property

mgeplf · 2023-03-08T09:51:57Z

@GianlucaFicarelli can you have a look at this? thanks.

GianlucaFicarelli

LGTM to test the current status.
Then we should define if there are cases where a conversion should happen. A few points for discussion (here or offline):

should we return the most generic type if possible? How to handle signed/unsigned numbers?
should we let pandas do an automatic conversion when concatenating different dtype? in what cases?
should we consider the known dtypes defined in the spec, accessible in the schema from Teach schemas to return datatypes for attributes for the nodes and edges #184, and use them for conversion when possible?
should we convert strings to category, when at least one dtype is category?
should we merge the categories, when they are different?

GianlucaFicarelli · 2023-03-08T11:03:36Z

tests/test_dtype_mismatch.py

+            add_test_field(node_path, population, data, dtype)
+
+        res = Circuit(config_path).nodes.property_dtypes
+        assert isinstance(res, pd.Series)


We could add also:

assert res.at[TEST_FIELD] == MAP_DTYPE[dtype]

GianlucaFicarelli · 2023-03-08T11:04:40Z

tests/test_dtype_mismatch.py

+@pytest.mark.parametrize(
+    "dtypes",
+    (
+        ("categorical", "categorical"),


We could test also:

("categorical", "str"), ("categorical", "int"),

mgeplf · 2023-03-09T13:02:07Z

@GianlucaFicarelli >

A few points for discussion (here or offline):

Here are my 2c.

should we return the most generic type if possible? How to handle signed/unsigned numbers?

should we let pandas do an automatic conversion when concatenating different dtype? in what cases?

I think we should see what pandas does by default, and only intervene when it does something strange.
My hope is it does the following:

float32 | float64 -> float64
int8 | uint8 -> int8 if everything fits, otherwise int16
int16 | uint8 -> int16 if everything fits, otherwise int32
int16 | uint16 -> int16 if everything fits, otherwise int16
etc

should we consider the known dtypes defined in the spec, accessible in the schema from Teach schemas to return datatypes for attributes for the nodes and edges #184, and use them for conversion when possible?

I don't think so, my suspicion is that sometimes the dtypes are set on purpose so that things fit in memory (ie: int8 is used when the spec says int64, because it's possible).

should we convert strings to category, when at least one dtype is category?

I think so.

should we merge the categories, when they are different?

I think so, too. I don't think pandas has functionality to do this, so we'll have to make sure we do it right.

joni-herttuainen · 2023-03-14T12:38:44Z

What pandas does in the following cases is:

float32 | float64 -> float64
int8 | uint8 -> int16 (always)
int16 | uint8 -> int16
int16 | uint16 -> int32 (always)
int32 | uint32 -> int64 (always)

and my absolute favorite:

int64 | uint64 -> float64

which, I guess, from numpy behavior that Is an open issue:
numpy/numpy#20905

If we want to have the "conditional" coercion to intN if possible, otherwise intM (M=2*N), we need to do it manually.

should we convert strings to category, when at least one dtype is category?

I think this is reasonable.

should we merge the categories, when they are different?

I think so, too. I don't think pandas has functionality to do this, so we'll have to make sure we do it right.

Agreed

joni-herttuainen · 2023-03-14T15:49:56Z

Hmm, I think the issue we have here is of the chicken and egg kind:

The only time we need to use the combined property_dtypes of multiple node/edge populations, is when we are trying to output combined data in NetworkObject.get. The reason we use it is to improve the performance of get (02da3ed).

However, to have a sophistical way of deciding which int type to use, would mean that we would need to go through the data and see if we can cast it to another dtype first (if we are not going with the default pandas rules).

Combining categories should be okay but concatenating strings (objects) to categoricals would mean finding unique strings in the data (for categories), which might for sure save memory but surely would slow it down.

What I'm saying is that it seems to me that we are programmatically "combining the data" so that we get the combined dtypes to more efficiently combine the data.

Or is there another reason why we would like to have explicit dtypes for the combined data?

joni-herttuainen · 2023-03-15T15:35:15Z

We discussed the issue w/ @GianlucaFicarelli, and we sort of came into the conclusion that the property_dtypes attribute seems redundant. The population property_dtypes is only used to create the global (for nodes/edges) property_dtypes, which then is only used in snap is when plotting the simulations (_plotting.py) and when concatenating multiple results in NetworkObject.get, which was only introduced for efficiency. Both use cases can be covered without property_dtypes.

We were also wondering, what is the use case for Nodes.get and Edges.get to begin with. We were also wondering whether they actually are needed but I can see them having more use cases than property_dtypes, so I can see there is a bigger motivation in keeping them.

* fix the functionaltity to fetch and merge different properties from multiple populations

joni-herttuainen · 2023-03-24T15:39:23Z

I went with the path of letting pandas decide the dtypes other than for category.

That effectively means that merging uintX,intX would end up being:
float, if X=64
intY, (Y=2X) otherwise

Merging intX,intX, floatX,floatX would still be intX, floatX, respectively. Reasoning here being that I would rather choose not to complicate the code with going through all the dtypes and data with np.can_cast unless really necessary.
I could make an exemption for the int64, uint64 = float case, since in the case of indices, those are converted to int64 anyways.

Even for the category, I went with "first merge and then convert to category" approach after a reasonable amount of time and effort that resulted in adequate solution at best.

Reasoning:

When creating a categorical dtype, each of the categories need to be unique
creating a sophisticated merge of categories would mean
- making sure each of the category is unique among all the categories of all the populations being merged
- if not unique
  - finding all the data indices sharing categories between populations
  - making sure the indices are the same
    - category_A might have a code/index of 0 in one population but 1 in another
  - offsetting the rest of indices when merging categories (to make sure we have n_categories = n_indices)
  - e.g. doing something like this for all the categories in all the populations:
```
dataset_1 = {'categories': ['a', 'b'], 'indices': [0,1,0,1]}
dataset_2 = {'categories': ['c', 'a'], 'indices': [1,0,0,1]}  # need to convert indices to [0, 2, 2, 0]
merged = {'categories': ['a', 'b' 'c'], 'indices': [0,1,0,1, 0,2,2,0]}
```
In some cases, such as morphologies, the ratio of Categories/Length is high, which means computing above could be slow and ineffective especially if the number of indices and populations to consider is high.

joni-herttuainen · 2023-10-17T09:48:02Z

Fixed in #217

joni-herttuainen self-assigned this Mar 1, 2023

joni-herttuainen requested a review from mgeplf March 1, 2023 16:08

mgeplf requested a review from GianlucaFicarelli March 8, 2023 09:51

GianlucaFicarelli reviewed Mar 8, 2023

View reviewed changes

Joni Herttuainen added 2 commits March 24, 2023 15:51

Added generation of dtype test cases (both mismatch and matching dtypes)

aa06b0d

Remove network object level property_dtypes.

2d2095f

* fix the functionaltity to fetch and merge different properties from multiple populations

joni-herttuainen force-pushed the add_tests_dtype branch from f2b0f37 to 2d2095f Compare March 24, 2023 14:54

remove network.property_dtypes from tests

37d1b78

joni-herttuainen changed the title ~~Added generation of dtype test cases (both mismatch and matching dtypes)~~ Fixing the issue with conflicting property dtypes Mar 24, 2023

joni-herttuainen marked this pull request as ready for review March 24, 2023 15:05

joni-herttuainen requested a review from GianlucaFicarelli March 24, 2023 15:39

joni-herttuainen mentioned this pull request May 2, 2023

circ_extr.nodes.property_dtypes throws BluepySnapError: Same property with different dtype. model_template: category!= category #198

Closed

joni-herttuainen mentioned this pull request Jun 8, 2023

Fixing dtype issues in network.get #217

Merged

joni-herttuainen closed this Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing the issue with conflicting property dtypes #186

Fixing the issue with conflicting property dtypes #186

joni-herttuainen commented Mar 1, 2023

mgeplf commented Mar 8, 2023

GianlucaFicarelli left a comment •

edited

Loading

GianlucaFicarelli Mar 8, 2023 •

edited

Loading

GianlucaFicarelli Mar 8, 2023

mgeplf commented Mar 9, 2023

joni-herttuainen commented Mar 14, 2023

joni-herttuainen commented Mar 14, 2023 •

edited

Loading

joni-herttuainen commented Mar 15, 2023

joni-herttuainen commented Mar 24, 2023

joni-herttuainen commented Oct 17, 2023

Fixing the issue with conflicting property dtypes #186

Fixing the issue with conflicting property dtypes #186

Conversation

joni-herttuainen commented Mar 1, 2023

mgeplf commented Mar 8, 2023

GianlucaFicarelli left a comment • edited Loading

Choose a reason for hiding this comment

GianlucaFicarelli Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

GianlucaFicarelli Mar 8, 2023

Choose a reason for hiding this comment

mgeplf commented Mar 9, 2023

joni-herttuainen commented Mar 14, 2023

joni-herttuainen commented Mar 14, 2023 • edited Loading

joni-herttuainen commented Mar 15, 2023

joni-herttuainen commented Mar 24, 2023

joni-herttuainen commented Oct 17, 2023

GianlucaFicarelli left a comment •

edited

Loading

GianlucaFicarelli Mar 8, 2023 •

edited

Loading

joni-herttuainen commented Mar 14, 2023 •

edited

Loading