Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing the issue with conflicting property dtypes #186

Closed
wants to merge 3 commits into from

Conversation

joni-herttuainen
Copy link
Contributor

Initial commit to have arbitrary test cases for mismatching dtypes on same property

@mgeplf
Copy link
Contributor

mgeplf commented Mar 8, 2023

@GianlucaFicarelli can you have a look at this? thanks.

Copy link
Contributor

@GianlucaFicarelli GianlucaFicarelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to test the current status.
Then we should define if there are cases where a conversion should happen. A few points for discussion (here or offline):

  1. should we return the most generic type if possible? How to handle signed/unsigned numbers?
  2. should we let pandas do an automatic conversion when concatenating different dtype? in what cases?
  3. should we consider the known dtypes defined in the spec, accessible in the schema from Teach schemas to return datatypes for attributes for the nodes and edges #184, and use them for conversion when possible?
  4. should we convert strings to category, when at least one dtype is category?
  5. should we merge the categories, when they are different?

add_test_field(node_path, population, data, dtype)

res = Circuit(config_path).nodes.property_dtypes
assert isinstance(res, pd.Series)
Copy link
Contributor

@GianlucaFicarelli GianlucaFicarelli Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add also:

        assert res.at[TEST_FIELD] == MAP_DTYPE[dtype]

@pytest.mark.parametrize(
"dtypes",
(
("categorical", "categorical"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could test also:

        ("categorical", "str"),
        ("categorical", "int"),

@mgeplf
Copy link
Contributor

mgeplf commented Mar 9, 2023

@GianlucaFicarelli >

A few points for discussion (here or offline):

Here are my 2c.

  1. should we return the most generic type if possible? How to handle signed/unsigned numbers?
  2. should we let pandas do an automatic conversion when concatenating different dtype? in what cases?

I think we should see what pandas does by default, and only intervene when it does something strange.
My hope is it does the following:

float32 | float64 -> float64
int8 | uint8 -> int8 if everything fits, otherwise int16
int16 | uint8 -> int16 if everything fits, otherwise int32
int16 | uint16 -> int16 if everything fits, otherwise int16
etc
  1. should we consider the known dtypes defined in the spec, accessible in the schema from Teach schemas to return datatypes for attributes for the nodes and edges #184, and use them for conversion when possible?

I don't think so, my suspicion is that sometimes the dtypes are set on purpose so that things fit in memory (ie: int8 is used when the spec says int64, because it's possible).

  1. should we convert strings to category, when at least one dtype is category?

I think so.

  1. should we merge the categories, when they are different?

I think so, too. I don't think pandas has functionality to do this, so we'll have to make sure we do it right.

@joni-herttuainen
Copy link
Contributor Author

What pandas does in the following cases is:

float32 | float64 -> float64
int8 | uint8 -> int16 (always)
int16 | uint8 -> int16
int16 | uint16 -> int32 (always)
int32 | uint32 -> int64 (always)

and my absolute favorite:

int64 | uint64 -> float64

which, I guess, from numpy behavior that Is an open issue:
numpy/numpy#20905

If we want to have the "conditional" coercion to intN if possible, otherwise intM (M=2*N), we need to do it manually.

  1. should we convert strings to category, when at least one dtype is category?

I think this is reasonable.

  1. should we merge the categories, when they are different?

I think so, too. I don't think pandas has functionality to do this, so we'll have to make sure we do it right.

Agreed

@joni-herttuainen
Copy link
Contributor Author

joni-herttuainen commented Mar 14, 2023

Hmm, I think the issue we have here is of the chicken and egg kind:

The only time we need to use the combined property_dtypes of multiple node/edge populations, is when we are trying to output combined data in NetworkObject.get. The reason we use it is to improve the performance of get (02da3ed).

However, to have a sophistical way of deciding which int type to use, would mean that we would need to go through the data and see if we can cast it to another dtype first (if we are not going with the default pandas rules).

Combining categories should be okay but concatenating strings (objects) to categoricals would mean finding unique strings in the data (for categories), which might for sure save memory but surely would slow it down.

What I'm saying is that it seems to me that we are programmatically "combining the data" so that we get the combined dtypes to more efficiently combine the data.

Or is there another reason why we would like to have explicit dtypes for the combined data?

@joni-herttuainen
Copy link
Contributor Author

We discussed the issue w/ @GianlucaFicarelli, and we sort of came into the conclusion that the property_dtypes attribute seems redundant. The population property_dtypes is only used to create the global (for nodes/edges) property_dtypes, which then is only used in snap is when plotting the simulations (_plotting.py) and when concatenating multiple results in NetworkObject.get, which was only introduced for efficiency. Both use cases can be covered without property_dtypes.

We were also wondering, what is the use case for Nodes.get and Edges.get to begin with. We were also wondering whether they actually are needed but I can see them having more use cases than property_dtypes, so I can see there is a bigger motivation in keeping them.

Joni Herttuainen added 2 commits March 24, 2023 15:51
* fix the functionaltity to fetch and merge different properties from multiple populations
@joni-herttuainen joni-herttuainen changed the title Added generation of dtype test cases (both mismatch and matching dtypes) Fixing the issue with conflicting property dtypes Mar 24, 2023
@joni-herttuainen joni-herttuainen marked this pull request as ready for review March 24, 2023 15:05
@joni-herttuainen
Copy link
Contributor Author

I went with the path of letting pandas decide the dtypes other than for category.

That effectively means that merging uintX,intX would end up being:
float, if X=64
intY, (Y=2X) otherwise

Merging intX,intX, floatX,floatX would still be intX, floatX, respectively. Reasoning here being that I would rather choose not to complicate the code with going through all the dtypes and data with np.can_cast unless really necessary.
I could make an exemption for the int64, uint64 = float case, since in the case of indices, those are converted to int64 anyways.

Even for the category, I went with "first merge and then convert to category" approach after a reasonable amount of time and effort that resulted in adequate solution at best.

Reasoning:

  • When creating a categorical dtype, each of the categories need to be unique
  • creating a sophisticated merge of categories would mean
    • making sure each of the category is unique among all the categories of all the populations being merged
    • if not unique
      • finding all the data indices sharing categories between populations
      • making sure the indices are the same
        • category_A might have a code/index of 0 in one population but 1 in another
      • offsetting the rest of indices when merging categories (to make sure we have n_categories = n_indices)
      • e.g. doing something like this for all the categories in all the populations:
        dataset_1 = {'categories': ['a', 'b'], 'indices': [0,1,0,1]}
        dataset_2 = {'categories': ['c', 'a'], 'indices': [1,0,0,1]}  # need to convert indices to [0, 2, 2, 0]
        merged = {'categories': ['a', 'b' 'c'], 'indices': [0,1,0,1, 0,2,2,0]}
  • In some cases, such as morphologies, the ratio of Categories/Length is high, which means computing above could be slow and ineffective especially if the number of indices and populations to consider is high.

@joni-herttuainen
Copy link
Contributor Author

Fixed in #217

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants