-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing the issue with conflicting property dtypes #186
Conversation
@GianlucaFicarelli can you have a look at this? thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM to test the current status.
Then we should define if there are cases where a conversion should happen. A few points for discussion (here or offline):
- should we return the most generic type if possible? How to handle signed/unsigned numbers?
- should we let pandas do an automatic conversion when concatenating different dtype? in what cases?
- should we consider the known dtypes defined in the spec, accessible in the schema from Teach schemas to return datatypes for attributes for the nodes and edges #184, and use them for conversion when possible?
- should we convert strings to category, when at least one dtype is category?
- should we merge the categories, when they are different?
tests/test_dtype_mismatch.py
Outdated
add_test_field(node_path, population, data, dtype) | ||
|
||
res = Circuit(config_path).nodes.property_dtypes | ||
assert isinstance(res, pd.Series) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add also:
assert res.at[TEST_FIELD] == MAP_DTYPE[dtype]
tests/test_dtype_mismatch.py
Outdated
@pytest.mark.parametrize( | ||
"dtypes", | ||
( | ||
("categorical", "categorical"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could test also:
("categorical", "str"),
("categorical", "int"),
Here are my 2c.
I think we should see what pandas does by default, and only intervene when it does something strange.
I don't think so, my suspicion is that sometimes the dtypes are set on purpose so that things fit in memory (ie: int8 is used when the spec says int64, because it's possible).
I think so.
I think so, too. I don't think pandas has functionality to do this, so we'll have to make sure we do it right. |
What pandas does in the following cases is:
and my absolute favorite:
which, I guess, from numpy behavior that Is an open issue: If we want to have the "conditional" coercion to intN if possible, otherwise intM (M=2*N), we need to do it manually.
I think this is reasonable.
Agreed |
Hmm, I think the issue we have here is of the chicken and egg kind: The only time we need to use the combined However, to have a sophistical way of deciding which int type to use, would mean that we would need to go through the data and see if we can cast it to another dtype first (if we are not going with the default pandas rules). Combining categories should be okay but concatenating What I'm saying is that it seems to me that we are programmatically "combining the data" so that we get the combined dtypes to more efficiently combine the data. Or is there another reason why we would like to have explicit dtypes for the combined data? |
We discussed the issue w/ @GianlucaFicarelli, and we sort of came into the conclusion that the We were also wondering, what is the use case for |
* fix the functionaltity to fetch and merge different properties from multiple populations
f2b0f37
to
2d2095f
Compare
I went with the path of letting pandas decide the dtypes other than for That effectively means that merging Merging Even for the category, I went with "first merge and then convert to category" approach after a reasonable amount of time and effort that resulted in adequate solution at best. Reasoning:
|
Fixed in #217 |
Initial commit to have arbitrary test cases for mismatching dtypes on same property