-
-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hash function for AffineScalarFunc class #189
Conversation
Add __hash__ so it works with Pandas, Pint, and Pint-Pandas Signed-off-by: Michael Tiemann <[email protected]>
Tweak hash function so that when x==y, hash(x)==hash(y) Test case: ``` import uncertainties from uncertainties import ufloat u = ufloat(1.23, 2.34) v = ufloat(1.23, 2.34) print(f"u{u} == v{v}: {u==v}") print(f"hash(u){hash(u)} == hash(v){hash(v)}: {hash(u)==hash(v)}") print(f"u{u} == (u+u)/2{(u+u)/2}: {u==(u+u)/2}") print(f"hash(u){hash(u)} == hash((u+u)/2){hash((u+u)/2)}: {hash(u)==hash((u+u)/2)}") ``` Signed-off-by: Michael Tiemann <[email protected]>
Ensure that we have a linear_combo attribute before trying to look up its keys. Also re-indent so our conditional test is not too long, and add test case to test_uncertainties. Now test_uncertainties passes all 31 tests original tests plus the new one, and test_umath passes all 9 tests. Signed-off-by: Michael Tiemann <[email protected]>
Call `expand` directly if self._linear_part is not yet expanded. Signed-off-by: Michael Tiemann <[email protected]>
Attempt to work around old version of nose that doesn't work with later versions of Python. Will it work? Signed-off-by: Michael Tiemann <[email protected]>
This reverts commit 7cca18c.
Attempt replacing `nose` (which has not been regularly supported for years) with `pytest` for better CI/CD experience. Signed-off-by: Michael Tiemann <[email protected]>
This reverts commit f3cb615.
Try running nosetests directly, rather than via setuptools, which trips up nose-1.3.7 for python >= 3.6. See nose-devs/nose#873 Signed-off-by: Michael Tiemann <[email protected]>
Install `future` to satisfy `builtins` for Python 2.7. Signed-off-by: Michael Tiemann <[email protected]>
Duplicate of #170 |
…sh-for-AffineScalarFunc
…ct, but simplified method.
The fact that the changes pass the tests is obviously a great first step. Looks like the next step is ensuring they work for all the versions of Pandas that |
Could you get this running again? It would be good to get this into a future release. |
Hi @andrewgsavage, nice to hear from you. I already were afraid, that this library is dead. Glad to see that a new maintainer was found :) I merged the master branch into this one and hope, that this will fix the pipeline. |
The Python data model requires that objects that compare equal have the same value for
but I have not thought through edge cases carefully ( |
Hi @wshanks, yes yo are right, your solution will lead to same hashes for equal objects. However, it is more or less the same as the solution of this PR but in one line :) Btw, the |
I checked the unittests and I really do not understand what's wrong. I just analyzed the first failing test so far. fixed_deriv_value = func_approx.derivatives[arg] This line is executed many time before, however for the function
When comparing >>> print(arg == list(func_approx.derivatives.keys())[0])
True We can also see, that those are identical objects: >>> print(arg is list(func_approx.derivatives.keys())[0])
True Accordingly, when checking if >>> print(arg in list(func_approx.derivatives.keys()))
True What's really strange is, when we do not convert the >>> print(arg in func_approx.derivatives.keys())
False I also compared the hashes, of >>> print(hash(arg))
3674758567180263205
>>> print(hash(list(func_approx.derivatives.keys())[0]))
3674758567180263205 So, why does |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my earlier comment was misguided because I was thinking about Variable
like a data container and not considering that ufloat(1, 1) != ufloat(1, 1)
because the two objects represent independent variables. We should keep using id
in the hash and not _nominal_value
for the Variable
s.
uncertainties/core.py
Outdated
if hasattr(self, '_linear_part') and hasattr(self, '_nominal_value'): | ||
return super().__hash__() | ||
else: | ||
return id(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test failure is coming from this part. __hash__
has to give the same value every time. To avoid the error, one option is to change __hash__
here to:
return hash((self._nominal_value, (id(self),), (1.,)))
and update __init__
to set self._nominal_value
before the {self: 1.}
dict is created. __setstate__
also needs to be updated to make sure _nominal_value
is set before it creates the self-referential dict (but see my other comment about wanting to avoid tuples).
It is tempting to try to remove the self-reference but I have seen another case in a symbolic algebra library where a set up similar to this one was the best option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are totally right, the hash must be the same every time. Not sure, how I could miss that.
uncertainties/core.py
Outdated
|
||
ids = [id(d) for d in self.derivatives.keys()] | ||
values_hash = tuple(self.derivatives.values()) | ||
return hash((self._nominal_value, tuple(ids), values_hash)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a case to test_hash
like
z_xy = x + y
z_yx = y + x
assert z_xy == z_yx
assert hash(z_xy) == hash(z_yx)
That case fails for me with this implementation but the implementation is dependent on the iteration order of derivatives
which depends on in the insertion order, which can vary depending on the order in which operations are performed. Changing the tuples to frozenset
or sorting the values before casting to tuple
could remove the order dependence.
Also, we could define __hash__
on LinearCombination
and use that here instead of constructing it from derivatives
directly. Perhaps there is no use for that though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth trying implementations that cover obvious use cases (such as z_xy == z_yx) and using assertions to raise if we've gone past the limit of our assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm the problem of a forzenset might be, that the elements of a set must be unique right?
The ids
are unique, but the values might not.
I will try to create a test for that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the example:
a = ufloat(1.23, 2.34)
b = ufloat(1.23, 2.34)
c = ufloat(1.23, 2.34)
x = AffineScalarFunc(1, LinearCombination({a:1, b:2, c:1}))
y = AffineScalarFunc(1, LinearCombination({a:1, b:2, c:2}))
assert x != y # Valid
assert hash(x) != hash(y) # Both hashes are identical :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using sorted tuples now, to avoid this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm the problem of a forzenset might be, that the elements of a set must be unique right?
This is not a blocking problem for the implementation of the __hash__
. It is okay if unequal objects hash to the same value. Using frozenset
would make it a little more likely to get hash collisions, but only in the case of multiple error terms with exactly the same value. Still, sorting should work just as well for the purpose of producing a deterministic hash. In the case of giant calculations with many error terms maybe sorting is slower than keeping a set, but in practice usually only a handful of variables are involved and sorting should be about as quick as creating a set.
When |
I agree, the only problem would be, that If it is a problem, we could stick to |
As stated above, the Python data model requires that This was my solution from last year--an injection of hash functions into
|
Yes, that's true, the data model has two conditions for hashes:
However, as long as the object is mutable, and the hash depends on any of the mutable properties, we cannot guarantee, that the hash does not change during lifetime right? Summarized, I think we have to violate one of the conditions. PS: The data model recommends to not implement a the hash function for mutable objects at all:
Currently, I think we should violate the second condition for I pushed a new proposal. |
pandas ExtensionArray. An array of objects for use with pandas. Does not necessarily need to use a numpy array to hold data. |
Also, I don't know that the example adds too much here, but I want to note that I had a similar discussion several months ago about a very similar data structure in another project here. |
@MichaelTiemannOSC Thanks for the clarification. I am not very fluent in pandas. I think propagating uncertainties on complex numbers is a fine idea. I'm not sure I would call it "easy", but it should be doable. I have a problem domain where it would very useful. |
Nothing like a new feature to help motivate continued efforts on maintenance ;-) |
So what is the story with the python data model. My understanding is that an object should be immutable and have an It sounds like we are planning to break the recommendations of the data model. I would prefer not to do that unless there is an awfully good reason to do so. Again, I haven't dug into this as much as I'd like, or followed this discussion as closely as I'd like, but it sounds like the issue is that the |
@jagerber48 I think, that is not completely correct. The data model defines the following:
That is exactly, what we did in this PR.
Yes, in general they recommend to not implement |
So, yeah why isn't
Like, this:
|
I am repeating what @NelDav said a little but just to re-emphasize: the data model requires that objects that compare equal have the same hash and that the hash never changes. I think considering the case of dict keys is helpful since that is the main use case. When inserting into a dictionary, an object is hashed and then the dict is checked to see if that hash (or a truncated part of it dependent on the size of the dict) is already present. If the hash is already present, then the two matching objects are checked for equality to see if the object is already present or if it is just a hash collision. So both
uncertainties/uncertainties/core.py Lines 806 to 814 in ea1d664
which casts both quantities to It is hard to point to how subtraction is defined because there is a lot of machinery for setting up the basic math methods, but basically I had not thought about the edge cases closely before writing this last paragraph. There are a couple:
I am not sure what to do about the |
Thinking about the other similar case I mentioned, one difference is that instead of using the object import pickle
from uncertainties import ufloat
b = ufloat(1, 0.1)
x = 2 * b
x2, b2 = pickle.loads(pickle.dumps((x, b)))
b3 = pickle.loads(pickle.dumps(b))
x3 = pickle.loads(pickle.dumps(x))
x == 2 * b # True
x2 == 2 * b2 # True
x3 == 2 * b3 # False! The last line is false because |
TLDR:
That's true, however the hash will also be unequal. Which means, that it is conform with the python data model right? In general, I think there are two questions, that we have to clarify:
If we decide, that two unpickled versions of the same object must be unequal, it follows, that if you unpickle a I personally think, the |
@wshanks Thank you very much for your detailed explanation. It is very informative.
From my perspective, this is not an issue of the hash, but of the
This is a little strange to me. Probably it is because of my lack of knowledge about uncertainties. |
Thanks, I totally agree that this should be discussed in in another issue. |
That's true of
|
That was also my first idea. And I think it would be a nice solution. |
I just tested the new assert op(ufloat(-1, 0), 9) == ref_op(-1, 9) In this case, the new definition will return false, because the derivatives of the left side will be So, I think there is a problem as soon as we are checking for uncertainties with a But maybe we can move this discussion to #218 as well. It seems to be a more general problem of |
Comparing the above (uncertainties with error of zero) with complex vs. real behavior:
I think that an uncertain number that happens to be certain (like a complex number that happens to be real) should honor equality relationships. Can a variable's value or an affine scalar function's value ever change from certain (error == 0) to uncertain? I can see the error term going from undefined to defined--that's not a problem. |
Yes, I think your comparison with complex vs. real fit quite well. |
@MichaelTiemannOSC I was going to raise a similar point. Note the following:
So current
Besides these options and others, we need to consider what level of interoperability there should be between |
@jagerber48 I meant to mention earlier that I think this blog post does a nice job exploring some of the edge cases that come up when a class does not follow the Python data model. |
-1 on merging. Too many overlapping changes and conflicting opinions. Please start over. |
I do not understand. There are changes in two files, with in total only 117 new lines and 11 removed lines. |
@NelDav I agree that this PR should be closed for now. It would be better to come to a resolution at least among @MichaelTiemannOSC, myself, and you (just selecting those three as the ones that seem most interested seeing something like the current PR merged) before presenting a new proposed change in a new pull request. Given that in the recent course of discussion we have discovered violations of the Python data model with the code currently in the |
Where should we discuss such things, if we do not do that in a PR?
Yes, we have discovered some violations of the Python data model inside the current master branch.
Is there any disagreement between us at the moment? - For me, it feels like all disagreements between @MichaelTiemannOSC, you and me have been resolved. |
I will close this PR now, for further questions or propositions regarding the hash, please create an issue at my fork: https://github.com/NelDav/uncertainties |
closes #186