Should `Variable.eq` be based on `id()` or a separate identifier? #218

wshanks · 2024-04-06T13:41:33Z

wshanks
Apr 6, 2024
Maintainer

AffineScalarFunc is defined as an object with a nominal_value and a derivatives dictionary with [Variable, float] pairs mapping Variables to coefficients of derivatives with respect to those variables.

Variable is a special subclass of AffineScalarFunc with derivatives set to {self: 1.}. Since Variable is used as a dictionary key, it must be hashable. Variable.__hash__ is defined using id(self):

uncertainties/uncertainties/core.py

Lines 2779 to 2784 in ea1d664

    
           def __hash__(self): 
        
               # All Variable objects are by definition independent 
        
               # variables, so they never compare equal; therefore, their 
        
               # id() are allowed to differ 
        
               # (http://docs.python.org/reference/datamodel.html#object.__hash__): 
        
               return id(self)

AffineScalarFunc.__eq__ is defined by (self - other).nominal_value == 0 and (self - other).std_dev == 0. The way error propagation works gives (self - other).std_dev like (this is not the exact form of the code):

all_vars = set(self.derivatives) || set(other.derivatives)
std_dev = sqrt(sum((v.std_dev * (self.derivatives.get(v, 0) - other.derivatives.get(v, 0)))**2 for v in all_vars))

This std_dev can only be 0 when the Variables in self.derivatives and other.derivatives are the same (or there are 0's for Variable's std_dev or derivative coefficients) because of the **2. Since Variables are defined with only themselves in their derivatives, they can only be equal to themselves or the AffineScalarFunc that comes from multiplying them by 1 (again ignoring the std_dev == 0 case). (More generally, barring the std_dev/derivatives equal to 0 edge cases, AffineScalarFuncs are equal only when the nominal values are equal and the derivatives dict of [Variable, coefficient] pairs is equal).

There is something a bit tricky happening here. Variable.__eq__ ends up calling a method that calls self.derivatives[self]. The way that look up in self.derivatives is performed is by calling Variable.__hash__ and checking the dict's hash table. If a matching hash is found, Python checks the corresponding matched key against the lookup key for equality. Since we are already inside of Variable.__eq__ that would trigger an infinite recursion...fortunately Python checks id() of the two keys first and skips __eq__ if they match. So this lookup is fine as long as there is no hash collision. Since Variable.__hash__ is based on id() and uncertainties controls the construction of AffineScalarFunc.derivatives and only allows Variable keys there, there is no possibility for such a hash collision (also, on x86_64 at least memory addresses usually stay below 2**48 while the Python hash modulus is 2**61).

So Variable.__eq__ is tied to id(). A Variable will be equal to itself and there is no way to construct a second Variable instance that will be equal to the first (ignoring zero std_dev).

An alternative to using id() would be to assign Variables a random id using uuid.uuid4() from the standard library during Variable.__init__. With this method, each call to Variable() will still create a new independent variable instance, but the identity of that variable is not tied to the current Python process. (For reference, in Qiskit, another project I work with, there is a class called Parameter which works this way).

Here is an example of how it might be desirable to split Variable identity from Python identity:

import multiprocessing

from uncertainties import ufloat


def times_2(x):
    return 2 * x


a = ufloat(1, 1)
with multiprocessing.Pool(processes=1) as pool:
    (a2,) = pool.map(times_2, (a,))


2 * a == a2  # Currently False

Here, we create a Variable named a, pass it to a subprocess for calculation (multiply by 2), and receive the result. We might expect that the result we got back would be equal to 2 * a. However, because multiprocessing uses pickle to send data between processes, the result AffineScalarFunc and the Variable inside of it are created as new objects with new id() values and that Variable no longer is the same as the original a, so the correlation is lost.

A similar example would be to calculate a2 = 2 * a in a single process and then serialize a and a2 with pickle. If they are serialized and deserialized together in the same object like b, b2 = pickle.loads(pickle.dumps([a, a2])), then b2 == b. If instead they are processed separately like b = pickle.loads(pickle.dumps(a)); b2 = pickle.loads(pickle.dumps(a2)), then b2 != b. pickle preserves the correlation only when the objects are serialized together.

Another question that was raised was the behavior of copy. I do not have much intuition for how copy should work. With id()-based equality, it has to produce a Variable that is not equal to the original. With uuid, it could work either way, based on the implementation.

Another issue that comes up for the uuid approach is that Variable is currently documented as being intentionally mutable. The documentation suggests changing the value of Variable.std_dev to see how the overall error on an AffineScalarFunc changes. That doesn't work with the uuid method since with that method you can have different Variable instances representing the same independent variable and changing the std_dev on one instance might not change the value on the instance actually used inside a particular AffineScalarFunc.

Regarding this last point, I wonder if it would not be better to deprecate that mutable std_dev feature in favor of a helper function that could calculate the error on an AffineScalarFunc with a modified std_dev for a Variable, so that we could treat Variable as immutable. Then we could handle the 0 std_dev edge case of __eq__ more cleanly.

Here is the original opening post which provided less context:

I wanted to branch off the discussion started here and here into a separate thread since #189 is already long enough and I don't want to make it harder to follow. I might edit this post later to give more an introduction but for now see the preceding linked comments. Here I respond to the second comment:

TLDR:
I think x3 == 2 * b3 # False is the correct behavior.
Since the hashes will also be unequal (hash(x3) == hash(2 * b3) # False), we are totally fine.

I agree that the current behavior matches the Python data model. That is why I said that switching to a uuid for Variable could be a follow up after #189. The question is what is the behavior a user would want and expect. If I create a set of AffineScalarFuncs and serialize them, I would expect (naively, not looking the current implementation) that deserializing them would preserve their correlations. With the current implementation, that is only true if they are serialized together within a single payload. I don't think it is unreasonable to imagine serializing AffineScalarFuncs in multiple payloads though -- for example using multiprocessing to generate some AffineScalarFuncs, send them to a subprocess, and then send the results back to the main process (multiprocessing uses pickle to serialize data sent between processes). The returned results would have lost correlations with the original values, which I think is surprising.

Here is an example:

import multiprocessing

from uncertainties import ufloat


def times_2(x):
    return 2 * x


a = ufloat(1, 1)
with multiprocessing.Pool(processes=1) as pool:
    (a2,) = pool.map(times_2, (a,))


2 * a == a2  # Currently False

I think it would be reasonable to have guessed that the last line would be True.

NelDav · 2024-04-06T13:57:29Z

NelDav
Apr 6, 2024

I am not sure, what behavior I would expect. But considering multiprocessing is a good point. The uuid would also have the benefit, that if you copy a Variable object, both, copy and original object would compare equal.

0 replies

newville · 2024-04-06T16:01:23Z

newville
Apr 6, 2024
Maintainer

@wshanks @NelDav I don't disagree that the conversation at #189 has sort of gone in many directions and it could be helpful to start over. There is also a lot of overlap with #217, which is only a few days old.

Discussions are inherenty challenging when different interested parties are participating at very different rates. Like, honestly you two started a discussion here, went back to #189 and continued exchanging multiple messages on a Saturday morning.

These conversations are extremely hard to follow. Adding yet another conversation thread could help, but only if you slowed way, way, way down and explained the issues at hand. Communication is hard, but the goal of these discussions, issues, and pull requests is to make sure that the maintainers (and there are sort of a lot of us) all understand and agree with the expected behavior and the changes requested.

#189 should be closed. There are a) too many commits in that PR spanning nearly a year, b) too many asides and discussions, and c) no consensus on what the behavior should be.

Here, you sort of start with

TLDR:
I think x3 == 2 * b3 # False is the correct behavior.
Since the hashes will also be unequal (hash(x3) == hash(2 * b3) # False), we are totally fine.

To which I am going to say, as politely as possible: What?? You did not define x3 or b3... what??

Please close #189, and #217, and start over by discussing and explaining the problem(s) to be solved with the current code.

1 reply

wshanks Apr 6, 2024
Maintainer Author

Here, you sort of start with

Not quite. Before that I state

I wanted to branch off the discussion started #189 (comment) and #189 (comment) into a separate thread since #189 is already long enough and I don't want to make it harder to follow. I might edit this post later to give more an introduction but for now see the preceding linked comments.

x3 and b3 are defined in the linked comments. When I have time, I can add more context from those linked comments to my opening post.

The question highlighted here is separate from the thrust of #189. For that reason, I wanted to split it off since #189 was long enough already. It is related to #217 but more specific, asking about how the current code should behave in relation to serialization (because id() does not survive serialization). Perhaps #217 could lead to a plan to re-work how things are architected, in which case the specific implementation question here might become irrelevant.

jagerber48 · 2024-04-07T02:29:57Z

jagerber48
Apr 7, 2024
Maintainer

It seems like this discussion is totally motivated by the pickling problem. But I think we need to step back and have a discussion about what the requirements are and what we're trying to accomplish with these changes. I've also thrown in the idea that a refactor of the data objects in uncertainties might be in store.

I'm going to say here that this is a very strange question. If x is correlated with y then it shouldn't really be possible to "have" x independent from y. That is, if I pickle and unpickle x but don't pickle/unpickle y then I would expect to hit exceptions when I try to work with x.

I'm not sure how to design such that this is the case though. Perhaps we don't want to try to support correlations between variables that live in separate threads...

11 replies

jagerber48 Apr 8, 2024
Maintainer

My preference would be the correlations ARE copied over, but I think it’s acceptable for correlations to not be copied as long as the behavior is well documented.

newville Apr 8, 2024
Maintainer

@NelDav wrote

pickle is simply not relevant here. It's about correlation.
I think it is not about correlation

Yes, it is about the correlations.

Using id() to set eq would break this.

That is another misunderstanding, due to the fact that the description of this issue is not complete. Your example will not generate a new Variable object but an object of type AffineScalarFunc. The eq method of AffineScalarFunc will not compare the IDs but the IDs of the references inside their _linear_part.

An AffineScalerFunc has an id(). If __eq__ used id() a == 1.0*a would be False. AffineScalerFunc.__eq__() will take the difference of the evaluating the left- and right-hand side of the == and subtract them, and return True if and only if the difference of the nominal value and the std_dev are both zero. If id() of the AffineScalerFunc 1.0*a is used as the derivative / linear_part, then it will not be seen as correlated with a.

@jagerber48 The way it is implemented currently (which I am not defending as "necessarily ideal" but would defend as "current behavior" that would need careful changing, as existing downstream code will break if this is changed):

>>> a = ufloat(1, 0.1)
>>> b = ufloat(1, 0.1)
>>> a == b
False

a does not store that it is not correlated with b, it just is not correlated with b.

Note that

>>> a.derivatives, b.derivatives
({1.0+/-0.1: 1.0}, {1.0+/-0.1: 1.0})

which could be a little confusing about what is "correlated". But, unpacking that a bit

>>> a.derivatives.keys()
dict_keys([1.0+/-0.1])
>>> list(a.derivatives.keys())[0] is a
True
>>> list(a.derivatives.keys())[0] is b
False

which is how "a Variable is correlated with itself" is denoted. This is completely from that {self: 1.0} in Variable.init()`:

       super(Variable, self).__init__(value, LinearCombination({self: 1.}))

I believe that is the source of most of the confusion here. Correlation is based (by default) on identity, not on value. A Variable is 100% correlated with itself, but it is not correlated (unless told otherwise) with another Variable, even one with the same value and std_dev. a and b and copy.copy(a) are not identical, so they are not correlated.

Evaluating the Affine Scalar Functions (maybe "expressions"):

>>> 20*a
20.0+/-2.0
>>> 10*a + 10*a
20.0+/-2.0
>>> 10*a + 10*b
20.0+/-1.4142135623730951

uses the derivatives to know that multiplying a Variable by a float uses the same self as the Variable itself to calculate the uncertainty. OTOH, it also knows to handle the addition of two Affine Scaler Functions differently when they have each other listed in their derivatives. Note that 10*a is an AffineScalerFunc, not a Variable, but it inherits its derivatives from the Variable a. Indeed, in each of 10*a + 10*a and 10*a + 10*b), 2 different AffineScalerFuncs are temporarily created and then added together. In the first case, the 2 AffineScalerFuncs inherit their uncertainties from the same Variable a, and so are fully correlated. In the second case, the 2 AffineScalerFuncs inherit their uncertainties from different Variables that are not correlated.

Just to clarify that a bit more:

>>> c = 4 * a
>>> a + c
5.0+/-0.5
>>> a in a.derivatives
True
>>> a in c.derivatives
True

But, then, unfortunately

>>> c
4.0+/-0.4
>>> c in c.derivatives
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'AffineScalarFunc'
>>> type(a), type(c)
(<class 'uncertainties.core.Variable'>, <class 'uncertainties.core.AffineScalarFunc'>)

That's "maybe understandable" (c is not a Variable it is an AffineScalerFunc), but also maybe something to try to fix....

Now, you could try to "monkey-patch" the correlations:

>>> a.derivatives[b] = 1
>>> b.derivatives[a] = 1
>>> a - b
0.0+/-0
>>> a == b
True

but that is still a little weird because now

>>> a + b 
2.0+/-0.28284271247461906

so the uncertainty is doubled from what you might expect. Or, perhaps "what to expect" can be confusing. ;)

NelDav Apr 8, 2024

Thank you for your nice explanation @newville.
I still think, that the behavior, you described is the same, even if equality is checked via ID. Of course not only the ID. Some additional stuff has to be implemented as well.

I could create an example, to validate and explain in more detail. However, I think it is not constructive, because I still think, that the initial question of this Issue is, if a copy of a Variable object should equal the original. But that still won't be the case, if we use the ID.
But if you wish, I will do so.

An AffineScalerFunc has an id(). If __eq__ used id() a == 1.0*a would be False.

I never said, that AffineScalerFunc do not have an id(). Every python object has one. I said that this issue does not suggest using the id() of AffineScalerFunc inside the __eq__ function. It suggests using the id() of Variable.

wshanks Apr 9, 2024
Maintainer Author

@wshanks Maybe you can change the description of the Issue, to explain the problem in more detail.

I wrote a new description for the opening post of this discussion. Let me know if it is more clear.

NelDav Apr 9, 2024

Wow, that is an extremely detailed description. Thank you very much.

NelDav · 2024-04-08T14:14:15Z

NelDav
Apr 8, 2024

I just noticed, that the title is kind of confusing. It mentions two options to define __eq__.

By id()
By a separate identifier.

But it does not clarify that the first option is equivalent to the current implementation.

0 replies

jagerber48 · 2024-04-09T04:46:25Z

jagerber48
Apr 9, 2024
Maintainer

Has this issue been hashed here? I think this is a straight up inconsistency that indicates the code needs to be reworked.

from uncertainties import ufloat
import copy

a = ufloat(1, 0.1)
aa = copy.copy(a)
print(aa == a)
# False

x = ufloat(1, 0.1)
y = ufloat(10, 1)
z = x + y
zz = copy.copy(z)
print(zz == z)
# True

In the first example copy breaks equality, in the second it maintains it. I think copy should consistently keep or break equality in both cases. I think the current behavior is a bug. I think the presence of this bug gives us the leeway to make a decision one way or another whether copying keeps or breaks equality.

I think copying should definitely keep equality. Suppose we decide copying breaks equality. Then above we have zz does not equal z. Also, presumably, that zz is not correlated with z. But should zz still be correlated with x and y similar to how zz is? Presumably no. Why would zz not be correlated with z but be correlated with x or y. But then that means zz is correlated with nothing. Ok fine, but that now means there isn't a simple way to save and reload a group of variables together with their correlations.

3 replies

wshanks Apr 9, 2024
Maintainer Author

One thing I wonder about is the cost/benefit of Variable acting like an AffineScalarFunc. One could imagine changing Variable to be not a subclass of AffineScalarFunc and not to support mathematical operations. Instead the current ufloat(x, y) would become like 1.0 * ufloat(x, y), returning an AffineScalarFunc backed by a single Variable. This is just a thought experiment to play with which parts of the current behavior are intuitive and useful. I have not thought through what would make sense to change and how that would affect current code.

I mention this here because then your two cases would both be True, so the behavior would be more consistent. Knowing how the current implementation works, the behavior in your example makes sense to me. An AffineScalarFunc is an object backed by a set of Variables. copy.copy is supposed to be shallow, so it should produce a new instance backed by those same Variables. I checked and found that copy.deepcopy(z) != z which also makes sense given that the deep copy should also copy the objects internal to an instance and copy.copy(a) != a for the internal Variables. As I noted in my edited first post, I don't have good intuition for what copy.copy should do on a Variable. If Variable equality is based on id() it has to produce a new unequal object. If Variable were based on a separate identifier, it could produce a new object that is still equal.

One note about ufloat returning an AffineScalarFunc -- it would make it harder for the user to access Variables. However, another topic of discussion is whether Variable should be mutable and allow changing std_dev like it does now. Without that feature, there is not really a need for a user to access a Variable directly, I think.

NelDav Apr 9, 2024

In the first example copy breaks equality, in the second it maintains it. I think copy should consistently keep or break equality in both cases.

I agree with that. The reason is, that in the first example, you compare Variable objects, which have a self reference in their linear part. Meaning, in the end, you just compare the IDs. While in your second example, you compare AffineScalerFunc objects, which do not have a self reference. That means, if the linear part refers to the same Variable objects, (which a shallow copy will do), they will compare equal.

newville Apr 9, 2024
Maintainer

@jagerber48 yeah, "wow", I had not that noticed that before! Also, just for fun:

>>> from uncertainties import ufloat
>>> import copy
>>> a = ufloat(1, 0.1)
>>> a == a*1.0
True
>>> a == copy.copy(a)
False
>>> a*1.0 == copy.copy(a*1.0)
True
>>> a == copy.copy(a*1.0)
True
>>> a*1.0 == copy.copy(a)
False

wait, what? So, yeah, that should be addressed ;)

I think the current behavior is a bug. I think the presence of this bug gives us the leeway to make a decision one way or another whether copying keeps or breaks equality.

I agree with both. I also think a "good" solution might be challenging. Or, at least, I don't know how to "just fix it".

I also agree with @wshanks. This goes right to the heart of the discussion here: the distinction between Variable and AffineScalarFunc. As for whether a Variable should be mutable (and to be clear, a Variable definitely is currently mutable), I would say that making user-defined classes immutable is hard, and if Variable should be immutable, then we better make it immutable. Saying "object of class X can be thought of as immutable-ish" basically means "objects of class X are mutable".

Currently, Variable inherits from (and so is an) AffineScalarFunc. But when is an AffineScalarFunc not a Variable? I think the answer is "when the std_dev is not provided but needs to be calculated". Maybe that distinction does not need to be so clear. That is, maybe these can be the same thing.

That is (idea, not worked out at all): Maybe this "Thing with a nominal value and uncertainty" (perhaps called "uncertainties.Variable" or "uncertainties.Value". I sort of like the idea of writing from uncertainties import Value as UValue) can have a std_dev default to None, and if it is None and if a linear_part is not None, then std_dev is calculated from the linear_part.

I think that might allow avoiding that

        super(Variable, self).__init__(value, LinearCombination({self: 1.}))

in Variable.__init__ that make a Variable point to itself, and complicates copying and serialization. Perhaps that {self: 1.0} could probably just be implicitly handled in the code: when. We might also consider just replacing LinearCombination with a plain dictionary - I think it is not doing much more than holding a linear_combo dict, or even some other data structure that doesn't require objects that are most definitely mutable and comparable-for-equality to be hashable dictionary keys. You know, like two lists of the same length ;).

If Variable cannot be merged into AffineScalarFunc to one class, then another alternative to consider would be that instead of one inheriting the other, maybe they should be truly distinct but use a mixin class to give shared methods.

For sure, I am still trying to wrap my brain around this ;)

NelDav · 2024-04-09T16:23:20Z

NelDav
Apr 9, 2024

That doesn't work with the uuid method since with that method you can have different Variable instances representing the same independent variable and changing the std_dev on one instance might not change the value on the instance actually used inside a particular AffineScalarFunc.

A solution to fix that would be, to have one "reference" object for each UUID. The "reference" object will only have a nominal value and a std_dev.

Store the assignment between UUID and "reference" variable in a static dict.

Inside the linear part of AffineScalarFunc, store the UUID as key. When evaluating the linear part, access the "reference" variable via the UUID.

When updating the std_dev of Variable, access the "reference" object via the static dict and update it. Variable will more or less be a "smart" reference to the "reference" object.

0 replies

jagerber48 · 2024-04-11T02:33:59Z

jagerber48
Apr 11, 2024
Maintainer

Seems we're learning a lot quickly.

Is there any need for Variable to be a subclass of AffineScalarFunc?

Stepping back and thinking conceptually here's what makes sense to me.

Variable should not be a subclass of AffineScalarFunc. It should be a simple dataclass that has std_dev and uuid as attributes (the uuid.uuid() idea from @wshanks seems great). Note there is no nominal_value. I think the Variable nominal value is unused. It is only the nominal_value on the AffineScalarFunc that matters. Variable might as well be immutable. This can be accomplished as a frozen dataclass. I think if the dataclass is frozen it will have a hash function created automatically.
LinearCombination maps Variables to float coefficients. I think it could possibly just be a dictionary attribute on AffineScalarFunc in which case LinearCombination is actually just a type annotation instead of a class.
I think AffineScalarFunc should not be called that. It includes something like an Affine function (the LinearCombination) but it also includes a nominal_value which breaks kind of the fundamental assumption of Affine objects. Let's call it UFloat. UFloat is a LinearCombination together with a nominal_value together with methods and infrastructure that allow us to do mathematical operations with UFloats and get error propagation.

One glaring issue here is that the proposal for LinearCombination doesn't support the "compressed" form. I think the compressed form is an important performance optimization, but I don't understand it. It seems like the compressed/expanded thing is doing some sort of lazy evaluation. I guess the idea is you might have a long string of calculations that work with compressed forms, and, if it's never called upon to be expressed, we need not expand the intermediate linear combinations, we only expand if requested, and this might be only for the final calculation. The problem with this lazy evaluation is it means the LinearCombination, and by extension any object that holds a LinearCombination is mutable.

It seems tricky to have both lazy evaluation and immutability. Especially if you want the object that the user has, UFloat, to have both of those properties, and this is what we want (so that UFloat can be used effectively in pandas dataframes). One way to do this is to have two objects, an unevaluated version and an evaluated version. In this case both objects can be immutable and hashable with no problem, but then you have to handle juggling these two classes. For backend code this works great (I do this in sciform). But in this case it is the UFloat object that is the concern. I also don't think this can be lightly brushed aside since if UFloat is hashable then its hash will depend on the hash of LinearCombination which will be different depending on if the LinearCombnation is compressed or expanded. We could always use the expanded version of the linear combination to compute the hash, but I assume that going from compressed to expanded is expensive, otherwise I don't see the benefit of the compressed/expanded optimization.

I'm stumped here for the moment.

edit: Actually I do think Variable needs a nominal_value as well. Mathematically I think the idea is that each Variable is a random variable with a mean and standard deviation. But the key about Variables is that any two variables will always be uncorrelated. A UFloat is also a random variable with a mean and standard deviation, but it may be correlated with Variables or other UFloats. In fact, if a UFloat is a function of other UFloats then it is possible that the first UFloat involves a non-expanded LinearCombination. I think when a LinearCombination is expanded it corresponds to expressing a UFloat in terms of a set of Variables which are known to be uncorrelated.

By contrast, UFloat may be a function of multiple Variable or other UFloat.

8 replies

jagerber48 Apr 12, 2024
Maintainer

@wshanks

There is some disagreement about what immutable means. For me, if the class can be constructed from a fixed set of inputs and always has the same behavior after that that is good enough. I don't mind if it has internal structure like LinearCombination that can mutate for performance reasons (not expanding before an expansion is needed).

The whole goal here is we want UFloat to be hashable. In the "simplest" case this means it is immutable. But UFloat necessarily has a LinearCombination. LinearCombination is one of the main data that goes into UFloat so I would recommend the hash of UFloat is derived using the LinearCombination, or at least the hash of the LinearCombination. So LinearCombination has to be hashable.

The issue isn't that LinaerCombination can't be constructed from a fixed set of inputs and have the same behavior. The issue is that multiple different inputs can lead to the same behavior. That is an expanded LinearCombination can be compressed many different ways but still give the same behavior (at least in terms of functional outputs, I guess performance may be different but I don't know). So then the question is how should the hash for LinearCombination be derived? One idea is it could depend on the expanded version of the LinearCombination. This would work fine probably but it means different objects with different data, but quite similar behavior, would have the same hash. That may or may not be ok. But the other issue is that, if expanding the object is costly, having the hash require expanding might now be very costly and this might end up defeating the compressed/expanded optimization. That is, the gains made by including the optimization might be spoiled by expanding LinearCombination when calculating its hash.

newville Apr 12, 2024
Maintainer

@jagerber48 I am not so sure that "hashable UFloat" or even "immutable UFloat" is the goal. It might be the best means to achieve the goal of "propagating uncertainties".

If each Variable gets a unique tag for that Python process (perhaps but not necessarily a uuid), then the keys of the derivatives could be those id-tags, not necessarily the hashable (so probably needs to be immutable-ish) Variable. But then you also need a way to find Variables with those id tags, maybe browsing locals or globals.

One could imagine a global cache of Variables for a Python process, or creating a "Context" or "Namespace" of Variables in which the Python expressions are evaluated. If doing that, the derivatives (with respect to any Variable) for any UFloat could be maintained in that global cache. The trivial {self: 1.0} derivatives for a Variable might not be necessary at all, but could be implicit. At first, I thought this approach of a global context was not practical or desirable, but now I'm not so sure -- it might be OK.

This would make Variable very trivial, and very dataclass friendly:

   nominal_value: float
   std_dev: float
   tag: str
   context_id: int

but with methods for math operations that generate UFloats. There could then be a global context (id=0) created at the first creation of a Variable. This context would be a dict of dicts of dicts held as global in uncertainties:

CONTEXT = {ctxt_id_0:  {var1_tag: {var1_tag: 1.0},  # or maybe use `None` here, or even leave out. 
                                       ufloat1_tag: {var1_tag: dvalue, var2_tag: dvalue, ....},  # with "varN_tag not found" meaning 0.0
                                       ufloat2_tag: {var2_tag: dvalue} 
                                       }  }

That context is a simple dict (so pickle/json-able) and could be transported across processes, at least read-only, and perhaps even "mergable". It might be ok to just have a single global context, but I think allowing multiple (and allowing these to be set at creation time) might be useful.

I have not tried to work this out in any detail (like, what would __add__() look like)

Does this seem like a feasible approach, at least to experiment with?

jagerber48 Apr 17, 2024
Maintainer

@newville

@jagerber48 I am not so sure that "hashable UFloat" or even "immutable UFloat" is the goal. It might be the best means to achieve the goal of "propagating uncertainties".

So this whole discussion came up because someone wanted hashable UFloat so that they could be appropriately used in pandas array. That's the only motivation I have for making UFloat hashable. That said, by comparison with regular float, it doesn't seem unreasonable to want UFloat to be immutable + hashable. But other than the pandas thing I don't really see a need for ufloat to be hashable, I agree.

Regarding your proposal about maintaining a global cache to keep track of the correlation information. This idea crossed my mind. It certainly could work, and there's something intuitive about storing the correlations outside of the variables rather than within the variables. I think it would simplify some things as you point out. However, my major questions is: would it be as performant as the current implementation. One downside is that pickling/unpickling will require the user to know to also pickle/unpickle the cache for any object that uses `UFloat.

I think a lot of the current implementation is the way it is to support some performance optimization that @lebigot seems to have thought hard about. I need to understand better some of his previous statements. But (1) storing derivatives in the variables seems to be more performant than keeping track of covariance matrices (or maybe a global covariance matrix which would be similar to your global cache idea)

I'm at a point where I feel I need to understand the performance considerations that have gone into the current design to know how to move forward. I'm also curious if the optimizations have been confirmed on any sorts of performance tests (I don't have much experience with performance testing/validation so not 100% sure what to look for, but timing lots of cases seems like an obvious path)

newville Apr 17, 2024
Maintainer

@jagerber48 I agree with all of that. And, to be clear, I am not opposed to the idea that Variable and UFloat should be "immutable-ish". but it is kind of complicated ;) Violating the recommended Data Model should not be done lightly. Currently, the derivatives/linear_parts are keyed by Variables, so they are explicitly made hashable even if strictly mutable. I think this leads to several of the corner cases and surprising behavior with copying.

I also think that "what does Pandas want" is not the only concern (but it does not count for nothing!) and probably helps illustrate the problem. But I also think that this might be a case where a "serious overhaul" might be needed rather than "patch this problem". For me, it's really that super(Variable, self).__init__(value, LinearCombination({self: 1.})) that raises an eyebrow.

However, my major questions is: would it be as performant as the current implementation.

I do not know. I mostly interpret that as "worth testing", and seeing how the trade-offs can be evaluated. Performance is not the only goal (we are, after all, using Python), but it is a worthwhile consideration. Do you agree?

One downside is that pickling/unpickling will require the user to know to also pickle/unpickle the cache for any object that uses `UFloat.

Yes. OTOH, helper functions to save, restore, and move to another process the set of Variables and their correlations might be easier (or easier to test/work).

jagerber48 Apr 18, 2024
Maintainer

@newville I agree with all of what you say.

However, my major questions is: would it be as performant as the current implementation.

I do not know. I mostly interpret that as "worth testing", and seeing how the trade-offs can be evaluated. Performance is not the only goal (we are, after all, using Python), but it is a worthwhile consideration. Do you agree?

In principle I definitely agree with you that performance is not the only goal, and since we are working in python, readability counts for a lot. However, I'm trying to tread very cautiously here because (1) I do not understand how the performance considerations existing in the code now work, (2) I don't understand how significant the performance considerations are right now (e.g. could we remove them and even heavy workloads would run fine, or would we remove them and even trivial tasks would tank?) and (3) From his comments on github and in the code I can see that @lebigot has thought carefully about the performance considerations and put some time into them.

All of that said, I'm not opposed to launching a two step project whose steps are

writing a "serious overhaul" which emphasizes readability (and adherence to standards, etc.) and ignores performance then,
as necessary, porting the existing performance tricks in the existing uncertainties into the overhauled code.

Should Variable.__eq__ be based on id() or a separate identifier? #218

wshanks Apr 6, 2024 Maintainer

Replies: 7 comments · 23 replies

NelDav Apr 6, 2024

newville Apr 6, 2024 Maintainer

wshanks Apr 6, 2024 Maintainer Author

jagerber48 Apr 7, 2024 Maintainer

jagerber48 Apr 8, 2024 Maintainer

newville Apr 8, 2024 Maintainer

NelDav Apr 8, 2024

wshanks Apr 9, 2024 Maintainer Author

NelDav Apr 9, 2024

NelDav Apr 8, 2024

jagerber48 Apr 9, 2024 Maintainer

wshanks Apr 9, 2024 Maintainer Author

NelDav Apr 9, 2024

newville Apr 9, 2024 Maintainer

NelDav Apr 9, 2024

jagerber48 Apr 11, 2024 Maintainer

jagerber48 Apr 12, 2024 Maintainer

newville Apr 12, 2024 Maintainer

jagerber48 Apr 17, 2024 Maintainer

newville Apr 17, 2024 Maintainer

jagerber48 Apr 18, 2024 Maintainer

Should `Variable.eq` be based on `id()` or a separate identifier? #218

wshanks
Apr 6, 2024
Maintainer

Replies: 7 comments 23 replies

NelDav
Apr 6, 2024

newville
Apr 6, 2024
Maintainer

wshanks Apr 6, 2024
Maintainer Author

jagerber48
Apr 7, 2024
Maintainer

jagerber48 Apr 8, 2024
Maintainer

newville Apr 8, 2024
Maintainer

wshanks Apr 9, 2024
Maintainer Author

NelDav
Apr 8, 2024

jagerber48
Apr 9, 2024
Maintainer

wshanks Apr 9, 2024
Maintainer Author

newville Apr 9, 2024
Maintainer

NelDav
Apr 9, 2024

jagerber48
Apr 11, 2024
Maintainer

jagerber48 Apr 12, 2024
Maintainer

newville Apr 12, 2024
Maintainer

jagerber48 Apr 17, 2024
Maintainer

newville Apr 17, 2024
Maintainer

jagerber48 Apr 18, 2024
Maintainer