Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unequal treatment of np.NaNs with operators #54

Open
fkroeber opened this issue Sep 5, 2024 · 3 comments
Open

Unequal treatment of np.NaNs with operators #54

fkroeber opened this issue Sep 5, 2024 · 3 comments
Labels
bug 🐛 Something isn't working

Comments

@fkroeber
Copy link
Collaborator

fkroeber commented Sep 5, 2024

Description

When evaluating the operators, np.NaNs are treated unevenly, depending on whether they occur in the array x or in the comparison array y. This is due to the use of np.where(pd.notnull(x)...) without the complementary np.where(pd.notnull(y)...) in the operators.py, where one-sided NaNs are retained if they occur in x, but not if they occur in y. The consequence of this is, for example, a violation of the commutativity of Boolean operators (see MRE below).

Reproducible example

import json
import geopandas as gpd
import semantique as sq
from semantique.processor.core import QueryProcessor

# define recipe
recipe = sq.QueryRecipe()

recipe["res_a"] = sq.reflectance("s2_band02").\
    filter_time("before", sq.time_instant("2019-12-31")).\
    evaluate("or", sq.reflectance("s2_band02"))

recipe["res_b"] = sq.reflectance("s2_band02").\
    evaluate("or", sq.reflectance("s2_band02").filter_time("before", sq.time_instant("2019-12-31")))

# create context
with open("files/mapping.json", "r") as file:
    mapping = sq.mapping.Semantique(json.load(file))

with open("files/layout_gtiff.json", "r") as file:
    dc = sq.datacube.GeotiffArchive(json.load(file), src = "files/layers_gtiff.zip")

space = sq.SpatialExtent(gpd.read_file("files/footprint.geojson"))
time = sq.TemporalExtent("2019-01-01", "2020-12-31")

context = {
    "datacube": dc, 
    "mapping": mapping,
    "space": space,
    "time": time,
    "crs": 3035, 
    "tz": "UTC", 
    "spatial_resolution": [-10, 10],
    "track_types": False
}

# execute recipe
fp = QueryProcessor.parse(recipe, **context)
response = fp.optimize().execute()

# evaluate equivalance of both results -> leads to False
response["res_a"].equals(response["res_b"])

Expected behavior

The correctness of the operator algebra should be ensured by consistent handling of NaNs regardless of whether they occur in x or y.

Proposed solution

Replacing the following operator defintions

def f(x, y):
   y = utils.null_as_zero(y)
   return np.where(pd.notnull(x), np.logical_or(x, y), np.nan)

with

def f(x, y):
  return np.where(pd.notnull(x) & pd.notnull(y), np.logical_or(x, y), np.nan)
@fkroeber fkroeber added the bug 🐛 Something isn't working label Sep 5, 2024
@luukvdmeer

This comment was marked as outdated.

@luukvdmeer
Copy link
Member

Sorry, I replied a bit to fast saying a lot of stuff that does not make sense at all ;) As semantique treats it now, it seems indeed to be a bug in any case. We do have the issue with nan values having different meanings, see #15. If a True observations in x has no corresponding observation in y, the result of the AND operator should intuitively be something as "unknown". But we do not have a way yet to distinguish between things like "the value of this observation is unknown" and "there was no observation at all done at this location". Thoughts?

@fkroeber
Copy link
Collaborator Author

Even though I do see the issue of having multiple semantics for NaN values (meaning that they could represent missing, unknown or invalid figures), I'm not sure if this is related to the current problem or should be seen as something that is rather independent. I don't see how the unequal treatment of NaN depending if they occur in x or y is currently contributing to resolving the different NaN semantics. I also doubt that there is any potential way of leveraging the NaN treatment within the operators module to resolve the NaN semantics issue even if we would restructure it. Two considerations that justify my point of view:

  1. If we wanted to use the different NaN treatment to express the semantic differences, we would first have to consider which NaN semantics should result in which operator behaviour. In concrete terms for the current handling of NaN values: What meaning must NaN values in the arrays x and y have so that pulling the NaN values from array x and possibly suppressing the NaN values from y as currently practised corresponds to the meaning of the NaN values and their combination. So NaN values in array y should always be unknown values and are therefore overwritten? If a coherent consensus were to be reached on such or other NaN evaluation behaviour, this semantics-specific handling of NaN values would have to be clearly documented and communicated to the user.

  2. Building on point 1: Even if there is a consensus on the meaning of the NaN values in the arrays x and y, the current implementation and any changes to this implementation are unsuitable for correctly implementing such a consensus.

  • Since NaN values are always evaluated as Boolean true in y, they inevitably have a different semantic meaning depending on the operator (cf. examples in the context of the Boolean bivariate operators or and xor). A semantically desired evaluation behaviour such as ‘ignore the values in y’ can therefore not be achieved with a blanket np.where(pd.notnull(x), <operator>, np.nan).
  • Across multiple operations, meaning sovereignty over the NaN values cannot be consistently maintained in any case. Specifically: After the application of a certain operator, the result array contains NaN values with different meanings (e.g. missing vs. invalid numbers). For all subsequent operator applications in the recipe using this result array, the clear assignment of NaN meanings is therefore no longer possible. A semantic specification of the NaN definition for the arrays x and y is therefore only meaningful and consistent if different NaN representations (sth along the NA, NAN representations in R) can be introduced.

I therefore regard the current implementation as a bug in any case and also believe that the problematic semantics of the NaN cannot be solved with this - at least not in a simple way ;)

For the given case, I suggest modifying the code so that the correct operator algebra is guarenteed. This is already the case for the algebraic bivariate operators (add, subtract, multiply, ...), as NaN in x and y are transferred equally to the result. The groups of functions that would have to be changed are...

a) relational bivariate operators (less, greater, equal, ...)
b) boolean bivariate operators (and, or, xor) and
c) membership bivariate operators (in, not_in)

The modification would be a replacement by np.where(pd.notnull(x) & pd.notnull(y), <operator>, np.nan) as mentioned at the beginning. This would mean that NaN values are consistently pulled along, regardless of the operator and whether they occur in x or y, which in my opinion corresponds to the intuitive expectation (also from the user's point of view). I am of course open to hearing your views on this consideration :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants