Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spatial binning doesn't properly account for floating point precision #255

Open
Tracked by #269
mherrmann3 opened this issue May 31, 2024 · 3 comments
Open
Tracked by #269
Assignees

Comments

@mherrmann3
Copy link
Contributor

mherrmann3 commented May 31, 2024

I tested forecast models with a catalog whose events sometimes occur exactly on a cell boundary.
I found that region.get_index_of generally does not assign them the spatial bin according to what is expected from

"inclusive on the lower bound and exclusive on the upper bound." [utils.calc.bin1d_vec docstring]

but the opposite (i.e., the neighboring bin). Although this function considers a tolerance (numpy.finfo(numpy.float64).eps) to (supposedly) account for numerical precision, I found that it is not sufficient to get the desired binning behavior, even after the revision in commit 949d342c.

I believe the corresponding unit tests don't catch these literal "corner cases", as those tests are too constructed/synthetic and not based on real data such as a loaded CSEP region. Specifically, they do not consider issues due to double precision (float64).

Demo & Reasoning

1. The problem

Let's do a quick self check:

region = csep.core.regions.california_relm_region()

idx = 3
origin = region.origins()[idx]
print(idx, origin)

idx2 = region.get_index_of(*origin)
print(idx2, region.origins()[idx2])  # not the same as `idx` and `origin`

outputs:

3 [-125.4   40.4]
2 [-125.4   40.3]

Ooops. Not even a bin's origin is considered to be inside its own bin (but the neighboring one)!

Btw: the above output may represent the coordinates as more precise than they numerically are:

print(idx, [x for x in origin])

gives:

3 [-125.39999999999999, 40.400000000000006]

which highlights that our unit tests are currently too optimistic/simplistic.

2. The correction

Only when adding a tiny amount to the Latitude and Longitude of the event coordinates gives the desired behavior:

tinyamount = 1e-12

idx3 = region.get_index_of(
    origin[0] + tinyamount,
    origin[1] + tinyamount
)
print(idx3, region.origins()[idx3])  # same as `idx` and `origin`

outputs:

3 [-125.4   40.4]

Now we can do this self check for all origins in the region:

# tinyamount = 0  # uncomment for raising an Error
for i, origin in enumerate(region.origins()):

    assert i == region.get_index_of(
        origin[0] + tinyamount,
        origin[1] + tinyamount
    )

Only if (at least) tinyamount = 1e-12 is added, the loop runs without Error.

Side note: Actually, if tinyamount = 0, it wouldn't raise an AssertionError, but this ValueError in get_index_of for the first bin, because get_index_of tries to assign a bin that lies East of it (which doesn't exist and is outside the region). You can get an AssertionError like so:

tinyamount = 0
lons = set()
for i, origin in enumerate(region.origins()):

    if origin[0] not in lons:
        lons.add(origin[0])
        continue
    
    assert i == region.get_index_of(
        origin[0] + tinyamount,
        origin[1] + tinyamount
    )

3. The complexity

But this numerical issue gets slightly more complicated than that:

a) Influence of the region

The region influences (i) the tinyamount that is just sufficient to bin correctly, and (ii) which coordinate component(s) must be corrected. For instance for Italy, this is sufficient:

region = csep.core.regions.italy_csep_region()
tinyamount = 2e-13  # Note #1: smaller than for California region
for i, origin in enumerate(region.origins()):

    assert i == region.get_index_of(
        origin[0],  # Note #2: adding `tinyamount` to the Longitude NOT needed (!)
        origin[1] + tinyamount
    )

I don't know why, but it's certainly related to numerical representations of values.

b) Influence of the floating-point format/precision

The just-sufficient tinyamount also depends on the floating-point format (i.e., precision). For instance, assuming the events have single precision (float32), this is necessary:

tinyamount = 1.6e-6  # for Italy region
tinyamount = 3.1e-6  # for California region
for i, origin in enumerate(region.origins()):

    assert i == region.get_index_of(
        origin[0].astype(np.float32) + tinyamount,  # (adding `tinyamount` not necessary for Italy region)
        origin[1].astype(np.float32) + tinyamount
    )

Summary

I tried to find out the just-sufficient threshold also as function of the corresponding .eps and .resolution property of np.finfo for the used precision (e.g., np.finfo(np.float64).eps):

Region Precision tinyamount .resolution * .eps *
California float64 1.0e-12 1000.0 ~4400.0
Italy float64 1.4e-13 140.0 ~630.0
California float32 3.1e-6 3.1 ~26.0
Italy float32 1.6e-6 1.6 ~13.0

Apparently, tinyamount not only depends on the region, but it also doesn't scale with the precision.

Suggested solution(s)

Some options (with increasing personal preference):

  1. add the required tinyamount to event coordinates;
  2. use bin1d_vec's tol argument, which is not used by the get_index_of wrapper;
  3. revise the tolerance section of bin1d_vec;
  4. simplify the tolerance section of bin1d_vec by only adding a tinyamount.

Option 1 is only a temporary (quick-)fix.

For Option 2, note that tol = tinyamount is not sufficient due to the follow-up 'absolute tolerance' section. It needs to be slightly higher, e.g., tol = tinyamount * 1.1 in our cases.

For Option 3, we could do:

-    a0_tol = numpy.abs(a0) * numpy.finfo(numpy.float64).eps
-    h_tol = numpy.abs(h) * numpy.finfo(numpy.float64).eps
-    p_tol = numpy.abs(p) * numpy.finfo(numpy.float64).eps
+#    tol_fact = numpy.finfo(numpy.float64).eps  # (current)
+    tol_fact = # NEW THRESHOLD NEEDED
+    a0_tol = numpy.abs(a0) * tol_fact
+    h_tol = numpy.abs(h) * tol_fact
+    p_tol = numpy.abs(p) * tol_fact

Compared to tol, tol_fact is relative; I tried to find out the necessary thresholds again:

Region Precision tol_fact .resolution * .eps *
California float64 4.14e-15 4.14 18.7
Italy float64 1.74e-15 1.74 7.9
California float32 2.38e-8 0.0238 0.2
Italy float32 2.35e-8 0.0235 0.2

It also doesn't scale with the precision of the floating-point format, so I don't see the benefit of this option.

For Option 4, we could revert the whole current tolerance section that was introduced in commit 949d342c to a prior state and extend it (keeping the tol argument):

-    a0_tol = numpy.abs(a0) * numpy.finfo(numpy.float64).eps
-    h_tol = numpy.abs(h) * numpy.finfo(numpy.float64).eps
-    p_tol = numpy.abs(p) * numpy.finfo(numpy.float64).eps
-
-
-    # absolute tolerance
-    if tol is None:
-        idx = numpy.floor((p + (p_tol + a0_tol) - a0) / (h - h_tol))
-    else:
-        idx = numpy.floor((p + (tol + a0_tol) - a0) / (h - h_tol))
+    tol = tol or 0
+
+    dtype = np.asarray(p).dtype
+    if dtype.name in ('float64', 'float32'):
+        float_tol = numpy.finfo(dtype).resolution * {
+            'float64': 2000,
+            'float32': 10,
+        }[dtype.name]
+        tol = max(tol, float_tol)
+
+    idx = numpy.floor((p + tol - a0) / h)

I added a safety factor of 2-3 compared to Table 1 just to be more on the safe side.

Btw: We could even always add numpy.finfo(numpy.float32).resolution * 10 = 1e-5 independent of the floating-point format, which is on the geographical order of ~1 meter. So instead of the above change, this one:

+    float_min_tol = 1e-5  # conservative (minimum) tolerance
+    tol = max(tol or 0, float_min_tol)
+    idx = numpy.floor((p + tol - a0) / h)

This would be the same as (and replace!) setting/fixing tol=0.00001 when binning the magnitude across various places in pyCSEP code.

The implication

If a catalog contains such "corner case" events, fixing this issue will change the test statistic – but only slightly.
Nevertheless, it would lead to irreproducible previous results (e.g., in my case for IGPE, ~at the 3 to 5th decimal digit).
Exactly such slight difference (compared to an independent binning implementation) was the reason for spotting this issue.



A personal note

Considering that event locations are sometimes very rough, it might have been a better idea to define grid cells so that their centers lie on rough coordinates (e.g., 13.5°), rather than their boundaries (13.45° / 13.55°); i.e., shift the grid by half a bin size in each direction.


P.S.

@khawajasim: Could QuadtreeGrid2D._find_location be affected by the same issue?

@pabloitu pabloitu self-assigned this Aug 23, 2024
@pabloitu pabloitu mentioned this issue Nov 25, 2024
22 tasks
@wsavran
Copy link
Collaborator

wsavran commented Dec 2, 2024

Hi @mherrmann3, thanks for flagging this issue. This is something that came up and the reason for writing the bin1d_vec function. Sadly, the tests that I wrote surrounding this function were not sufficient to spot the condition that you've identified in this comment.

I think the first thing we should do is add in some test functions that recreate this issue and prevent builds from passing until this has been corrected.

We might also want to consider the uncertainty in the event locations, because in that case there might be some average rate contribution from the adjacent cells. For these 'border' events, the error ellipse for the location would likely intersect both neighboring cells. I would imagine that this would translate into smoother models scoring higher, which might not reflect nature.

@wsavran
Copy link
Collaborator

wsavran commented Dec 3, 2024

Hi again @mherrmann3, I'm looking at the Decimal package to see if there are any solutions in here. It's a python module that allows for the precise representation of floating point arithmetic.

@mherrmann3
Copy link
Contributor Author

I identified the root of the issue 🎉. No need to add tinyamount.

tl;dr: just round h to the intended bin size with

-        h = bins[1] - bins[0]
+        h = np.round(bins[1] - bins[0], 6)

in this line. This change passes the "self-check loop" over all origins (see first comment) without errors.

This change is sufficient because the numerical precision of h was the culprit: it is imprecise beyond floating point resolution from the intended 0.1, which originates from the floating-point operation in this very line – the simple subtraction of two floats – even though the bins, i.e., region.xs/.ys, are indeed "perfectly precise" (within the resolution of floats) thanks to creating them with csep.utils.calc.cleaner_range. So perhaps some sort of Catastrophic cancellation.
Eventually, h_tol doesn't sufficiently account for this imprecision.

  • I will create a PR with a more elaborate fix (not hardcoding 6) + further considerations/improvements/optimizations (currently working on it).

@wsavran: I also tried out the decimal package, but the performance is abysmal (at least 50x slower due to mapping via strings, which is the way the package works to circumvent numerical precision).

How I got there / Aha-moment

I investigated the first origin in the self-check closer (lon: -125.39999999999999, lat: 40.1), which – as mentioned in the original comment – raises "this ValueError in get_index_of".

Independently determining the bin indices and corresponding locations with

def bin_closest(p, bins):
    ind = np.argmin(np.abs(bins - p))
    print(f"ind: {ind} [{bins[ind]}]")

lon, lat = region.origins()[0]
bin_closest(lon, region.xs)
bin_closest(lat, region.ys)

yields

ind: 0 [-125.4]
ind: 86 [40.1]

which is the actual location of this origin (as expected).

Logging idx and idy inside get_index_of as calculated with the current bin1d_vec gives:

idx: 0 [-125.4]
idy: 85 [40.0]

🤔

As a reminder, the core of bin1d_vec is:

a0 = numpy.min(bins)
h = bins[1] - bins[0]
idx = numpy.floor((p + (p_tol + a0_tol) - a0) / (h - h_tol))

Specifically, (p + (p_tol + a0_tol) - a0) / (h - h_tol) yields for the lat coordinate:

85.99999999999896

which numpy.floor rounds down to 85. But why, considering all tolerances? Let's log all terms:

                        p: 40.1
                       a0: 31.5
                   p - a0: 8.600000000000001
p + (p_tol + a0_tol) - a0: 8.600000000000016

                        h: 0.10000000000000142
                    h_tol: 0.00000000000000002
      (float64 resolution: 0.00000000000000100 - just for comparison)
                h - h_tol: 0.1000000000000014

Aha! This means h - h_tol is still too large! (numerically, it should be exactly or just "below" the actual binning of 0.1 to get a "p/h ratio" that is numerically >=86.0 (so that when rounded down yields the desired 86.0).

Apparently, the h_tolerance consideration doesn't account for numerical precision/resolution issues, especially when they derive from an additional preceding floating-point operation. The deviation of h from 0.1 is even greater than the float64 resolution (logged above for comparison), which is yet 50x greater than h_tol.

Btw: tinyamount just sufficiently nudged p to avoid these numerical effects of a tiny bit smaller "p/h ratio".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants