Description
I tested forecast models with a catalog whose events sometimes occur exactly on a cell boundary.
I found that region.get_index_of
generally does not assign them the spatial bin according to what is expected from
"inclusive on the lower bound and exclusive on the upper bound." [
utils.calc.bin1d_vec
docstring]
but the opposite (i.e., the neighboring bin). Although this function considers a tolerance (numpy.finfo(numpy.float64).eps
) to (supposedly) account for numerical precision, I found that it is not sufficient to get the desired binning behavior, even after the revision in commit 949d342c.
I believe the corresponding unit tests don't catch these literal "corner cases", as those tests are too constructed/synthetic and not based on real data such as a loaded CSEP region. Specifically, they do not consider issues due to double precision (float64).
Demo & Reasoning
1. The problem
Let's do a quick self check:
region = csep.core.regions.california_relm_region()
idx = 3
origin = region.origins()[idx]
print(idx, origin)
idx2 = region.get_index_of(*origin)
print(idx2, region.origins()[idx2]) # not the same as `idx` and `origin`
outputs:
3 [-125.4 40.4]
2 [-125.4 40.3]
Ooops. Not even a bin's origin is considered to be inside its own bin (but the neighboring one)!
Btw: the above output may represent the coordinates as more precise than they numerically are:
print(idx, [x for x in origin])
gives:
3 [-125.39999999999999, 40.400000000000006]
which highlights that our unit tests are currently too optimistic/simplistic.
2. The correction
Only when adding a tiny amount to the Latitude and Longitude of the event coordinates gives the desired behavior:
tinyamount = 1e-12
idx3 = region.get_index_of(
origin[0] + tinyamount,
origin[1] + tinyamount
)
print(idx3, region.origins()[idx3]) # same as `idx` and `origin`
outputs:
3 [-125.4 40.4]
Now we can do this self check for all origins in the region:
# tinyamount = 0 # uncomment for raising an Error
for i, origin in enumerate(region.origins()):
assert i == region.get_index_of(
origin[0] + tinyamount,
origin[1] + tinyamount
)
Only if (at least) tinyamount = 1e-12
is added, the loop runs without Error.
Side note: Actually, if
tinyamount = 0
, it wouldn't raise anAssertionError
, but thisValueError
inget_index_of
for the first bin, becauseget_index_of
tries to assign a bin that lies East of it (which doesn't exist and is outside the region). You can get anAssertionError
like so:tinyamount = 0 lons = set() for i, origin in enumerate(region.origins()): if origin[0] not in lons: lons.add(origin[0]) continue assert i == region.get_index_of( origin[0] + tinyamount, origin[1] + tinyamount )
3. The complexity
But this numerical issue gets slightly more complicated than that:
a) Influence of the region
The region influences (i) the tinyamount
that is just sufficient to bin correctly, and (ii) which coordinate component(s) must be corrected. For instance for Italy, this is sufficient:
region = csep.core.regions.italy_csep_region()
tinyamount = 2e-13 # Note #1: smaller than for California region
for i, origin in enumerate(region.origins()):
assert i == region.get_index_of(
origin[0], # Note #2: adding `tinyamount` to the Longitude NOT needed (!)
origin[1] + tinyamount
)
I don't know why, but it's certainly related to numerical representations of values.
b) Influence of the floating-point format/precision
The just-sufficient tinyamount
also depends on the floating-point format (i.e., precision). For instance, assuming the events have single precision (float32), this is necessary:
tinyamount = 1.6e-6 # for Italy region
tinyamount = 3.1e-6 # for California region
for i, origin in enumerate(region.origins()):
assert i == region.get_index_of(
origin[0].astype(np.float32) + tinyamount, # (adding `tinyamount` not necessary for Italy region)
origin[1].astype(np.float32) + tinyamount
)
Summary
I tried to find out the just-sufficient threshold also as function of the corresponding .eps
and .resolution
property of np.finfo
for the used precision (e.g., np.finfo(np.float64).eps
):
Region | Precision | tinyamount |
.resolution * |
.eps * |
---|---|---|---|---|
California | float64 | 1.0e-12 | 1000.0 | ~4400.0 |
Italy | float64 | 1.4e-13 | 140.0 | ~630.0 |
California | float32 | 3.1e-6 | 3.1 | ~26.0 |
Italy | float32 | 1.6e-6 | 1.6 | ~13.0 |
Apparently, tinyamount
not only depends on the region, but it also doesn't scale with the precision.
Suggested solution(s)
Some options (with increasing personal preference):
- add the required
tinyamount
to event coordinates; - use
bin1d_vec
'stol
argument, which is not used by theget_index_of
wrapper; - revise the tolerance section of
bin1d_vec
; - simplify the tolerance section of
bin1d_vec
by only adding atinyamount
.
Option 1 is only a temporary (quick-)fix.
For Option 2, note that tol = tinyamount
is not sufficient due to the follow-up 'absolute tolerance' section. It needs to be slightly higher, e.g., tol = tinyamount * 1.1
in our cases.
For Option 3, we could do:
- a0_tol = numpy.abs(a0) * numpy.finfo(numpy.float64).eps
- h_tol = numpy.abs(h) * numpy.finfo(numpy.float64).eps
- p_tol = numpy.abs(p) * numpy.finfo(numpy.float64).eps
+# tol_fact = numpy.finfo(numpy.float64).eps # (current)
+ tol_fact = # NEW THRESHOLD NEEDED
+ a0_tol = numpy.abs(a0) * tol_fact
+ h_tol = numpy.abs(h) * tol_fact
+ p_tol = numpy.abs(p) * tol_fact
Compared to tol
, tol_fact
is relative; I tried to find out the necessary thresholds again:
Region | Precision | tol_fact |
.resolution * |
.eps * |
---|---|---|---|---|
California | float64 | 4.14e-15 | 4.14 | 18.7 |
Italy | float64 | 1.74e-15 | 1.74 | 7.9 |
California | float32 | 2.38e-8 | 0.0238 | 0.2 |
Italy | float32 | 2.35e-8 | 0.0235 | 0.2 |
It also doesn't scale with the precision of the floating-point format, so I don't see the benefit of this option.
For Option 4, we could revert the whole current tolerance section that was introduced in commit 949d342c to a prior state and extend it (keeping the tol
argument):
- a0_tol = numpy.abs(a0) * numpy.finfo(numpy.float64).eps
- h_tol = numpy.abs(h) * numpy.finfo(numpy.float64).eps
- p_tol = numpy.abs(p) * numpy.finfo(numpy.float64).eps
-
-
- # absolute tolerance
- if tol is None:
- idx = numpy.floor((p + (p_tol + a0_tol) - a0) / (h - h_tol))
- else:
- idx = numpy.floor((p + (tol + a0_tol) - a0) / (h - h_tol))
+ tol = tol or 0
+
+ dtype = np.asarray(p).dtype
+ if dtype.name in ('float64', 'float32'):
+ float_tol = numpy.finfo(dtype).resolution * {
+ 'float64': 2000,
+ 'float32': 10,
+ }[dtype.name]
+ tol = max(tol, float_tol)
+
+ idx = numpy.floor((p + tol - a0) / h)
I added a safety factor of 2-3 compared to Table 1 just to be more on the safe side.
Btw: We could even always add numpy.finfo(numpy.float32).resolution * 10 = 1e-5
independent of the floating-point format, which is on the geographical order of ~1 meter. So instead of the above change, this one:
+ float_min_tol = 1e-5 # conservative (minimum) tolerance
+ tol = max(tol or 0, float_min_tol)
+ idx = numpy.floor((p + tol - a0) / h)
This would be the same as (and replace!) setting/fixing tol=0.00001
when binning the magnitude across various places in pyCSEP code.
The implication
If a catalog contains such "corner case" events, fixing this issue will change the test statistic – but only slightly.
Nevertheless, it would lead to irreproducible previous results (e.g., in my case for IGPE, ~at the 3 to 5th decimal digit).
Exactly such slight difference (compared to an independent binning implementation) was the reason for spotting this issue.
A personal note
Considering that event locations are sometimes very rough, it might have been a better idea to define grid cells so that their centers lie on rough coordinates (e.g., 13.5°), rather than their boundaries (13.45° / 13.55°); i.e., shift the grid by half a bin size in each direction.
P.S.
@khawajasim: Could QuadtreeGrid2D._find_location
be affected by the same issue?