-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spatial binning doesn't properly account for floating point precision #255
Comments
Hi @mherrmann3, thanks for flagging this issue. This is something that came up and the reason for writing the I think the first thing we should do is add in some test functions that recreate this issue and prevent builds from passing until this has been corrected. We might also want to consider the uncertainty in the event locations, because in that case there might be some average rate contribution from the adjacent cells. For these 'border' events, the error ellipse for the location would likely intersect both neighboring cells. I would imagine that this would translate into smoother models scoring higher, which might not reflect nature. |
Hi again @mherrmann3, I'm looking at the Decimal package to see if there are any solutions in here. It's a python module that allows for the precise representation of floating point arithmetic. |
I identified the root of the issue 🎉. No need to add tl;dr: just round - h = bins[1] - bins[0]
+ h = np.round(bins[1] - bins[0], 6) in this line. This change passes the "self-check loop" over all origins (see first comment) without errors. This change is sufficient because the numerical precision of
@wsavran: I also tried out the How I got there / Aha-momentI investigated the first origin in the self-check closer ( Independently determining the bin indices and corresponding locations with def bin_closest(p, bins):
ind = np.argmin(np.abs(bins - p))
print(f"ind: {ind} [{bins[ind]}]")
lon, lat = region.origins()[0]
bin_closest(lon, region.xs)
bin_closest(lat, region.ys) yields
which is the actual location of this origin (as expected). Logging
🤔 As a reminder, the core of
Specifically,
which
Aha! This means Apparently, the
|
I tested forecast models with a catalog whose events sometimes occur exactly on a cell boundary.
I found that
region.get_index_of
generally does not assign them the spatial bin according to what is expected frombut the opposite (i.e., the neighboring bin). Although this function considers a tolerance (
numpy.finfo(numpy.float64).eps
) to (supposedly) account for numerical precision, I found that it is not sufficient to get the desired binning behavior, even after the revision in commit 949d342c.I believe the corresponding unit tests don't catch these literal "corner cases", as those tests are too constructed/synthetic and not based on real data such as a loaded CSEP region. Specifically, they do not consider issues due to double precision (float64).
Demo & Reasoning
1. The problem
Let's do a quick self check:
outputs:
Ooops. Not even a bin's origin is considered to be inside its own bin (but the neighboring one)!
2. The correction
Only when adding a tiny amount to the Latitude and Longitude of the event coordinates gives the desired behavior:
outputs:
Now we can do this self check for all origins in the region:
Only if (at least)
tinyamount = 1e-12
is added, the loop runs without Error.3. The complexity
But this numerical issue gets slightly more complicated than that:
a) Influence of the region
The region influences (i) the
tinyamount
that is just sufficient to bin correctly, and (ii) which coordinate component(s) must be corrected. For instance for Italy, this is sufficient:I don't know why, but it's certainly related to numerical representations of values.
b) Influence of the floating-point format/precision
The just-sufficient
tinyamount
also depends on the floating-point format (i.e., precision). For instance, assuming the events have single precision (float32), this is necessary:Summary
I tried to find out the just-sufficient threshold also as function of the corresponding
.eps
and.resolution
property ofnp.finfo
for the used precision (e.g.,np.finfo(np.float64).eps
):tinyamount
.resolution *
.eps *
Apparently,
tinyamount
not only depends on the region, but it also doesn't scale with the precision.Suggested solution(s)
Some options (with increasing personal preference):
tinyamount
to event coordinates;bin1d_vec
'stol
argument, which is not used by theget_index_of
wrapper;bin1d_vec
;bin1d_vec
by only adding atinyamount
.Option 1 is only a temporary (quick-)fix.
For Option 2, note that
tol = tinyamount
is not sufficient due to the follow-up 'absolute tolerance' section. It needs to be slightly higher, e.g.,tol = tinyamount * 1.1
in our cases.For Option 3, we could do:
Compared to
tol
,tol_fact
is relative; I tried to find out the necessary thresholds again:tol_fact
.resolution *
.eps *
It also doesn't scale with the precision of the floating-point format, so I don't see the benefit of this option.
For Option 4, we could revert the whole current tolerance section that was introduced in commit 949d342c to a prior state and extend it (keeping the
tol
argument):I added a safety factor of 2-3 compared to Table 1 just to be more on the safe side.
Btw: We could even always add
numpy.finfo(numpy.float32).resolution * 10 = 1e-5
independent of the floating-point format, which is on the geographical order of ~1 meter. So instead of the above change, this one:This would be the same as (and replace!) setting/fixing
tol=0.00001
when binning the magnitude across various places in pyCSEP code.The implication
If a catalog contains such "corner case" events, fixing this issue will change the test statistic – but only slightly.
Nevertheless, it would lead to irreproducible previous results (e.g., in my case for IGPE, ~at the 3 to 5th decimal digit).
Exactly such slight difference (compared to an independent binning implementation) was the reason for spotting this issue.
A personal note
Considering that event locations are sometimes very rough, it might have been a better idea to define grid cells so that their centers lie on rough coordinates (e.g., 13.5°), rather than their boundaries (13.45° / 13.55°); i.e., shift the grid by half a bin size in each direction.
P.S.
@khawajasim: Could
QuadtreeGrid2D._find_location
be affected by the same issue?The text was updated successfully, but these errors were encountered: