Wrong results from `sum(skipmissing(raster))` #812

tiemvanderdeure · 2024-11-04T10:53:57Z

I was getting some mysterious results from summing over a raster of gridded population counts from WorldPop (which can be downloaded here)

This is a big raster, with Float32 data and a Float32 missingval. The output of sum(skipmissing(x)) is not identical to the sum after aggregating, or the sum of the sum of the slices, or the sum of collect(skipmissing(x))

In the worldpop data, the result is off by like 5 percent, but when I try to reproduce I can only get it off by like 0.1%.
MWE:

n = 1000000
mval = -Inf32
ras = Raster(fill(mval, X(1:10000), Y(1:10000)); missingval = mval)
vals = exp.(rand(Float32, n) .* Float32(10))
ras[1:length(vals)] .= vals
sum(skipmissing(ras)) ≈ sum(vals) # false!
sum(skipmissing(replace_missing(ras))) ≈
sum(sum.(skipmissing.(eachslice(ras; dims = X)))) ≈
sum(vals) # true

What is going on here?

The text was updated successfully, but these errors were encountered:

lazarusA · 2024-11-04T13:13:21Z

just to add to this a little bit more, I also noticed that doing (maybe a totally different issue, but I mention it because is also a sum 😄 ).

sum(resample(ras,...., method="sum")) and sum(ras) will give slightly different results.

rafaqz · 2024-11-04T14:39:01Z

Think of the sheer number of floating point operations happening in summing your 10_000 * 10_000 array. Theres a lot of room for error to creep in

There is a correctness optimisation to reduce the error in that process:
https://en.wikipedia.org/wiki/Kahan_summation_algorithm

Maybe we don't hit that path with the Rasters inbuilt SkipMissingVal iterator. Or something like that.

There could be specialised methods of sum that we need to add? If someone could track down what the difference is, we can add them.

rafaqz · 2024-11-04T14:45:06Z

But seems we don't actually use kahan summation, but pairwise. And not on a regular iteration:

https://discourse.julialang.org/t/kahan-summation-in-sum/102723

tiemvanderdeure · 2024-11-04T15:32:59Z

Yes it only affects really big rasters, an error of more than 5% seemed crazy, but I also haven't been able to reproduce it.

asinghvi17 · 2024-11-04T15:37:11Z

Could you try reproducing this on the original data using AccurateArithmetic.jl? Or convert the values to Float64 before summing?

rafaqz · 2024-11-04T15:37:53Z

For me it was a fraction of a percent, but still big. But its a real problem, thus all the algorithms to fix it.

Try looking in Base for what sum does on SkipMissing as it seems to be better

tiemvanderdeure · 2024-11-04T16:22:14Z

Base has all of this machinery implemented for mapreduce on SkipMissing, which breaks the iterator up into chunks (of size 1024 by default). There actually isn't a specialization for sum.
https://github.com/JuliaLang/julia/blob/50713ee4a82eb1b5613647cd74b027315f665080/base/missing.jl#L273-L362

With Rasters.SkipMissing we hit the fallback, which is Base.mapfoldl_impl, and just adds all the values in a while loop.

It's quite a lot of machinery to add and maintain in Rasters.jl, though.

To clarify: with the example above, it's 0.1% off or so, with the population data it was way more, maybe because a few pixels have very high values. Anyway it seems important that these things are accurate.

tiemvanderdeure · 2024-11-04T16:33:33Z

Rasters.SkipMissing hits the same fallback as any generator, which there is already this issue about in base JuliaLang/julia#30421

asinghvi17 · 2024-11-04T19:19:38Z

Hmm it's hard to do this with lazy iteration without essentially copying the Base implementation. One can always define Base.sum(sm::Rasters.SkipMissing) = Base.sum(collect(sm)), but that materializes sm which may not be desireable.

It also seems that sum on Base skipmissing has about the same performance as Rasters' internal skipmissingval, and generates the correct result as well. But you have to call replace_missing which is not ideal.

julia> @be sum($(Base.SkipMissing(replace_missing(ras))))
Benchmark: 3 samples with 1 evaluation
        47.097 ms (5 allocs: 128 bytes)
        47.157 ms (5 allocs: 128 bytes)
        47.242 ms (5 allocs: 128 bytes)

julia> @be sum($(skipmissing(replace_missing(ras))))
Benchmark: 3 samples with 1 evaluation
        46.943 ms (5 allocs: 128 bytes)
        47.107 ms (5 allocs: 128 bytes)
        47.122 ms (5 allocs: 128 bytes)

As far as I remember, in the cf branch, replace_missing will no longer be needed (@rafaqz?). If so it would probably be best to define Base.sum(sm::Rasters.SkipMissingVal) = Base.sum(Base.SkipMissing(sm.parent)), or maybe a broader dispatch on the mapreduce code.

What is the value of Rasters skipmissing (in a post-cf world) to your perspective @rafaqz @tiemvanderdeure?

rafaqz · 2024-11-04T20:28:40Z

I imagine a lot of people still won't use missing after that PR, like if you want Bool or NaN missingval because they're much faster in modelling. Really anything is faster than using missing... SkipMissingval is very fast for a lot of things (Your benchmark is timing the same thing twice there so we can't see that).

We cant collect, we just need to see what SkipMissing does in sum and add the method.

(We already fixed this in DiskArrays too, doing nested block sum. So it only applies to in memory data)

tiemvanderdeure · 2024-11-04T21:34:26Z

julia> @be sum($(skipmissing(replace_missing(ras))))

This method falls back to the base implementation. If a raster has missing as its missingval, then calling skipmissing on it returns a Base.SkipMissing, otherwise a Rasters.SkipMissing.

We cant collect, we just need to see what SkipMissing does in sum and add the method.

SkipMissing has a method for mapreduce, which is called by sum - it's in the lines I linked earlier. We could totally copy-paste that code and replace all the ismissing by _missing Of course it's not ideal, but I don't see any way to get around duplicating code.

rafaqz · 2024-11-04T22:09:03Z

I'm happy to duplicate.

But we would also need to do chunked DiskArrays handling inside that (will also be much faster as well as more correct on disk data).

tiemvanderdeure added the bug Something isn't working label Nov 4, 2024

tiemvanderdeure mentioned this issue Nov 5, 2024

implement mapreduce for SkipMissingVal #813

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong results from `sum(skipmissing(raster))` #812

Wrong results from `sum(skipmissing(raster))` #812

tiemvanderdeure commented Nov 4, 2024

lazarusA commented Nov 4, 2024 •

edited

Loading

rafaqz commented Nov 4, 2024

rafaqz commented Nov 4, 2024 •

edited

Loading

tiemvanderdeure commented Nov 4, 2024

asinghvi17 commented Nov 4, 2024 •

edited

Loading

rafaqz commented Nov 4, 2024 •

edited

Loading

tiemvanderdeure commented Nov 4, 2024

tiemvanderdeure commented Nov 4, 2024

asinghvi17 commented Nov 4, 2024

rafaqz commented Nov 4, 2024 •

edited

Loading

tiemvanderdeure commented Nov 4, 2024

rafaqz commented Nov 4, 2024 •

edited

Loading

Wrong results from sum(skipmissing(raster)) #812

Wrong results from sum(skipmissing(raster)) #812

Comments

tiemvanderdeure commented Nov 4, 2024

lazarusA commented Nov 4, 2024 • edited Loading

rafaqz commented Nov 4, 2024

rafaqz commented Nov 4, 2024 • edited Loading

tiemvanderdeure commented Nov 4, 2024

asinghvi17 commented Nov 4, 2024 • edited Loading

rafaqz commented Nov 4, 2024 • edited Loading

tiemvanderdeure commented Nov 4, 2024

tiemvanderdeure commented Nov 4, 2024

asinghvi17 commented Nov 4, 2024

rafaqz commented Nov 4, 2024 • edited Loading

tiemvanderdeure commented Nov 4, 2024

rafaqz commented Nov 4, 2024 • edited Loading

Wrong results from `sum(skipmissing(raster))` #812

Wrong results from `sum(skipmissing(raster))` #812

lazarusA commented Nov 4, 2024 •

edited

Loading

rafaqz commented Nov 4, 2024 •

edited

Loading

asinghvi17 commented Nov 4, 2024 •

edited

Loading

rafaqz commented Nov 4, 2024 •

edited

Loading

rafaqz commented Nov 4, 2024 •

edited

Loading

rafaqz commented Nov 4, 2024 •

edited

Loading