Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong results from sum(skipmissing(raster)) #812

Open
tiemvanderdeure opened this issue Nov 4, 2024 · 12 comments
Open

Wrong results from sum(skipmissing(raster)) #812

tiemvanderdeure opened this issue Nov 4, 2024 · 12 comments
Labels
bug Something isn't working

Comments

@tiemvanderdeure
Copy link
Contributor

I was getting some mysterious results from summing over a raster of gridded population counts from WorldPop (which can be downloaded here)

This is a big raster, with Float32 data and a Float32 missingval. The output of sum(skipmissing(x)) is not identical to the sum after aggregating, or the sum of the sum of the slices, or the sum of collect(skipmissing(x))

In the worldpop data, the result is off by like 5 percent, but when I try to reproduce I can only get it off by like 0.1%.
MWE:

n = 1000000
mval = -Inf32
ras = Raster(fill(mval, X(1:10000), Y(1:10000)); missingval = mval)
vals = exp.(rand(Float32, n) .* Float32(10))
ras[1:length(vals)] .= vals
sum(skipmissing(ras)) ≈ sum(vals) # false!
sum(skipmissing(replace_missing(ras))) ≈
sum(sum.(skipmissing.(eachslice(ras; dims = X)))) ≈
sum(vals) # true

What is going on here?

@tiemvanderdeure tiemvanderdeure added the bug Something isn't working label Nov 4, 2024
@lazarusA
Copy link
Collaborator

lazarusA commented Nov 4, 2024

just to add to this a little bit more, I also noticed that doing (maybe a totally different issue, but I mention it because is also a sum 😄 ).

sum(resample(ras,...., method="sum")) and sum(ras) will give slightly different results.

@rafaqz
Copy link
Owner

rafaqz commented Nov 4, 2024

Think of the sheer number of floating point operations happening in summing your 10_000 * 10_000 array. Theres a lot of room for error to creep in

There is a correctness optimisation to reduce the error in that process:
https://en.wikipedia.org/wiki/Kahan_summation_algorithm

Maybe we don't hit that path with the Rasters inbuilt SkipMissingVal iterator. Or something like that.

There could be specialised methods of sum that we need to add? If someone could track down what the difference is, we can add them.

@rafaqz
Copy link
Owner

rafaqz commented Nov 4, 2024

But seems we don't actually use kahan summation, but pairwise. And not on a regular iteration:

https://discourse.julialang.org/t/kahan-summation-in-sum/102723

@tiemvanderdeure
Copy link
Contributor Author

Yes it only affects really big rasters, an error of more than 5% seemed crazy, but I also haven't been able to reproduce it.

@asinghvi17
Copy link
Collaborator

asinghvi17 commented Nov 4, 2024

Could you try reproducing this on the original data using AccurateArithmetic.jl? Or convert the values to Float64 before summing?

@rafaqz
Copy link
Owner

rafaqz commented Nov 4, 2024

For me it was a fraction of a percent, but still big. But its a real problem, thus all the algorithms to fix it.

Try looking in Base for what sum does on SkipMissing as it seems to be better

@tiemvanderdeure
Copy link
Contributor Author

Base has all of this machinery implemented for mapreduce on SkipMissing, which breaks the iterator up into chunks (of size 1024 by default). There actually isn't a specialization for sum.
https://github.com/JuliaLang/julia/blob/50713ee4a82eb1b5613647cd74b027315f665080/base/missing.jl#L273-L362

With Rasters.SkipMissing we hit the fallback, which is Base.mapfoldl_impl, and just adds all the values in a while loop.

It's quite a lot of machinery to add and maintain in Rasters.jl, though.

To clarify: with the example above, it's 0.1% off or so, with the population data it was way more, maybe because a few pixels have very high values. Anyway it seems important that these things are accurate.

@tiemvanderdeure
Copy link
Contributor Author

Rasters.SkipMissing hits the same fallback as any generator, which there is already this issue about in base JuliaLang/julia#30421

@asinghvi17
Copy link
Collaborator

Hmm it's hard to do this with lazy iteration without essentially copying the Base implementation. One can always define Base.sum(sm::Rasters.SkipMissing) = Base.sum(collect(sm)), but that materializes sm which may not be desireable.

It also seems that sum on Base skipmissing has about the same performance as Rasters' internal skipmissingval, and generates the correct result as well. But you have to call replace_missing which is not ideal.

julia> @be sum($(Base.SkipMissing(replace_missing(ras))))
Benchmark: 3 samples with 1 evaluation
        47.097 ms (5 allocs: 128 bytes)
        47.157 ms (5 allocs: 128 bytes)
        47.242 ms (5 allocs: 128 bytes)

julia> @be sum($(skipmissing(replace_missing(ras))))
Benchmark: 3 samples with 1 evaluation
        46.943 ms (5 allocs: 128 bytes)
        47.107 ms (5 allocs: 128 bytes)
        47.122 ms (5 allocs: 128 bytes)

As far as I remember, in the cf branch, replace_missing will no longer be needed (@rafaqz?). If so it would probably be best to define Base.sum(sm::Rasters.SkipMissingVal) = Base.sum(Base.SkipMissing(sm.parent)), or maybe a broader dispatch on the mapreduce code.

What is the value of Rasters skipmissing (in a post-cf world) to your perspective @rafaqz @tiemvanderdeure?

@rafaqz
Copy link
Owner

rafaqz commented Nov 4, 2024

I imagine a lot of people still won't use missing after that PR, like if you want Bool or NaN missingval because they're much faster in modelling. Really anything is faster than using missing... SkipMissingval is very fast for a lot of things (Your benchmark is timing the same thing twice there so we can't see that).

We cant collect, we just need to see what SkipMissing does in sum and add the method.

(We already fixed this in DiskArrays too, doing nested block sum. So it only applies to in memory data)

@tiemvanderdeure
Copy link
Contributor Author

julia> @be sum($(skipmissing(replace_missing(ras))))

This method falls back to the base implementation. If a raster has missing as its missingval, then calling skipmissing on it returns a Base.SkipMissing, otherwise a Rasters.SkipMissing.

We cant collect, we just need to see what SkipMissing does in sum and add the method.

SkipMissing has a method for mapreduce, which is called by sum - it's in the lines I linked earlier. We could totally copy-paste that code and replace all the ismissing by _missing Of course it's not ideal, but I don't see any way to get around duplicating code.

@rafaqz
Copy link
Owner

rafaqz commented Nov 4, 2024

I'm happy to duplicate.

But we would also need to do chunked DiskArrays handling inside that (will also be much faster as well as more correct on disk data).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants