Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does sorting actually sort? #2

Open
raphlinus opened this issue Dec 22, 2023 · 3 comments
Open

Does sorting actually sort? #2

raphlinus opened this issue Dec 22, 2023 · 3 comments

Comments

@raphlinus
Copy link

Apologies for the drive-by issue, feel free to direct me to a more appropriate place to ask the question.

I'm evaluating parallel sorting algorithms that can run on GPU, and the Onesweep implementation in this repo is one of the most promising. However, on trying to understand the code I came across what I think is a serious flaw. In particular, the ranking section just uses atomicAdd() for all invocations in the workgroup with no additional synchronization. Consider for simplification the case where all keys have the same digit. The allowable results are all permutations such that the values delivered to a single invocation are ascending. Thus, what is supposed to be a stable multi-split appears not to actually be stable.

It is easy to believe that on particular hardware, failure of stability might be hard to observe, in particular I wouldn't be surprised at all if atomics from a single subgroup were resolved in order. Additionally, for this application an approximate sort may be tolerable.

Looking at the history, I wonder if commit 4aa3754 might have traded a correct but slower implementation for a faster but approximate one. Obviously, if we had subgroups (which I'm tracking closely), we could do proper WLMS, but I am interested in how well we can do with basic WGSL.

In any case, thanks for the implementation, it's an impressive and valuable resource even if we don't end up going that direction.

@Lichtso
Copy link
Owner

Lichtso commented Dec 25, 2023

Yes, the commit traded sorting stability for performance. The sorting order of the keys is still correct though (not just an approximation) and value order is irrelevant to the application in this repository here. I know that it is not really Onesweep, but without WLMS it isn't yet anyway. Still, my goal is to one day have an actual Onesweep implementation in WGSL.

That is why together with exrook I implemented the subgroup-ops transpilation for all front- and backends in wgpu-rs which subsequently lead to the current proposal for subgroup-ops in the spec. But you already know about that and also that it will take a long time to land and become available in Browsers. Though, in wgpu-rs we might be able to get it sooner behind a feature gate.

Independent of that issue, one major problem with Onesweep is that it builds on top of decoupled-lookback to accumulate the ranks globally. The scheduling, fairness and progress guarantees are very different between manufacturers with Apple in particular (both A and M series GPUs) that decoupled-lookback approaches do not reliably work on, as you have already discussed on your blog. It is so bad that the entire system crashes regularly and the device gets into a unrecoverable state, which can only be solved by a power cycle restart.

@raphlinus
Copy link
Author

raphlinus commented Dec 25, 2023

Thanks for the response! I'm not sure I understand what you mean by the order of the keys being correct. It sounds like you're asserting that the sort isn't stable, but what I'm saying is that if an individual radix pass is not stable, then it can undo the sort order of previous passes. Only the most significant digit will be guaranteed to be sorted. Even so, if it's "mostly stable" when running on real hardware, I can see that being a reasonable tradeoff for your application. Sadly, it is not sufficient for applications such as the sort phase in Li et al scanline path rendering.

I have been following WebGPU subgroups but did not know that you were one of the authors of the wgpu prototype implementation. Thanks for that.

The one-pass story on Apple is complicated. Until recently, I believed that the missing piece was device-scoped barriers. Inspired by reading your implementation, I realized that it is possible to do single pass scan without barriers. Performance is mediocre on Apple, for reasons I haven't fully analyzed, but I suspect have to do with variations in execution speed for the local aggregation.

As you point out, Apple (along with a handful of other GPUs) lack a forward progress guarantee. Thus, correct implementation of the single pass scan requires a scalar fallback so the spinlock on the status counter can't loop forever. If that happens often, performance suffers, though. I believe it is possible for Onesweep, basically on iteration each thread reads one value from input, compares it to local_id, and increments its histogram bin on match. I believe this would upgrade the situation from "needs power cycle" to "performance is disappointing," but the latter is still not a good place to be.

I've got an almost-done straightforward port of FidelityFx sort to WebGPU, just using shared memory instead of wave (subgroup) intrinsics. This doesn't play any games with the execution model, but just does multiple dispatches to compute the prefix sums of the histograms. Initial experiments show mediocre performance, probably much slower than what you've done but maybe enough for some applications. Partly that's because it has a radix of 4 bits rather than 8, partly it's because shared memory is slower than subgroups. I suspect that it'll be possible to improve it even within the constraints of WebGPU 1.0 and lack of fairness / progress guarantees, possibly by hybridizing with ideas in your implementation. And of course when we do get subgroups, that changes the performance game. If you want to follow that, I'll post updates in the Sorting revisited topic.

Again thanks for your work, it's been very useful even if we can't simply adopt your implementation. With luck, we'll have a good story for portable GPU sorting before long.

@raphlinus
Copy link
Author

See significant updates to the Sorting revisited thread, including a pointer to my adaptation of WLMS to vanilla WGSL (faking the waves with shared memory).

I think it would be very interesting to try various hybrids between the two codebases; I think somewhere in that space is a correct and reasonably performant implementation in vanilla WebGPU, and when subgroups land I think we can get reasonably close to CUB Onesweep.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants