Adding parallel implementations of (some?) quasisep algorithms #210

dfm · 2024-04-03T22:21:02Z

The quasisep solver is fast on CPU, but the performance is very bad on GPU (and probably TPU) because of the extensive use of lax.scan. It's possible to rewrite at least some of these operations using lax.associative_scan which (at least in principle) are more accelerator friendly. This approach is similar is spirit to the algorithms derived in https://arxiv.org/abs/1905.13002

This PR is a WIP to add some of these operations. So far, I've just implemented a parallel matrix multiplication. There are still some precision issues to work out, but the initial performance looks good:

On CPU, the scan and associative_scan matmuls take 1.65 ms and 3.59 ms respectively, for a J = 3 lower triangular matrix with N = 50,000 data points. On the GPU, these computations cost 685 ms and 1.32 ms respectively. Therefore, the scan version is ~600x slower on GPU, whereas the associative_scan version isn't. These GPU results are not impressive, but it might be worth investigating further in case someone wants to use this solver as part of a larger model that benefits from hardware acceleration.

dfm added 6 commits April 3, 2024 17:51

Adding parallel matmuls for strict tri quasisep matrices

5e15922

propagate parallel matmuls

2c8bce9

force general implementation of parallel matmul

93f7db5

revert checking devices

56503a4

fixing transpose bug

c447bca

fixing new matern test

1fca6b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding parallel implementations of (some?) quasisep algorithms #210

Adding parallel implementations of (some?) quasisep algorithms #210

dfm commented Apr 3, 2024

Adding parallel implementations of (some?) quasisep algorithms #210

Are you sure you want to change the base?

Adding parallel implementations of (some?) quasisep algorithms #210

Conversation

dfm commented Apr 3, 2024