Numerically stable parallel cumsum-based WKV + jax/tf/keras implementations #189

jackd · 2023-10-02T07:34:58Z

jackd
Oct 2, 2023

Hi there, love the work. Having dug through the paper in more detail recently I realized the WKV implementation has some similarities with some ongoing work I'm involved in, so I hacked up a proof of concept using keras / keras-nlp here. I've included a theory page, but to summarise:

By scaling the numerator and denominator of the WKV expression by w**(t-1), it can be expressed as a cumulative sum
evaluating these scaled expressions and adding them up is numerically unstable, but we can use a numerically stable implementation that works on exponentially weighted parameterizations (v, t), where the actual value z is represented as z = exp(t) * v
the resulting cumulative sum can be implemented with any associative scan / prefix sum implementation, which is work-efficient and highly parallelizable; and
the gradient can be expressed in terms of cumulative sums.

I've included a very rough performance summary which show promise. That said:

I'm not a torch expert, and the triton implementations I hacked together perform considerably worse than the tensorflow / jax versions;
I don't have access to the computational resources to test whether these results are indicative of results on even a modest scale; and
due to trivial parallelism possible on the batch/embedding dimension, parallelism over the time dimension may prove less beneficial than one might otherwise think, at least with sequence lengths typically used at the moment.

This is a side-project of a side-project for me, so while I've enjoyed doing it I can't afford to spend much longer fine tuning a backend I understand little about. I'm pretty confident a cuda implementation based on thrust's inclusive_scan would be straight forward and perform considerably better than my triton implementation, but having never written custom pytorch bindings that's a project I'm going to pass on (if anyone decides to take that up I can offer a basic sketch).

Hope this helps someone :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numerically stable parallel cumsum-based WKV + jax/tf/keras implementations #189

{{title}}

Replies: 0 comments

Select a reply

Numerically stable parallel cumsum-based WKV + jax/tf/keras implementations #189

jackd Oct 2, 2023

Replies: 0 comments

jackd
Oct 2, 2023