Memory-efficient attention #7

justheuristic · 2022-03-15T20:26:25Z

This is a discussion of how to minimize memory usage of attention.

Current state: investigating apex's scaled_masked_softmax to check how it operates

krunt · 2022-03-16T20:20:23Z

regarding scaled_masked_softmax_cuda
scaled_masked_softmax_cuda from apex/csrc/megatron
behaves the same on forward pass as pytorch softmax,
on backward it is inplace!!! (saving 2 buffers: tmp and return)

It supports seq_len<=2048 (I think easy to extend), float16

regarding next iteration for memory saving
Implemented here
https://github.com/krunt/mytorchcudamodules/blob/master/modules/mine_self_multihead_attn_func.py
loop by batch dimension based on python version of multihead attention from apex.

Need to commit this code & tests to this repo
The logic should be enabled by input argument flag

justheuristic · 2022-03-22T05:56:12Z

Summary based on @krunt 's recent talk about FMHA design:

there are two ways to implement attention:
-- the naive way:
- 1. compute query-key dot-products by the tile, store dot-products in global memory,
- 2. then compute attention weights: softmax of dot-products, store results in global memory
- 3. then compute the weighted sum of values with attention weights store sums in shared memory
-- the shmemory way:
- load a subset of queries and all keys/values from global to shared memory
- compute dot-products, maintain them in shared memory without offloading to global
- compute attention weights via softmax in-place in shared memory
- then compute weighted sum of values with attention weights and only then store results in global memory

The shmemory way is significantly faster (~10x on fmha benchmark #8), but requires that all keys/values fit into shared memory. As a result, both FMHA and FasterTransformer are limited by head dimension 64 and sequence length 512.

In turn, the naive way supports arbitrary head size and sequence length, but is significantly slower because it needs to store/load intermediate values in global memory.

justheuristic · 2022-03-22T06:30:39Z

Based on these two solutions, we can produce a middle-of-the-road implementation that the flexibility of naive strategy with most_of_the performance from shmemory-based strategy

Stage 1: compute log-sum-exps

for each query, compute a scalar log-sum-exp of dot products, i.e.
result[i] = log(sum_over_j(<query_i, key_j>))

Log-sum-exps can be partially computed in chunks of tile_size tokens.
Second, third, etc. tiles do the following:

# forall tile i = 0...num_queries/tile_size, j=0...num_keys/tile_size
logaddexp_accumulators_i = load_logsumexp_outputs_from_previous_part()  # initially 1d[tile_size] of -inf
new_log_add_exps_ij = compute_dotproduct_logsumexp(query_tiles[i], key_tiles[j])
logaddexp_accumulators_i [:]= safe_logaddexp_pair(logaddexp_accumulators_i, new_log_add_exps_ij)

Wherein compute_dotproduct_logsumexp stands for computing dot-product of queries to keys, followed by a reduce_logsumexp over all keys, parallel for each query.
, safe_logaddexp_pair is an element-wise log-sum of two exponents, equivalent to torch logaddexp

i/o: load queries and keys, 2x [tile_size x head size], store logsumexps: small [tile_size] vectors
flops: ~half of fusedMHA's forward pass, since we have no need

Stage 2: forward (given logsumexp)

Once we know log-sum-exps, we no longer need to load the entire set of queries into shared memory.

Instead, we can load one chunk at a time, compute partial attention outputs from that chunk, add them to the accumulator, then load the next chunk, etc.

# forall tile i = 0...num_queries/tile_size, j=0...num_keys/tile_size
query_tiles[i], key_tiles[j], value_tiles[j] = load_into_shmemory()
attention_accumulators_i = load_partial_results_from_previous_part()  # initially 2d[num_queries, head_dim] of zeros
logsumexp_accumulator_i = load_from_stage_1_for_queries_i()

dot_product_ij = dot_product(query_tiles[i], key_tiles[j])
softmax_tile_ij = exp(dot_product_ij - logsumexp_accumulator_i)
attention_output_tile_ij = dot_product(softmax_tile_ij, value_tiles[j])
attention_accumulators_i [:]= attention_accumulators_i + attention_output_tile_ij

i/o same as shmemory-based MHA, but with one extra tensor loaded
flops: a bit less than shmemory-based MHA since softmax denominator is pre-computed

Stage 3: backward

Use the same backward logic as in shmemory, but this time you reuse log-sum-exps saved from the forward pass and accumulate gradients by tiles.

Notes:

compute log-add-exps during forward pass and reuse for backward pass
if it works well enough, maybe contribute this to apex in order to avoid compilation here?

krunt · 2022-04-12T18:47:34Z

fwd fmha for longer sequences is implemented on this fork https://github.com/krunt/apex

k,v are in smem always (no offload (!!!) to gmem during iteration by Q)

fwd 2x-2.5x faster than lean (and memory efficient too!).
not optimal support (fmha does not support it) of head_dim > 64 (hope it is correct - the results say so)

TODO:

support for initialization of cacc_max, cacc_sum, vacc
test gmem offload slowdown of cacc_max, cacc_sum
support different seqlen (via mask (currently not fixed in fmha - easy to do))
bwd
fwd accumulators to float (??? is needed)

krunt · 2022-04-18T20:26:53Z

bwd is ported:

krunt · 2022-04-18T20:28:14Z

fwd+bwd results:

krunt mentioned this issue Mar 19, 2022

multihead attention version with loop by batch dimension to reduce memory usage #10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory-efficient attention #7

Memory-efficient attention #7

justheuristic commented Mar 15, 2022 •

edited

Loading

krunt commented Mar 16, 2022 •

edited

Loading

justheuristic commented Mar 22, 2022

justheuristic commented Mar 22, 2022 •

edited

Loading

krunt commented Apr 12, 2022 •

edited

Loading

krunt commented Apr 18, 2022 •

edited

Loading

krunt commented Apr 18, 2022

Memory-efficient attention #7

Memory-efficient attention #7

Comments

justheuristic commented Mar 15, 2022 • edited Loading

krunt commented Mar 16, 2022 • edited Loading

justheuristic commented Mar 22, 2022

justheuristic commented Mar 22, 2022 • edited Loading

Stage 1: compute log-sum-exps

Stage 2: forward (given logsumexp)

Stage 3: backward

Notes:

krunt commented Apr 12, 2022 • edited Loading

krunt commented Apr 18, 2022 • edited Loading

krunt commented Apr 18, 2022

justheuristic commented Mar 15, 2022 •

edited

Loading

krunt commented Mar 16, 2022 •

edited

Loading

justheuristic commented Mar 22, 2022 •

edited

Loading

krunt commented Apr 12, 2022 •

edited

Loading

krunt commented Apr 18, 2022 •

edited

Loading