Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion for new level-1v/-1m-like operations #762

Open
11 tasks
fgvanzee opened this issue Jul 31, 2023 · 27 comments
Open
11 tasks

Discussion for new level-1v/-1m-like operations #762

fgvanzee opened this issue Jul 31, 2023 · 27 comments
Assignees

Comments

@fgvanzee
Copy link
Member

fgvanzee commented Jul 31, 2023

Let's discuss new operations that we might like to add to BLIS, specifically those that would fall into level-1v or level-1m families (and perhaps level-2):

  • element-wise vector/matrix multiplication
    • fused element-wise vector ops
  • element-wise vector/matrix division
  • vector reduction
  • partial matrix reductions
    • reduce rows to vector
    • reduce columns to vector
  • "matrix-times-diagonal-matrix", e.g. C_ij = A_ij * b_j
  • c = diag(A*trans?(B)) (c_i = A_ij*B_ji or A_ij*B_ij)
  • mixed-precision/mixed-domain level1/2
  • parallelization for level1/2

We will add to this list as the discussion unfolds!

@ct-clmsn
Copy link
Contributor

ct-clmsn commented Aug 1, 2023

Not sure if these are in scope but, a function like einsum would be useful; ensum is a dependency function for gaussian mixture modeling (and adjacent algorithms).

Generally, any functionality in BLIS that could reflect, or be utilized to implement, functionality found in numpy would be especially valuable: https://numpy.org/doc/stable/reference/routines.linalg.html

@devinamatthews
Copy link
Member

@ct-clmsn some of the things I have in mind can be used as kernels for various combinations of einsum parameters. Providing einsum itself is almost certainly out-of-scope, but yes we'd definitely like to be able to provide all of the basic operations that einsum would use internally. Is there some taxonomy/list of these?

@ct-clmsn
Copy link
Contributor

ct-clmsn commented Aug 1, 2023

@devinamatthews I'll have to dig through the source to find them, shouldn't take too much time. Will get back with a list soon.

Source implementation is here.

Looks like dot product is the only thing happening under the hood get_sum_of_products_function ; the indexing mechanisms, which work a little bit like a simple domain specific language for describing looping mechanics/loads/stores, is where the 'magic' happens. So it's out of scope.

There's an optimization effort for the functionality that's worth noting

@realab-ai
Copy link

realab-ai commented Aug 2, 2023

I found the performance of operators in 1f and 2 (axpyf, gemv, etc.) are lower than expected at SKX. It seems the current kernels are not designed under the context of AVX512. Shall we take AVX512 feature in the the planning level-1v/-1m?
And, how about multi-threading in the planning work? I'm not sure if it is a good feature for level-1f/-1m but very interested in this issue.

@fgvanzee
Copy link
Member Author

fgvanzee commented Aug 2, 2023

I found the performance of operators in 1f and 2 (axpyf, gemv, etc.) are lower than expected at SKX. It seems the current kernels are not designed under the context of AVX512. Shall we take AVX512 feature in the the planning level-1v/-1m?

Fair question. I think for now we're only gathering candidate operations to add to BLIS. Microarchitectures and instruction sets for which to optimize those (or existing) operations is mostly orthogonal IMO.

And, how about multi-threading in the planning work? I'm not sure if it is a good feature for level-1f/-1m but very interested in this issue.

Multithreading for non-level-3 operations is one of those action items we've wanted for a long time, but not badly enough relative to everything else we want. 😂 It will happen someday!

@hominhquan
Copy link
Contributor

Good initiative, I would add to the original list:

  • a, b, c, d, e are vectors (real or complex)
  • element-wise dual-multiply dual-add/sub:
    • e = (conj?(a)*b) + (conj?(c)*d)
    • e = (conj?(a)*b) - (conj?(c)*d)
    • e = -(conj?(a)*b) - (conj?(c)*d)
  • element-wise dual-multiply dual-add/sub accumulate:
    • e += (conj?(a)*b) + (conj?(c)*d)
    • e += (conj?(a)*b) - (conj?(c)*d)
    • e += -(conj?(a)*b) - (conj?(c)*d)
  • element-wise vector type-conversion (narrow, widen). Does copyv kernel already support type-conversion ?
    • a = type(conj?(b)) where type() can be S, D, C, Z and Half (H) and Complex-half (CH)
  • Ideally, I would like to see fp16 Half (H) and Complex-half (CH) in the BLAS/BLIS API as well.

@devinamatthews
Copy link
Member

@hominhquan for 2./3., are these building blocks for something like e = \sum_i conj?(a_i)*b_i?

@devinamatthews
Copy link
Member

@hominhquan 4./5. are on the radar!

@devinamatthews
Copy link
Member

@realab-ai thanks for letting us know about this! Would you mind opening an issue for this with some basic performance comparison data? I think that actually using the AVX2 level1v/f kernels from Zen could help a lot. I've heard conflicting things about the benefits of AVX512 for such operations due to the more severe throttling.

@devinamatthews
Copy link
Member

@ct-clmsn I created a taxonomy of "feasible" unary and binary tensor operations in TBLIS, and these reduce naturally to matrix ops. That actually is sort of what prompted this issue. Hopefully those same operations would be those that slot into opt_einsum as basic kernels.

@hominhquan
Copy link
Contributor

@hominhquan for 2./3., are these building blocks for something like e = \sum_i conj?(a_i)*b_i?

Kind of. I've seen some DSP algorithms on which they can be useful. They are like the next-gen of FMA operations with more flops.

@realab-ai
Copy link

@realab-ai thanks for letting us know about this! Would you mind opening an issue for this with some basic performance comparison data? I think that actually using the AVX2 level1v/f kernels from Zen could help a lot. I've heard conflicting things about the benefits of AVX512 for such operations due to the more severe throttling.

Ok. I'll do this recently.

@devinamatthews
Copy link
Member

@hominhquan Cool. I ask because if the goal is N simultaneous element-wise multiplications where N > 1 but not necessary just 2 then this is more like a level1f kernel than level1. I'll add to the list.

@jeffhammond
Copy link
Member

Native support for einsum would be awesome. Don't limit yourselves just because the creators of BLAS stopped at two indices in the 1980s. Why shouldn't BLIS try to solve directly one of the most widely used linear algebra APIs in all of computing?

@hominhquan
Copy link
Contributor

hominhquan commented Aug 3, 2023

Another feature I'd like to see: for element-wise operation, since there is no reduction nor (backward) spatio-temporal dependency in memory access like level-3, to support user-defined custom ukernels to be applied on each (vectorized) chunk of inputs (like what @devinamatthews has done in level-3):

for ichunk in 0 .. (VEC_LENGTH/SIMD_LENGTH -1)
   simd_t  achunk = *((simd_t *) &a[ichunk*SIMD_LENGTH]);
   simd_t  bchunk = *((simd_t *) &b[ichunk*SIMD_LENGTH]);
   simd_t  cchunk = *((simd_t *) &c[ichunk*SIMD_LENGTH]);

   simd_t  rchunk = user_ukr_3inputs_simd(achunk, bchunk, cchunk);

   *((simd_t *) &res[ichunk*SIMD_LENGTH]) = rchunk;
end

// trailing
for i in ...
   ctype r = user_ukr_3inputs(a[i], b[i], c[i]);
   ...
end
  • User are then able to implement whatever they want to do in user_ukr_3inputs_simd(), whenever the fusion gain is worthy compared to the function-call cost.

@rvdg
Copy link
Collaborator

rvdg commented Aug 3, 2023

I introduced invscal in libflame, and that operation is now in BLIS as well.

It performance x := 1/alpha x. Much cleaner than passing 1/alpha into scal.

About 10 years ago while reading lapack code, I ran across some specialized code in an edge case. It is possible that 1/alpha underflows (meaning 1/alpha x yields the zero vector) even though dividing all elements of x by alpha would give you the expected result. What I saw them do was:

Call a routine, SLAMCH, that returns the largest value that underflows when inverted.

Check alpha against this value.

If it is ok, then scal is called with 1/alpha

Otherwise, a loop is executed to do individual divides.

This suggests:

  • If not already in BLIS, we need "level 0" routines that return things like the value where underflow happens.

  • a invscalx routines that does what is described above.

Now if only I could remember in what lapack routine this happens...
UPDATE: an example is in sgetrf: https://netlib.org/lapack/explore-html/d8/ddc/group__real_g_ecomputational_gab9b698b35b884b4d17b9713e76fabe37.html

@ct-clmsn
Copy link
Contributor

ct-clmsn commented Aug 3, 2023

Would sparse operations be on the table? NIST has an existing implementation available here that could be used as a template.

@realab-ai
Copy link

I found all the band relevant operations (GB,SB,HB,TB) belonging to level-1/2 are only in f2c version instead of a blis implemented. What's the reason behind this situation?

@devinamatthews
Copy link
Member

The banded operations, like the other level2 operations, are memory bandwidth limited. Also, because of the band structure they are somewhat more difficult to optimize (you get some flavor of triangular operations e.g.). So, they basically never became a target of more in-depth optimization and inclusion in the BLIS interface. Are you interested in BLIS-style banded operations (with separate row and column strides, conjugation without transposition, etc.)?

@devinamatthews
Copy link
Member

  • User are then able to implement whatever they want to do in user_ukr_3inputs_simd(), whenever the fusion gain is worthy compared to the function-call cost.

Take a look at the attached file, in particular the bit at the end.
vec_util.h.txt

I use this to vectorize chunks of "level1-like" code in C++ using lambdas, even quite complicated code with function calls etc.

@nick-knight
Copy link

nick-knight commented Aug 22, 2023

The banded operations, like the other level2 operations, are memory bandwidth limited.

Successive band reduction algorithms based on two-sided orthogonal transformations can be reorganized to expose higher, "level3-like" arithmetic intensity.

For Householder-based approaches, the resulting kernels can be efficiently expressed as a sequence of TRMM and SYMM/GEMM calls (compressed band layouts handled by stride manipulation) --- see Fig. 1. The last reduction step (to bidiagonal, tridiagonal, or Hessenberg form) will of course be level2-like. As a complementary optimization, you can chase a train of bulges to improve locality --- see Fig. 2. All of this can be handled at the libflame/LAPACK level. I don't really think there is much to be done within BLIS for this. (EDIT: Actually, IIRC, BLIS is missing "native TRMM", so maybe this is a gap?)

For Givens-based approaches, it's a different story, and custom kernels would be beneficial. See, e.g.,
https://www.cs.utexas.edu/users/flame/pubs/RestructuredQRTOMS.pdf
(This is in the context of QR iteration on bidiagonal/tridiagonal matrices, but it's a similar flavor to the successive band reduction case. I can provide more references if desired.)

This comment is, of course, completely off-topic for the present discussion on level1 stuff.

@fgvanzee
Copy link
Member Author

For Givens-based approaches, it's a different story, and custom kernels would be beneficial.

Definitely. Givens kernels (including fused Givens kernels) were always something that I anticipated we would revisit once BLIS "settled down." Of course, here we are, 10+ years later, and it's still settling. 😂

Thanks for reminding me of this, @nick-knight!

@realab-ai
Copy link

realab-ai commented Aug 23, 2023

For Givens-based approaches, it's a different story, and custom kernels would be beneficial. See, e.g., https://www.cs.utexas.edu/users/flame/pubs/RestructuredQRTOMS.pdf (This is in the context of QR iteration on bidiagonal/tridiagonal matrices, but it's a similar flavor to the successive band reduction case. I can provide more references if desired.)

Now I’m looking for a solution of custom band-like kernel which is hoped to bring benefits to Givens-based successive band reduction case. I would be appreciate if there's more references about this, @nick-knight.

Sorry that the request is off-topic. Would you please share further information to : [email protected]

@chillenb
Copy link

Well, it seems a bit late to add to this discussion, but I'd have a lot of use for a level 3 routine that calculates A diag(D) A.H. This is a syrk / herk fused with column scaling. It can also be viewed as a special case of syr2k. I need to do a lot of these with the same A but different D, and usually A is short and wide. If I could do this without modifying or copying A, it would save a lot of memory and make life easier.

@devinamatthews
Copy link
Member

@chillenb that's great to know. There's a high likelihood that we'll add this in a near future version.

@rvdg
Copy link
Collaborator

rvdg commented Nov 11, 2024

@chillenb You may want to watch Devin's talk at the most recent BLIS retreat: https://www.cs.utexas.edu/~flame/BLISRetreat2024/Talks.html#DevinTalk1

@chillenb
Copy link

@devinamatthews @rvdg
Wonderful! I'm looking forward to it, as well as parallel L1/L2 capability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants