tested on single NVIDIA A10 22G and 3060 6G
examples: https://triton-lang.org/main/getting-started/tutorials/index.html
reference: https://zhuanlan.zhihu.com/p/684473453
local test of official examples
currently read 3 examples
triton puzzles from https://github.com/srush/Triton-Puzzles and https://github.com/SiriusNEO/Triton-Puzzles-Lite
currently finish 7 puzzles
WIP for flash-atten