Custom kernels in Triton language for accelerating LLMs
First kernel added - a Triton Fused Softmax.
Currently same speed and numerics as PyTorch Softmax in E2E training, hopefully better tuning will accelerate past PyTorch.
Next up - RMSNorm. Fwd working, bwd in progress.