You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Commit 32b9b15 has successfully executed the GEMM code generated by CuTe on InfiniGen, and its performance has been compared with cublas. However, the current code generation is based on direct template copying. The next steps will include:
Configuring matrix blocking based on tiling.
Attempting fusion within the GEMM kernel, such as fusing relu and other activation functions.
Merging the GEMM Graph and BinaryUnaryGraph into a single graph and designing a graph that can accommodate multiple operators working together.
The current testing framework is relatively simple and lacks standardization. Continuing to brainstorm and design a standardized testing framework.
1.请深入矩阵乘算子的运算过程,挖掘如下可能的性能点
1.1 并行性
1.2 高效 IO
1.3 高效计算
2.考虑如下的功能点
2.1 后融合激活操作或者下一个算子
2.2 前融合前一个算子
3.提供多种计算内核的选项,例如 cuda 平台的 cuda core / tensor core;bang 平台的 张量核 / 卷积核。
The text was updated successfully, but these errors were encountered: