- install
turingas
compilergit clone [email protected]:daadaada/turingas.git
python setup.py install
mkdir build && cd build
cmake .. && make
python ../compile_sass.py -arch=<70|75|80>
Device | Turing RTX-2070 | |
---|---|---|
Global Latency | cycle | TBD |
L2 Latency | cycle | 236 |
L1 Latency | cycle | 32 |
Shared Latency | cycle | 23 |
Constant Latency | cycle | 448 |
Constant L2 Latency | cycle | 62 |
Constant L1 Latency | cycle | 4 |
- const L1-cache is as fast as register.
Device | Turing RTX-2070 | |
---|---|---|
L2 Linesise | bytes | 64 |
L1 Linesize | bytes | 32 |
Constant L2 Linesise | bytes | 256 |
Constant L1 Linesize | bytes | 32 |
Instruction | conflict | without conflict | |
---|---|---|---|
FFMA | CPI | 1.758 | 1.484 |
Memory Load | Turing RTX-2070 | |
---|---|---|
Single | cycle | 23 |
Vector2 X 2 | cycle | 27 |
Conflict Strided | cycle | 41 |
Conlict-Free Strided | cycle | 32 |
- Jia, Zhe, et al. "Dissecting the NVIDIA volta GPU architecture via microbenchmarking." arXiv preprint arXiv:1804.06826 (2018).
- Jia, Zhe, et al. "Dissecting the NVidia Turing T4 GPU via microbenchmarking." arXiv preprint arXiv:1903.07486 (2019).
- Yan, Da, Wei Wang, and Xiaowen Chu. "Optimizing batched winograd convolution on GPUs." Proceedings of the 25th ACM SIGPLAN symposium on principles and practice of parallel programming. 2020. (turingas)