-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hip mfma tests #246
Open
CRobeck
wants to merge
92
commits into
develop
Choose a base branch
from
test/mfma
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
hip mfma tests #246
Changes from 88 commits
Commits
Show all changes
92 commits
Select commit
Hold shift + click to select a range
1594a50
adding inital infrastructure for MFMA test
CRobeck e3d5aac
working hip mfma with matrix core builtins
CRobeck 332067b
fixing incorrect default problem size
CRobeck 8aa020b
fixing flop calc
CRobeck bc9770a
fixing some variable names
CRobeck 774d742
setting up problem size infrastructure correctly
CRobeck 83700d6
fixing results array storage
CRobeck 1bbd473
finish multi matrix support
CRobeck 5591849
updating algorithm description
CRobeck 4653c23
adding a few more guard rails for fma builtins. kernel still only imp…
CRobeck bd47d35
adding some ifdefs to call a seperate mfma kernel if hardware support…
CRobeck fde4b65
adding reference basic mat-mat kernel for comparison when no mfma ins…
47925bb
Merge branch 'develop' into test/mfma
7174382
adding problem size support to base hip kernel
f5fe861
adding problem size support to base hip kernel
CRobeck 23866ad
Merge branch 'test/mfma' of https://github.com/llnl/RAJAPerf into tes…
CRobeck 5b7bd12
Merge branch 'test/mfma' of https://github.com/llnl/RAJAPerf into tes…
CRobeck 5ad1472
Merge branch 'test/mfma' of https://github.com/llnl/RAJAPerf into tes…
CRobeck 67c02c4
Merge branch 'test/mfma' of https://github.com/llnl/RAJAPerf into tes…
CRobeck f3b4488
Merge branch 'test/mfma' of https://github.com/llnl/RAJAPerf into tes…
CRobeck 799e17c
updating top level cmake file for new mfma test
CRobeck 451dee2
updating cmake file to add mfma test to basic list
CRobeck c015117
adding mfma test to raja perf suite infrastructure
CRobeck 6a6ccd7
adding mfma base and header files
CRobeck 43c1c6e
adding mfma seq variant skeleton
CRobeck 2e75981
adding mfma omp and omp offload variant skeleton structure
CRobeck 5181219
adding mfma cuda variant skeleton
CRobeck ef451af
add inital set of HIP mfma varaint with builtin matrix core instructions
CRobeck 7eab287
Merge branch 'test/mfma' of https://github.com/llnl/RAJAPerf into tes…
CRobeck 0bd36f4
add mat_fused_mul_add cuda variant kernel
CRobeck 3c906a0
filling in seq MAT_FUSED_MUL_ADD seq variant
CRobeck 464380d
cleaning up MAT_FUSED_MUL_ADD macros a bit
CRobeck ee3c695
making MAT_FUSED_MUL_ADD cuda and hip variants more consistent
CRobeck 24279eb
fixing bug in MAT_FUSED_MUL_ADD Cuda kernel
CRobeck 75c7a07
making MAT_FUSED_MUL_ADD naming consistent
CRobeck 75875b9
updating mat fused body and ading raja lam variant
CRobeck 2a46e30
add data set up to mat_fused_lam
CRobeck a4328c8
moving MAT_FUSED_MUL_ADD array init into macro
CRobeck be2430f
fixing missing ;
CRobeck 1ff6f93
finish filling in RAJA_HIP MAT_FUSED_MUL_ADD variant
CRobeck 9dcd3d8
finish filling in RAJA_CUDA MAT_FUSED_MUL_ADD variant
CRobeck 66d9756
filling in MAT_FUSED_MUL_ADD OMP variants
CRobeck 73d745a
add MAT_FUSED_MUL_ADD_DATA_INIT
CRobeck 9c63d3b
add MAT_FUSED_MUL_ADD_DATA_INIT
CRobeck ca7b810
fixing some unused vars
CRobeck 02633ea
fix indexing issue
CRobeck f1fb1ab
fix incorrect order of raja_hip segments
CRobeck faf046f
fix incorrect order of raja_cuda segments
CRobeck 73bd85e
update data types
CRobeck 4f7dac3
fix indexing issue
CRobeck 8e119d7
fix indexing issue
CRobeck cde8323
fix merge conflict
CRobeck e7ef2b4
fix one more indexing issue
CRobeck 78e2590
fix indexing issue
CRobeck feadc64
adding tunings for builtin version of mfma hip kernel
CRobeck 900671a
fix one more indexing issue
CRobeck 162a855
clean up some comments and unused vars
CRobeck fbf2820
hopefully fixing final indexing issue
CRobeck e1ebcdd
removing some unused vars
CRobeck f02da32
cleaning up block/grid defs
CRobeck 59f6766
cleaning up block/grid defs further
CRobeck 89a8c00
fix spacing
CRobeck 1156a66
fix spacing
CRobeck fd9d80a
spacing
CRobeck c7b35a9
spacing
CRobeck 2d591b2
moving function to unnamed namespace
CRobeck 30cfca4
Merge branch 'test/mfma' of https://github.com/LLNL/RAJAPerf into tes…
CRobeck 3ca73ff
Merge branch 'develop' into test/mfma
CRobeck 18622fb
fixing undefined var
CRobeck 23b9277
fixing omp looping issue for gcc
CRobeck 5f46c0e
fixing omp looping issue for gcc
CRobeck 8f0e512
fixing one more omp looping issue for gcc
CRobeck 8f07996
omp requires canonical loop forms
CRobeck c94586f
update default problem size
CRobeck 43e277b
update spacing
CRobeck 44c36b4
fix spacing
CRobeck 8e3fbf6
formating clean up
CRobeck bd2d44c
move getHipArch into common header file as a static free function
CRobeck 49084ef
update out loop for raja hip/cuda variant
CRobeck 5169e6e
update unroll pragma to RAJAPERF_UNROLL
CRobeck a400d93
fix RAJAPERF_UNROLL def
CRobeck 5bc407f
fix type in cuda variant
CRobeck 3186df1
Update MAT_FUSED_MUL_ADD.hpp
CRobeck 9eb3ce4
cleaning up some variable naming and create a N_Elem var
CRobeck f8c1f2c
add raja_hiperrchk header
CRobeck 1347434
update lone N_Elem var
CRobeck cde5c95
update lambda forall
CRobeck a4ee50d
Merge branch 'test/mfma' of github.com:LLNL/RAJAPerf into test/mfma
CRobeck 5b93123
roll back lambda_hip_forall naming
CRobeck 39fbf15
git rid of some unused static function warning
CRobeck 831e61c
make hipArch more consistent with device naming
CRobeck 4e6a7b3
fixing var naming issue
CRobeck File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,155 @@ | ||
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~// | ||
// Copyright (c) 2017-20, Lawrence Livermore National Security, LLC | ||
// and RAJA Performance Suite project contributors. | ||
// See the RAJAPerf/LICENSE file for details. | ||
// | ||
// SPDX-License-Identifier: (BSD-3-Clause) | ||
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~// | ||
|
||
#include "MAT_FUSED_MUL_ADD.hpp" | ||
|
||
#include "RAJA/RAJA.hpp" | ||
|
||
#if defined(RAJA_ENABLE_CUDA) | ||
|
||
#include "common/CudaDataUtils.hpp" | ||
|
||
#include <iostream> | ||
|
||
namespace rajaperf { | ||
namespace basic { | ||
|
||
#define MAT_FUSED_MUL_ADD_DATA_SETUP_CUDA \ | ||
const Index_type N = m_N; \ | ||
const Index_type Ne = m_Ne; \ | ||
allocAndInitCudaDeviceData(A, m_A, N); \ | ||
allocAndInitCudaDeviceData(B, m_B, N); \ | ||
allocAndInitCudaDeviceData(D, m_D, N); | ||
|
||
#define MAT_FUSED_MUL_ADD_DATA_TEARDOWN_CUDA \ | ||
getCudaDeviceData(m_A, A, N); \ | ||
getCudaDeviceData(m_B, B, N); \ | ||
getCudaDeviceData(m_D, D, N); \ | ||
deallocCudaDeviceData(A); \ | ||
deallocCudaDeviceData(B); \ | ||
deallocCudaDeviceData(D); | ||
|
||
|
||
template < Index_type block_size > | ||
__launch_bounds__(block_size) | ||
__global__ void mat_fused_mul_add(const Real_ptr A, const Real_ptr B, Real_ptr D, | ||
Index_type N){ | ||
constexpr Index_type Ne = 16; | ||
const Index_Type N_Elem = N/(Ne*Ne); | ||
for(Index_type ii = 0; ii != N_Elem; ++ii){ | ||
Index_type col = threadIdx.x + blockIdx.x * blockDim.x; | ||
Index_type row = threadIdx.y + blockIdx.y * blockDim.y; | ||
MAT_FUSED_MUL_ADD_BODY; | ||
} | ||
} | ||
template < Index_type block_size, typename Lambda > | ||
__launch_bounds__(block_size) | ||
__global__ void mat_fused_lam(Index_type N, Lambda body) | ||
{ | ||
constexpr Index_type Ne = 16; | ||
const Index_Type N_Elem = N/(Ne*Ne); | ||
for(Index_type ii = 0; ii != N_Elem; ++ii){ | ||
Index_type col = threadIdx.x + blockIdx.x * blockDim.x; | ||
Index_type row = threadIdx.y + blockIdx.y * blockDim.y; | ||
body(ii,col,row); | ||
} | ||
} | ||
template < size_t block_size > | ||
void MAT_FUSED_MUL_ADD::runCudaVariantImpl(VariantID vid) | ||
{ | ||
const Index_type run_reps = getRunReps(); | ||
const Index_type N = m_N; | ||
const Index_Type N_Elem = N/(Ne*Ne); | ||
constexpr Index_type Ne = m_Ne; | ||
|
||
constexpr Index_type block_x = gpu_block_size::sqrt(block_size); | ||
constexpr Index_type block_y = gpu_block_size::sqrt(block_size); | ||
dim3 blockDim(block_x, block_y); | ||
dim3 gridDim(static_cast<size_t>(RAJA_DIVIDE_CEILING_INT(Ne, block_size)), | ||
static_cast<size_t>(RAJA_DIVIDE_CEILING_INT(Ne, block_size)), | ||
static_cast<size_t>(1)); | ||
|
||
MAT_FUSED_MUL_ADD_DATA_SETUP; | ||
|
||
MAT_FUSED_MUL_ADD_DATA_INIT; | ||
|
||
if (vid == Base_CUDA) { | ||
|
||
MAT_FUSED_MUL_ADD_DATA_SETUP_CUDA; | ||
|
||
startTimer(); | ||
for (RepIndex_type irep = 0; irep < run_reps; ++irep) { | ||
mat_fused_mul_add<block_size><<<dim3(gridDim), dim3(blockDim)>>>(A, B, D, N); | ||
} | ||
stopTimer(); | ||
|
||
MAT_FUSED_MUL_ADD_DATA_TEARDOWN_CUDA; | ||
|
||
} else if (vid == Lambda_CUDA) { | ||
|
||
MAT_FUSED_MUL_ADD_DATA_SETUP_CUDA; | ||
|
||
startTimer(); | ||
for (RepIndex_type irep = 0; irep < run_reps; ++irep) { | ||
|
||
auto mat_fused_lamda = | ||
[=] __device__ (Index_type ii, Index_type row, Index_type col) { | ||
MAT_FUSED_MUL_ADD_BODY; | ||
}; | ||
mat_fused_lam<block_size, decltype(mat_fused_lamda)> | ||
<<<dim3(gridDim), dim3(blockDim)>>>(N, mat_fused_lamda); | ||
} | ||
stopTimer(); | ||
|
||
MAT_FUSED_MUL_ADD_DATA_TEARDOWN_CUDA; | ||
|
||
} else if (vid == RAJA_CUDA) { | ||
|
||
MAT_FUSED_MUL_ADD_DATA_SETUP_CUDA; | ||
|
||
startTimer(); | ||
RAJA::RangeSegment row_range(0, Ne); | ||
RAJA::RangeSegment col_range(0, Ne); | ||
RAJA::RangeSegment ii_range(0, N_Elem); | ||
using EXEC_POL = | ||
RAJA::KernelPolicy< | ||
RAJA::statement::CudaKernel< | ||
RAJA::statement::For<2, RAJA::cuda_block_z_loop, | ||
RAJA::statement::Tile<1, RAJA::tile_fixed<block_size>, RAJA::cuda_block_y_direct, | ||
RAJA::statement::Tile<0, RAJA::tile_fixed<block_size>, RAJA::cuda_block_x_direct, | ||
RAJA::statement::For<1, RAJA::cuda_thread_y_direct, | ||
RAJA::statement::For<0, RAJA::cuda_thread_x_direct, | ||
RAJA::statement::Lambda<0> | ||
> | ||
> | ||
> | ||
> | ||
> | ||
> | ||
>; | ||
RAJA::kernel<EXEC_POL>(RAJA::make_tuple(row_range, col_range, ii_range), | ||
[=] RAJA_DEVICE (Index_type row, Index_type col, Index_type ii) { | ||
MAT_FUSED_MUL_ADD_BODY; | ||
}); | ||
stopTimer(); | ||
|
||
MAT_FUSED_MUL_ADD_DATA_TEARDOWN_CUDA; | ||
|
||
|
||
} else { | ||
getCout() << "\n MAT_FUSED_MUL_ADD : Unknown Cuda variant id = " << vid | ||
<< std::endl; | ||
} | ||
} | ||
|
||
RAJAPERF_GPU_BLOCK_SIZE_TUNING_DEFINE_BIOLERPLATE(MAT_FUSED_MUL_ADD, Cuda) | ||
|
||
} // end namespace basic | ||
} // end namespace rajaperf | ||
|
||
#endif // RAJA_ENABLE_CUDA |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this update be in another PR to keep this focused on the Kernel? I try to keep cuda and hip in sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can revert it and/or move it to a new branch, it got pulled in from one of your review comments and I foresee adding a lambda_hip_forall using 2D thread indexing and wanted to get in front of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not rename it here.