From 99dfb62362dd0e28ce1f873b1585adad4183d184 Mon Sep 17 00:00:00 2001
From: Arturo Vargas <vargas45@llnl.gov>
Date: Tue, 17 Sep 2024 14:41:55 -0700
Subject: [PATCH 01/16] add note about unit stride to layout

---
 .../user_guide/tutorial/view_layout.rst       | 120 ++++++++++--------
 1 file changed, 65 insertions(+), 55 deletions(-)

diff --git a/docs/sphinx/user_guide/tutorial/view_layout.rst b/docs/sphinx/user_guide/tutorial/view_layout.rst
index 252a9fe8b1..2132cfa222 100644
--- a/docs/sphinx/user_guide/tutorial/view_layout.rst
+++ b/docs/sphinx/user_guide/tutorial/view_layout.rst
@@ -22,8 +22,8 @@ from the build directory.
 
 Key RAJA features shown in this section are:
 
-  * ``RAJA::View`` 
-  * ``RAJA::Layout`` and ``RAJA::OffsetLayout`` constructs 
+  * ``RAJA::View``
+  * ``RAJA::Layout`` and ``RAJA::OffsetLayout`` constructs
   * Layout permutations
 
 The examples in this section illustrate RAJA View and Layout concepts
@@ -40,11 +40,11 @@ operation, using :math:`N \times N` matrices:
    :end-before: _cstyle_matmult_end
    :language: C++
 
-As is commonly done for efficiency in C and C++, we have allocated the data 
-for the matrices as one-dimensional arrays. Thus, we need to manually compute 
+As is commonly done for efficiency in C and C++, we have allocated the data
+for the matrices as one-dimensional arrays. Thus, we need to manually compute
 the data pointer offsets for the row and column indices in the kernel.
 Here, we use the array ``Cref`` to hold a reference solution matrix that
-we use to compare with results generated by the examples below. 
+we use to compare with results generated by the examples below.
 
 To simplify the multi-dimensional indexing, we can use ``RAJA::View`` objects,
 which we define as:
@@ -55,20 +55,31 @@ which we define as:
    :language: C++
 
 Here we define three ``RAJA::View`` objects, 'Aview', 'Bview', and 'Cview',
-that *wrap* the array data pointers, 'A', 'B', and 'C', respectively. We 
-pass a data pointer as the first argument to each view constructor and then 
+that *wrap* the array data pointers, 'A', 'B', and 'C', respectively. We
+pass a data pointer as the first argument to each view constructor and then
 the extent of each matrix dimension as the second and third arguments. There
 are two extent arguments since we indicate in the ``RAJA::Layout`` template
-parameter list. The matrices are square and each extent is 'N'. Here, the 
-template parameters to ``RAJA::View`` are the array data type 'double' and 
+parameter list. The matrices are square and each extent is 'N'. Here, the
+template parameters to ``RAJA::View`` are the array data type 'double' and
 a ``RAJA::Layout`` type. Specifically::
 
   RAJA::Layout<2, int>
 
-means that each View represents a two-dimensional default data layout, and 
-that we will use values of type 'int' to index into the arrays. 
+means that each View represents a two-dimensional default data layout, and
+that we will use values of type 'int' to index into the arrays.
 
-Using the ``RAJA::View`` objects, we can access the data entries for the rows 
+.. note:: A third argument in the Layout type can be used to specify the index
+          with unit stride::
+
+            RAJA::Layout<2, int, 1>
+
+          In the example above index 1 will be marked to have unit stride making
+          multi-dimensional indexing more efficient by avoiding multiplication by
+          `1` when it is unnecessary.
+
+
+
+Using the ``RAJA::View`` objects, we can access the data entries for the rows
 and columns using a more natural, less error-prone syntax:
 
 .. literalinclude:: ../../../../exercises/view-layout_solution.cpp
@@ -79,9 +90,9 @@ and columns using a more natural, less error-prone syntax:
 Default Layouts Use Row-major Ordering
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-The default data layout ordering in RAJA is *row-major*, which is the 
-convention for multi-dimensional array indexing in C and C++. This means that 
-the rightmost index will be stride-1, the index to the left of the rightmost 
+The default data layout ordering in RAJA is *row-major*, which is the
+convention for multi-dimensional array indexing in C and C++. This means that
+the rightmost index will be stride-1, the index to the left of the rightmost
 index will have stride equal to the extent of the rightmost dimension, and
 so on.
 
@@ -90,8 +101,8 @@ so on.
           see :ref:`feat-view-label` for more details.
 
 To illustrate the default data layout striding, we next show simple
-one-, two-, and three-dimensional examples where the for-loop ordering 
-for the different dimensions is such that all data access is stride-1. We 
+one-, two-, and three-dimensional examples where the for-loop ordering
+for the different dimensions is such that all data access is stride-1. We
 begin by defining some dimensions, allocate and initialize arrays:
 
 .. literalinclude:: ../../../../exercises/view-layout_solution.cpp
@@ -99,7 +110,7 @@ begin by defining some dimensions, allocate and initialize arrays:
    :end-before: _default_views_init_end
    :language: C++
 
-The version of the array initialization kernel using a one-dimensional 
+The version of the array initialization kernel using a one-dimensional
 ``RAJA::View`` is:
 
 .. literalinclude:: ../../../../exercises/view-layout_solution.cpp
@@ -107,7 +118,7 @@ The version of the array initialization kernel using a one-dimensional
    :end-before: _default_view1D_end
    :language: C++
 
-The version of the array initialization using a two-dimensional 
+The version of the array initialization using a two-dimensional
 ``RAJA::View`` is:
 
 .. literalinclude:: ../../../../exercises/view-layout_solution.cpp
@@ -115,7 +126,7 @@ The version of the array initialization using a two-dimensional
    :end-before: _default_view2D_end
    :language: C++
 
-The three-dimensional version is: 
+The three-dimensional version is:
 
 .. literalinclude:: ../../../../exercises/view-layout_solution.cpp
    :start-after: _default_view3D_start
@@ -126,16 +137,16 @@ It's worth repeating that the data array access in all three variants shown
 here using ``RAJA::View`` objects is stride-1 since we order the for-loops
 in the loop nests to match the row-major ordering.
 
-RAJA Layout types support other data access patterns with different striding 
-orders, offsets, and permutations. To this point, we have used the default 
-Layout constructor. RAJA provides methods to generate Layouts for different 
-indexing patterns. We describe these in the next several sections. Next, we 
+RAJA Layout types support other data access patterns with different striding
+orders, offsets, and permutations. To this point, we have used the default
+Layout constructor. RAJA provides methods to generate Layouts for different
+indexing patterns. We describe these in the next several sections. Next, we
 show how to permute the data striding order using permuted Layouts.
 
 Permuted Layouts Change Data Striding Order
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Every ``RAJA::Layout`` object has a permutation. When a permutation is not 
+Every ``RAJA::Layout`` object has a permutation. When a permutation is not
 specified at creation, a Layout will use the identity permutation. Here are
 examples where the identity permutation is explicitly provided. First, in
 two dimensions:
@@ -153,10 +164,10 @@ Then, in three dimensions:
    :language: C++
 
 These two examples access the data with stride-1 ordering, the same as in
-the earlier examples, which is shown by the nested loop ordering.  
+the earlier examples, which is shown by the nested loop ordering.
 The identity permutation in two dimensions is '{0, 1}' and is '{0, 1, 2}'
-for three dimensions. The method ``RAJA::make_permuted_layout`` is used to 
-create a ``RAJA::Layout`` object with a permutation. The method takes two 
+for three dimensions. The method ``RAJA::make_permuted_layout`` is used to
+create a ``RAJA::Layout`` object with a permutation. The method takes two
 arguments, the extents of each dimension and the permutation.
 
 .. note:: If a permuted Layout is created with the *identity permutation*
@@ -170,8 +181,8 @@ Next, we permute the striding order for the two-dimensional example:
    :language: C++
 
 Read from right to left, the permutation '{1, 0}' specifies that the first
-(zero) index 'i' is stride-1 and the second index (one) 'j' has stride equal 
-to the extent of the first Layout dimension 'Nx'. This is evident in the 
+(zero) index 'i' is stride-1 and the second index (one) 'j' has stride equal
+to the extent of the first Layout dimension 'Nx'. This is evident in the
 for-loop ordering.
 
 Here is the three-dimensional case, where we have reversed the striding order
@@ -182,7 +193,7 @@ using the permutation '{2, 1, 0}':
    :end-before: _perma_view3D_end
    :language: C++
 
-The data access remains stride-1 due to the for-loop reordering. For fun, 
+The data access remains stride-1 due to the for-loop reordering. For fun,
 here is another three-dimensional permutation:
 
 .. literalinclude:: ../../../../exercises/view-layout_solution.cpp
@@ -197,8 +208,8 @@ Multi-dimensional Indices and Linear Indices
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 ``RAJA::Layout`` types provide methods to convert between linear indices and
-multi-dimensional indices and vice versa. Recall the Layout 'perm3a_layout' 
-from above that was created with the permutation '{2, 1, 0}'. To get the 
+multi-dimensional indices and vice versa. Recall the Layout 'perm3a_layout'
+from above that was created with the permutation '{2, 1, 0}'. To get the
 linear index corresponding to the index triple '(1, 2, 0)', you can do
 this::
 
@@ -210,12 +221,12 @@ for linear index 7, you can do::
   int i, j, k;
   perm3a_layout.toIndices(7, i, j, k);
 
-This sets 'i' to 1, 'j' to 2, and 'k' to 0.  
+This sets 'i' to 1, 'j' to 2, and 'k' to 0.
 
-Similarly for the Layout 'permb_layout', which was created with the 
+Similarly for the Layout 'permb_layout', which was created with the
 permutation '{1, 2, 0}'::
 
-  lin = perm3b_layout(1, 2, 0); 
+  lin = perm3b_layout(1, 2, 0);
 
 sets 'lin' to 13 = 1 + 0 * Nx + 2 * Nx * Nz and::
 
@@ -223,23 +234,23 @@ sets 'lin' to 13 = 1 + 0 * Nx + 2 * Nx * Nz and::
 
 sets 'i' to 1, 'j' to 2, and 'k' to 0.
 
-There are more examples in the exercise file associated with this section. 
+There are more examples in the exercise file associated with this section.
 Feel free to experiment with them.
 
 One important item to note is that, by default, there is no bounds checking
 on indices passed to a ``RAJA::View`` data access method or ``RAJA::Layout``
-index computation methods. Therefore, it is the responsibility of a user 
-to ensure that indices passed to ``RAJA::View`` and ``RAJA::Layoout`` 
-methods are in bounds to avoid accessing data outside 
-of the View or computing invalid indices. 
+index computation methods. Therefore, it is the responsibility of a user
+to ensure that indices passed to ``RAJA::View`` and ``RAJA::Layoout``
+methods are in bounds to avoid accessing data outside
+of the View or computing invalid indices.
 
-.. note:: RAJA provides a CMake variable ``RAJA_ENABLE_BOUNDS_CHECK`` to 
+.. note:: RAJA provides a CMake variable ``RAJA_ENABLE_BOUNDS_CHECK`` to
           turn run time bounds checking on or off when the code is compiled.
           Enabling bounds checking is useful for debugging and to ensure
           your code is correct. However, when enabled, bounds checking adds
           noticeable run time overhead. So it should not be enabled for
-          a production build of your code. 
-   
+          a production build of your code.
+
 Offset Layouts Apply Offsets to Indices
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -251,9 +262,9 @@ We first illustrate the concept of an offset with a C-style for-loop:
    :end-before: _cstyle_offlayout1D_end
    :language: C++
 
-Here, the for-loop runs from 'imin' to 'imax-1' (i.e., -5 to 5). To avoid 
-out-of-bounds negative indexing, we subtract 'imin' (i.e., -5) from the loop 
-index 'i'. 
+Here, the for-loop runs from 'imin' to 'imax-1' (i.e., -5 to 5). To avoid
+out-of-bounds negative indexing, we subtract 'imin' (i.e., -5) from the loop
+index 'i'.
 
 To do the same thing with RAJA, we create a ``RAJA::OffsetLayout`` object
 and use it to index into the array:
@@ -264,7 +275,7 @@ and use it to index into the array:
    :language: C++
 
 ``RAJA::OffsetLayout`` is a different type than ``RAJA::Layout`` because
-it contains offset information. The arguments to the 
+it contains offset information. The arguments to the
 ``RAJA::make_offset_layout`` method are the index bounds.
 
 As expected, the two dimensional case is similar. First, a C-style loop:
@@ -284,7 +295,7 @@ and then the same operation using a ``RAJA::OffsetLayout`` object:
 Note that the first argument passed to ``RAJA::make_offset_layout`` contains
 the lower bounds for 'i' and 'j' and the second argument contains the upper
 bounds. Also, the 'j' index is stride-1 by default since we did not pass
-a permutation to the ``RAJA::make_offset_layout`` method, which is the same 
+a permutation to the ``RAJA::make_offset_layout`` method, which is the same
 as the non-offset Layout usage.
 
 Just like ``RAJA::Layout`` has a permutation, so does ``RAJA::OffsetLayout``.
@@ -293,11 +304,10 @@ Here is an example where we permute the (i, j) index stride ordering:
 .. literalinclude:: ../../../../exercises/view-layout_solution.cpp
    :start-after: _raja_permofflayout2D_start
    :end-before: _raja_permofflayout2D_end
-   :language: C++ 
+   :language: C++
 
-The permutation '{1, 0}' is passed as the third argument to 
-``RAJA::make_offset_layout``. From the ordering of the for-loops, we can see 
-that the 'i' index is stride-1 and the 'j' index has stride equal to the 
-extent of the 'i' dimension so the for-loop nest strides through 
+The permutation '{1, 0}' is passed as the third argument to
+``RAJA::make_offset_layout``. From the ordering of the for-loops, we can see
+that the 'i' index is stride-1 and the 'j' index has stride equal to the
+extent of the 'i' dimension so the for-loop nest strides through
 the data with unit stride.
-

From 2b1ab410a67f794865ced2691a8e54431db7b9ab Mon Sep 17 00:00:00 2001
From: Arturo Vargas <vargas45@llnl.gov>
Date: Wed, 18 Sep 2024 13:29:18 -0700
Subject: [PATCH 02/16] Update docs/sphinx/user_guide/tutorial/view_layout.rst

---
 docs/sphinx/user_guide/tutorial/view_layout.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/sphinx/user_guide/tutorial/view_layout.rst b/docs/sphinx/user_guide/tutorial/view_layout.rst
index 2132cfa222..8cd26f59b3 100644
--- a/docs/sphinx/user_guide/tutorial/view_layout.rst
+++ b/docs/sphinx/user_guide/tutorial/view_layout.rst
@@ -240,7 +240,7 @@ Feel free to experiment with them.
 One important item to note is that, by default, there is no bounds checking
 on indices passed to a ``RAJA::View`` data access method or ``RAJA::Layout``
 index computation methods. Therefore, it is the responsibility of a user
-to ensure that indices passed to ``RAJA::View`` and ``RAJA::Layoout``
+to ensure that indices passed to ``RAJA::View`` and ``RAJA::Layout``
 methods are in bounds to avoid accessing data outside
 of the View or computing invalid indices.
 

From d5d3b83757e25b0895ae2a07ddab80c80e2b60af Mon Sep 17 00:00:00 2001
From: Arturo Vargas <vargas45@llnl.gov>
Date: Wed, 9 Oct 2024 16:36:23 -0700
Subject: [PATCH 03/16] indicate unit stride in exercises

---
 exercises/view-layout.cpp          | 18 +++++++++---------
 exercises/view-layout_solution.cpp | 30 +++++++++++++++---------------
 2 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/exercises/view-layout.cpp b/exercises/view-layout.cpp
index 0f9383e95e..6714fe66fb 100644
--- a/exercises/view-layout.cpp
+++ b/exercises/view-layout.cpp
@@ -105,9 +105,9 @@ int main(int RAJA_UNUSED_ARG(argc), char **RAJA_UNUSED_ARG(argv[]))
   // Note: we use default Layout 
   //
   // _matmult_views_start
-  RAJA::View< double, RAJA::Layout<2, int> > Aview(A, N, N);
-  RAJA::View< double, RAJA::Layout<2, int> > Bview(B, N, N);
-  RAJA::View< double, RAJA::Layout<2, int> > Cview(C, N, N);
+  RAJA::View< double, RAJA::Layout<2, int, 1> > Aview(A, N, N);
+  RAJA::View< double, RAJA::Layout<2, int, 1> > Bview(B, N, N);
+  RAJA::View< double, RAJA::Layout<2, int, 1> > Cview(C, N, N);
   // _matmult_views_end
 
   // _cstyle_matmult_views_start
@@ -165,7 +165,7 @@ int main(int RAJA_UNUSED_ARG(argc), char **RAJA_UNUSED_ARG(argv[]))
   std::memset(a, 0, Ntot * sizeof(int));
  
   // _default_view1D_start 
-  RAJA::View< int, RAJA::Layout<1, int> > view_1D(a, Ntot);
+  RAJA::View< int, RAJA::Layout<1, int, 0> > view_1D(a, Ntot);
 
   for (int i = 0; i < Ntot; ++i) {
     view_1D(i) = i;
@@ -182,7 +182,7 @@ int main(int RAJA_UNUSED_ARG(argc), char **RAJA_UNUSED_ARG(argv[]))
   std::memset(a, 0, Ntot * sizeof(int));
  
   // _default_view2D_start
-  RAJA::View< int, RAJA::Layout<2, int> > view_2D(a, Nx, Ny);
+  RAJA::View< int, RAJA::Layout<2, int, 1> > view_2D(a, Nx, Ny);
 
   int iter{0};
   for (int i = 0; i < Nx; ++i) {
@@ -229,9 +229,9 @@ int main(int RAJA_UNUSED_ARG(argc), char **RAJA_UNUSED_ARG(argv[]))
 
   // _default_perm_view2D_start
   std::array<RAJA::idx_t, 2> defperm2 {{0, 1}};
-  RAJA::Layout< 2, int > defperm2_layout =
+  RAJA::Layout< 2, int> defperm2_layout =
     RAJA::make_permuted_layout( {{Nx, Ny}}, defperm2);
-  RAJA::View< int, RAJA::Layout<2, int> > defperm_view_2D(a, defperm2_layout);
+  RAJA::View< int, RAJA::Layout<2, int, 1> > defperm_view_2D(a, defperm2_layout);
 
   iter = 0;
   for (int i = 0; i < Nx; ++i) {
@@ -272,7 +272,7 @@ int main(int RAJA_UNUSED_ARG(argc), char **RAJA_UNUSED_ARG(argv[]))
   std::array<RAJA::idx_t, 2> perm2 {{1, 0}};
   RAJA::Layout< 2, int > perm2_layout =
     RAJA::make_permuted_layout( {{Nx, Ny}}, perm2);
-  RAJA::View< int, RAJA::Layout<2, int> > perm_view_2D(a, perm2_layout);
+  RAJA::View< int, RAJA::Layout<2, int, 0> > perm_view_2D(a, perm2_layout);
 
   iter = 0;
   for (int j = 0; j < Ny; ++j) {
@@ -318,7 +318,7 @@ int main(int RAJA_UNUSED_ARG(argc), char **RAJA_UNUSED_ARG(argv[]))
   std::array<RAJA::idx_t, 3> perm3b {{1, 2, 0}};
   RAJA::Layout< 3, int > perm3b_layout =
     RAJA::make_permuted_layout( {{Nx, Ny, Nz}}, perm3b);
-  RAJA::View< int, RAJA::Layout<3, int> > perm3b_view_3D(a, perm3b_layout);
+  RAJA::View< int, RAJA::Layout<3, int, 0> > perm3b_view_3D(a, perm3b_layout);
 
   iter = 0;
   for (int j = 0; j < Ny; ++j) {
diff --git a/exercises/view-layout_solution.cpp b/exercises/view-layout_solution.cpp
index 7614c993a8..e6a2788b6c 100644
--- a/exercises/view-layout_solution.cpp
+++ b/exercises/view-layout_solution.cpp
@@ -102,12 +102,12 @@ int main(int RAJA_UNUSED_ARG(argc), char **RAJA_UNUSED_ARG(argv[]))
   // 
   // Define RAJA View objects to simplify access to the matrix entries.
   // 
-  // Note: we use default Layout 
+  // Note: we use default Layout and specify unit stride
   //
   // _matmult_views_start
-  RAJA::View< double, RAJA::Layout<2, int> > Aview(A, N, N);
-  RAJA::View< double, RAJA::Layout<2, int> > Bview(B, N, N);
-  RAJA::View< double, RAJA::Layout<2, int> > Cview(C, N, N);
+  RAJA::View< double, RAJA::Layout<2, int, 1> > Aview(A, N, N);
+  RAJA::View< double, RAJA::Layout<2, int, 1> > Bview(B, N, N);
+  RAJA::View< double, RAJA::Layout<2, int, 1> > Cview(C, N, N);
   // _matmult_views_end
 
   // _cstyle_matmult_views_start
@@ -165,7 +165,7 @@ int main(int RAJA_UNUSED_ARG(argc), char **RAJA_UNUSED_ARG(argv[]))
   std::memset(a, 0, Ntot * sizeof(int));
  
   // _default_view1D_start 
-  RAJA::View< int, RAJA::Layout<1, int> > view_1D(a, Ntot);
+  RAJA::View< int, RAJA::Layout<1, int, 0> > view_1D(a, Ntot);
 
   for (int i = 0; i < Ntot; ++i) {
     view_1D(i) = i;
@@ -182,7 +182,7 @@ int main(int RAJA_UNUSED_ARG(argc), char **RAJA_UNUSED_ARG(argv[]))
   std::memset(a, 0, Ntot * sizeof(int));
  
   // _default_view2D_start
-  RAJA::View< int, RAJA::Layout<2, int> > view_2D(a, Nx, Ny);
+  RAJA::View< int, RAJA::Layout<2, int, 1> > view_2D(a, Nx, Ny);
 
   int iter{0};
   for (int i = 0; i < Nx; ++i) {
@@ -203,7 +203,7 @@ int main(int RAJA_UNUSED_ARG(argc), char **RAJA_UNUSED_ARG(argv[]))
   std::memset(a, 0, Ntot * sizeof(int));
 
   // _default_view3D_start    
-  RAJA::View< int, RAJA::Layout<3, int> > view_3D(a, Nx, Ny, Nz);
+  RAJA::View< int, RAJA::Layout<3, int, 2> > view_3D(a, Nx, Ny, Nz);
 
   iter = 0;
   for (int i = 0; i < Nx; ++i) {
@@ -235,9 +235,9 @@ int main(int RAJA_UNUSED_ARG(argc), char **RAJA_UNUSED_ARG(argv[]))
 
   // _default_perm_view2D_start
   std::array<RAJA::idx_t, 2> defperm2 {{0, 1}};
-  RAJA::Layout< 2, int > defperm2_layout =
+  RAJA::Layout< 2, int> defperm2_layout =
     RAJA::make_permuted_layout( {{Nx, Ny}}, defperm2);
-  RAJA::View< int, RAJA::Layout<2, int> > defperm_view_2D(a, defperm2_layout);
+  RAJA::View< int, RAJA::Layout<2, int, 1> > defperm_view_2D(a, defperm2_layout);
 
   iter = 0;
   for (int i = 0; i < Nx; ++i) {
@@ -261,7 +261,7 @@ int main(int RAJA_UNUSED_ARG(argc), char **RAJA_UNUSED_ARG(argv[]))
   std::array<RAJA::idx_t, 3> defperm3 {{0, 1, 2}};
   RAJA::Layout< 3, int > defperm3_layout =
     RAJA::make_permuted_layout( {{Nx, Ny, Nz}}, defperm3);
-  RAJA::View< int, RAJA::Layout<3, int> > defperm_view_3D(a, defperm3_layout);
+  RAJA::View< int, RAJA::Layout<3, int, 2> > defperm_view_3D(a, defperm3_layout);
 
   iter = 0;
   for (int i = 0; i < Nx; ++i) {
@@ -286,9 +286,9 @@ int main(int RAJA_UNUSED_ARG(argc), char **RAJA_UNUSED_ARG(argv[]))
 
   // _perm_2D_start
   std::array<RAJA::idx_t, 2> perm2 {{1, 0}};
-  RAJA::Layout< 2, int > perm2_layout =
+  RAJA::Layout< 2, int> perm2_layout =
     RAJA::make_permuted_layout( {{Nx, Ny}}, perm2);
-  RAJA::View< int, RAJA::Layout<2, int> > perm_view_2D(a, perm2_layout);
+  RAJA::View< int, RAJA::Layout<2, int, 0> > perm_view_2D(a, perm2_layout);
 
   iter = 0;
   for (int j = 0; j < Ny; ++j) {
@@ -310,9 +310,9 @@ int main(int RAJA_UNUSED_ARG(argc), char **RAJA_UNUSED_ARG(argv[]))
 
   // _perma_view3D_start
   std::array<RAJA::idx_t, 3> perm3a {{2, 1, 0}};
-  RAJA::Layout< 3, int > perm3a_layout =
+  RAJA::Layout< 3, int> perm3a_layout =
     RAJA::make_permuted_layout( {{Nx, Ny, Nz}}, perm3a);
-  RAJA::View< int, RAJA::Layout<3, int> > perm3a_view_3D(a, perm3a_layout);
+  RAJA::View< int, RAJA::Layout<3, int, 0> > perm3a_view_3D(a, perm3a_layout);
 
   iter = 0;
   for (int k = 0; k < Nz; ++k) {
@@ -338,7 +338,7 @@ int main(int RAJA_UNUSED_ARG(argc), char **RAJA_UNUSED_ARG(argv[]))
   std::array<RAJA::idx_t, 3> perm3b {{1, 2, 0}};
   RAJA::Layout< 3, int > perm3b_layout =
     RAJA::make_permuted_layout( {{Nx, Ny, Nz}}, perm3b);
-  RAJA::View< int, RAJA::Layout<3, int> > perm3b_view_3D(a, perm3b_layout);
+  RAJA::View< int, RAJA::Layout<3, int, 0> > perm3b_view_3D(a, perm3b_layout);
 
   iter = 0;
   for (int j = 0; j < Ny; ++j) {

From 5781c3f5957771ffc1fed356db0cf0311f4cfe24 Mon Sep 17 00:00:00 2001
From: Arturo Vargas <vargas45@llnl.gov>
Date: Wed, 9 Oct 2024 16:49:34 -0700
Subject: [PATCH 04/16] add 3D unit stride permuted layout example

---
 docs/sphinx/user_guide/tutorial/view_layout.rst | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/docs/sphinx/user_guide/tutorial/view_layout.rst b/docs/sphinx/user_guide/tutorial/view_layout.rst
index 8cd26f59b3..88d62ad16b 100644
--- a/docs/sphinx/user_guide/tutorial/view_layout.rst
+++ b/docs/sphinx/user_guide/tutorial/view_layout.rst
@@ -181,9 +181,9 @@ Next, we permute the striding order for the two-dimensional example:
    :language: C++
 
 Read from right to left, the permutation '{1, 0}' specifies that the first
-(zero) index 'i' is stride-1 and the second index (one) 'j' has stride equal
-to the extent of the first Layout dimension 'Nx'. This is evident in the
-for-loop ordering.
+(zero) index 'i' is stride-1, additionally captured in the 'RAJA::Layout',
+and the second index (one) 'j' has stride equal to the extent of the first
+Layout dimension 'Nx'. This is evident in the for-loop ordering.
 
 Here is the three-dimensional case, where we have reversed the striding order
 using the permutation '{2, 1, 0}':
@@ -193,6 +193,15 @@ using the permutation '{2, 1, 0}':
    :end-before: _perma_view3D_end
    :language: C++
 
+.. note:: As the index is now held by index 0 we adjust the Layout template
+          argument accordingly::
+
+            RAJA::Layout<3, int, 0>
+
+          As before index 0 will be marked to have unit stride making
+          multi-dimensional indexing more efficient by avoiding multiplication by
+          `1` when it is unnecessary.
+
 The data access remains stride-1 due to the for-loop reordering. For fun,
 here is another three-dimensional permutation:
 

From 421f25eeda6f484f82ab19cce0ad872e48246c99 Mon Sep 17 00:00:00 2001
From: Arturo Vargas <vargas45@llnl.gov>
Date: Wed, 9 Oct 2024 16:55:02 -0700
Subject: [PATCH 05/16] change mark up style

---
 docs/sphinx/user_guide/tutorial/view_layout.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/sphinx/user_guide/tutorial/view_layout.rst b/docs/sphinx/user_guide/tutorial/view_layout.rst
index 88d62ad16b..4d778ad941 100644
--- a/docs/sphinx/user_guide/tutorial/view_layout.rst
+++ b/docs/sphinx/user_guide/tutorial/view_layout.rst
@@ -181,7 +181,7 @@ Next, we permute the striding order for the two-dimensional example:
    :language: C++
 
 Read from right to left, the permutation '{1, 0}' specifies that the first
-(zero) index 'i' is stride-1, additionally captured in the 'RAJA::Layout',
+(zero) index 'i' is stride-1, additionally captured in the ``RAJA::Layout``,
 and the second index (one) 'j' has stride equal to the extent of the first
 Layout dimension 'Nx'. This is evident in the for-loop ordering.
 

From 0edd83c1a94397d88d52ae3358ae92b4c8432093 Mon Sep 17 00:00:00 2001
From: Alan Dayton <6393677+adayton1@users.noreply.github.com>
Date: Mon, 14 Oct 2024 13:40:46 -0700
Subject: [PATCH 06/16] Builtin atomic fixes for 32 bit windows (#1746)

Co-authored-by: Rich Hornung <hornung1@llnl.gov>
---
 include/RAJA/policy/atomic_builtin.hpp | 31 ++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/include/RAJA/policy/atomic_builtin.hpp b/include/RAJA/policy/atomic_builtin.hpp
index 34755fa49d..e43bd71386 100644
--- a/include/RAJA/policy/atomic_builtin.hpp
+++ b/include/RAJA/policy/atomic_builtin.hpp
@@ -22,7 +22,7 @@
 
 #include <cstdint>
 
-#if defined(RAJA_COMPILER_MSVC) || (defined(_WIN32) && defined(__INTEL_COMPILER))
+#if defined(RAJA_COMPILER_MSVC) || ((defined(_WIN32) || defined(_WIN64)) && defined(__INTEL_COMPILER))
 #include <intrin.h>
 #endif
 
@@ -48,7 +48,7 @@ struct builtin_atomic {
 namespace detail {
 
 
-#if defined(RAJA_COMPILER_MSVC) || (defined(_WIN32) && defined(__INTEL_COMPILER))
+#if defined(RAJA_COMPILER_MSVC) || ((defined(_WIN32) || defined(_WIN64)) && defined(__INTEL_COMPILER))
 
 
 /*!
@@ -120,11 +120,14 @@ RAJA_INLINE long builtin_atomicOr(long *acc, long value)
   return _InterlockedOr(acc, value);
 }
 
+#if defined(_WIN64)
+
 RAJA_INLINE long long builtin_atomicOr(long long *acc, long long value)
 {
   return _InterlockedOr64(acc, value);
 }
 
+#endif
 
 /*!
  * Atomic load using atomic or
@@ -155,11 +158,15 @@ RAJA_INLINE long builtin_atomicExchange(long *acc, long value)
   return _InterlockedExchange(acc, value);
 }
 
+#if defined(_WIN64)
+
 RAJA_INLINE long long builtin_atomicExchange(long long *acc, long long value)
 {
   return _InterlockedExchange64(acc, value);
 }
 
+#endif
+
 
 /*!
  * Atomic store using atomic exchange
@@ -190,11 +197,15 @@ RAJA_INLINE long builtin_atomicCAS(long *acc, long compare, long value)
   return _InterlockedCompareExchange(acc, value, compare);
 }
 
+#if defined(_WIN64)
+
 RAJA_INLINE long long builtin_atomicCAS(long long *acc, long long compare, long long value)
 {
   return _InterlockedCompareExchange64(acc, value, compare);
 }
 
+#endif
+
 
 /*!
  * Atomic addition using intrinsics
@@ -214,11 +225,15 @@ RAJA_INLINE long builtin_atomicAdd(long *acc, long value)
   return _InterlockedExchangeAdd(acc, value);
 }
 
+#if defined(_WIN64)
+
 RAJA_INLINE long long builtin_atomicAdd(long long *acc, long long value)
 {
   return _InterlockedExchangeAdd64(acc, value);
 }
 
+#endif
+
 
 /*!
  * Atomic subtraction using intrinsics
@@ -238,11 +253,15 @@ RAJA_INLINE long builtin_atomicSub(long *acc, long value)
   return _InterlockedExchangeAdd(acc, -value);
 }
 
+#if defined(_WIN64)
+
 RAJA_INLINE long long builtin_atomicSub(long long *acc, long long value)
 {
   return _InterlockedExchangeAdd64(acc, -value);
 }
 
+#endif
+
 
 /*!
  * Atomic and using intrinsics
@@ -262,11 +281,15 @@ RAJA_INLINE long builtin_atomicAnd(long *acc, long value)
   return _InterlockedAnd(acc, value);
 }
 
+#if defined(_WIN64)
+
 RAJA_INLINE long long builtin_atomicAnd(long long *acc, long long value)
 {
   return _InterlockedAnd64(acc, value);
 }
 
+#endif
+
 
 /*!
  * Atomic xor using intrinsics
@@ -286,11 +309,15 @@ RAJA_INLINE long builtin_atomicXor(long *acc, long value)
   return _InterlockedXor(acc, value);
 }
 
+#if defined(_WIN64)
+
 RAJA_INLINE long long builtin_atomicXor(long long *acc, long long value)
 {
   return _InterlockedXor64(acc, value);
 }
 
+#endif
+
 
 #else  // RAJA_COMPILER_MSVC
 

From 7de5378b4d341d45574ab346370f58523f9972bf Mon Sep 17 00:00:00 2001
From: Robert Chen <chen59@llnl.gov>
Date: Tue, 15 Oct 2024 15:45:25 -0700
Subject: [PATCH 07/16] Fix typo in omp target new reducer.

---
 include/RAJA/policy/openmp_target/params/reduce.hpp | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/RAJA/policy/openmp_target/params/reduce.hpp b/include/RAJA/policy/openmp_target/params/reduce.hpp
index 6127eef226..34c23fb5db 100644
--- a/include/RAJA/policy/openmp_target/params/reduce.hpp
+++ b/include/RAJA/policy/openmp_target/params/reduce.hpp
@@ -26,7 +26,7 @@ namespace detail {
   // Resolve
   template<typename EXEC_POL, typename OP, typename T, typename VOp>
   camp::concepts::enable_if< type_traits::is_target_openmp_policy<EXEC_POL> >
-  resolve(Reducer<OP, T, I, VOp>& red) {
+  resolve(Reducer<OP, T, VOp>& red) {
     red.combineTarget(red.m_valop.val);
   }
 

From 620ee492234b9a117fdace6fc77805942302622e Mon Sep 17 00:00:00 2001
From: Robert Chen <chen59@llnl.gov>
Date: Tue, 15 Oct 2024 15:45:46 -0700
Subject: [PATCH 08/16] Turn off omp target multi reducer tests.

---
 test/unit/multi_reducer/CMakeLists.txt | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/test/unit/multi_reducer/CMakeLists.txt b/test/unit/multi_reducer/CMakeLists.txt
index 6453fa66cb..94bbbc68d9 100644
--- a/test/unit/multi_reducer/CMakeLists.txt
+++ b/test/unit/multi_reducer/CMakeLists.txt
@@ -37,9 +37,10 @@ if(RAJA_ENABLE_OPENMP)
   list(APPEND BACKENDS OpenMP)
 endif()
 
-if(RAJA_ENABLE_TARGET_OPENMP)
-  list(APPEND BACKENDS OpenMPTarget)
-endif()
+# Add this back in when OpenMP Target implementation exists for multi-reducer
+#if(RAJA_ENABLE_TARGET_OPENMP)
+#  list(APPEND BACKENDS OpenMPTarget)
+#endif()
 
 if(RAJA_ENABLE_CUDA)
   list(APPEND BACKENDS Cuda)

From 13e24315e7299018cf0971b87506afa2d446e566 Mon Sep 17 00:00:00 2001
From: Rich Hornung <hornung1@llnl.gov>
Date: Thu, 17 Oct 2024 13:27:46 -0700
Subject: [PATCH 09/16] Modify build script and host-config file for Intel
 builds.

---
 host-configs/lc-builds/toss4/icpx_X.cmake |  4 ++--
 scripts/lc-builds/toss4_icpx.sh           | 28 ++++++++++++++++++++---
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/host-configs/lc-builds/toss4/icpx_X.cmake b/host-configs/lc-builds/toss4/icpx_X.cmake
index 2f5301bd22..a1499ce08d 100755
--- a/host-configs/lc-builds/toss4/icpx_X.cmake
+++ b/host-configs/lc-builds/toss4/icpx_X.cmake
@@ -8,8 +8,8 @@
 set(RAJA_COMPILER "RAJA_COMPILER_ICC" CACHE STRING "")
 
 ##set(COMMON_FLAGS "--gcc-toolchain=/usr/tce/packages/gcc/gcc-10.3.1")
-##set(COMMON_OPT_FLAGS "-march=native -finline-functions -fp-model=precise")
-set(COMMON_OPT_FLAGS "-march=native -finline-functions")
+set(COMMON_OPT_FLAGS "-march=native -finline-functions -fp-model=precise")
+#set(COMMON_OPT_FLAGS "-march=native -finline-functions")
 
 ##set(CMAKE_CXX_FLAGS_RELEASE "${COMMON_FLAGS} -O3 ${COMMON_OPT_FLAGS}" CACHE STRING "")
 ##set(CMAKE_CXX_FLAGS_RELWITHDEBINFO "${COMMON_FLAGS} -O3 -g ${COMMON_OPT_FLAGS}" CACHE STRING "")
diff --git a/scripts/lc-builds/toss4_icpx.sh b/scripts/lc-builds/toss4_icpx.sh
index 88cc43d824..d7d7c7dd85 100755
--- a/scripts/lc-builds/toss4_icpx.sh
+++ b/scripts/lc-builds/toss4_icpx.sh
@@ -35,12 +35,17 @@ module load cmake/3.23.1
 # times at a potential cost of slower 'forall' execution.
 ##
 
-source /usr/tce/packages/intel/intel-${COMP_VER}/setvars.sh
+if [[ ${COMP_VER} == 2024.2.1 ]]
+then
+  source /collab/usr/global/tools/intel/toss_4_x86_64_ib/oneapi-2024.2.1/setvars.sh
+else
+  source /usr/tce/packages/intel/intel-${COMP_VER}/setvars.sh
+fi
 
 cmake \
   -DCMAKE_BUILD_TYPE=Release \
-  -DCMAKE_CXX_COMPILER=/usr/tce/packages/intel/intel-${COMP_VER}/bin/icpx \
-  -DCMAKE_C_COMPILER=/usr/tce/packages/intel/intel-${COMP_VER}/bin/icx \
+  -DCMAKE_CXX_COMPILER=icpx \
+  -DCMAKE_C_COMPILER=icx \
   -DBLT_CXX_STD=c++14 \
   -C ../host-configs/lc-builds/toss4/icpx_X.cmake \
   -DRAJA_ENABLE_FORCEINLINE_RECURSIVE=Off \
@@ -49,3 +54,20 @@ cmake \
   -DCMAKE_INSTALL_PREFIX=../install_${BUILD_SUFFIX} \
   "$@" \
   ..
+
+if [[ ${COMP_VER} == 2024.2.1 ]]
+then
+
+echo
+echo "***********************************************************************"
+echo
+echo "cd into directory build_${BUILD_SUFFIX} and run make to build RAJA"
+echo
+echo "To successfully build and run all tests, you may need to run the"
+echo "command to make sure your environment is set up properly:"
+echo
+echo "  source /collab/usr/global/tools/intel/toss_4_x86_64_ib/oneapi-2024.2.1/setvars.sh"
+echo
+echo "***********************************************************************"
+
+fi

From ee749f97819aff5e4902ac19da2aab1fe920c7f6 Mon Sep 17 00:00:00 2001
From: "Adrien M. BERNEDE" <51493078+adrienbernede@users.noreply.github.com>
Date: Mon, 21 Oct 2024 11:13:54 +0200
Subject: [PATCH 10/16] Point at radiuss-spack-configs with new intel compiler
 and "fix" for intel 2023

---
 scripts/radiuss-spack-configs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/radiuss-spack-configs b/scripts/radiuss-spack-configs
index 00c06c2d02..e2f3fb4bdb 160000
--- a/scripts/radiuss-spack-configs
+++ b/scripts/radiuss-spack-configs
@@ -1 +1 @@
-Subproject commit 00c06c2d0258802fbf4a57ff987314d4acd9f629
+Subproject commit e2f3fb4bdb4e803ae32775c361b61b1c34eb3203

From de6a7831c91b21ac024be7bf7d2a870aa0358596 Mon Sep 17 00:00:00 2001
From: "Adrien M. BERNEDE" <51493078+adrienbernede@users.noreply.github.com>
Date: Mon, 21 Oct 2024 14:32:21 +0200
Subject: [PATCH 11/16] Point at new RSC main commit

---
 scripts/radiuss-spack-configs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/radiuss-spack-configs b/scripts/radiuss-spack-configs
index e2f3fb4bdb..522da3898f 160000
--- a/scripts/radiuss-spack-configs
+++ b/scripts/radiuss-spack-configs
@@ -1 +1 @@
-Subproject commit e2f3fb4bdb4e803ae32775c361b61b1c34eb3203
+Subproject commit 522da3898fcc8942f8eb3270f01c8212937589f5

From 9a936d3c39aa0a901ef54451f0b61871a371e776 Mon Sep 17 00:00:00 2001
From: "Adrien M. BERNEDE" <51493078+adrienbernede@users.noreply.github.com>
Date: Mon, 21 Oct 2024 17:26:33 +0200
Subject: [PATCH 12/16] use -01 and fp-precise with intel 2023

---
 .gitlab/jobs/poodle.yml | 5 ++---
 .gitlab/jobs/ruby.yml   | 5 ++---
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/.gitlab/jobs/poodle.yml b/.gitlab/jobs/poodle.yml
index 3b2a174b97..8486266704 100644
--- a/.gitlab/jobs/poodle.yml
+++ b/.gitlab/jobs/poodle.yml
@@ -29,13 +29,12 @@ gcc_10_3_1:
     SPEC: " ~shared +openmp +omptask ~vectorization +tests %gcc@=10.3.1 ${PROJECT_POODLE_DEPS}"
   extends: .job_on_poodle
 
-# Known issue currently under investigation
+# custom variant
 # https://github.com/LLNL/RAJA/pull/1712#issuecomment-2292006843
 intel_2023_2_1:
   variables:
-    SPEC: "${PROJECT_POODLE_VARIANTS} %intel@=2023.2.1 ${PROJECT_POODLE_DEPS}"
+    SPEC: "${PROJECT_POODLE_VARIANTS} +lowopttest cxxflags==-fp-model=precise %intel@=2023.2.1 ${PROJECT_POODLE_DEPS}"
   extends: .job_on_poodle
-  allow_failure: true
 
 ############
 # Extra jobs
diff --git a/.gitlab/jobs/ruby.yml b/.gitlab/jobs/ruby.yml
index 2258878c3e..c745ca4a6c 100644
--- a/.gitlab/jobs/ruby.yml
+++ b/.gitlab/jobs/ruby.yml
@@ -29,13 +29,12 @@ gcc_10_3_1:
     SPEC: " ~shared +openmp +omptask ~vectorization +tests %gcc@=10.3.1 ${PROJECT_RUBY_DEPS}"
   extends: .job_on_ruby
 
-# Known issue currently under investigation
+# custom variant
 # https://github.com/LLNL/RAJA/pull/1712#issuecomment-2292006843
 intel_2023_2_1:
   variables:
-    SPEC: "${PROJECT_RUBY_VARIANTS} %intel@=2023.2.1 ${PROJECT_RUBY_DEPS}"
+    SPEC: "${PROJECT_RUBY_VARIANTS} +lowopttest cxxflags==-fp-model=precise %intel@=2023.2.1 ${PROJECT_RUBY_DEPS}"
   extends: .job_on_ruby
-  allow_failure: true
 
 ############
 # Extra jobs

From 4bcc2e3de9a70011ce244a7559b6e41a0014ace3 Mon Sep 17 00:00:00 2001
From: Rich Hornung <hornung1@llnl.gov>
Date: Thu, 24 Oct 2024 10:45:05 -0700
Subject: [PATCH 13/16] Pull in branch with RAJAPerf package fix for SYCL CI
 job

---
 scripts/radiuss-spack-configs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/radiuss-spack-configs b/scripts/radiuss-spack-configs
index 522da3898f..26c2a9eb3f 160000
--- a/scripts/radiuss-spack-configs
+++ b/scripts/radiuss-spack-configs
@@ -1 +1 @@
-Subproject commit 522da3898fcc8942f8eb3270f01c8212937589f5
+Subproject commit 26c2a9eb3f1c4a83d07092c7e41d42ec1334a350

From e6bfd3ea86c62073a766f93991dfb0217efedba9 Mon Sep 17 00:00:00 2001
From: "Adrien M. BERNEDE" <51493078+adrienbernede@users.noreply.github.com>
Date: Fri, 25 Oct 2024 23:07:32 +0200
Subject: [PATCH 14/16] From RSC: Set both ENABLE_SYCL and RAJA_ENABLE_SYCL

---
 scripts/radiuss-spack-configs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/radiuss-spack-configs b/scripts/radiuss-spack-configs
index 26c2a9eb3f..30e4692779 160000
--- a/scripts/radiuss-spack-configs
+++ b/scripts/radiuss-spack-configs
@@ -1 +1 @@
-Subproject commit 26c2a9eb3f1c4a83d07092c7e41d42ec1334a350
+Subproject commit 30e46927790da4c64700d45d207121515537828f

From 06e9df69f014060f3184d4e111f746e839435ff2 Mon Sep 17 00:00:00 2001
From: Rich Hornung <hornung1@llnl.gov>
Date: Tue, 29 Oct 2024 11:17:43 -0700
Subject: [PATCH 15/16] Pull in branch of radiuss-spack-configs with fixes for
 SYCL CI

---
 scripts/radiuss-spack-configs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/radiuss-spack-configs b/scripts/radiuss-spack-configs
index 30e4692779..1d430d0c79 160000
--- a/scripts/radiuss-spack-configs
+++ b/scripts/radiuss-spack-configs
@@ -1 +1 @@
-Subproject commit 30e46927790da4c64700d45d207121515537828f
+Subproject commit 1d430d0c798abbada6bc65aafbf4a7b3409987d4

From 40a0e9fbb549766ec26f6f0264303c97c83ca876 Mon Sep 17 00:00:00 2001
From: Rich Hornung <hornung1@llnl.gov>
Date: Thu, 31 Oct 2024 11:40:45 -0700
Subject: [PATCH 16/16] Pull in new radiuss0spack-configs main which has fix in
 RAJAPerf packages

---
 scripts/radiuss-spack-configs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/radiuss-spack-configs b/scripts/radiuss-spack-configs
index 1d430d0c79..9634711c8b 160000
--- a/scripts/radiuss-spack-configs
+++ b/scripts/radiuss-spack-configs
@@ -1 +1 @@
-Subproject commit 1d430d0c798abbada6bc65aafbf4a7b3409987d4
+Subproject commit 9634711c8bc0e8cbc6a4ae4c4fe81161d48d5d12