case-study-kblpp.tex

% !TEX root = thesis.tex

\section{Multi-iteration Path Profiling}
\label{se:kblpp-case-studies}
In this section we consider examples of applications where $k$-iteration path profiling can reveal optimization opportunities or help developers comprehend relevant properties of a piece of software by identifying structured execution patterns that would be missed by an acyclic-path profiler. Our discussion is based on idealized examples found in real programs of the kind of behavior that can be exploited using multi-iteration path profiles. Since our methodology can be applied to different languages, we addressed both Java and C applications\footnote{We profiled Java programs using the k-BLPP tool described in \mysection\ref{ss:kblpp-implementation} and C programs with manual source code instrumentation based on a C implementation of the algorithms and data structures of \mysection\ref{ss:kblp-algorithms}, available at \url{http://www.dis.uniroma1.it/~demetres/kstream/}. }.

\subsection{Masked Convolution Filters in Image Processing}
\label{ss:convolution}

As a first example, we consider a classic application of convolution filters to image processing, addressing the problem of masked filtering that arises when the user applies a transformation to a collection of arbitrary-shaped subregions of the input image. A common scenario is face anonymization, illustrated in the example of \myfigure\ref{fig:CS-convolution-images}. The case study discussed in this section shows that $k$-iteration path profiling with large values of $k$ can identify regular patterns spanning multiple loop iterations that can be effectively exploited to speed up the code.

\begin{figure}[!hb]
\centering
\cprotect\fbox{
\begin{minipage}{10cm}
\begin{small}
\begin{verbatim}
 #define NEIGHBOR(m,i,dy,dx,w)                 \
     (*((m)+(i)+(dy)*(w)+(dx)))

 #define CONVOLUTION(i) do {                   \
     val  = NEIGHBOR(img_in, (i),              \
                     -2, -2, cols)*filter[0];  \
     val += NEIGHBOR(img_in, (i),              \
                     -2, -1, cols)*filter[1];  \
     ...
     val += NEIGHBOR(img_in, (i),
                     +2, +2, cols)*filter[24]; \
     val = val*factor+bias;                    \
     img_out[i] = (unsigned char)              \
        (val < 0 ? 0 : val > 255 ? 255 : val); \
 } while(0)

 void filter_conv(unsigned char* img_in,
                  unsigned char* img_out,
                  unsigned char* mask,
                  char filter[25],
                  double factor, double bias,
                  int rows, int cols) {
     int val;
     long n = rows*cols, i;

     for (i = 0; i < n; i++)
         if (mask[i]) img_out[i] = img_in[i];
         else CONVOLUTION(i);
 }
\end{verbatim}
\end{small}
\end{minipage}
}
\vspace{-2mm}
\caption{Masked image filtering code based on a convolution matrix.}
\label{fig:CS-convolution-orig}
\end{figure}

\noindent \myfigure\ref{fig:CS-convolution-orig} shows a C implementation of a masked image filtering algorithm based on a $5\times 5$ convolution matrix\footnote{The source code of our example is provided at \url{http://www.dis.uniroma1.it/~demetres/kstream/}.}. The function takes as input a grayscale input image (8-bit depth) and a black and white mask image that specifies the regions of the image to be filtered. \myfigure\ref{fig:CS-convolution-images} shows a sample input image (top), a mask image (center), and the output image (bottom) generated by the code of \myfigure\ref{fig:CS-convolution-orig} by applying a blur filter to the regions of the input image specified by the mask. Notice that the filter code iterates over all pixels of the input image and for each pixel checks whether the corresponding mask is black (zero) or white (non-zero, i.e., 255). If the mask is white, the original grayscale value is copied from the input image to the output image; otherwise, the grayscale value of the output pixel is computed by applying the convolution kernel to the neighborhood of the current pixel in the input image. To avoid expensive boundary checks in the convolution operation, the mask is preliminarily cropped so that all values near the border are white (this operation takes negligible time).

\myfigure\ref{fig:CS-convolution-kipf} shows a portion of the $10$-IPF forest containing the trees rooted at the BL path IDs that correspond to: the path entering the loop (ID=0), the copy branch taken in the loop body when the mask is non-zero (ID=1), and the convolution branch taken in the loop body when the mask is zero (ID=2). We generated the $10$-IPF on the workload of \myfigure\ref{fig:CS-convolution-images} and pruned it by removing all nodes whose counters are less than $0.01\%$ of the counter of their parents or less than $0.01\%$ of the counter of their roots. For each node $v$ in the forest, if $v$ has a counter that is $X\%$ of the counter of its parent and is $Y\%$ of the counter of the root, then the edge leading to $v$ is labeled with ``$X\% (Y\%)$''. A visual analysis of the forest shows that:
\begin{itemize}[itemsep=0pt,parsep=3pt]
\item the copy branch (1) is more frequent than the convolution branch (2);
\item 98.9\% of the times a copy branch (1) is taken, it is repeated consecutively at least 10 times, and only $0.1\%$ of the times is immediately followed by a convolution branch;
\item 95\% of the times a convolution branch (2) is taken, it is repeated consecutively at least 10 times, and only $0.6\%$ of the times is immediately followed by a copy branch.
\end{itemize}

\ifdefined\noauthorea
\begin{figure}[!t]
\centerline{\includegraphics[width=0.95\textwidth]{figures/CS-kipf-indiana/CS-kipf-indiana-top.eps}}
\vspace{6mm}
\centerline{\includegraphics[width=0.95\textwidth]{figures/CS-kipf-indiana/CS-kipf-indiana-bottom.eps}}
\caption{\protect\input{figures/CS-kipf-indiana/caption}}
\end{figure}
\fi

\ifdefined\noauthorea
\begin{figure}[!t]
\begingroup\setlength{\fboxsep}{0pt}
\centerline{\framebox{\includegraphics[width=0.6\textwidth]{figures/CS-indiana-pacers/CS-indiana-pacers-orig.png}}}
\vspace{2.5mm}
\centerline{\framebox{\includegraphics[width=0.6\textwidth]{figures/CS-indiana-pacers/CS-indiana-pacers-mask.png}}}
\vspace{2.5mm}
\centerline{\framebox{\includegraphics[width=0.6\textwidth]{figures/CS-indiana-pacers/CS-indiana-pacers-out.png}}}
\endgroup
\caption{\protect\input{figures/CS-indiana-pacers/caption}}
\end{figure}
\fi

\noindent This entails that both the copy and the convolution operations are repeated along long consecutive runs. The above properties are typical of masks used in face anonymization and other common image manipulations based on user-defined selections of portions of the image. The collected profiles suggest that consecutive iterations of the same branches may be selectively unrolled as shown in \myfigure\ref{fig:CS-convolution-opt}. Each iteration of the outer loop, designed for a 64-bit platform, works on 8 (rather than 1) pixels at a time. Three cases are possible:

\begin{enumerate}[itemsep=0pt,parsep=3pt]
\item the next 8 mask entries are all 255 (white): the 8 corresponding input pixel values are copied to the output image at once with a single assignment instruction;
\item the next 8 mask entries are all 0 (black): the kernel is applied sequentially to each of the next 8 input pixels;
\item the next 8 mask entries are mixed: an inner loop performs either copy or convolution on the corresponding pixels.
\end{enumerate}

\begin{figure}[!ht]
\centering
\cprotect\fbox{
\begin{minipage}{10cm}
\begin{small}
\begin{verbatim}
 for (i = 0; i < n-7; i += 8) {

    if (*(long*)(mask+i) == 0xFFFFFFFFFFFFFFFF)
       *(long*)(img_out+i) = *(long*)(img_in+i);

    else if (*(long*)(mask+i) == 0) {
       CONVOLUTION(i);
       CONVOLUTION(i+1);
       CONVOLUTION(i+2);
       CONVOLUTION(i+3);
       CONVOLUTION(i+4);
       CONVOLUTION(i+5);
       CONVOLUTION(i+6);
       CONVOLUTION(i+7);
    }

    else for (j = i; j < i+8; j++)
       if (mask[j]) img_out[j] = img_in[j];
       else CONVOLUTION(j);
 }
\end{verbatim}
\end{small}
\end{minipage}
}
\vspace{-2mm}
\caption{Optimized 64-bit version of the loop of \myfigure\ref{fig:CS-convolution-orig}.}
\label{fig:CS-convolution-opt}
\end{figure}

\subsubsection*{Performance Analysis}
To assess the benefits of the optimization performed in \myfigure\ref{fig:CS-convolution-opt}, we conducted several tests on recent commodity platforms (Intel Core 2 Duo, Intel Core i7, Linux and Mac OS X, 32 and 64 bits, {\tt gcc} -O3), considering a variety of sample images and masks with regions of different sizes and shapes. We obtained non-negligible speedups on all our tests, with a peak of about 21\% on the workload of \myfigure\ref{fig:CS-convolution-images} ($3114\times2376$ pixels) and about 30\% on a larger $9265\times7549$ image with a memory footprint of about 200 MB. In general, the higher the white entries in the mask, the faster the code, with larger speedups on more recent machines. As we expected, for entirely black masks the speedup was instead barely noticeable: this is due to the fact that the convolution operations are computationally demanding and tend to hide the benefits of loop unrolling.

\subsubsection*{Discussion}
The example discussed in this section shows that both the ability to profile paths across multiple iterations, and the possibility to handle large values of $k$ played a crucial role in optimizing the code. Indeed, acyclic-path profiling would count the number of times each branch is taken, but would not reveal that they appear in consecutive runs. Moreover, previous multi-iteration approaches that only handle very small values of $k$ would not capture the long runs that make the proposed optimization effective. 

For the example of \myfigure\ref{fig:CS-convolution-images}, an acyclic-path profile would indicate that the copy branch is taken 81.7\% of the time, but not how branches are interleaved. From this information, we would be able to deduce that the average length of a sequence of consecutive white values in the mask is $\ge$ 4. Our profile shows that 98.9\% of the time the actual length is at least 10, fully justifying our optimization that copies 8 bytes at a time. The advantage of $k$-iteration path profiling increases for masks with a more balanced ratio between white and black pixels: for a 50-50 ratio, an acyclic-path profile would indicate that the average length of consecutive white/black runs is $\ge$ 1, yielding no useful information for loop unrolling purposes.

The execution pattern where the same branches are repeatedly taken over consecutive loop iterations is common to several other applications, which may benefit from optimizations that take advantage of long repeated runs. For instance, the {\tt LBM\_performStreamCollide} function of the {\tt lbm} benchmark from the \speccpu\ suite~\cite{Henning06}  iterates over a 3D domain, simulating incompressible fluid dynamics based on the Lattice Boltzmann Method. An input geometry file specifies obstacles that determine a steady state solution. The loop contains branches that depend upon the currently scanned cell, which alternates between obstacles and void regions of the domain, producing a \kipf\ similar to that of \myfigure\ref{fig:CS-convolution-kipf} on typical workloads.

\subsection{Instruction Scheduling}
\label{ss:instr-scheduling}

Young and Smith~\cite{Young98} have shown that path profiles spanning multiple loop iterations can be used to improve the construction of superblocks in trace schedulers.

Global instruction scheduling groups and orders the instructions of a program in order to match the hardware resource constraints when they are fetched. In particular, trace schedulers rely on the identification of traces (i.e., sequences of basic blocks) that are frequently executed. These traces are then extended by appending extra copies of likely successors blocks, in order to form a larger pool of instructions for reordering. A trace that is likely to complete is clearly preferable, since instructions moved before an early exit point are wasted work.

{\em Superblocks} are defined as sequences of basic blocks with a single entry point and multiple exit points; they are useful for maintaining the original program semantics during a global code motion. Superblock formation is usually driven by edge profiles: however, path profiles usually provide better information to determine which traces are worthwhile to enlarge (i.e., those for which execution reaches the ending block most of the time). \myfigure\ref{fig:CS-sched-example} shows how superblock construction may benefit from path profiling information for two different behaviors, characterized by the same edge profile, of a {\tt do} $\ldots$ {\tt while} loop.

\ifdefined\noauthorea
\begin{figure}[!ht]
\begin{center}
\includegraphics[width=0.75\textwidth]{figures/CS-sched-example/CS-sched-example.eps}
\caption{\protect\input{figures/CS-sched-example/caption}}
\end{center}
\end{figure}
\fi

\noindent Path profiling techniques that do not span multiple loop iterations chop execution traces into pieces separated at back edges, hence the authors collect execution frequencies for {\em general paths}~\cite{Young98thesis}, which contain any contiguous sequences of CFG edges up to a limiting path length; they use a path length of 15 branches in the experiments.

\subsubsection*{Example}
Phased and alternating behaviors as in \myfigure\ref{fig:CS-sched-example} are quite common among many applications, thus offering interesting optimization opportunities. For instance, the convolution filter discussed in the previous section is a clear example of phased behavior. An alternating behavior is shown by the {\tt checkTaskTag} method of class {\tt Scanner} in the {\tt org.eclipse.jdt.internal.compiler.parser} package of the {\tt eclipse} benchmark included in the {\tt DaCapo} release 2006-10-MR2. In \myfigure\ref{fig:CS-sched-eclipse} we show a subtree of the $11$-IPF generated for this method; in the subtree, we pruned all nodes with counters less than 10\% of the counter of the root. Notice that after executing the BL path with ID $38$, in $66\%$ of the executions the program continues with $86$, and in $28\%$ of the executions with BL path $87$. When $86$ follows $38$, in $100\%$ of the executions the control flow takes the path $\langle 86, 86, 86, 755\rangle$, which spans four loop iterations and may be successfully unrolled to perform instruction scheduling. Interestingly, sequence $\langle 38, 86, 86, 86, 755, 38,$ $86, 86, 86, 755, 38\rangle$ of 11 BL path IDs, highlighted in \myfigure\ref{fig:CS-sched-eclipse}, accounts for more than $50\%$ of all executions of the first BL path in the sequence, showing that sequence $\langle 38, 86, 86, 86, 755\rangle$ is likely to be repeated consecutively more than once.

\ifdefined\noauthorea
\begin{figure}[!ht]
\begin{center}
\includegraphics[width=0.65\textwidth]{figures/CS-sched-eclipse/CS-sched-eclipse.eps}
\caption{\protect\input{figures/CS-sched-eclipse/caption}}
\end{center}
\end{figure}
\fi

\subsubsection*{Discussion}
The work presented in~\cite{Young98} focused on assessing the benefits of using general paths for global instruction scheduling, rather than on how to profile them. As we have seen in \mysection\ref{ss:kblpp-related}, compared to our approach the technique proposed by Young~\cite{Young98thesis} for profiling general paths scales poorly for increasing path lengths both in terms of space usage and running time. %In addition to making profile-guided optimizations more efficient in conventional compilers, 
We believe that our method, by substantially reducing the overhead of cyclic-path profiling, has the potential to provide a useful ingredient for making profile-guided global instruction scheduling more efficient in modern compilers.

%\missing\ JIT? trace stitching?