conclusions.tex

% !TEX root = thesis.tex

\chapter{Conclusions and Future Work}
\label{ch:conclusions}

We are confident that the ideas presented in this thesis can contribute to the advances in adaptive optimization technology for modern runtimes.

Collecting accurate profiling information with low overhead is a crucial factor for making online complex optimizations practical. In \mychapter\ref{ch:profiling} we have presented two analysis techniques that rely on elegant algorithmic solutions to profile data coming at a high rate from a large universe.

Our inter-procedural profiling technique enables the identification of most frequently encountered calling contexts without having to maintain the whole calling context tree in main memory - which we show can be impractical for real-world applications. We propose a compact data structure, the {\em Hot Calling Context Tree} (HCCT), that can be constructed in small space thanks to the adoption of efficient data streaming algorithms. These algorithms provide us with strong theoretical guarantees on the accuracy and the space requirements of the solution, and operate with a constant per-item processing time.

Our intra-procedural profiling technique extends the well-known Ball-Larus algorithm to cyclic-path profiles, in order to yield more optimization opportunities. We show that cyclic paths can be represented as concatenations of acyclic Ball-Larus paths, and that a prefix forest can compactly encode them. We then introduce an intermediate data structure, the $k$-slab forest (\ksf), that can be constructed online with a constant per-item processing time and converted to a prefix forest on demand.

The algorithms behind our two profiling techniques have been implemented in mainstream systems and evaluated against prominent benchmarks. Theoretical predictions are thus reinforced by promising experimental results, showing that our techniques can be used in practical scenarios where previous solutions failed.

In \mychapter\ref{ch:continuous} we have then focused our attention on a main player of adaptive optimization cycles, {\em On-Stack Replacement} (OSR), which enables runtimes to divert the execution to freshly generated optimized code using profiling information, or to deoptimize to a different version of the code when conditions change, e.g., when program behavior starts to diverge from the profile significantly.

OSR is not only a great engineering problem, but also an intellectually challenging endeavor. We thus tackle the problem from both a practical and theoretical perspective. We present a new framework for on-stack replacement that combines some of the best practices in the literature, such as platform independence and the generation of highly optimized continuation functions, with two novel features: the ability to perform OSR at any program location, and a compensation code abstraction to encode changes to the program state, thus increasing the flexibility the OSR mechanism. Experimental results collected on classic benchmarks for our \osrkit\ embodiment in the LLVM compiler toolchain suggest that encoding OSR at intermediate representation level allows the compiler to generate very efficient machinery with a hardly noticeable impact on performance. As the ideas behind our OSR framework are general, we do not foresee any limitation to its adoption in other runtime environments as well.

In the second part of \mychapter\ref{ch:continuous}, we make a first attempt to prove OSR transitions sound. To capture OSR in its full generality, we define a notion of multi-program, and let an oracle decide at each program step in which version of the multi-program execution should continue. We distill the essence of OSR to a simple imperative calculus with an operational semantics. Using program bisimulation, we prove that an OSR can correctly divert execution across program versions if they have been generated using live-variable equivalent transformations. We also present an algorithm that can relieve code optimizers from the burden of generating all the required glue machinery to realign the state during an OSR transition.

There is a trade-off between the number of points where OSR can be correctly fired and the price to pay in terms of space and time in order to support them. Our work lies at the performance-preserving end of the spectrum, supporting OSR transitions in constant time and space: we do not restrict optimizations, and do not require any state logging. To assess the practical impact of this design choice, we analyze experimentally the fraction of program locations where OSR can be efficiently fired in prominent benchmarks across several LLVM optimization passes. Our experiments suggest that bidirectional OSR transitions between rather different program versions can be supported almost everywhere in the code under several classic optimizations.

Finally, we present a number of case studies to investigate the end-to-end utility of the techniques described in this thesis. All of our code is publicly available and, for \kblpp\ and \osrkit, has been endorsed along with the associated experiments in the Artifact Evaluation process of known conferences on programming languages.

\section*{Future Work}
The methodologies and ideas presented in this thesis leave a number of interesting open questions that we hope to address in future work.

We believe that a careful use of data mining techniques has the potential benefit of enabling some previously impossible dynamic program analysis tasks, which would otherwise be too costly. In particular, our techniques could be applied to certain forms of path profiling: e.g., they could help leverage the scalability problems encountered when collecting performance metrics about interprocedural paths (i.e., acyclic paths that may cross procedure boundaries)~\cite{Melski99}. Also, it would be interesting to investigate whether streaming algorithms for weighted item sets might be useful to solve space issues arising in other performance profiling methodologies.

\noindent An interesting open question for our multi-iteration path profiling technique is how to use sophisticated sampling techniques (e.g., ~\cite{Arnold01,Zhuang06}) to further reduce the profiling overhead. We have seen that bursting is effective in a context-sensitive profiling scenario. To capture even longer paths, our technique might be extended with pruning heuristics that trade accuracy for a smaller memory footprint in the \ksf\ construction. We also remark that our approach, by decoupling path tracing from profiling, is amenable to multi-core implementations by letting the profiled code and the analysis algorithm run on separate cores using shared buffers. A promising line of research is to explore how to partition the data structures so that portions of the stream buffer can be processed in parallel.

We have seen that our approach to OSR mapping generation relies on the live-variable bisimilarity property for program versions. What are the limitations of our formalism in terms of existing compiler optimizations? What is involved in rethinking existing compiler optimizations in terms of the presented model? Transformations that deeply change the structure of a program, such as aggressive loop optimizations, are not supported at the moment. Indeed, such transformations typically require entire portions of state to be logged in order to support deoptimization, such as in the loop tiling case~\cite{Bhandari15}. Our work is just a scratch off the surface of the fascinating problem of how to dynamically morph one program into another. A deep understanding of the trade-offs between flexibility and time/space requirements of OSR remains a compelling goal for future work.

As a next step on the implementation side, we plan to extend \osrkit\ to generate continuation functions that can be shared by multiple deoptimization points. By adding a dispatcher in the entry block that examines the current OSR source location in order to properly compensate the state and jump to the associated landing pad, a single continuation function might serve more than a deoptimization point. Indeed, when transferring execution to less optimized code we should not worry about a possibly reduced peak performance for the modified continuation function: execution won't likely stay in it for long. Also, in the presence of frequent OSR transitions between pairs of versions, this solution would be very effective when deoptimization does not always occur at the same locations.

%We are aware that \osrkit\ is currently being used in a joint academic-industrial research project for the optimization of the runtime for the R language, and we hope to look at other tools that may use it in the future.

We are aware that \osrkit\ is currently being used in a joint academic-industrial research project for the optimization of the runtime for the R language~\cite{vitek16}, and we hope to look at other tools that may use it in the future.