Merge pull request #39 from nrc/nrc-changes

potanin · web-flow · commit 6567dd175c50 · 2016-08-08T10:36:55.000+12:00
Implementation section changes
diff --git a/papers/ACSC2017/implementation.tex b/papers/ACSC2017/implementation.tex
@@ -1,29 +1,22 @@
 \section{Implementation}\label{C:impl}
 
+Our refactoring tool relies on the compiler for semantic information about programs to be refactored. We get this from two sources: the compiler can be requested (with the `-Zsave-analysis' flag) to dump information from its analysis of a program; where we needed more refactoring-specific information, we used a modified version of the compiler with callbacks to our refactoring tool from the name resolution pass.
+
 \subsection{Renaming}
 
-Given an id for an AST node, a new name, the save-analysis file, and the crate root file, a rename refactoring then has enough information to begin. Loading in the csv analysis, there are two separate pieces of information that need to be identified: the declaration and the references. Once they are ascertained, we run the compiler API to invoke the compiler. Using name resolution within the compiler, we can attempt to resolve the new name at the declaration site in the AST to ensure that it does not cause any conflicts. By doing so, this would avoid same-block conflicts and prevent all super-block conflicts. Consequently, this also prevents a number of valid renamings where there is no eventual usage of the shadowed item.
+To evaluate a potential renaming, our tool starts with the save-analysis information. This allows the tool to identify the declaration of a variable (or other item) and all of its uses (in contrast with a syntactic search, the compiler can differentiate between different bindings with the same name). If the renaming is valid, then this is enough information to perform the rename. To check validity, we must re-run the compiler in order to try and resolve the name name at the declaration site. If the name does resolve, then there would be potential name conflicts. Our check prevents all super- and same-block conflicts. However, it is conservative and some valid renamings are rejected (where the existing name is not in fact used in the program after the declaration).
 
-Referring back to the conditions listed in Section~\ref{C:back}, resolution at the declaration site for super-block and same-block conflicts force usages binding to different declarations to remain binded to their different declaration. By addressing sub-block conflicts, at each renamed usage, name resolution would check the remaining condition that the binding was made only to the renamed declaration. Ideally, name resolution would run with both the declaration renamed and the usages renamed and within a single pass of the compiler.
+To guard against sub-block conflicts, we must further try to resolve the new name at every use of the variable being renamed. Only if every attempt at resolution fails can we be sure that the renaming is safe.
 
-Unfortunately, limitations imposed by the structure of name resolution and the internal representation mean that this is not possible. In order to provide functionality for detecting the missing sub-block conflicts, recompilation of the entire crate with a single use renamed is necessary. Of course this provides significant overhead; however, hopefully name resolution can provide the required functionality in the future. Apart from compilation, there does not appear to be any straightforward way to checking if a name already exists in the context for a usage. The full name resolution approach is one which appears to be adopted by gorename \cite{gorename15} and is much more efficient in general due to the fact that only one compiler run should be necessary to check every modification point. The additional choice of employing the full compilation approach for declarations indicates further complexities in providing valid expression constructions (to test the presence of an existing name). A generic approach could not be used and so constructions of different forms for variables and variations of types and functions would be necessary -- which might not be compatible with simple ad-hoc replacement at the source code level.
+Whilst in theory, all these checks could be done with a single pass of the compiler, in practice Rust's name resolution is not flexible enough to check an arbitrary name, it can only check a name from the source text. Furthermore, we could at the time only observe success or failure of name resolution, not the reason why (the compiler has improved considerably since we implemented this tool). That means that to be safe, we must re-compile once for each use of the variable being renamed. This is clearly expensive.
 
-Adopting the compilation approach, each reference is renamed to the new name one at a time and compiled to ensure that it fails. If a compilation succeeds, then a super-block or sub-block conflict would have occurred in this location and the refactoring must be halted. Care must be taken to ensure that the compilation fails due to a name resolution problem and not one which is due to other failures. If all the compilations fail correctly, the refactoring proceeds and performs all renamings of the occurrences of a variable/function/type.
+A much better approach (but outside the scope of this project) would be to modify name resolution to allow checking for arbitrary names. This approach is taken by Gorename \cite{gorename15}.
 
 \subsection{Inlining}
 
-Following the description given in Section~\ref{C:back}, the feasible implementation of inlining a local variable is relatively straightforward. This is especially the case when considering that Steps 1 and 2 listed are effectively impossible given the current language constraints as well the current compiler implementation.
-
-\subsubsection{Addressing Steps 3 to 7}
-In order to provide the functionality for the remaining steps, the compiler provided essentially all of the necessary constructs. By reappropriating the save-analysis module that typically outputs a csv file that includes all the variable usages, the tool goes and counts the number of usages of the variable you are about the inline. This information is enough to satisfy Step 3.
-
-By using the node id that will be supplied to the tool at the beginning to identify which variable to inline, the tool can use the compiler to reconstruct the AST and determine all the mutability information. In order to get this information though, the tool needs to run the compiler to the end of the analysis phase which forms a significant proportion of the time spent compiling. This is required to check that a `mut' declaration was actually required and that the variable does not have interior mutability (satisfying Steps 4 and 5).
-
-To replace the usages of the local variable with the initializing expression, the Rust compiler offers a useful `folder' trait which allows manipulation of the AST. This `folder' trait is used to expand or replace nodes in the tree and is how macros or syntactic sugar are generally handled in the compiler. The idea is to first walk the AST of the initializing replacement expression to determine which identifiers are being used to compose it. Then you walk the tree with the folder looking for any references to your local variable. If you find one, go through all the identifiers you found earlier and use name resolution to see if they resolve to their original declarations. If not, abort the refactoring. This satisfies Step 6.
+Again, inlining starts with the save-analysis data. This data allows finding the number of uses of a variable to be inlined and the mutability of its type. However, this is not enough to complete our analysis. In particular, in Rust objects can have \it{interior mutability} which is not reflected in that object's type. However, it is tracked by the compiler, so our tool can query this information. We also take account of a mutability annotation being egregious by relying on the compiler identifying such unneccessary annotations. Unfortunately, this requires running the compiler to a late stage of its analysis and thus is fairly time-consuming. Finally, we rely again on name resolution to ensure that any variables which are substituted in still resolve to their original binding.
 
-Step 7 is actually fairly trivial to implement because all that needs to be done is to add an additional check during the folding just described to remove the affected declaration nodes. From here, the inbuilt pretty printer in the compiler is used to format the modified AST.
-
-\subsubsection{Concrete example with order of operations}
+We perform the actual inlining on the AST. Following that change, we must ensure that the fragment of the AST is properly printed back into the source text. In particular, parentheses may need to be added to ensure the correct ordering of operations due to precedence. See Figure \ref{Fig:exinline} for an example.
 
 \begin{figure}[h]
 \centering
@@ -41,20 +34,13 @@ \subsubsection{Concrete example with order of operations}
 \label{Fig:exinline}
 \end{figure}
 
-In Figure \ref{Fig:exinline}, you can see the general result of running the tool on the given input. In particular, you can notice that the order of operations is preserved due to the fact that the pretty printer correctly identifies where parentheses are required. Without the pretty printer handling this case, the identified expression would evaluate to 4 instead of 6. Originally, this was not the behaviour given by the pretty printer, and contributions to the compiler were required to ensure that this case (as well as other similar cases) were correctly handled.
-
-\subsubsection{An alternative approach}
-In the original list of steps, there is no transformation between some abstract representation, like the AST, to concrete code. Using the pretty printer was a relatively straightforward choice since walking the AST was required for the checks in Step 6. By performing the replacements at this step, there would be no need to do any secondary walks and by pretty printing, there would be no need to determine and translate the locations in each file of the variable usages. There are some obvious disadvantages to using the pretty printer. One of them is bugs in the compiler, which was found to happen with parentheses. Another is the fact that the pretty printer and replacement operations function on the expanded AST, where macros no longer exist. Although the expanded code of the macro might compile and function just fine, there is the chance that it doesn't due to the syntax contexts distinguishing identifiers only within the compiler. Furthermore, the expanded code is often just ugly, which is why it was replaced with a simple macro.
-
-Instead of using the pretty printer and pursuing the issue much like the Scala refactoring tool, we can perform the replacements one by one which requires all the location information for each of the usages to be recorded. The only caveat is the removal of the actual local variable declaration. This is because although we can delete the entire declaration, it may not be the case that we can remove the blank line left in its place without additional analysis. The fact that compatibility with macros has not been a strong point of the tool and the relative ease of implementation, both contributed to the decision to opt for the pretty printing approach.
-
 \subsection{Lifetime elision and reification}
 
-As Section~\ref{C:back} described, implementation based on the RFC rules makes reify relatively straightforward. The reintroduction of lifetime parameters was based on the implementation of error reporting of missing lifetimes within the compiler. The original hope was that the compiler could simply output the reified function declarations, but it appeared that all that information is encoded in a different format (possibly for LLVM) and impossible to translate back to the AST. In general, this has been a recurring problem that after losing an abstraction level, it is impossible to raise it back up an abstraction level. Macros are another good example of this. While it is logical and useful for a compiler to perform these abstraction changes, for a tool, it is important to know which level to operate on, which steps are reversible and how your approach should be accommodated.
+Reification of lifetime parameters was based on the implementation of error reporting for missing lifetimes in the compiler. This was somewhat complicated by the compiler's representation of lifetimes ( a combination of explicit binder structures and de Bruijn indices); converting into fresh lifetime variables again required interaction with name resolution (although note that name resolution for lifetimes is a simpler case in the Rust compiler and is handled by its own code).
 
-The idea in general is to count the amount of lifetimes in the various positions: in, out, as well as noting the position of self. The idea is to do a walk the AST, looking specifically at the lifetimes present within a function declaration. In order to rebuild the function declaration with the correct lifetimes using the lifetime error reporting system, a vector describing a partition of the parameters is necessary. The vector contains a list of the different equivalence classes of lifetimes. Once rebuilt, the pretty printer can be used to replace the old function declaration.
+In contrast, the fundamentals of elision were more complex - there is no help from the compiler here, and we only implemented for very simple cases. However, since we are only removing lifetime parameters from the source code, there is no difficulty with names.
 
-Compared to reify, elide was not quite so simple. The list of constraints identified and shown in Section~\ref{C:back} have still not been completely implemented. In particular, the tool bails out in cases where a partial elision could still occur, like Figure \ref{Fig:partial} below. Even the constraints themselves are quite conservative and more work can definitely be done to improve them, particularly with parameters with bounds. Again, the idea is to walk the AST, counting the amount of lifetimes in various positions or situations. If all the constraints are met, then we use a `folder' much like inline-local to fold away the unnecessary parameters and simply pretty print the result.
+For a problematic example, see Figure \ref{Fig:partial}. Here, 'b can be elided, but 'a cannot because if it were, the compiler would treat x and y's types as having unique lifetimes.
 
 \begin{figure}[h]
 {\verb|fn foo<'a,'b>(x: &'a Debug, y: &'a Debug, z: &'b Debug)|}\newline