Code Base Investigator (CBI) is a tool designed to help developers reason about the use of specialization (i.e. code written specifically to provide support for or improve performance on some set of platforms) in a code base. Specialization is often necessary, but how a developer chooses to express it may impact code portability and future maintenance costs.
The definition of platform used by CBI is deliberately very flexible and completely user-defined; a platform can represent any execution environment for which code may be specialized. A platform could be a compiler, an operating system, a micro-architecture or some combination of these options.
CBI measures the amount of specialization in a code base using code divergence, which is defined as the arithmetic mean pair-wise distance between the code-paths used by each platform.
At the two extremes, a code divergence of 0 means that all of the platforms use exactly the same code, while a code divergence of 1 means that there is no code shared between any of the platforms. The code divergence of real codes will fall somewhere in between.
CBI tracks specialization in two forms: source files that are not compiled for all platforms; and regions of source files that are guarded by C preprocessor directives (e.g. #ifdef
). A typical run of CBI consists of a three step process:
- Extract source files and compilation commands from a configuration file or compilation database.
- Build an AST representing which source lines of code (LOC) are associated with each specialization.
- Record which specializations are used by each platform.
usage: codebasin.py [-h] [-c FILE] [-v] [-q] [-r DIR] [-R REPORT [REPORT ...]]
optional arguments:
-h, --help show this help message and exit
-c FILE, --config FILE configuration file
-v, --verbose increase verbosity level
-q, --quiet decrease verbosity level
-r DIR, --rootdir DIR working root directory
-R REPORT [REPORT ...] desired outout reports
The codebasin.py
script analyzes a code base described in a YAML configuration file and produces one or more output reports. Example configuration files can be found in the examples directory, and see the configuration file documentation for a detailed description of the configuration file format.
The summary report (-R summary
) gives a high-level summary of a code base, as shown below:
---------------------------------------------
Platform Set LOC % LOC
---------------------------------------------
{} 2 4.88
{GPU 1} 1 2.44
{GPU 2} 1 2.44
{CPU 2} 1 2.44
{CPU 1} 1 2.44
{FPGA} 14 34.15
{GPU 2, GPU 1} 6 14.63
{CPU 1, CPU 2} 6 14.63
{FPGA, CPU 1, GPU 2, GPU 1, CPU 2} 9 21.95
---------------------------------------------
Code Divergence: 0.55
Unused Code (%): 4.88
Total SLOC: 41
Each row in the table shows the amount of code that is unique to a given set of platforms. Listed below the table are the computed code divergence, the amount of code in the code base that was not compiled for any platform, and the total size of the code base.
The clustering report (-R clustering
) consists of a pair-wise distance matrix, showing the ratio of platform-specific code to code used by both platforms. These distances are the same as those used to compute code divergence.
Distance Matrix
-----------------------------------
FPGA CPU 1 GPU 2 GPU 1 CPU 2
-----------------------------------
FPGA 0.00 0.70 0.70 0.70 0.70
CPU 1 0.70 0.00 0.61 0.61 0.12
GPU 2 0.70 0.61 0.00 0.12 0.61
GPU 1 0.70 0.61 0.12 0.00 0.61
CPU 2 0.70 0.12 0.61 0.61 0.00
-----------------------------------
The distances can also be used to produce a dendrogram, showing the result of hierarchical clustering by platform similarity:
- NumPy
- Matplotlib
- Python 3
- PyYAML
- SciPy
CBI and its dependencies can be installed using setup.py
:
python3 setup.py install
The master branch of CBI is the development branch, and should not be used in production. Tagged releases are available here.
See the contribution guidelines for details.