-
Notifications
You must be signed in to change notification settings - Fork 0
Building classifier
First of all, on the data preprocessing stage, all the log statements are matched against a regular expression.
Every log statement is stored in the following data structure:
class LogStatement:
Depending on the classification task log statements are represented differently.
For every project, we calculate the percent of files which contain at least one log statement. If this percentage is lower than a certain threshold, we assume that this project does not have proper logging. We do not consider such projects when generating a dataset for classifications.
REGEX TBA
and the threshold 1.0%, the number of projects that are left is n out of N, the number of files that contain at least one log statement in those N projects is k out of K.
Each example in a dataset is N_LIMIT of words before the position which is considered to be the location of a log statement, and N_LIMIT words after. Those contexts around the position where there should be a log statement is labeled with True
, the one around a place where there should be no log statement is marked as False
.
For both cases, the context around the log statement (existing one in one case and artificially inserted one in the other) is taken. If the beginning/the end of the file is reached when capturing a context, padding tokens are added to the beginning/end. If there are other log statements in the context, they are removed with probability 0.5.
The first logging-related task is finding a position in the code where a log statement can be inserted. To achieve that the trained model will try to go through the file with the source code and for each line find a probability that a log statement should be inserted after it.
The dataset is generated from the selected k files. For positive cases, one of the existing log-statements is randomly selected from a file. For negative cases, blocks are identified where log statements would be syntactically possible (method and constructor bodies, static and dynamic initialization blocks). One of the position after {
, }
or ;
is randomly selected. That is the position where randomly 'a fake log statement' is inserted.
TRACE_OPTIONS = '[Tt]race|TRACE|v|t|logV|[Ff]inest|FINEST'
DEBUG_OPTIONS = '[Dd]ebug|DEBUG|d|logD|[Ff]iner|FINER|[Cc]onfig|CONFIG'
INFO_OPTIONS = '[Ii]nfo|INFO|i|logI|[Ff]ine|FINE'
WARN_OPTIONS = '[Ww]arn|WARN|[Ww]arning|WARNING|w|logW'
ERROR_OPTIONS = '[Ee]rror|ERROR|e|logE'
FATAL_OPTIONS = '[Ff]atal|FATAL|[Ss]evere|SEVERE|s|f'