Author: Guillaume Steveny; Year: 2023 - 2024
VOCABULARY:
- huggingface_model = name of the HuggingFace model in which the vocabulary is defined. (default = "Enoch/graphcodebert-py")
READER:
- huggingface_model = name of the HuggingFace model in which the tokenizer is stored. (default = "Enoch/graphcodebert-py")
- snippet_splitter = string to represent the pattern that indicates a separation between the different examples in the dataset. (default = "\n$$$\n")
- label_splitter = string to indicate the pattern to split the code snippet from the associated labels. (default = "
$x$ ") - multi_labels = string to specify the pattern used to separate the labels associated with a specific example. (default = None)
- part_graph = a list of two integers containing the number of tokens to keep for the code part and the number of tokens to keep for the dataflow part. (default = [256, 256])
- compiled_language = path to the file used to parse the python codes with TreeSitter. (default = "./my-language.so")
- kwargs_tokenizer = set of fields to be given to the tokenizer when creating a nex instance. (default = None)
- kwargs_indexer = set of fields containing the additional parameters to give to the Indexer constructor. (default = None)
MODEL:
- labels = list of strings representing the different labels to be predicted by the model. The order of the provided entries is important. (default = ["success", "failed"])
- huggingface_model = name of the HuggingFace model containing the weights from the pretraining process. (default = "Enoch/graphcodebert-py")
- kwargs_embedder = additional fields to be used when creating the new model instance. (default = None)
- trainable = boolean to specify if the parameters of the pretrained model should be retrained.
- embedding_size = number of dimension used in the embeddings generated by the pretrained model. (default = 768)
- encoder: the way the embeddings are combined into a single embedding representing the whole code snippet. (default = {"name": "bert_pooler", "arg": [huggingface_model]})
- name = either "cls_label" (take the embedding of the CLS token) or "bert_pooler" (take the CLS embedding after going through a dense layer)
- arg = the additional argument (ordered) for the "bert_pooler" encoder
- kwargs = the additional keyword arguments for the "bert_pooler"
- classification_head: module to transform the encoded embedding into a sequence of len(labels)+2 output. (default = {"name": "simple", "arg": [embedding_size, len(labels)]})
- name = either "simple" (a single dense layer) or "mult_dense" (a sequence of dense layers)
- arg = list of additional parameters for the classification_head (ordered)
- kwargs: additional keyword arguments to provide to the classification_head, the following parameters are specific to the "mult_dense" parameter.
- activation: the type of activation to be used between the layers
- name = either "gelu", "relu" or "leaky_relu". (default = "gelu")
- arg = ignored for the two possible names.
- kwargs = ignored for the two possible names.
- norm = a boolean to indicate if BatchNormalization should be performed between the layers. (default = False)
- activation: the type of activation to be used between the layers
- accuracy: the type of accuracy metric to compute. This parameter should only be used in a multi-class single-label model. (default = None)
- name = "categorical_accuracy"
- arg = ignored
- kwargs = ignored
- loss: type of loss to be used to train the model
- name = either "cross_entropy" or "multilabel_soft_margin_loss" (default = "cross_entropy")
- arg = ignored
- kwargs = ignored
- multi_label = boolean to specify if the classification is multi-label (multiple labels associated with each example). The accuracy should not be activated if this parameter is set to True. (default = False)
- debug = boolean to indicate if the forward call should print its output to the user each 10 calls. (default = False)
TRAINER:
- validation_metric = either "-loss" or "+fscore" to indicate which metric should be used to select the best epoch and save the associated weights in the serialization directory. (default = "+fscore")
- learning_rate = a float number (can be in scientific notation, e.g. 1.e-5) representing the learning rate that will be used to train the model. (default = 1.e-5)
- patience = a positive integer representing the number of epochs to wait before doing early stopping if no improvement has been observed during these epochs. (default: early stopping is disabled)
- kwargs_optimizer = additional keyword arguments for the AdamW optimizer. (default : None)
INTERPRETER:
- captum = a boolean to specify if the captum framework should be used to perform the interpretation of the predicted classes. (default = False)
- kwargs: these parameters are valid is the captum parameter is activated.
- interpreter_name = either "LayerIntegratedGradients" or "LayerConductance" representing type of attributions in the captum framework. (default = "LayerIntegratedGradients")
- layer = the name of the layer for layer-wise attribution metrics, None if the selected attribution should not be layer-wise (possible parameter value: "bert_interpretable_layer")
- attribute_kwargs:
- n_steps = number of estimation steps for the interpreter chosen. (default = 2)
- internal_batch_size = the number of examples per batch (should be 1 at prediction time). (default = None)
CONFIG:
- predict = a boolean to launch the command line interpreter. Exclusive with the gui parameter. (default = False)
- no_loop = a boolean to avoid the training loop. (default = False)
- no_eval = a boolean to avoid the evaluation of the trained or loaded model. (default = False)
- gui = a boolean to launch the model in server mode to answer to prediction request. Can only be activated without training and evaluation procedure. (default = False)
- model = string representing the name of the HuggingFace model to be used as base (default = "Enoch/graphcodebert-py")
- training = path to the training dataset. (default = "../output/codebert/train.txt")
- validation = path to the validation dataset. (default = "../output/codebert/validation.txt")
- evaluation = path to the evaluation dataset. (default = "../output/codebert/test.txt")
- random_seed = integer corresponding to the seed for the random number generator contained in the Python random module, NumPy and PyTorch. (default: no seeds are used)
- serialization_dir = path to the directory where the output of the model and the best saved weights would be stored. (default = "../output/cesres_codebert")
- no_loop_weight_file = name of the weight file inside the serialization dir when setting no_loop to true. (default = "best.th")
- load_model = name of the weight file in an arbitrary directory (overwrite the no_loop_weight_file) when setting no_loop to true. (default = serialization_dir + "/" + no_loop_weight_file)
- batch_size = the number of examples to be inside each batch for the training process. (default = 8)
- validation_batch_size = the number of examples to be inside each batch for the validation process. (default = same as batch_size)
- evaluation_batch_size = the number of examples to be inside each batch for the evaluation process. (default = same as batch_size)
- loops = the number of epochs to train the model. (default = 1)
- device = either "cpu" or "cuda" to select the device on which the complete model should be loaded, runned and trained. (default = "cpu")