Skip to content

Latest commit

 

History

History
79 lines (73 loc) · 7.3 KB

parameters.md

File metadata and controls

79 lines (73 loc) · 7.3 KB

Author: Guillaume Steveny; Year: 2023 - 2024

VOCABULARY:

  • huggingface_model = name of the HuggingFace model in which the vocabulary is defined. (default = "Enoch/graphcodebert-py")

READER:

  • huggingface_model = name of the HuggingFace model in which the tokenizer is stored. (default = "Enoch/graphcodebert-py")
  • snippet_splitter = string to represent the pattern that indicates a separation between the different examples in the dataset. (default = "\n$$$\n")
  • label_splitter = string to indicate the pattern to split the code snippet from the associated labels. (default = " $x$ ")
  • multi_labels = string to specify the pattern used to separate the labels associated with a specific example. (default = None)
  • part_graph = a list of two integers containing the number of tokens to keep for the code part and the number of tokens to keep for the dataflow part. (default = [256, 256])
  • compiled_language = path to the file used to parse the python codes with TreeSitter. (default = "./my-language.so")
  • kwargs_tokenizer = set of fields to be given to the tokenizer when creating a nex instance. (default = None)
  • kwargs_indexer = set of fields containing the additional parameters to give to the Indexer constructor. (default = None)

MODEL:

  • labels = list of strings representing the different labels to be predicted by the model. The order of the provided entries is important. (default = ["success", "failed"])
  • huggingface_model = name of the HuggingFace model containing the weights from the pretraining process. (default = "Enoch/graphcodebert-py")
  • kwargs_embedder = additional fields to be used when creating the new model instance. (default = None)
    • trainable = boolean to specify if the parameters of the pretrained model should be retrained.
  • embedding_size = number of dimension used in the embeddings generated by the pretrained model. (default = 768)
  • encoder: the way the embeddings are combined into a single embedding representing the whole code snippet. (default = {"name": "bert_pooler", "arg": [huggingface_model]})
    • name = either "cls_label" (take the embedding of the CLS token) or "bert_pooler" (take the CLS embedding after going through a dense layer)
    • arg = the additional argument (ordered) for the "bert_pooler" encoder
    • kwargs = the additional keyword arguments for the "bert_pooler"
  • classification_head: module to transform the encoded embedding into a sequence of len(labels)+2 output. (default = {"name": "simple", "arg": [embedding_size, len(labels)]})
    • name = either "simple" (a single dense layer) or "mult_dense" (a sequence of dense layers)
    • arg = list of additional parameters for the classification_head (ordered)
    • kwargs: additional keyword arguments to provide to the classification_head, the following parameters are specific to the "mult_dense" parameter.
      • activation: the type of activation to be used between the layers
        • name = either "gelu", "relu" or "leaky_relu". (default = "gelu")
        • arg = ignored for the two possible names.
        • kwargs = ignored for the two possible names.
      • norm = a boolean to indicate if BatchNormalization should be performed between the layers. (default = False)
  • accuracy: the type of accuracy metric to compute. This parameter should only be used in a multi-class single-label model. (default = None)
    • name = "categorical_accuracy"
    • arg = ignored
    • kwargs = ignored
  • loss: type of loss to be used to train the model
    • name = either "cross_entropy" or "multilabel_soft_margin_loss" (default = "cross_entropy")
    • arg = ignored
    • kwargs = ignored
  • multi_label = boolean to specify if the classification is multi-label (multiple labels associated with each example). The accuracy should not be activated if this parameter is set to True. (default = False)
  • debug = boolean to indicate if the forward call should print its output to the user each 10 calls. (default = False)

TRAINER:

  • validation_metric = either "-loss" or "+fscore" to indicate which metric should be used to select the best epoch and save the associated weights in the serialization directory. (default = "+fscore")
  • learning_rate = a float number (can be in scientific notation, e.g. 1.e-5) representing the learning rate that will be used to train the model. (default = 1.e-5)
  • patience = a positive integer representing the number of epochs to wait before doing early stopping if no improvement has been observed during these epochs. (default: early stopping is disabled)
  • kwargs_optimizer = additional keyword arguments for the AdamW optimizer. (default : None)

INTERPRETER:

  • captum = a boolean to specify if the captum framework should be used to perform the interpretation of the predicted classes. (default = False)
  • kwargs: these parameters are valid is the captum parameter is activated.
    • interpreter_name = either "LayerIntegratedGradients" or "LayerConductance" representing type of attributions in the captum framework. (default = "LayerIntegratedGradients")
    • layer = the name of the layer for layer-wise attribution metrics, None if the selected attribution should not be layer-wise (possible parameter value: "bert_interpretable_layer")
    • attribute_kwargs:
      • n_steps = number of estimation steps for the interpreter chosen. (default = 2)
      • internal_batch_size = the number of examples per batch (should be 1 at prediction time). (default = None)

CONFIG:

  • predict = a boolean to launch the command line interpreter. Exclusive with the gui parameter. (default = False)
  • no_loop = a boolean to avoid the training loop. (default = False)
  • no_eval = a boolean to avoid the evaluation of the trained or loaded model. (default = False)
  • gui = a boolean to launch the model in server mode to answer to prediction request. Can only be activated without training and evaluation procedure. (default = False)
  • model = string representing the name of the HuggingFace model to be used as base (default = "Enoch/graphcodebert-py")
  • training = path to the training dataset. (default = "../output/codebert/train.txt")
  • validation = path to the validation dataset. (default = "../output/codebert/validation.txt")
  • evaluation = path to the evaluation dataset. (default = "../output/codebert/test.txt")
  • random_seed = integer corresponding to the seed for the random number generator contained in the Python random module, NumPy and PyTorch. (default: no seeds are used)
  • serialization_dir = path to the directory where the output of the model and the best saved weights would be stored. (default = "../output/cesres_codebert")
  • no_loop_weight_file = name of the weight file inside the serialization dir when setting no_loop to true. (default = "best.th")
  • load_model = name of the weight file in an arbitrary directory (overwrite the no_loop_weight_file) when setting no_loop to true. (default = serialization_dir + "/" + no_loop_weight_file)
  • batch_size = the number of examples to be inside each batch for the training process. (default = 8)
  • validation_batch_size = the number of examples to be inside each batch for the validation process. (default = same as batch_size)
  • evaluation_batch_size = the number of examples to be inside each batch for the evaluation process. (default = same as batch_size)
  • loops = the number of epochs to train the model. (default = 1)
  • device = either "cpu" or "cuda" to select the device on which the complete model should be loaded, runned and trained. (default = "cpu")