Skip to content

semantic-systems/discie

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 

Repository files navigation

README

Training the mention recognizer

Execute the following arguments to train the mention recognizer:

python src/mention_recognizer/mention_recognizer

Arguments

  • --model_name: Specify the pre-trained model name or path (default: "distilbert-base-cased").
  • --mode: Set the operational mode (choices: "train," "evaluate," "predict") (default: "train").
  • --dataset_path: Path to the training dataset (default: "data/rebel/en_train.jsonl").
  • --output_path: Specify the output directory or path (default: "bert-finetuned-ner").

Training of entity linker / relation extractor

The main command is:

python src/candidate_generation/candidate_generator.py

Arguments

  • --mode: Choose the mode (choices: TRAIN, INDEX, CANDIDATES, TRAIN_CE) (default: TRAIN).
  • --train_dataset: Path to the training dataset (default: "data/rebel/en_train.jsonl").
  • --eval_dataset: Path to the evaluation dataset (default: "data/rebel/en_val.jsonl").
  • --output_path: Path to the output directory (default: "run_training_bi_encoder").
  • --model_directory: Specify the directory for the model (default: "models/small").
  • --checkpoint_path: Specify the checkpoint path (default: None).
  • --candidate_generation_dataset: Path to the dataset for candidate generation; important for mode CANDIDATES (default: "data/rebel/en_train.jsonl").
  • --training_candidate_set_path: Path to the training candidate set; important for mode TRAIN_CE (default: "data/rebel/en_train_mapped_candidate_set.json").
  • --eval_candidate_set_path: Path to the evaluation candidate set; important for mode TRAIN_CE (default: "data/rebel/en_val_mapped_candidate_set.json").
  • --model_name: Specify the model name (default: "sentence-transformers/all-MiniLM-L12-v2").
  • --batch_size: Set the batch size for training (default: 128).
  • --num_candidates: Number of candidates to consider during training of cross-encoder (default: 10).
  • --candidate_weight: Set the weight of the candidate loss for the cross-encoder (default: 1.0).
  • --normalize: Enable/disable embeddings normalization (default: True).
  • --exclude_types: Exclude types in relation extraction (default: False).
  • --types_index_path: Specify the types index path (default: None).
  • --filter_set_path: Specify the filter set path (default: None).
  • --type_dictionary_file: Specify the type dictionary file (default: "data/item_types_relation_extraction_alt.jsonl").

Training the bi-encoder

Execute the following arguments to train the bi-encoder:

python src/candidate_generation/candidate_generator.py --mode TRAIN --train_dataset {train_dataset} --eval_dataset {eval_dataset}

Then we create an index for the bi-encoder:

python src/candidate_generation/candidate_generator.py --mode INDEX --model_directory {model_directory}

Training the cross-encoder with relation extraction

To train the cross-encoder, we need initial candidate sets. We can generate them with the following command:

python src/candidate_generation/candidate_generator.py --mode CANDIDATES --model_directory {model_directory} --candidate_generation_dataset {candidate_generation_dataset}

This has to be done for the validation and training dataset.

Then we can train the cross-encoder with the following command:

python src/candidate_generation/candidate_generator.py --mode TRAIN_CE --train_dataset {train_dataset} --eval_dataset {eval_dataset} --training_candidate_set_path {training_candidate_set_path} --eval_candidate_set_path {eval_candidate_set_path}

Training only the relation extractor

For that, we simply reduce the number of candidates to 0 and eliminate the candidate loss:

python src/candidate_generation/candidate_generator.py --mode TRAIN_CE --num_candidates 0 --candidate_weight 0.0 --train_dataset {train_dataset} --eval_dataset {eval_dataset} --training_candidate_set_path {training_candidate_set_path} --eval_candidate_set_path {eval_candidate_set_path}

Running DISCIE

Arguments

The script accepts several command-line arguments for configuring its behavior. Here is a list of available arguments and their descriptions:

  • --debug: Enable debugging mode (default: False).
  • --spoof_boundaries: Use provided boundaries instead of doing mention recognition (default: False).
  • --include_mention_scores: Include mention scores into the combined scores (default: False).
  • --include_property_scores: Include property scores into the combined scores (default: False).
  • --alternative_relation_extractor: Use an alternative relation extractor (default: False).
  • --alternative_relation_extractor_use_types: Use types with the alternative relation extractor (default: False).
  • --alternative_relation_extractor_deactivate_text: Deactivate text with the alternative relation extractor (default: False).
  • --disambiguation_mode: Set the disambiguation mode (choices: SIMPLE, ...) (default: SIMPLE).
  • --dataset_path: Specify the dataset path (default: "data/rebel_small/en_val_small_v2_filtered.jsonl").
  • --bi_encoder_path: Specify the path to the bi-encoder model (default: "models/run_training_bi_encoder_new").
  • --mention_recognizer_path: Specify the path to the mention recognizer model (default: "models/mention_recognizer_2023-07-22_18-10-13/model-epoch=06-val_f1=0.85_val_f1.ckpt").
  • --crossencoder_path: Specify the path to the crossencoder model (default: "models/crossencoder_checkpoints/model-epoch=13-val_triple_f1=0.85_triple_f1.ckpt").
  • --relation_extractor_path: Specify the path to a separate relation extractor model (default: "models/cross_encoder_2023-07-26_16-30-38/model-epoch=25-val_triple_f1=0.90_triple_f1.ckpt").
  • --entity_restrictions: Specify entity restrictions (default: None). Necessary when evaluating on restricted datasets.
  • --property_restrictions: Specify property restrictions (default: None). Necessary when evaluating on restricted datasets.
  • --mention_threshold: Set the mention threshold (default: 0.5).
  • --property_threshold: Set the property threshold (default: 0.5).
  • --combined_threshold: Set the combined threshold (default: 0.5).
  • --num_candidates: Specify the number of candidates (default: 10).
  • --mode: Set the evaluation mode (choices: ET, E) (default: ET). ET evaluates for several thresholds, E only for the specified thresholds.

You can customize the script's behavior by providing these command-line arguments when running the script.

python src/discriminative_cie/discriminative_cie.py

Reproduce results

Download datasets from GenIE paper

Download the datasets by following the instructions in the GenIE paper:

  • REBEL
  • WikipediaNRE
  • GeoNRE
  • FewRel

Train the mention recognizer on the Rebel dataset:

python src/mention_recognizer/mention_recognizer.py --mode TRAIN --dataset_path {rebel_train_dataset_path} --output_path {mention_recognizer_output_path}

  • {rebel_train_dataset_path}: Path to the Rebel training dataset.
  • {mention_recognizer_output_path}: Path to the mention recognizer output directory.
  • Example: python src/mention_recognizer/mention_recognizer.py --mode TRAIN --dataset_path data/rebel/en_train.jsonl --output_path models/mention_recognizer

Train the bi-encoder on the Rebel dataset:

python src/candidate_generation/candidate_generator.py --mode TRAIN --train_dataset {rebel_train_dataset_path} --eval_dataset {rebel_val_dataset_path} --output_path {bi_encoder_output_path}

  • {rebel_train_dataset_path}: Path to the Rebel training dataset.
  • {rebel_val_dataset_path}: Path to the Rebel validation dataset.
  • {bi_encoder_output_path}: Path to the bi-encoder output directory.
  • Example: python src/candidate/generation/candidate_generator.py --mode TRAIN --train_dataset data/rebel/en_train.jsonl --eval_dataset data/rebel/en_val.jsonl --output_path models/run_training_bi_encoder

Generate candidates for the Rebel dataset:

python src/candidate_generation/candidate_generator.py --mode CANDIDATES --model_directory {bi_encoder_output_path} --candidate_generation_dataset {rebel_train_dataset_path}

  • {bi_encoder_output_path}: Path to the bi-encoder output directory.
  • {rebel_train_dataset_path}: Path to the Rebel training dataset.
  • Example: python src/candidate_generation/candidate_generator.py --mode CANDIDATES --model_directory models/run_training_bi_encoder --candidate_generation_dataset data/rebel/en_train.jsonl

Train the cross-encoder on the Rebel dataset:

python src/candidate/generation/candidate_generator.py --mode TRAIN_CE --train_dataset {rebel_train_dataset_path} --eval_dataset {rebel_val_dataset_path} --training_candidate_set_path {training_candidate_set_path} --eval_candidate_set_path {eval_candidate_set_path}

  • {rebel_train_dataset_path}: Path to the Rebel training dataset.
  • {rebel_val_dataset_path}: Path to the Rebel validation dataset.
  • {training_candidate_set_path}: Path to the training candidate set.
  • {eval_candidate_set_path}: Path to the evaluation candidate set.
  • Example: python src/candidate_generation/candidate_generator.py --mode TRAIN_CE --train_dataset data/rebel/en_train.jsonl --eval_dataset data/rebel/en_val.jsonl --training_candidate_set_path data/rebel/en_train_candidates.jsonl --eval_candidate_set_path data/rebel/en_val_candidates.jsonl

Optionally, train a separate relation extractor on the Rebel dataset:

python src/relation_extractor/relation_extractor.py --mode TRAIN_CE --num_candidates 0 --candidate_weight 0.0 --train_dataset {rebel_train_dataset_path} --eval_dataset {eval_dataset} --training_candidate_set_path {training_candidate_set_path} --eval_candidate_set_path {eval_candidate_set_path}

  • {rebel_train_dataset_path}: Path to the Rebel training dataset.
  • {rebel_val_dataset_path}: Path to the Rebel validation dataset.
  • {relation_extractor_output_path}: Path to the relation extractor output directory.
  • Example: python src/candidate_generation/candidate_generator.py --mode TRAIN_CE --num_candidates 0 --candidate_weight 0.0 --train_dataset data/rebel/en_train.jsonl --eval_dataset data/rebel/en_val.jsonl --training_candidate_set_path data/rebel/en_train_candidates.jsonl --eval_candidate_set_path data/rebel/en_val_candidates.jsonl

Run DISCIE on the Rebel dataset:

python src/discriminative_cie/discriminative_cie.py --mode ET --dataset_path {rebel_val_dataset_path} --bi_encoder_path {bi_encoder_output_path} --mention_recognizer_path {mention_recognizer_output_path} --crossencoder_path {cross_encoder_output_path} --relation_extractor_path {relation_extractor_output_path}

python src/discriminative_cie/discriminative_cie.py --mode E --dataset_path {rebel_test_dataset_path} --bi_encoder_path {bi_encoder_output_path} --mention_recognizer_path {mention_recognizer_output_path} --crossencoder_path {cross_encoder_output_path} --relation_extractor_path {relation_extractor_output_path}

  • {rebel_test_dataset_path}: Path to the Rebel test dataset.
  • {bi_encoder_output_path}: Path to the bi-encoder output directory.
  • {mention_recognizer_output_path}: Path to the mention recognizer output directory.
  • {cross_encoder_output_path}: Path to the cross-encoder output directory.
  • {relation_extractor_output_path}: Path to the relation extractor output directory.
  • Example: python src/discriminative_cie/discriminative_cie.py --mode ET --dataset_path data/rebel/en_val.jsonl --bi_encoder_path models/run_training_bi_encoder --mention_recognizer_path models/mention_recognizer_2023-07-22_18-10-13/model-epoch=06-val_f1=0.85_val_f1.ckpt --crossencoder_path models/crossencoder_checkpoints/model-epoch=13-val_triple_f1=0.85_triple_f1.ckpt --relation_extractor_path models/relation_extractor/model-epoch=15-val_triple_f1=0.90_triple_f1.ckpt
  • Example: python src/discriminative_cie/discriminative_cie.py --mode E --dataset_path data/rebel/en_test.jsonl --bi_encoder_path models/run_training_bi_encoder --mention_recognizer_path models/mention_recognizer_2023-07-22_18-10-13/model-epoch=06-val_f1=0.85_val_f1.ckpt --crossencoder_path models/crossencoder_checkpoints/model-epoch=13-val_triple_f1=0.85_triple_f1.ckpt --relation_extractor_path models/relation_extractor/model-epoch=15-val_triple_f1=0.90_triple_f1.ckpt

Repeat the above steps for the other datasets. Optionally, fine-tune the REBEL-trained models on the other datasets.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages