Discern-and-Answer

Why So Gullible? Enhancing the Robustness of Retrieval-Augmented Models against Counterfactual Noise [Paper]
Giwon Hong*, Jeonghwan Kim*, Junmo Kang*, Sung-Hyon Myaeng, Joyce Jiyoung Whang

Install

Clone the current repository

git clone https://github.com/wjdghks950/Discern-and-Answer.git
cd Discern-and-Answer

Install the conda enviroment

conda env create -f environment.yaml

Activate the conda enviroment

conda activate DaC_env

Dataset

Download datasets from Download the dataset from the following link: MacNoise dataset
Place them under DATA folder

e.g., Discern-and-Answer/DATA/corpus/NQ_train_titlemerged_longpre-corpus.json

Training dataset list

DATA/corpus/NQ_train_titlemerged_longpre-corpus.json : The NQ (Natural Questions) train set used for finetuning the FiD model. This dataset is perturbed following the method by Longpre et al. (2021) and includes the top 100 documents per question.
DATA/corpus/NQ_train_titlemerged_chatgpt_top20.json : The NQ (Natural Questions) train set used for finetuning the FiD model. This dataset is perturbed by our proposed method (MacNoise) and includes the top 20 documents per question.
DATA/corpus/NQ_train_titlemerged_joint_top20.json : The NQ (Natural Questions) train set used for finetuning the FiD model. This dataset is perturbed following the method by Longpre et al. (2021) and our proposed method (MacNoise), and includes the top 100 documents per question (where top 20 documents have both Longpre & MacNoise perturbation).

Evaluation dataset list

DATA/corpus/NQ_eval_longpre_dev_256_new_fix.json : The sampled NQ dev set with 256 instances used to evaluate the finetuned FiD model and GPT 3.5. This dataset is perturbed following the method by Longpre et al. (2021) and includes the top 5 documents per question.
DATA/corpus/NQ_eval_longpre_test_fix.json : The NQ test set used to evaluate the finetuned FiD model and GPT 3.5. This dataset is perturbed following the method by Longpre et al. (2021) and includes the top 5 documents per question.
DATA/corpus/NQ_eval_gpt4_dev_256_new_fix.json : The sampled NQ dev set with 256 instances used to evaluate the finetuned FiD model and GPT 3.5. This dataset is perturbed by our proposed method (MacNoise) and includes the top 5 documents per question.
DATA/corpus/TQA_eval_gpt4_dev_256_new_fix.json : The sampled TQA dev set with 256 instances used to evaluate the finetuned FiD model and GPT 3.5. This dataset is perturbed by our proposed method (MacNoise) and includes the top 5 documents per question.

Train

Configure training settings in Discern-and-Answer/codes/FiD_contra/train_reader.sh

python train_reader.py \
        --train_data ../../DATA/corpus/NQ_train_titlemerged_longpre-corpus.json \
        --eval_data ../../DATA/corpus/NQ_dev_titlemerged_longpre-corpus.json \
        --model_size base \
        --per_gpu_batch_size 1 \
        --n_context 50 \
        --total_steps 640000 \
        --accumulation_steps 64 \
        --eval_freq 800000 \
        --save_freq 160000 \
        --name nq_base_640k_semi_parametric_disc_p75 \
        --checkpoint_dir checkpoint \
        --perturb 0.75 \
        --model_setting semi_parametric_pert #[parametric, semi_parametric, semi_parametric_pert]

train_data / eval_data: paths to the train/eval datasets. The list of possible datasets can be found above.
name: model name to be saved in FiD_contra/checkpoint
perturb: perturbation probability. Documents with perturbable answers are perturbed according to this probability. Note: The --perturb parameter in training refers to the probability of perturbing documents with perturbable answers. In evaluation, --perturb refers to the proportion of documents being perturbed out of the total. Therefore, setting --perturb to 0, 31, 50, 75 in training corresponds to 0, 15, 25, 35 in evaluation. (For more details, please refer to the paper.)
model_setting: it should be one of the followings: "parametric", "semi_parametric", "semi_parametric_pert"

parametric: Inferring an answer using only parametric knowledge of a model without using retrieved passages at all. Therefore, the performance is not affected by the perturbation probability.

semi_parametric: The same settings as the original FiD, inferring answers using retrieved passages.

semi_parametric_pert: Jointly train a discriminator that determines whether passages are perturbed or not in order to mitigate the effects of perturbed documents (Discriminator_FiD)

Evaluation

Configure evaluation settings in Discern-and-Answer/codes/FiD_contra/test_reader.sh

python test_reader.py \
        --model_path checkpoint/nq_base_640k_semi_parametric_disc_p75/checkpoint/step-640001 \
        --eval_data ../DATA/corpus/NQ_eval_longpre_dev_256_new_fix.json \
        --per_gpu_batch_size 1 \
        --n_context 5 \
        --name nq_base_640k_semi_parametric_disc_p75 \
        --checkpoint_dir checkpoint \
        --perturb 0.35 \
        --model_setting semi_parametric_pert #[parametric, semi_parametric, semi_parametric_pert]

model_path: path to model checkpoint
eval_data: paths to the evaluation datasets with deterministic perturbation in DATA/Evaluation. The list of possible datasets can be found above.
perturb: perturbation probability. It should be one of 0.0, 0.15, 0.25, 0.35. This is to use a pre-made perturbation for deterministic evaluation.
model_setting: same as in training

GPT-3.5

In order to handle frequent failures of the GPT-3.5 API, the code is provided in the form of a jupyter notebook.

Answer Generation

To generate answers through in-context learning of GPT-3.5, use scripts from Discern-and-Answer/codes/GPT

GPT_in-context_learning_nq_longpre.ipynb : A script for generating answers using the GPT-3.5 model on the NQ dev/test set perturbed following the method by Longpre et al. (2021). During this process, the classification results of the dev/test set by the FiD model, trained on the corresponding train set, can be used.
GPT_in-context_learning_nq_gpt4.ipynb : A script for generating answers using the GPT-3.5 model on the NQ dev set perturbed by our proposed method (MacNoise dataset). During this process, the classification results of the dev/test set by the FiD model, trained on the corresponding train set, can be used.
GPT_in-context_learning_tqa_gpt4.ipynb : A script for generating answers using the GPT-3.5 model on the TQA dev set perturbed by our proposed method (MacNoise dataset). During this process, the classification results of the dev/test set by the FiD model, trained on the perturbed NQ train set, can be used.

You can do various configurations in the third cell.

GPT3_api_key: API key to use GPT-3.5. Refer to https://beta.openai.com/
is_dev: if True, it will generate answers on the dev dataset. Else, on the test dataset
use_parametric_only: If True, generate answers using only the parametric knowledge of GPT-3 without providing the retrieved passages
use_pert_aware_instruction: If True, add an perturbation-aware instruction to GPT-3.5's prompt to identify perturbed passages and ignore them in answer generation. (Discriminator_inst)
use_discriminator_fid: If True, inject the perturbation classification results of FiD discriminator (Discriminator_FiD) into prompts instead of GPT-3's perturbed passage identification
pert_ratio: perturbation probability. It should be one of 0.0, 0.15, 0.25, 0.35. This is to use a pre-made perturbation for deterministic evaluation.
train_sample_idx: Decide which of the 5 train samples to use. An ensemble can be performed in Evaluation only when answers are derived for all 5 train samples.
train_sample_path: path to train sample json file in DATA

Evaluation

To evaluate answers generated by GPT-3.5, use scripts from Discern-and-Answer/codes/GPT.

GPT_evaluation_nq_longpre.ipynb : A script to evaluate answers generated by GPT_in-context_learning_nq_longpre.ipynb
GPT_evaluation_nq_gpt4.ipynb : A script to evaluate answers generated by GPT_in-context_learning_nq_gpt4.ipynb
GPT_evaluation_tqa_gpt4.ipynb : A script to evaluate answers generated by GPT_in-context_learning_tqa_gpt4.ipynb

You can do various configurations in the third cell. They are identical to the settings in Answer Generation scripts.

If you run all cells, you can get the best/average/worst results and ensemble results for 5 samples for each specified setting.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
DATA/GPT		DATA/GPT
codes		codes
README.md		README.md
environment.yaml		environment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Discern-and-Answer

Contents

Install

Dataset

Train

Evaluation

GPT-3.5

About

Releases

Packages

Contributors 2

Languages

wjdghks950/Discern-and-Answer

Folders and files

Latest commit

History

Repository files navigation

Discern-and-Answer

Contents

Install

Dataset

Train

Evaluation

GPT-3.5

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages