A Condense-then-select Framework for Text Summarization

This repository contains the source code for our paper "A Condense-then-select Framework for Text Summarization" in the KBS journal.

Our source code is built on the code of Fast Abstractive Summarization-RL.

If you use our code, please cite our paper:

@article{condense_then_abstract_2021,
  title={A condense-then-select strategy for text summarization},
  author={Chan, Hou Pong and King, Irwin},
  journal={Knowledge-Based Systems},
  pages={107235},
  year={2021},
  publisher={Elsevier}
}

Model Architecture

Dependencies

Python 3.6
Pytorch 1.4.0
cytoolz
tensorboardX
pyrouge
sentence-transformers 0.3.3
transformers 3.0.2

Please refer to the requirements.txt for the full dependencies.

Data

CNN/DM: you can download our preprocessed version of CNN/DM dataset here for downloading and preprocessing the CNN/DailyMail dataset.
DUC-2002: please sign the agreements and request the DUC-2002 dataset follows the instructions here. After you obtain their approval, please send an email to me ([email protected]) to request our preprocessed version of DUC-2002.
Pubmed: you can download our preprocessed version of Pubmed dataset here.

Setup

Our method with 1to1 top-1 abstractor

Training on CNN/DM

Export the path of CNN/DM dataset export DATA=path/to/CNNDM
Export the path for storing the cache of pretrained models export MODEL_CACHE=path/to/model_cache
make the pseudo-labels for abstractor

python make_extraction_labels.py --ROUGE_mode r

pretrain the word2vec word embeddings

python train_word2vec.py --path=[path/to/word2vec]

build vocab

python build_vocab_pubmed.py --data_dir path/to/CNNDM

train one-to-one abstractor using ML objective

python train_abstractor.py --path=[path/to/abstractor] --w2v=[path/to/word2vec/word2vec.128d.226k.bin]

generate candidates from one-to-one abstractor. Alternatively, you can download our generated candidates here, and move the extracted folders to path/to/data/.

python decode_candidates.py --path=[path/to/data/train_cand_top1_beam] --abs_dir=saved_models/abstractor --beam=5 --topk 1 --batch 16 --split train
python decode_candidates.py --path=[path/to/data/val_cand_top1_beam] --abs_dir=saved_models/abstractor --beam=5 --topk 1 --batch 12 --split val

make the pseudo-labels for extractor

python make_extraction_labels.py --folder_name train_cand_top1_beam --ROUGE_mode f
python make_extraction_labels.py --folder_name val_cand_top1_beam --ROUGE_mode f

train extractor using ML objective with Sentence-BERT

python train_extractor_ml.py --path=saved_models/extractor_ml_top1_beam --net-type rewritten_sent_word_bert_rnn --num_candidates 2 --train_set_folder train_cand_top1_beam --valid_set_folder val_cand_top1_beam --batch 64 --min_lr 1e-5 --lr 5e-4 --ckpt_freq 1500

train extractor using RL objective with Sentence-BERT

python train_full_rl.py --path=saved_models/extractor_rl_top1_beam --ext_dir=saved_models/extractor_ml_top1_beam --abs_dir=saved_models/abstractor --num_candidates 2 --ext_type rewritten_sent_word_bert_rnn --train_set_folder train_cand_top1_beam --valid_set_folder val_cand_top1_beam --min_lr 1e-5 --patience 9 --reward_type 2 --lr 5e-5

(Optional) train extractor using ML objective without Sentence-BERT

python train_extractor_ml.py --path=saved_models/extractor_ml_top1_beam_no_BERT --w2v=pretrained_embedding/word2vec.128d.226k.bin --net-type rewritten_rnn --num_candidates 2 --train_set_folder train_cand_top1_beam --valid_set_folder val_cand_top1_beam

(Optional) train extractor using RL objective without Sentence-BERT

python train_full_rl.py --path=saved_models/extractor_rl_top1_beam_no_BERT --ext_dir=saved_models/extractor_ml_top1_beam_no_BERT --abs_dir=saved_models/abstractor --num_candidates 2 --ext_type rewritten_rnn --train_set_folder train_cand_top1_beam --valid_set_folder val_cand_top1_beam --min_lr 5e-5 --patience 6 --reward_type 2

Testing on CNN/DM

Export the path of CNN/DM dataset export DATA=path/to/CNNDM
Download pyrouge, and save it to path/to/pyrouge.

git clone https://github.com/andersjo/pyrouge.git

Export ROUGE score environment variable

export ROUGE=path/to/pyrouge/tools/ROUGE-1.5.5

generate candidates from abstractor for the test set. You can skip this step if you downloaded our generated candidates.

python decode_candidates.py --path=[path/to/data/test_cand_top1_beam] --abs_dir=saved_models/abstractor --beam=5 --topk 1 --batch 12 --split test

Make the reference for evaluation

python make_eval_references.py --folder_name test_cand_top1_beam

Decode summaries from model

python decode_full_model_cand.py --path [path/to/save/decoded/files] --model_dir [path/to/extractor_rl] --num_candidates 2 --beam 5 --test_set_folder test_cand_top1_beam --abstracted

Run evaluation

python eval_full_model.py --rouge --decode_dir [path/to/save/decoded/files]

Test on DUC

Export the path of ROUGE
Export the path of duc2002 dataset export DATA=path/to/duc2002
generate candidates from compression-controllable abstractor.

python decode_candidates.py --path=[path/to/data/test_cand_top1_beam] --abs_dir=saved_models/abstractor --beam=5 --topk 1 --batch 6 --split test

Make the reference for evaluation

python make_eval_references_duc.py --folder_name test_cand_top1_beam

Decode summaries from model

python3 -u decode_full_model_cand.py --path [path/to/save/decoded/files] --model_dir [path/to/extractor_rl] --num_candidates 2 --beam 5 --test_set_folder test_cand_top1_beam --abstracted

Run evaluation

python eval_full_model_duc.py --rouge --decode_dir=[path/to/save/decoded/files]

Train and test on Pubmed

Export the path of ROUGE
Export the path of Pubmed dataset export DATA=path/to/pubmed
Export the path for storing the cache of pretrained models export MODEL_CACHE=path/to/model_cache
make the pseudo-labels for abstractor

python make_extraction_labels.py --ROUGE_mode r

build vocab

python build_vocab_pubmed.py --data_dir path/to/pubmed

pretrain word embedding

python train_word2vec.py --path=[path/to/word2vec_pubmed]

train one-to-one abstractor using ML objective

python train_abstractor.py --path=saved_models/abstractor_ml_pubmed_max_50 --w2v=pretrained_embedding_pubmed/word2vec.128d.405k.bin --max_abs 50

generate candidates from compression-controllable abstractor, or you can download our generated candidates here, and move the extracted folders to path/to/data/.

python decode_candidates.py --path=[path/to/data/train_cand_top1_beam] --abs_dir=saved_models/abstractor_ml_pubmed_max_50 --beam=5 --topk 1 --batch 12 --split train
python decode_candidates.py --path=[path/to/data/val_cand_top1_beam] --abs_dir=saved_models/abstractor_ml_pubmed_max_50 --beam=5 --topk 1 --batch 12 --split val
python decode_candidates.py --path=[path/to/data/test_cand_top1_beam] --abs_dir=saved_models/abstractor_ml_pubmed_max_50 --beam=5 --topk 1 --batch 12 --split test

make the pseudo-labels for extractor

python make_extraction_labels.py --folder_name train_cand_top1_beam --ROUGE_mode f
python make_extraction_labels.py --folder_name val_cand_top1_beam --ROUGE_mode f

train extractor using ML objective without Sentence-BERT

python3 -u train_extractor_ml.py --path=saved_models/extractor_ml_pubmed --net-type rewritten_rnn --num_candidates 2 --train_set_folder train_cand_top1_beam --valid_set_folder val_cand_top1_beam --batch 64 --min_lr 1e-5 --lr 5e-4 --ckpt_freq 1500 --max_word 100 --max_sent 700 --w2v=pretrained_embedding_pubmed/word2vec.128d.405k.bin

train extractor using RL objective without Sentence-BERT

python3 -u train_full_rl.py --path=saved_models/extractor_rl_pubmed --ext_dir=saved_models/extractor_ml_pubmed --abs_dir=saved_models/abstractor_ml_pubmed_max_50 --num_candidates 2 --ext_type rewritten_rnn --train_set_folder train_cand_top1_beam --valid_set_folder val_cand_top1_beam --min_lr 1e-5 --patience 6 --reward_type 2 --max_word 100 --max_sent 700

Decode summaries from model

python3 decode_full_model_cand.py --path=[path/to/save/decoded/files] --model_dir=[path/to/extractor_rl] --num_candidates 2 --beam 1 --test_set_folder test_cand_top1_beam --abstracted

Make the reference for evaluation

python make_eval_references.py --folder_name test_cand_top1_beam

Run evaluation

python3 eval_full_model_pubmed.py --rouge --decode_dir=[path/to/save/decoded/files]

Our method with compression-controllable abstractor

Training on CNN/DM

Export the path of CNN/DM dataset export DATA=path/to/CNNDM
Export the path for storing the cache of pretrained models export MODEL_CACHE=path/to/model_cache
make the pseudo-labels for abstractor, pretrain word embedding, and build vocab following the instructions in one2one abstractor, you only need to do it once.
make compression level labels for compression-controllable abstractor

python make_compression_label.py --split all

train compression-controllable abstractor using ML objective

python train_controllable_abstractor.py --path=[path/to/compression_controllable_abstractor] --w2v=[path/to/word2vec/word2vec.128d.226k.bin]

generate candidates from compression-controllable abstractor. Alternatively, you can download our generated candidates here and move the extracted folders to path/to/data/.

python decode_compression.py --path=[path/to/data/val_cand_control_abs_2] --abs_dir=[path/to/compression_controllable_abstractor] --beam=5 --topk 1 --batch 3 --split val --n_compression_levels 2
python decode_compression.py --path=[path/to/data/train_cand_control_abs_2] --abs_dir=[path/to/compression_controllable_abstractor] --beam=5 --topk 1 --batch 3 --split train --n_compression_levels 2

make the pseudo-labels for extractor

python make_extraction_labels.py --folder_name train_cand_control_abs_2 --ROUGE_mode f
python make_extraction_labels.py --folder_name val_cand_control_abs_2 --ROUGE_mode f

train extractor using ML objective with Sentence-BERT

python train_extractor_ml.py --path=[path/to/extractor_ml] --net-type rewritten_sent_word_bert_rnn --num_candidates 3 --train_set_folder train_cand_control_abs_2 --valid_set_folder val_cand_control_abs_2 --batch 64 --min_lr 1e-5 --lr 5e-4 --ckpt_freq 1500

train extractor using RL objective with Sentence-BERT

python train_full_rl.py --path=[path/to/extractor_rl] --ext_dir=[path/to/extractor_ml] --abs_dir=[path/to/compression_controllable_abstractor] --num_candidates 3 --ext_type rewritten_sent_word_bert_rnn --train_set_folder train_cand_control_abs_2 --valid_set_folder val_cand_control_abs_2 --min_lr 1e-5 --patience 9 --reward_type 2

(Optional) train extractor using ML objective without Sentence-BERT

python train_extractor_ml.py --path=[path/to/extractor_ml_no_BERT] --net-type rewritten_rnn --w2v=pretrained_embedding/word2vec.128d.226k.bin --num_candidates 3 --train_set_folder train_cand_control_abs_2 --valid_set_folder val_cand_control_abs_2

(Optional) train extractor using RL objective without Sentence-BERT

python train_full_rl.py --path=[path/to/extractor_rl_no_BERT] --ext_dir=[path/to/extractor_ml_no_BERT] --abs_dir=[path/to/compression_controllable_abstractor] --num_candidates 3 --ext_type rewritten_sent_word_bert_rnn --train_set_folder train_cand_control_abs_2 --valid_set_folder val_cand_control_abs_2 --min_lr 5e-5 --patience 6 --reward_type 2

Testing on CNN/DM

Export the path of CNN/DM dataset export DATA=path/to/CNNDM
Download and export the path of ROUGE following the testing procedure of 1to1 top-1 abstractor.
generate candidates from compression-controllable abstractor. You can skip this step by downloading our

python decode_compression.py --path=[path/to/data/test_cand_control_abs_2] --abs_dir=[path/to/compression_controllable_abstractor] --beam=5 --topk 1 --batch 3 --split test --n_compression_levels 2

Make the reference for evaluation

python make_eval_references.py --folder_name test_cand_control_abs_2

Decode summaries from model

python decode_full_model_cand.py --path [path/to/save/decoded/files] --model_dir [path/to/extractor_rl] --num_candidates 3 --beam 5 --test_set_folder test_cand_control_abs_2 --abstracted

Run evaluation

python eval_full_model.py --rouge --decode_dir [path/to/save/decoded/files]

Test on DUC

Export the path of ROUGE
Export the path of duc2002 dataset export DATA=path/to/duc2002
generate candidates from compression-controllable abstractor.

python decode_compression.py --path=[path/to/data/test_cand_control_abs_2] --abs_dir=[path/to/compression_controllable_abstractor] --beam=5 --topk 1 --batch 3 --split test --n_compression_levels 2

Make the reference for evaluation

python make_eval_references_duc.py --folder_name test_cand_control_abs_2

Decode summaries from model

python3 -u decode_full_model_cand.py --path [path/to/save/decoded/files] --model_dir [path/to/extractor_rl] --num_candidates 3 --beam 5 --test_set_folder test_cand_control_abs_2 --abstracted

Run evaluation

python eval_full_model_duc.py --rouge --decode_dir=[path/to/save/decoded/files]

Train and test on Pubmed

Export the path of ROUGE
Export the path of Pubmed dataset export DATA=path/to/pubmed
Export the path for storing the cache of pretrained models export MODEL_CACHE=path/to/model_cache
make the pseudo-labels for abstractor, pretrain word embedding, and build vocab following the instructions in one2one abstractor, you only need to do it once.
make compression level labels for compression-controllable abstractor

python make_compression_label.py --split all

train compression-controllable abstractor using ML objective

python train_controllable_abstractor.py --path=saved_models/control_abstractor_pubmed_max_50 --w2v=pretrained_embedding_pubmed/word2vec.128d.405k.bin --max_abs 50

generate candidates from compression-controllable abstractor, or you can download our generated candidates here, and move the extracted folders to path/to/data/.

python decode_compression.py --path=[path/to/data/train_cand_control_abs_2] --abs_dir=saved_models/control_abstractor_pubmed_max_50 --beam=5 --topk 1 --batch 4 --split train --n_compression_levels 2 --max_dec_word 50
python decode_compression.py --path=[path/to/data/val_cand_control_abs_2] --abs_dir=saved_models/control_abstractor_pubmed_max_50 --beam=5 --topk 1 --batch 4 --split val --n_compression_levels 2 --max_dec_word 50
python decode_compression.py --path=[path/to/data/test_cand_control_abs_2] --abs_dir=saved_models/control_abstractor_pubmed_max_50 --beam=5 --topk 1 --batch 4 --split test --n_compression_levels 2 --max_dec_word 50

make the pseudo-labels for extractor

python make_extraction_labels.py --folder_name train_cand_control_abs_2 --ROUGE_mode f
python make_extraction_labels.py --folder_name val_cand_control_abs_2 --ROUGE_mode f

train extractor using ML objective without Sentence-BERT

python3 -u train_extractor_ml.py --path=[path/to/extractor_ml_pubmed] --net-type rewritten_rnn --num_candidates 3 --train_set_folder train_cand_control_abs_2 --valid_set_folder val_cand_control_abs_2 --batch 32 --min_lr 1e-5 --lr 5e-4 --ckpt_freq 3000 --max_word 100 --max_sent 1050 --w2v=pretrained_embedding_pubmed/word2vec.128d.405k.bin

train extractor using RL objective without Sentence-BERT

python3 -u train_full_rl.py --path=[path/to/extractor_rl_pubmed] --ext_dir=[path/to/extractor_ml_pubmed] --abs_dir=saved_models/abstractor_ml_pubmed_max_50 --num_candidates 3 --ext_type rewritten_rnn --train_set_folder train_cand_control_abs_2 --valid_set_folder val_cand_control_abs_2 --min_lr 1e-5 --patience 6 --reward_type 2 --max_word 100 --max_sent 1050

Decode summaries from model

python3 decode_full_model.py --path=[path/to/save/decoded/files] --model_dir=[path/to/extractor_rl_pubmed] --beam=5 --test

Make the reference for evaluation

python make_eval_references.py --folder_name test_cand_control_abs_2

Run evaluation

python3 eval_full_model_pubmed.py --rouge --decode_dir=[path/to/save/decoded/files]

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
figures		figures
model		model
sentence_transformers		sentence_transformers
BART_summarization_prediction.py		BART_summarization_prediction.py
LICENSE		LICENSE
README.md		README.md
build_vocab.py		build_vocab.py
build_vocab_pubmed.py		build_vocab_pubmed.py
clean_empty_sentence_pubmed_training_data.py		clean_empty_sentence_pubmed_training_data.py
clean_pubmed_training_data.py		clean_pubmed_training_data.py
compress_stat.py		compress_stat.py
compute_doc_len_stat.py		compute_doc_len_stat.py
copy_sentences.py		copy_sentences.py
decode_baseline_cand.py		decode_baseline_cand.py
decode_baselines.py		decode_baselines.py
decode_candidates.py		decode_candidates.py
decode_compression.py		decode_compression.py
decode_full_model.py		decode_full_model.py
decode_full_model_cand.py		decode_full_model_cand.py
decode_full_model_other.py		decode_full_model_other.py
decode_two_to_one.py		decode_two_to_one.py
decode_two_to_one_next_neighbor.py		decode_two_to_one_next_neighbor.py
decode_two_to_one_oracle.py		decode_two_to_one_oracle.py
decode_two_to_one_previous_neighbor.py		decode_two_to_one_previous_neighbor.py
decoding.py		decoding.py
eval_acl.py		eval_acl.py
eval_baselines.py		eval_baselines.py
eval_full_model.py		eval_full_model.py
eval_full_model_duc.py		eval_full_model_duc.py
eval_full_model_pubmed.py		eval_full_model_pubmed.py
evaluate.py		evaluate.py
ext_label_rouge_stat.py		ext_label_rouge_stat.py
lead_baseline.py		lead_baseline.py
make_compression_label.py		make_compression_label.py
make_eval_references.py		make_eval_references.py
make_eval_references_duc.py		make_eval_references_duc.py
make_extraction_labels.py		make_extraction_labels.py
make_group_ext_labels.py		make_group_ext_labels.py
make_group_ext_labels_new.py		make_group_ext_labels_new.py
make_lens_info.py		make_lens_info.py
matched_labels.py		matched_labels.py
metric.py		metric.py
opennmt_pred_to_dec.py		opennmt_pred_to_dec.py
oracle_rouge_l.py		oracle_rouge_l.py
output_stat.py		output_stat.py
preprocess_amazon_data.py		preprocess_amazon_data.py
preprocess_duc.py		preprocess_duc.py
preprocess_pubmed.py		preprocess_pubmed.py
requirements.txt		requirements.txt
rl.py		rl.py
rouge_l_stat.py		rouge_l_stat.py
sentence_transformer_wrapper.py		sentence_transformer_wrapper.py
tokenize_duc.py		tokenize_duc.py
train_abstractor.py		train_abstractor.py
train_abstractor_two_to_one.py		train_abstractor_two_to_one.py
train_conditional_abstractor.py		train_conditional_abstractor.py
train_controllable_abstractor.py		train_controllable_abstractor.py
train_extractor_join_ml.py		train_extractor_join_ml.py
train_extractor_ml.py		train_extractor_ml.py
train_full_rl.py		train_full_rl.py
train_word2vec.py		train_word2vec.py
training.py		training.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Condense-then-select Framework for Text Summarization

Model Architecture

Dependencies

Data

Setup

Our method with 1to1 top-1 abstractor

Training on CNN/DM

Testing on CNN/DM

Test on DUC

Train and test on Pubmed

Our method with compression-controllable abstractor

Training on CNN/DM

Testing on CNN/DM

Test on DUC

Train and test on Pubmed

About

Releases

Packages

Languages

License

kenchan0226/abs-then-ext-public

Folders and files

Latest commit

History

Repository files navigation

A Condense-then-select Framework for Text Summarization

Model Architecture

Dependencies

Data

Setup

Our method with 1to1 top-1 abstractor

Training on CNN/DM

Testing on CNN/DM

Test on DUC

Train and test on Pubmed

Our method with compression-controllable abstractor

Training on CNN/DM

Testing on CNN/DM

Test on DUC

Train and test on Pubmed

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages