Skip to content

code and data for "A Hierarchical Encoding-Decoding Scheme for Abstractive Multi-document Summarization"

Notifications You must be signed in to change notification settings

DAMO-NLP-SG/HierEncDec

Repository files navigation

A Hierarchical Encoding-Decoding Scheme for Abstractive Multi-document Summarization

Authors: Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You and Lidong Bing

This repository contains code and related resources of our paper "A Hierarchical Encoding-Decoding Scheme for Abstractive Multi-document Summarization".


If you find our paper and resources useful, please kindly leave a star and cite our papers. Thanks!

@inproceedings{shen2023hierencdec,
  title={A Hierarchical Encoding-Decoding Scheme for Abstractive Multi-document Summarization},
  author={Shen, Chenhui and Cheng, Liying and Nguyen, Xuan-Phi and Bing, Lidong and You, Yang},
  booktitle={Findings of EMNLP},
  url={"https://arxiv.org/abs/2305.08503"},
  year={2023}
}

Catalogue:


1. Introduction: [Back to Top]

Pre-trained language models (PLMs) have accomplished impressive achievements in abstractive single-document summarization (SDS). However, such benefits may not be readily extended to muti-document summarization (MDS), where the interactions among documents are more complex. Previous works either design new architectures or new pre-training objectives for MDS, or apply PLMs to MDS without considering the complex document interactions. While the former does not make full use of previous pre-training efforts and may not generalize well across multiple domains, the latter cannot fully attend to the intricate relationships unique to MDS tasks. In this paper, we enforce hierarchy on both the encoder and decoder and seek to make better use of a PLM to facilitate multi-document interactions for the MDS task. We test our design on 10 MDS datasets across a wide range of domains. Extensive experiments show that our proposed method can achieve consistent improvements on all these datasets, outperforming the previous best models, and even achieving better or competitive results as compared to some models with additional MDS pre-training or larger model parameters.


2. Running our Code

The data can be downloaded at HierEncDec_data.zip. Unzip this folder to data/ and store it under the root directory. Alternatively, you may use your own dataset formated as Doc 1 <REVBREAK> Doc 2 <REVBREAK> ... <REVBREAK> Doc n, where <REVBREAK> are the separator between documents. The exact locations where we download existing datasets are provided in Appendix B of our paper. Note that to reporduce the PRIMERA results, you need to use ``'' for the separator token instead.

For the downloaded data/, our new datasets are organized as follows:

  • the MReD+ data are under the mred/ folder, ending with _rebuttal.json.
  • the 4 Wikipedia domains data are stored under the folders of Film/, MeanOfTransportation/, Software/, and Town/ respectively.

2.1. Pre-requisites: [Back to Top]

We use conda evironments.

conda create --prefix <path_to_env> python=3.7
conda activate <path_to_env>
# install torch according to your cuda version, for instance:
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
pip install -r requirements.txt

2.2. Commands to reproduce our results: [Back to Top]

To quickly test the code for the following sections, add the following flags

--max_steps 10 --max_train_samples 10 --max_eval_samples 10 --max_predict_samples 10

NOTE:

  1. To replicate our results, you need to run on A100 (80G). Alternatively, for running on V100, truncate source input further by setting a smaller value of max_source_length (e.g. 1024, 2048) to avoid OOM error, but this differs from the setting of 4096 in our paper.
  2. Currently our code only supports for batch size of 1 (a larger number will lead to OOM errors anyway), so it is important to set --per_device_train_batch_size=1 --per_device_eval_batch_size=1.

2.2.1. Reproduce results on BART: [Back to Top]

To reproduce our BART+HED on MReD, run the following command:

CUDA_VISIBLE_DEVICES=0 python run_summarization.py --output_dir results/bart_hed_mred --model_name_or_path facebook/bart-large --do_train --do_predict --train_file data/mred/train.csv --test_file data/mred/test.csv --overwrite_output_dir --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --predict_with_generate --seed 0 --max_source_length 4096 --max_target_length 1024 --save_steps 500 --save_strategy steps --save_total_limit 3 --num_train_epochs 3 --max_steps 10500 --enc_cross_doc --doc_dec

Specifically, the flags --enc_cross_doc enables the hierarchical encoder, whereas --doc_dec enables hieararchical decoder.

For other datasets, set --max_steps to the following values, and use --per_passage_source_length_limit for the first 3 datasets (see more explainations in Section 5.3 of our paper).

  • Mutlinews: 130000, use additional flag --per_passage_source_length_limit
  • WCEP: 15500, use additional flag --per_passage_source_length_limit
  • Multi-Xscience: 90000, use additional flag --per_passage_source_length_limit
  • Rotten Tomatoes: 4500
  • MReD: 10500
  • MReD+: 10500
  • WikiDomains-Film: 85000
  • WikiDomains-MeanOfTransportation: 20000
  • WikiDomains-Town: 37000
  • WikiDomains-Software: 35000

For ablation (see Tab.4) settings,

# using <s> components only
CUDA_VISIBLE_DEVICES=0 python run_summarization.py --output_dir results/bart_hed_mred --model_name_or_path facebook/bart-large --do_train --do_predict --train_file data/mred/train.csv --test_file data/mred/test.csv --overwrite_output_dir --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --predict_with_generate --seed 0 --max_source_length 4096 --max_target_length 1024 --save_steps 500 --save_strategy steps --save_total_limit 3 --num_train_epochs 3 --max_steps 10500

# NOTE: For all following settings, simply add the following flags on top of the above command

# run on the pre-trained BART without any structural modifications
--use_original_bart

# using <s> and HAE
--enc_cross_doc --no_posres_only

# using <s>, HAE and PR
--enc_cross_doc

# using <s>, HAE, and HAD
--enc_cross_doc --no_posres_only --doc_dec

# using <s>, HAE, HAD, and PR (This is basically our full HED setting)
--enc_cross_doc --doc_dec

2.2.1. Reproduce our baselines (Table 1 upper section): [Back to Top]

  • to reproduce the LED results, run:

    python finetune_led.py --output_dir results/led_mred --model_name_or_path allenai/led-large-16384 --do_train --do_predict --train_file data/mred/train.csv --test_file data/mred/test.csv --overwrite_output_dir --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --predict_with_generate --seed 0 --max_source_length 4096 --max_target_length 1024 --save_steps 500 --save_strategy steps --save_total_limit 3 --num_train_epochs 3 --max_steps 10500
  • to reproduce the LongT5 results, run:

    python finetune_longt5.py --output_dir results/longt5_base_mred --source_prefix 'summarize: ' --model_name_or_path google/long-t5-tglobal-base --do_train --do_predict --train_file data/mred/train.csv --test_file data/mred/test.csv --overwrite_output_dir --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --predict_with_generate --seed 0 --max_source_length 4096 --max_target_length 1024 --save_steps 500 --save_strategy steps --save_total_limit 3 --num_train_epochs 3 --max_steps 10500
  • to reproduce the PRIMERA results, kindly follow this GitHub repo .

  • to reproduce the BigBird results, run:

    CUDA_VISIBLE_DEVICES=0 python finetune_bigbird.py --output_dir results/bigbird_mred --model_name_or_path google/bigbird-pegasus-large-arxiv --do_train --do_predict --train_file data/mred/train.csv --test_file data/mred/test.csv --overwrite_output_dir --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --predict_with_generate --seed 0 --max_source_length 4096 --max_target_length 1024 --save_steps 500 --save_strategy steps --save_total_limit 3 --num_train_epochs 3 --max_steps 10500

2.3. Attention Analysis: [Back to Top]

To conduct attention analysis, run

CUDA_VISIBLE_DEVICES=0 python run_summarization.py --model_name_or_path results/<your_trained_model_name> --output_dir results/<your_preferred_save_dir> --do_predict --test_file data/mred/test.csv --overwrite_output_dir --per_device_eval_batch_size=1 --predict_with_generate --max_source_length 4096 --max_target_length 1024 --max_predict_samples 200 --enc_cross_doc --doc_dec --model_analysis --analyze_self_attn --analyze_cross_attn --model_analysis_file mred_hed_attn_analysis.txt 

Specifically, model_analysis must be enabled, whereas analyze_self_attn and (or) analyze_cross_attn can be used together to conduct the corresponding encoder self-attention and (or) decoder cross-attention analysis.

About

code and data for "A Hierarchical Encoding-Decoding Scheme for Abstractive Multi-document Summarization"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages