Summarization Papers

Full List

Organized by Xiachong Feng.

Contributor

Yichong Huang, Haozheng Yang, Jiaan Wang

Summarization Learning Route

Summarization Learning Route (with link)

Presentations && Notes

Big Model Era

TempoSum: Evaluating the Temporal Generalization of Abstractive Summarization Chi Seng Cheang, Hou Pong Chan, Derek F. Wong, Xuebo Liu, Zhaocong Li, Yanming Sun, Shudong Liu, Lidia S. Chao `` [pdf] [data]

[Abs]
Recent pre-trained language models (PLMs) achieve promising results in existing abstractive summarization datasets. However, existing summarization benchmarks overlap in time with the standard pre-training corpora and finetuning datasets. Hence, the strong performance of PLMs may rely on the parametric knowledge that is memorized during pre-training and fine-tuning. Moreover, the knowledge memorized by PLMs may quickly become outdated, which affects the generalization performance of PLMs on future data. In this work, we propose TempoSum, a novel benchmark that contains data samples from 2010 to 2022, to understand the temporal generalization ability of abstractive summarization models. Through extensive human evaluation, we show that parametric knowledge stored in summarization models significantly affects the faithfulness of the generated summaries on future data. Moreover, existing faithfulness enhancement methods cannot reliably improve the faithfulness of summarization models on future data. Finally, we discuss several recommendations to the research community on how to evaluate and improve the temporal generalization capability of text summarization models.
RadAdapt: Radiology Report Summarization via Lightweight Domain Adaptation of Large Language Models Dave Van Veen, Cara Van Uden, Maayane Attias, Anuj Pareek, Christian Bluethgen, Malgorzata Polacin, Wah Chiu, Jean-Benoit Delbrouck, Juan Manuel Zambrano Chaves, Curtis P. Langlotz, Akshay S. Chaudhari, John Pauly [pdf]

[Abs]
We systematically investigate lightweight strategies to adapt large language models (LLMs) for the task of radiology report summarization (RRS). Specifically, we focus on domain adaptation via pretraining (on natural language, biomedical text, and clinical text) and via prompting (zero-shot, in-context learning) or parameter-efficient fine-tuning (prefix tuning, LoRA). Our results on the MIMIC-III dataset consistently demonstrate best performance by maximally adapting to the task via pretraining on clinical text and parameter-efficient fine-tuning on RRS examples. Importantly, this method fine-tunes a mere 0.32% of parameters throughout the model, in contrast to end-to-end fine-tuning (100% of parameters). Additionally, we study the effect of in-context examples and out-of-distribution (OOD) training before concluding with a radiologist reader study and qualitative analysis. Our findings highlight the importance of domain adaptation in RRS and provide valuable insights toward developing effective natural language processing solutions for clinical tasks.
ImpressionGPT: An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT Chong Ma, Zihao Wu, Jiaqi Wang, Shaochen Xu, Yaonai Wei, Zhengliang Liu, Lei Guo, Xiaoyan Cai, Shu Zhang, Tuo Zhang, Dajiang Zhu, Dinggang Shen, Tianming Liu, Xiang Li [pdf]

[Abs]
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians, and it is typically written by radiologists based on the 'Findings' section. However, writing numerous impressions can be laborious and error-prone for radiologists. Although recent studies have achieved promising results in automatic impression generation using large-scale medical text data for pre-training and fine-tuning pre-trained language models, such models often require substantial amounts of medical text data and have poor generalization performance. While large language models (LLMs) like ChatGPT have shown strong generalization capabilities and performance, their performance in specific domains, such as radiology, remains under-investigated and potentially limited. To address this limitation, we propose ImpressionGPT, which leverages the in-context learning capability of LLMs by constructing dynamic contexts using domain-specific, individualized data. This dynamic prompt approach enables the model to learn contextual knowledge from semantically similar examples from existing data. Additionally, we design an iterative optimization algorithm that performs automatic evaluation on the generated impression results and composes the corresponding instruction prompts to further optimize the model. The proposed ImpressionGPT model achieves state-of-the-art performance on both MIMIC-CXR and OpenI datasets without requiring additional training data or fine-tuning the LLMs. This work presents a paradigm for localizing LLMs that can be applied in a wide range of similar application scenarios, bridging the gap between general-purpose LLMs and the specific language processing needs of various domains.
Extractive Summarization via ChatGPT for Faithful Summary Generation Haopeng Zhang, Xiao Liu, Jiawei Zhang [pdf]

[Abs]
Extractive summarization is a crucial task in natural language processing that aims to condense long documents into shorter versions by directly extracting sentences. The recent introduction of ChatGPT has attracted significant interest in the NLP community due to its remarkable performance on a wide range of downstream tasks. However, concerns regarding factuality and faithfulness have hindered its practical applications for summarization systems. This paper first presents a thorough evaluation of ChatGPT's performance on extractive summarization and compares it with traditional fine-tuning methods on various benchmark datasets. Our experimental analysis reveals that ChatGPT's extractive summarization performance is still inferior to existing supervised systems in terms of ROUGE scores. In addition, we explore the effectiveness of in-context learning and chain-of-thought reasoning for enhancing its performance. Furthermore, we find that applying an extract-then-generate pipeline with ChatGPT yields significant performance improvements over abstractive baselines in terms of summary faithfulness. These observations highlight potential directions for enhancing ChatGPT's capabilities for faithful text summarization tasks using two-stage approaches.
Human-like Summarization Evaluation with ChatGPT Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, Xiaojun Wan [pdf]

[Abs]
Evaluating text summarization is a challenging problem, and existing evaluation metrics are far from satisfactory. In this study, we explored ChatGPT's ability to perform human-like summarization evaluation using four human evaluation methods on five datasets. We found that ChatGPT was able to complete annotations relatively smoothly using Likert scale scoring, pairwise comparison, Pyramid, and binary factuality evaluation. Additionally, it outperformed commonly used automatic evaluation metrics on some datasets. Furthermore, we discussed the impact of different prompts, compared its performance with that of human evaluation, and analyzed the generated explanations and invalid responses.
Cross-Lingual Summarization via ChatGPT Jiaan Wang, Yunlong Liang, Fandong Meng, Zhixu Li, Jianfeng Qu, Jie Zhou [pdf]

[Abs]
Given a document in a source language, cross-lingual summarization (CLS) aims to generate a summary in a different target language. Recently, the emergence of ChatGPT has attracted wide attention from the computational linguistics community. However, it is not yet known the performance of ChatGPT on CLS. In this report, we empirically use various prompts to guide ChatGPT to perform zero-shot CLS from different paradigms (i.e., end-to-end and pipeline), and provide a preliminary evaluation on its generated summaries.We find that ChatGPT originally prefers to produce lengthy summaries with more detailed information. But with the help of an interactive prompt, ChatGPT can balance between informativeness and conciseness, and significantly improve its CLS performance. Experimental results on three widely-used CLS datasets show that ChatGPT outperforms the advanced GPT 3.5 model (i.e., text-davinci-003). In addition, we provide qualitative case studies to show the superiority of ChatGPT on CLS.
Exploring the Limits of ChatGPT for Query or Aspect-based Text Summarization Xianjun Yang, Yan Li, Xinlu Zhang, Haifeng Chen, Wei Cheng [pdf]

[Abs]
Text summarization has been a crucial problem in natural language processing (NLP) for several decades. It aims to condense lengthy documents into shorter versions while retaining the most critical information. Various methods have been proposed for text summarization, including extractive and abstractive summarization. The emergence of large language models (LLMs) like GPT3 and ChatGPT has recently created significant interest in using these models for text summarization tasks. Recent studies \cite{goyal2022news, zhang2023benchmarking} have shown that LLMs-generated news summaries are already on par with humans. However, the performance of LLMs for more practical applications like aspect or query-based summaries is underexplored. To fill this gap, we conducted an evaluation of ChatGPT's performance on four widely used benchmark datasets, encompassing diverse summaries from Reddit posts, news articles, dialogue meetings, and stories. Our experiments reveal that ChatGPT's performance is comparable to traditional fine-tuning methods in terms of Rouge scores. Moreover, we highlight some unique differences between ChatGPT-generated summaries and human references, providing valuable insights into the superpower of ChatGPT for diverse text summarization tasks. Our findings call for new directions in this area, and we plan to conduct further research to systematically examine the characteristics of ChatGPT-generated summaries through extensive human evaluation.
News Summarization and Evaluation in the Era of GPT-3 Tanya Goyal, Junyi Jessy Li, Greg Durrett [pdf] [code]

[Abs]
The recent success of zero- and few-shot prompting with models like GPT-3 has led to a paradigm shift in NLP research. In this paper, we study its impact on text summarization, focusing on the classic benchmark domain of news summarization. First, we investigate how zero-shot GPT-3 compares against fine-tuned models trained on large summarization datasets. We show that not only do humans overwhelmingly prefer GPT-3 summaries, but these also do not suffer from common dataset-specific issues such as poor factuality. Next, we study what this means for evaluation, particularly the role of gold standard test sets. Our experiments show that both reference-based and reference-free automatic metrics, e.g. recently proposed QA- or entailment-based factuality approaches, cannot reliably evaluate zero-shot summaries. Finally, we discuss future research challenges beyond generic summarization, specifically, keyword- and aspect-based summarization, showing how dominant fine-tuning approaches compare to zero-shot prompting. To support further research, we release: (a) a corpus of 10K generated summaries from fine-tuned and zero-shot models across 4 standard summarization benchmarks, (b) 1K human preference judgments and rationales comparing different systems for generic- and keyword-based summarization.
Benchmarking Large Language Models for News Summarization Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, Tatsunori B. Hashimoto [pdf]

[Abs]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LMM summaries are judged to be on par with human written summaries.
Is ChatGPT a General-Purpose Natural Language Processing Task Solver? Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, Diyi Yang [pdf]

[Abs]
Spurred by advancements in scale, large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot -- i.e., without adaptation on downstream data. Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community due to the fact that it can generate high-quality responses to human input and self-correct previous mistakes based on subsequent conversations. However, it is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot. In this work, we empirically analyze the zero-shot learning ability of ChatGPT by evaluating it on 20 popular NLP datasets covering 7 representative task categories. With extensive empirical studies, we demonstrate both the effectiveness and limitations of the current version of ChatGPT. We find that ChatGPT performs well on many tasks favoring reasoning capabilities (e.g., arithmetic reasoning) while it still faces challenges when solving specific tasks such as sequence tagging. We additionally provide in-depth analysis through qualitative case studies.
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, Pascale Fung `` [pdf]

[Abs]
This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 21 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. We find that ChatGPT outperforms LLMs with zero-shot learning on most tasks and even outperforms fine-tuned models on some tasks. We find that it is better at understanding non-Latin script languages than generating them. It is able to generate multimodal content from textual prompts, via an intermediate code generation step. Moreover, we find that ChatGPT is 64.33% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning, hence making it an unreliable reasoner. It is, for example, better at deductive than inductive reasoning. ChatGPT suffers from hallucination problems like other LLMs and it generates more extrinsic hallucinations from its parametric memory as it does not have access to an external knowledge base. Finally, the interactive feature of ChatGPT enables human collaboration with the underlying LLM to improve its performance, i.e, 8% ROUGE-1 on summarization and 2% ChrF++ on machine translation, in a multi-turn "prompt engineering" fashion.

Decomposed

Summarization Programs: Interpretable Abstractive Summarization with Neural Modular Trees Swarnadeep Saha, Shiyue Zhang, Peter Hase, Mohit Bansal ICLR 2023 [pdf] [code]

[Abs]
Current abstractive summarization models either suffer from a lack of clear interpretability or provide incomplete rationales by only highlighting parts of the source document. To this end, we propose the Summarization Program (SP), an interpretable modular framework consisting of an (ordered) list of binary trees, each encoding the step-by-step generative process of an abstractive summary sentence from the source document. A Summarization Program contains one root node per summary sentence, and a distinct tree connects each summary sentence (root node) to the document sentences (leaf nodes) from which it is derived, with the connecting nodes containing intermediate generated sentences. Edges represent different modular operations involved in summarization such as sentence fusion, compression, and paraphrasing. We first propose an efficient best-first search method over neural modules, SP-Search that identifies SPs for human summaries by directly optimizing for ROUGE scores. Next, using these programs as automatic supervision, we propose seq2seq models that generate Summarization Programs, which are then executed to obtain final summaries. We demonstrate that SP-Search effectively represents the generative process behind human summaries using modules that are typically faithful to their intended behavior. We also conduct a simulation study to show that Summarization Programs improve the interpretability of summarization models by allowing humans to better simulate model reasoning. Summarization Programs constitute a promising step toward interpretable and modular abstractive summarization, a complex task previously addressed primarily through blackbox end-to-end neural systems. Supporting code available at this https URL

Benchmark

Benchmarking Large Language Models for News Summarizatio Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, Tatsunori B. Hashimoto [pdf]

[Abs]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LMM summaries are judged to be on par with human written summaries.
MuLD: The Multitask Long Document Benchmark G Thomas Hudson, Noura Al Moubayed [pdf] [data]
EXPLAINABOARD: An Explainable Leaderboard for NLP Pengfei Liu, Jinlan Fu, Yang Xiao, Weizhe Yuan, Shuaichen Chang, Junqi Dai, Yixin Liu, Zihuiwen Ye, Graham Neubig [pdf] [ExplainaBoard]
GLGE: A New General Language Generation Evaluation Benchmark Dayiheng Liu, Yu Yan, Yeyun Gong, Weizhen Qi, Hang Zhang, Jian Jiao, Weizhu Chen, Jie Fu, Linjun Shou, Ming Gong, Pengcheng Wang, Jiusheng Chen, Daxin Jiang, Jiancheng Lv, Ruofei Zhang, Winnie Wu, Ming Zhou, Nan Duan [pdf] [benchmark]

Survey

A Survey on Biomedical Text Summarization with Pre-trained Language Model Qianqian Xie, Zheheng Luo, Benyou Wang, Sophia Ananiadou [pdf] [code]

[Abs]
The exponential growth of biomedical texts such as biomedical literature and electronic health records (EHRs), provides a big challenge for clinicians and researchers to access clinical information efficiently. To address the problem, biomedical text summarization has been proposed to support clinical information retrieval and management, aiming at generating concise summaries that distill key information from single or multiple biomedical documents. In recent years, pre-trained language models (PLMs) have been the de facto standard of various natural language processing tasks in the general domain. Most recently, PLMs have been further investigated in the biomedical field and brought new insights into the biomedical text summarization task. In this paper, we systematically summarize recent advances that explore PLMs for biomedical text summarization, to help understand recent progress, challenges, and future directions. We categorize PLMs-based approaches according to how they utilize PLMs and what PLMs they use. We then review available datasets, recent approaches and evaluation metrics of the task. We finally discuss existing challenges and promising future directions. To facilitate the research community, we line up open resources including available datasets, recent approaches, codes, evaluation metrics, and the leaderboard in a public project: this https URL.
A Survey on Medical Document Summarization Raghav Jain, Anubhav Jangra, Sriparna Saha, Adam Jatowt [pdf]

[Abs]
The internet has had a dramatic effect on the healthcare industry, allowing documents to be saved, shared, and managed digitally. This has made it easier to locate and share important data, improving patient care and providing more opportunities for medical studies. As there is so much data accessible to doctors and patients alike, summarizing it has become increasingly necessary - this has been supported through the introduction of deep learning and transformer-based networks, which have boosted the sector significantly in recent years. This paper gives a comprehensive survey of the current techniques and trends in medical summarization
Taxonomy of Abstractive Dialogue Summarization: Scenarios, Approaches and Future Directions Qi Jia, Siyu Ren, Yizhu Liu, Kenny Q. Zhu [pdf]

[Abs]
Abstractive dialogue summarization is to generate a concise and fluent summary covering the salient information in a dialogue among two or more interlocutors. It has attracted great attention in recent years based on the massive emergence of social communication platforms and an urgent requirement for efficient dialogue information understanding and digestion. Different from news or articles in traditional document summarization, dialogues bring unique characteristics and additional challenges, including different language styles and formats, scattered information, flexible discourse structures and unclear topic boundaries. This survey provides a comprehensive investigation on existing work for abstractive dialogue summarization from scenarios, approaches to evaluations. It categorizes the task into two broad categories according to the type of input dialogues, i.e., open-domain and task-oriented, and presents a taxonomy of existing techniques in three directions, namely, injecting dialogue features, designing auxiliary training tasks and using additional data.A list of datasets under different scenarios and widely-accepted evaluation metrics are summarized for completeness. After that, the trends of scenarios and techniques are summarized, together with deep insights on correlations between extensively exploited features and different scenarios. Based on these analyses, we recommend future directions including more controlled and complicated scenarios, technical innovations and comparisons, publicly available datasets in special domains, etc.
A Survey of Automatic Text Summarization Using Graph Neural Networks Marco Ferdinand Salchner, Adam Jatowt COLING 2022 [pdf]

[Abs]
Although automatic text summarization (ATS) has been researched for several decades, the application of graph neural networks (GNNs) to this task started relatively recently. In this survey we provide an overview on the rapidly evolving approach of using GNNs for the task of automatic text summarization. In particular we provide detailed information on the functionality of GNNs in the context of ATS, and a comprehensive overview of models utilizing this approach.
A Survey on Cross-Lingual Summarization Jiaan Wang, Fandong Meng, Duo Zheng, Yunlong Liang, Zhixu Li, Jianfeng Qu, Jie Zhou TACL 2022 [pdf]

[Abs]
Cross-lingual summarization is the task of generating a summary in one language (e.g., English) for the given document(s) in a different language (e.g., Chinese). Under the globalization background, this task has attracted increasing attention of the computational linguistics community. Nevertheless, there still remains a lack of comprehensive review for this task. Therefore, we present the first systematic critical review on the datasets, approaches, and challenges in this field. Specifically, we carefully organize existing datasets and approaches according to different construction methods and solution paradigms, respectively. For each type of datasets or approaches, we thoroughly introduce and summarize previous efforts and further compare them with each other to provide deeper analyses. In the end, we also discuss promising directions and offer our thoughts to facilitate future research. This survey is for both beginners and experts in cross-lingual summarization, and we hope it will serve as a starting point as well as a source of new ideas for researchers and engineers interested in this area.
An Empirical Survey on Long Document Summarization: Datasets, Models and Metrics uan Yee Koh, Jiaxin Ju, Ming Liu, Shirui Pan ACM Computing Surveys [pdf]

[Abs]
Long documents such as academic articles and business reports have been the standard format to detail out important issues and complicated subjects that require extra attention. An automatic summarization system that can effectively condense long documents into short and concise texts to encapsulate the most important information would thus be significant in aiding the reader's comprehension. Recently, with the advent of neural architectures, significant research efforts have been made to advance automatic text summarization systems, and numerous studies on the challenges of extending these systems to the long document domain have emerged. In this survey, we provide a comprehensive overview of the research on long document summarization and a systematic evaluation across the three principal components of its research setting: benchmark datasets, summarization models, and evaluation metrics. For each component, we organize the literature within the context of long document summarization and conduct an empirical analysis to broaden the perspective on current research progress. The empirical analysis includes a study on the intrinsic characteristics of benchmark datasets, a multi-dimensional analysis of summarization models, and a review of the summarization evaluation metrics. Based on the overall findings, we conclude by proposing possible directions for future exploration in this rapidly growing field.
Multi-document Summarization via Deep Learning Techniques: A Survey Congbo Ma, Wei Emma Zhang, Mingyu Guo, Hu Wang, QUAN Z. Sheng [pdf]
Embedding Knowledge for Document Summarization: A Survey Yutong Qu, Wei Emma Zhang, Jian Yang, Lingfei Wu, Jia Wu, Xindong Wu [pdf]
A Survey on Dialogue Summarization: Recent Advances and New Frontiers Xiachong Feng, Xiaocheng Feng, Bing Qin IJCAI 2022, Survey Track [pdf]
Automatic Text Summarization Methods: A Comprehensive Review Divakar Yadav, Jalpa Desai, Arun Kumar Yadav [pdf]
Faithfulness in Natural Language Generation: A Systematic Survey of Analysis, Evaluation and Optimization Methods Wei Li, Wenhao Wu, Moye Chen, Jiachen Liu, Xinyan Xiao, Hua Wu [pdf]
Recent Advances in Neural Text Generation: A Task-Agnostic Survey Chen Tang, Frank Guerin, Yucheng Li, Chenghua Lin [pdf]
Survey of Hallucination in Natural Language Generation Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, Pascale Fung [pdf]
A Survey on Retrieval-Augmented Text Generation Huayang Li, Yixuan Su, Deng Cai, Yan Wang, Lemao Liu [pdf]
A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models Hanqing Zhang, Haolin Song, Shaoyu Li, Ming Zhou, Dawei Song [pdf]
A Survey of Pretrained Language Models Based Text Generation Junyi Li, Tianyi Tang, Wayne Xin Zhao, Jian-Yun Nie, Ji-Rong Wen [pdf]
A Comprehensive Review on Summarizing Financial News Using Deep Learning Saurabh Kamal, Sahil Sharma [pdf]
A Survey on Multi-modal Summarization Anubhav Jangra, Adam Jatowt, Sriparna Saha, Mohammad Hasanuzzaman [pdf]
Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, Graham Neubig [pdf]
Pretrained Language Models for Text Generation: A Survey Junyi Li, Tianyi Tang, Wayne Xin Zhao, Ji-Rong Wen IJCAI21 [pdf]
A Survey of Recent Abstract Summarization Techniques Diyah Puspitaningrum ICICT21 [pdf]
A Survey of the State-of-the-Art Models in Neural Abstractive Text Summarization AYESHA AYUB SYED, FORD LUMBAN GAOL, TOKURO MATSUO [pdf]
Automatic summarization of scientific articles: A survey Nouf Ibrahim Altmami, Mohamed El Bachir Menai Journal of King Saud University - Computer and Information Sciences [pdf]
Multi-document Summarization via Deep Learning Techniques: A Survey Congbo Ma, Wei Emma Zhang, Mingyu Guo, Hu Wang, Quan Z. Sheng [pdf]
Deep Learning Based Abstractive Text Summarization: Approaches, Datasets, Evaluation Measures, and Challenges Dima Suleiman, Arafat A. Awajan [pdf]
A Survey of Knowledge-Enhanced Text Generation Wenhao Yu, Chenguang Zhu, Zaitang Li, Zhiting Hu, Qingyun Wang, Heng Ji, Meng Jiang [pdf]
From Standard Summarization to New Tasks and Beyond: Summarization with Manifold Information Shen Gao, Xiuying Chen, Zhaochun Ren, Dongyan Zhao, Rui Yan IJCAI20 [pdf]
Neural Abstractive Text Summarization with Sequence-to-Sequence Models Tian Shi, Yaser Keneshloo, Naren Ramakrishnan, Chandan K. Reddy [pdf]
A Survey on Neural Network-Based Summarization Methods Yue Dong [pdf]
Automated text summarisation and evidence-based medicine: A survey of two domains Abeed Sarker, Diego Molla, Cecile Paris [pdf]
Automatic Keyword Extraction for Text Summarization: A Survey Santosh Kumar Bharti, Korra Sathya Babu [pdf]
Text Summarization Techniques: A Brief Survey Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Saeid Safaei, Elizabeth D. Trippe, Juan B. Gutierrez, Krys Kochut [pdf]
Recent automatic text summarization techniques: a survey Mahak Gambhir, Vishal Gupta [pdf]

Toolkit

Summary Workbench: Unifying Application and Evaluation of Text Summarization Models Shahbaz Syed, Dominik Schwabe, Martin Potthast EMNLP 2022 Demo [pdf] [demo]

[Abs]
This paper presents Summary Workbench, a new tool for developing and evaluating text summarization models. New models and evaluation measures can be easily integrated as Docker-based plugins, allowing to examine the quality of their summaries against any input and to evaluate them using various evaluation measures. Visual analyses combining multiple measures provide insights into the models' strengths and weaknesses. The tool is hosted at \url{this https URL} and also supports local deployment for private resources.
iFacetSum: Coreference-based Interactive Faceted Summarization for Multi-Document Exploration Eran Hirsch, Alon Eirew, Ori Shapira, Avi Caciularu, Arie Cattan, Ori Ernst, Ramakanth Pasunuru, Hadar Ronen, Mohit Bansal, Ido Dagan EMNLP 2021 [pdf] [demo]
SummerTime: Text Summarization Toolkit for Non-experts Ansong Ni, Zhangir Azerbayev, Mutethia Mutuma, Troy Feng, Yusen Zhang, Tao Yu, Ahmed Hassan Awadallah, Dragomir Radev EMNLP 2021 Demo Track [pdf] [Demo]
Summary Explorer: Visualizing the State of the Art in Text Summarization Shahbaz Syed, Tariq Yousef, Khalid Al-Khatib, Stefan Jänicke, Martin Potthast [pdf] [web]
fastnlp/fastSum [code]
Graph4NLP [code] [summarization]
CTRLsum: Towards Generic Controllable Text Summarization [pdf] [code] EMNLP 2022

[Abs]
Current summarization systems yield generic summaries that are disconnected from users’ preferences and expectations. To address this limitation, we present CTRLsum, a generic framework to control generated summaries through a set of keywords. During training keywords are extracted automatically without requiring additional human annotations. At test time CTRLsum features a control function to map control signal to keywords; through engineering the control function, the same trained model is able to be applied to control summaries on various dimensions, while neither affecting the model training process nor the pretrained models. We additionally explore the combination of keywords and text prompts for more control tasks. Experiments demonstrate the effectiveness of CTRLsum on three domains of summarization datasets and five control tasks: (1) entity-centric and (2) length-controllable summarization, (3) contribution summarization on scientific papers, (4) invention purpose summarization on patent filings, and (5) question-guided summarization on news articles. Moreover, when used in a standard, unconstrained summarization setting, CTRLsum is comparable or better than strong pretrained systems.
OpenNMT-py: Open-Source Neural Machine Translation [pdf] [code]
Fairseq: Facebook AI Research Sequence-to-Sequence Toolkit written in Python. [code]
LeafNATS: An Open-Source Toolkit and Live Demo System for Neural Abstractive Text Summarization Tian Shi, Ping Wang, Chandan K. Reddy NAACL19 [pdf] [code]
TransformerSum [code]

Analysis

Human Guided Exploitation of Interpretable Attention Patterns in Summarization and Topic Segmentation Raymond Li, Wen Xiao, Linzi Xing, Lanjun Wang, Gabriel Murray, Giuseppe Carenini EMNLP 2022 [pdf] [code]

[Abs]
The multi-head self-attention mechanism of the transformer model has been thoroughly investigated recently. In one vein of study, researchers are interested in understanding why and how transformers work. In another vein, researchers propose new attention augmentation methods to make transformers more accurate, efficient and interpretable. In this paper, we combine these two lines of research in a human-in-the-loop pipeline to first discover important task-specific attention patterns. Then those patterns are injected, not only to smaller models, but also to the original model. The benefits of our pipeline and discovered patterns are demonstrated in two case studies with extractive summarization and topic segmentation. After discovering interpretable patterns in BERT-based models fine-tuned for the two downstream tasks, experiments indicate that when we inject the patterns into attention heads, the models show considerable improvements in accuracy and efficiency.
Analyzing Multi-Task Learning for Abstractive Text Summarization Frederic Kirstein, Jan Philip Wahle, Terry Ruas, Bela Gipp `` [pdf]

[Abs]
Despite the recent success of multi-task learning and pre-finetuning for natural language understanding, few works have studied the effects of task families on abstractive text summarization. Task families are a form of task grouping during the pre-finetuning stage to learn common skills, such as reading comprehension. To close this gap, we analyze the influence of multi-task learning strategies using task families for the English abstractive text summarization task. We group tasks into one of three strategies, i.e., sequential, simultaneous, and continual multi-task learning, and evaluate trained models through two downstream tasks. We find that certain combinations of task families (e.g., advanced reading comprehension and natural language inference) positively impact downstream performance. Further, we find that choice and combinations of task families influence downstream performance more than the training scheme, supporting the use of task families for abstractive text summarization.
On Decoding Strategies for Neural Text Generators Gian Wiher, Clara Meister, Ryan Cotterell [pdf]
Training Dynamics for Text Summarization Models Tanya Goyal, Jiacheng Xu, Junyi Jessy Li, Greg Durrett [https://arxiv.org/abs/2110.08370]
Does Summary Evaluation Survive Translation to Other Languages? Neslihan Iskender, Oleg Vasilyev, Tim Polzehl, John Bohannon, Sebastian Möller [pdf]
How well do you know your summarization datasets? Priyam Tejaswin, Dhruv Naik, Pengfei Liu Findings of ACL 2021 [pdf] [code]
Dissecting Generation Modes for Abstractive Summarization Models via Ablation and Attribution Jiacheng Xu, Greg Durrett ACL2021 [pdf] [code]
To Point or Not to Point: Understanding How Abstractive Summarizers Paraphrase Text Matt Wilber, William Timkey, Marten Van Schijndel Findings of ACL 2021 [pdf] [code]
What Makes a Good Summary? Reconsidering the Focus of Automatic Summarization Maartje ter Hoeve, Julia Kiseleva, Maarten de Rijke [pdf]
Intrinsic Evaluation of Summarization Datasets Rishi Bommasani, Claire Cardie EMNLP20 [pdf]
Metrics also Disagree in the Low Scoring Range: Revisiting Summarization Evaluation Metrics Manik Bhandari, Pranav Gour, Atabak Ashfaq, Pengfei Liu COLING20 Short [pdf] [code]
At Which Level Should We Extract? An Empirical Analysis on Extractive Document Summarization Qingyu Zhou, Furu Wei, Ming Zhou COLING20 [pdf]
Corpora Evaluation and System Bias detection in Multi Document Summarization Alvin Dey, Tanya Chowdhury, Yash Kumar, Tanmoy Chakraborty Findings of EMNLP [pdf]
Understanding the Extent to which Summarization Evaluation Metrics Measure the Information Quality of Summaries Daniel Deutsch, Dan Roth [pdf] [code]
Understanding Neural Abstractive Summarization Models via Uncertainty Jiacheng Xu, Shrey Desai, Greg Durrett EMNLP20 Short [pdf] [code]
Re-evaluating Evaluation in Text Summarization Manik Bhandari, Pranav Gour, Atabak Ashfaq, Pengfei Liu, Graham Neubig EMNLP20 [pdf] [code]
CDEvalSumm: An Empirical Study of Cross-Dataset Evaluation for Neural Summarization Systems Yiran Chen, Pengfei Liu, Ming Zhong, Zi-Yi Dou, Danqing Wang, Xipeng Qiu, Xuanjing Huang EMNLP20 [pdf] [code]
What Have We Achieved on Text Summarization? Dandan Huang, Leyang Cui, Sen Yang, Guangsheng Bao, Kun Wang, Jun Xie, Yue Zhang EMNLP20 [pdf]
Conditional Neural Generation using Sub-Aspect Functions for Extractive News Summarization Zhengyuan Liu, Ke Shi, Nancy F. Chen Findings of EMNLP20 [pdf]
Extractive Summarization as Text Matching Ming Zhong, Pengfei Liu, Yiran Chen, Danqing Wang, Xipeng Qiu, Xuanjing Huang ACL20 [pdf] [code]
Neural Text Summarization: A Critical Evaluation Wojciech Kryściński, Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, Richard Socher EMNLP19 [pdf]
Earlier Isn’t Always Better:Sub-aspect Analysis on Corpus and System Biases in Summarization Taehee Jung, Dongyeop Kang, Lucas Mentch, Eduard Hovy EMNLP19 [pdf] [code]
A Closer Look at Data Bias in Neural Extractive Summarization Models Ming Zhong, Danqing Wang, Pengfei Liu, Xipeng Qiu, Xuanjing Huang EMNLP19 Workshop [pdf]
Countering the Effects of Lead Bias in News Summarization via Multi-Stage Training and Auxiliary Losses Matt Grenander, Yue Dong, Jackie Chi Kit Cheung, Annie Louis EMNLP19 Short [pdf]
Searching for Effective Neural Extractive Summarization: What Works and What's Next Ming Zhong, Pengfei Liu, Danqing Wang, Xipeng Qiu, Xuanjing Huang ACL19 [pdf] [code]
Content Selection in Deep Learning Models of Summarization Chris Kedzie, Kathleen McKeown, Hal Daumé III EMNLP18 [pdf] [code]

Thesis

Principled Approaches to Automatic Text Summarization Maxime Peyrard [pdf]
Neural Text Summarization and Generation Piji Li [pdf]

Theory

Bayesian Active Summarization Alexios Gidiotis, Grigorios Tsoumakas [pdf]
RefSum: Refactoring Neural Summarization Yixin Liu, Zi-Yi Dou, Pengfei Liu NAACL21 [pdf] [code]
Principled Approaches to Automatic Text Summarization Maxime Peyrard [pdf]
KLearn: Background Knowledge Inference from Summarization Data Maxime Peyrard, Robert West Findings of EMNLP20 [pdf] [code]
A Simple Theoretical Model of Importance for Summarization Maxime Peyrard ACL19 [pdf]
BottleSum: Unsupervised and Self-supervised Sentence Summarization using the Information Bottleneck Principle Peter West, Ari Holtzman, Jan Buys, Yejin Choi EMNLP19 [pdf] [code]

Dataset

ID	Name	Description	Paper	Conference
1	CNN-DailyMail	News	Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond	SIGNLL16
2	New York Times	News	The New York Times Annotated Corpus
3	DUC	News	The Effects Of Human Variation In DUC Summarization Evaluation
4	Gigaword	News	A Neural Attention Model For Abstractive Sentence Summarization	EMNLP15
5	Newsroom	News	Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies	NAACL18
6	Xsum	News	Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization	EMNLP18
7	Multi-News	Multi-document News	Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model	ACL19
8	SAMSum	Multi-party conversation	SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization	EMNLP19
9	AMI	Meeting	The AMI Meeting Corpus: A pre-announcement.
10	ICSI	Meeting	The ICSI Meeting Corpus
11	MSMO	Multi-modal	MSMO: Multimodal Summarization with Multimodal Output	EMNLP18
12	How2	Multi-modal	How2: A Large-scale Dataset for Multimodal Language Understanding	NIPS18
13	ScisummNet	Scientific paper	ScisummNet: A Large Annotated Corpus and Content-Impact Models for Scientific Paper Summarization with Citation Networks	AAAI19
14	PubMed, ArXiv	Scientific paper	A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents	NAACL18
15	TALKSUMM	Scientific paper	TALKSUMM: A Dataset and Scalable Annotation Method for Scientiﬁc Paper Summarization Based on Conference Talks	ACL19
16	BillSum	Legal	BillSum: A Corpus for Automatic Summarization of US Legislation	EMNLP19
17	LCSTS	Chinese Weibo	LCSTS: A Large Scale Chinese Short Text Summarization Dataset	EMNLP15
18	WikiHow	Online Knowledge Base	WikiHow: A Large Scale Text Summarization Dataset
19	Concept-map-based MDS Corpus	Educational Multi-document	Bringing Structure into Summaries : Crowdsourcing a Benchmark Corpus of Concept Maps	EMNLP17
20	WikiSum	Wikipedia Multi-document	Generating Wikipedia By Summarizing Long Sequence	ICLR18
21	GameWikiSum	Game Multi-document	GameWikiSum : a Novel Large Multi-Document Summarization Dataset	LREC20
22	En2Zh CLS, Zh2En CLS	Cross-Lingual	NCLS: Neural Cross-Lingual Summarization	EMNLP19
23	Timeline Summarization Dataset	Baidu timeline	Learning towards Abstractive Timeline Summarization	IJCAI19
24	Reddit TIFU	online discussion	Abstractive Summarization of Reddit Posts with Multi-level Memory Networks	NAACL19
25	TripAtt	Review	Attribute-aware Sequence Network for Review Summarization	EMNLP19
26	Reader Comments Summarization Corpus	Comments-based Weibo	Abstractive Text Summarization by Incorporating Reader Comments	AAAI19
27	BIGPATENT	Patent	BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization	ACL19
28	Curation Corpus	News	Curation Corpus for Abstractive Text Summarisation
29	MATINF	Multi-task	MATINF: A Jointly Labeled Large-Scale Dataset for Classification, Question Answering and Summarization	ACL20
30	MLSUM	Multi-Lingual Summarization Dataset	MLSUM: The Multilingual Summarization Corpus	EMNLP20
31	Dialogue(Debate)	Argumentative Dialogue Summary Corpus	Using Summarization to Discover Argument Facets in Online Idealogical Dialog	NAACL15
32	WCEP	News Multi-document	A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal	ACL20 Short
33	ArgKP	Argument-to-key Point Mapping	From Arguments to Key Points: Towards Automatic Argument Summarization	ACL20
34	CRD3	Dialogue	Storytelling with Dialogue: A Critical Role Dungeons and Dragons Dataset	2020
35	Gazeta	Russian news	Dataset for Automatic Summarization of Russian News
36	MIND	English news recommendation, Summarization, Classification, Entity	MIND: A Large-scale Dataset for News Recommendation	ACL20
37	public_meetings	french meeting(test set)	Align then Summarize: Automatic Alignment Methods for Summarization Corpus Creation	LREC
38	Enron	Email	Building a Dataset for Summarization and Keyword Extraction from Emails	2014
39	Columbia	Email	Summarizing Email Threads	2004
40	BC3	Email	A publicly available annotated corpus for supervised email summarization
41	WikiLingua	Cross-Lingual	WikiLingua- A New Benchmark Dataset for Cross-Lingual Abstractive Summarization	Findings of EMNLP20
42	LcsPIRT	Chinese Dialogue	Global Encoding for Long Chinese Text Summarization	TALLIP
43	CLTS，CLTS-plus	Chinese News	CLTS: A New Chinese Long Text Summarization Dataset CLTS+: A New Chinese Long Text Summarization Dataset with Abstractive Summaries	NLPCC20
44	VMSMO	Multi-modal	VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles	EMNLP20
45	Multi-XScience	Multi-document	Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientiﬁc Articles	EMNLP20 short
46	SCITLDR	Scientific Document	TLDR: Extreme Summarization of Scientific Documents	Findings of EMNLP20
47	scisumm-corpus	Scientific Document
48	QBSUM	Query-Based Chinese	QBSUM: a Large-Scale Query-Based Document Summarization Dataset from Real-world Applications	Computer Speech & Language
49	qMDS	Query-Based Multi-Document	AQuaMuSe: Automatically Generating Datasets for Query-Based Multi-Document Summarization
50	Liputan6	Indonesian	Liputan6: A Large-scale Indonesian Dataset for Text Summarization	AACL20
51	SportsSum	Sports Game	Generating Sports News from Live Commentary: A Chinese Dataset for Sports Game Summarization	AACL20
52	WikiAsp	Aspect-based	WikiAsp: A Dataset for Multi-domain Aspect-based Summarization	Transaction of the ACL
53	DebateSum	argument	DebateSum:A large-scale argument mining and summarization dataset	ARGMIN 2020
54	Open4Business	Business	Open4Business (O4B): An Open Access Dataset for Summarizing Business Documents	Workshop on Dataset Curation and Security-NeurIPS 2020
55	OrangeSum	French	BARThez: a Skilled Pretrained French Sequence-to-Sequence Model
56	Medical Conversation	medical conversation	Summarizing Medical Conversations via Identifying Important Utterances	COLING20
57	SumTitles	movie dialogue	SumTitles: a Summarization Dataset with Low Extractiveness	COLING20
58	BANS	bengali news	Bengali Abstractive News Summarization (BANS): A Neural Attention Approach	TCCE-2020
59	e-commerce	E-commerce	On the Faithfulness for E-commerce Product Summarization	COLING20
60	TWEETSUM	Twitter	TWEETSUM: Event-oriented Social Summarization Dataset	COLING20
61	SPACE	Opinion	Extractive Opinion Summarization in Quantized Transformer Spaces	TACL
62	pn-summary	Persian	Leveraging ParsBERT and Pretrained mT5 for Persian Abstractive Text Summarization	csicc2021
63	E-commerce1desensitized	Dialogue	Topic-Oriented Spoken Dialogue Summarization for Customer Service with Saliency-Aware Topic Modeling	AAAI21
64	E-commerce2desensitized	Dialogue	Unsupervised Summarization for Chat Logs with Topic-Oriented Ranking and Context-Aware Auto-Encoders	AAAI21
65	BengaliSummarization	Bengali	Unsupervised Abstractive Summarization of Bengali Text Documents	EACL21
66	MediaSum	Dialogue	MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization	NAACL21
67	Healthline and BreastCancer	multi-document	Nutri-bullets: Summarizing Health Studies by Composing Segments	AAAI21
68	GOVREPORT	Long Government reports	Efficient Attentions for Long Document Summarization	NAACL21
69	SSN	Scientific Paper	Enhancing Scientific Papers Summarization with Citation Graph	AAAI21
70	MTSamples	Medical	Towards objectively evaluating the quality of generated medical summaries
71	QMSum	Meeting, Query	QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization	NAACL21
72	MS2	Medical, Multi-Document	MS2: Multi-Document Summarization of Medical Studies
73	SummScreen	Television Series	SummScreen: A Dataset for Abstractive Screenplay Summarization	ACL 2022
74	SciDuet	Scientific Papers and Slides	D2S: Document-to-Slide Generation Via Query-Based Text Summarization	NAACL21
75	MultiHumES	Multilingual	MultiHumES: Multilingual Humanitarian Dataset for Extractive Summarization	EACL21
76	DialSumm	Dialogue	DialSumm: A Real-Life Scenario Dialogue Summarization Dataset	Findings of ACL21
77	BookSum	Book, Long-form	BookSum: A Collection of Datasets for Long-form Narrative Summarization
78	CLES	Chinese Weibo	A Large-Scale Chinese Long-Text Extractive Summarization Corpus	ICASSP
79	FacetSum	Scientific Paper	Bringing Structure into Summaries: a Faceted Summarization Dataset for Long Scientific Documents	ACL2021 short
80	ConvoSumm	Dialogue	ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining	ACL2021
81	AgreeSum	Multi-document with entailment annotations	AgreeSum: Agreement-Oriented Multi-Document Summarization	Findings of ACL2021
82	En2De	Cross-Lingual En2De	Cross-Lingual Abstractive Summarization with Limited Parallel Resources	ACL 2021
83	VT-SSum	Spoken	VT-SSum: A Benchmark Dataset for Video Transcript Segmentation and Summarization
84	AESLC	Email	This Email Could Save Your Life: Introducing the Task of Email Subject Line Generation	ACL 2019
85	XL-Sum	Cross-lingual	XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages	Findings of ACL2021
86	TES 2012-2016	Tweet	TSSuBERT: Tweet Stream Summarization Using BERT
87	PENS	Personalized Headline	PENS: A Dataset and Generic Framework for Personalized News Headline Generation	ACL 2021
88	XSum Hallucination Annotations	Factuality	On Faithfulness and Factuality in Abstractive Summarization	ACL 2020
89	factuality-datasets	Factuality	Annotating and Modeling Fine-grained Factuality in Summarization	NAACL 2021
90	frank	Factuality	Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics	NAACL 2021
91	TRIPOD	Movie	Movie Summarization via Sparse Graph Construction	AAAI 2021
92	AdaptSum	Low-Resource	AdaptSum: Towards Low-Resource Domain Adaptation for Abstractive Summarization	NAACL 2021
93	PTS	Product	Multi-Source Pointer Network for Product Title Summarization	CIKM 2018
94	RAMDS	Reader-Aware	Reader-Aware Multi-Document Summarization: An Enhanced Model and The First Dataset	EMNLP 2017 Workshop
95	court judgment	court judgment	How to Write Summaries with Patterns? Learning towards Abstractive Summarization through Prototype Editing	EMNLP 2019
96	ADEGBTS	gaze behaviors	A Dataset for Exploring Gaze Behaviors in Text Summarization	ACM MMSys'20
97	MeQSum	Medical	On the Summarization of Consumer Health Questions	ACL 2019
98	OpoSum	Opinion	Summarizing Opinions: Aspect Extraction Meets Sentiment Prediction and They Are Both Weakly Supervised	EMNLP 2018
99	MM-AVS	Multi-modal	Multi-modal Summarization for Video-containing Documents	NAACL 2021
100	WikiCatSum	multi-doc	Generating Summaries with Topic Templates and Structured Convolutional Decoders	ACL 2019
101	SDF-TLS	Timeline	Summarize Dates First: A Paradigm Shift in Timeline Summarization	SIGIR 2021
102	RWS-Cit		*Automatic generation of related work through summarizing citations	2017
103	MTLS	Timeline	Multi-TimeLine Summarization (MTLS): Improving Timeline Summarization by Generating Multiple Summaries	ACL 2021
104	EMAILSUM	Email	EmailSum: Abstractive Email Thread Summarization	ACL 2021
105	WikiSum	WikiHow	WikiSum: Coherent Summarization Dataset for Efficient Human-Evaluation	ACL 2021 Short
106	SumPubMed	PubMed Scientific Article	SumPubMed: Summarization Dataset of PubMed Scientific Articles	ACL 2021 Student Research Workshop
107	MLGSum	Multi-lingual	Contrastive Aligned Joint Learning for Multilingual Summarization	ACL 2021 Findings
108	SMARTPHONE,COMPUTER	Product	CUSTOM: Aspect-Oriented Product Summarization for E-Commerce
109	CSDS	Customer Service Dialogue	CSDS: A Fine-grained Chinese Dataset for Customer Service Dialogue Summarization	EMNLP 2021
110	persian-dataset	persian	ARMAN: Pre-training with Semantically Selecting and Reordering of Sentences for Persian Abstractive Summarization
111	StreamHover	spoken livestream	StreamHover: Livestream Transcript Summarization and Annotation	EMNLP 2021
112	CNewSum	News	CNewSum: A Large-scale Chinese News Summarization Dataset with Human-annotated Adequacy and Deducibility Level	NLPCC 2021
113	MiRANews	news, factual	MiRANews: Dataset and Benchmarks for Multi-Resource-Assisted News Summarization	EMNLP 2021 Findings
114	HowSumm	query multi-doc	HowSumm: A Multi-Document Summarization Dataset Derived from WikiHow Articles
115	SportsSum2.0	Sports	SportsSum2.0: Generating High-Quality Sports News from Live Text Commentary
116	CoCoSum	opinion multi-ref	Comparative Opinion Summarization via Collaborative Decoding
117	MReD	Controllable	MReD: A Meta-Review Dataset for Controllable Text Generation
118	MSˆ2	Multi-Document, Medical	MSˆ2: Multi-Document Summarization of Medical Studies	EMNLP 2021
119	MassiveSumm		MassiveSumm: a very large-scale, very multilingual, news summarisation dataset	EMNLP 2021
120	XWikis	multilingual	Models and Datasets for Cross-Lingual Summarisation	EMNLP 2021
121	SUBSUME	Intent, subjective	SUBSUME: A Dataset for Subjective Summary Extraction from Wikipedia Documents	EMNLP 2021 newsum
122	TLDR9+		TLDR9+: A Large Scale Resource for Extreme Summarization of Social Media Posts	EMNLP 2021 newsum
123	20 Minuten	German	A New Dataset and Efficient Baselines for Document-level Text Simplification in German	EMNLP 2021 newsum
124	WSD	multi-lingual	A Novel Wikipedia based Dataset for Monolingual and Cross-Lingual Summarization	EMNLP 2021 newsum
125	TEDSummary	Speech	Attention-based Multi-hypothesis Fusion for Speech Summarization
126	SummaC Benchmark	Factual, NLI	SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization
127	ForumSum	Conversation	ForumSum: A Multi-Speaker Conversation Summarization Dataset	EMNLP 2021 Findings
128	K-SportsSum	Sports	Knowledge Enhanced Sports Game Summarization	WSDM 2022
129	Test-Amazon	Opinion, New test for Amazon reviews	Unsupervised Opinion Summarization as Copycat-Review Generation	ACL 2020
130	Test-Amazon-Yelp	Opinion, New test for Amazon(180) and Yelp(300)	Few-Shot Learning for Opinion Summarization	EMNLP 2020
131	AmaSum	Opinion	Learning Opinion Summarizers by Selecting Informative Reviews	EMNLP 2021
132	CrossSum	Cross lingual	CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs
133	HCSCL-MSDataset	Multi-modal	Hierarchical Cross-Modality Semantic Correlation Learning Model for Multimodal Summarization	AAAI 2022
134	Klexikon	German	Klexikon: A German Dataset for Joint Summarization and Simplification
135	TODSum	Customer Service	TODSum: Task-Oriented Dialogue Summarization with State Tracking
136	TWEETSUMM	Customer Service	TWEETSUMM - A Dialog Summarization Dataset for Customer Service	Findings of EMNLP 2021
137	PeerSum	Multi-document, Scientific	PeerSum: A Peer Review Dataset for Abstractive Multi-document Summarization
138	Celebrity TS, Event TS, Wiki TS	Timeline, person, event	Follow the Timeline! Generating Abstractive and Extractive Timeline Summary in Chronological Order	TOSI 2022
139	Chart-to-Text	chart	Chart-to-Text: A Large-Scale Benchmark for Chart Summarization
140	GovReport-QS	Long Document	HIBRIDS: Attention with Hierarchical Biases for Structure-aware Long Document Summarization	ACL 2022
141	EntSUM	Entity	EntSUM: A Data Set for Entity-Centric Summarization	ACL 2022
142	ALLSIDES	Framing Bias	NeuS: Neutral Multi-News Summarization for Mitigating Framing Bias	ACL 2022
143	GRAPHELSUMS	graph	Summarization with Graphical Elements
144	Annotated-Wikilarge-Newsela	Factuality	Evaluating Factuality in Text Simplification	ACL 2022
145	WikiMulti	Cross-lingual	WikiMulti: a Corpus for Cross-Lingual Summarization
146	Welsh		Introducing the Welsh Text Summarisation Dataset and Baseline Systems
147	SuMe	Biomedical	SuMe: A Dataset Towards Summarizing Biomedical Mechanisms	LREC 2022
148	CiteSum		CiteSum: Citation Text-guided Scientific Extreme Summarization and Low-resource Domain Adaptation
148	MSAMSum	Dialogue	MSAMSum: Towards Benchmarking Multi-lingual Dialogue Summarization	ACL 2022 DialDoc
149	SQuALITY	Long-Document	SQuALITY: Building a Long-Document Summarization Dataset the Hard Way	EMNLP 2022
150	X-SCITLDR		X-SCITLDR: Cross-Lingual Extreme Summarization of Scholarly Documents	JCDL 2022
151	NEWTS	News	NEWTS: A Corpus for News Topic-Focused Summarization
152	EntSUM	Entity	EntSUM: A Data Set for Entity-Centric Extractive Summarization	ACL 2022
153	ASPECTNEWS		ASPECTNEWS: Aspect-Oriented Summarization of News Documents	ACL 2022
154	RNSum	Commit Logs	RNSum: A Large-Scale Dataset for Automatic Release Note Generation via Commit Logs Summarization	ACL 2022
155	AnswerSumm	query multi-doc	AnswerSumm: A Manually-Curated Dataset and Pipeline for Answer Summarization	NAACL 2022
156	CHQ-Summ		CHQ-Summ: A Dataset for Consumer Healthcare Question Summarization
157	Multi-LexSum	multi-doc	Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities
158	DACSA	Catalan and Spanish	DACSA: A large-scale Dataset for Automatic summarization of Catalan and Spanish newspaper Articles	NAACL 2022
159	BigSurvey	Academic Multi-doc	Generating a Structured Summary of Numerous Academic Papers: Dataset and Method	IJCAI 2022
160	CSL	Chinese, Academic	CSL: A Large-scale Chinese Scientific Literature Dataset	COLING 2022
161	PCC Summaries	German	Extractive Summarisation for German-language Data: A Text-level Approach with Discourse Features	COLING 2022
162	LipKey	abstractive summaries, absent keyphrases, and titles	LipKey: A Large-Scale News Dataset for Absent Keyphrases Generation and Abstractive Summarization	COLING 2022
163	PLOS	Lay summary of biomedical journal articles	Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature	EMNLP 2022
164	eLife	Lay summary of biomedical journal articles	Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature	EMNLP 2022
165	ECTSum	Long Earnings Call Transcripts	ECTSum: A New Benchmark Dataset For Bullet Point Summarization of Long Earnings Call Transcripts	EMNLP 2022
166	EUR-Lex-Sum	Multi- and Cross-lingual Legal	EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form Summarization in the Legal Domain	EMNLP 2022
167	CrisisLTLSum	Timeline	CrisisLTLSum: A Benchmark for Local Crisis Event Timeline Extraction and Summarization
168	LANS(`upon request`)	Arabic	LANS: Large-scale Arabic News Summarization Corpus
169	MACSUM	Controllable News Dialogue	MACSUM: Controllable Summarization with Mixed Attributes
170	NarraSum	Narrative	NarraSum: A Large-Scale Dataset for Abstractive Narrative Summarization	EMNLP Findings 2022
171	LoRaLay	Long Scientific Visual	LoRaLay: A Multilingual and Multimodal Dataset for Long Range and Layout-Aware Summarization	EACL 2023
172	HunSum-1	Hungarian	HunSum-1: an Abstractive Summarization Dataset for Hungarian
173	MCLS	ultimodal Cross-Lingual	Assist Non-native Viewers: Multimodal Cross-Lingual Summarization for How2 Videos	EMNLP 2022
174	JDDC 2.1	multimodal	JDDC 2.1: A Multimodal Chinese Dialogue Dataset with Joint Tasks of Query Rewriting, Response Generation, Discourse Parsing, and Summarization	EMNLP 2022
175	CroCoSum	Code-switched Cross-lingual	CroCoSum: A Benchmark Dataset for Cross-Lingual Code-Switched Summarization
176	unarXive	scholarly	unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata	Scientometrics 2020
177	TempoSum		TempoSum: Evaluating the Temporal Generalization of Abstractive Summarization
178	VCSUM	meeting	VCSUM: A Versatile Chinese Meeting Summarization Dataset	ACL Findings 2023
179	MeetingBank	meeting	MeetingBank: A Benchmark Dataset for Meeting Summarization	ACL 2023

Dialogue

Dataset

MeetingBank: A Benchmark Dataset for Meeting Summarization Yebowen Hu, Timothy Ganter, Hanieh Deilamsalehy, Franck Dernoncourt, Hassan Foroosh, Fei Liu ACL 2023 [pdf] [data]

[Abs]
As the number of recorded meetings increases, it becomes increasingly important to utilize summarization technology to create useful summaries of these recordings. However, there is a crucial lack of annotated meeting corpora for developing this technology, as it can be hard to collect meetings, especially when the topics discussed are confidential. Furthermore, meeting summaries written by experienced writers are scarce, making it hard for abstractive summarizers to produce sensible output without a reliable reference. This lack of annotated corpora has hindered the development of meeting summarization technology. In this paper, we present MeetingBank, a new benchmark dataset of city council meetings over the past decade. MeetingBank is unique among other meeting corpora due to its divide-and-conquer approach, which involves dividing professionally written meeting minutes into shorter passages and aligning them with specific segments of the meeting. This breaks down the process of summarizing a lengthy meeting into smaller, more manageable tasks. The dataset provides a new testbed of various meeting summarization systems and also allows the public to gain insight into how council decisions are made. We make the collection, including meeting video links, transcripts, reference summaries, agenda, and other metadata, publicly available to facilitate the development of better meeting summarization techniques.
ECTSum: A New Benchmark Dataset For Bullet Point Summarization of Long Earnings Call Transcripts Rajdeep Mukherjee, Abhinav Bohra, Akash Banerjee, Soumya Sharma, Manjunath Hegde, Afreen Shaikh, Shivani Shrivastava, Koustuv Dasgupta, Niloy Ganguly, Saptarshi Ghosh, Pawan Goyal EMNLP 2022 [pdf] [data]

[Abs]
Despite tremendous progress in automatic summarization, state-of-the-art methods are predominantly trained to excel in summarizing short newswire articles, or documents with strong layout biases such as scientific articles or government reports. Efficient techniques to summarize financial documents, including facts and figures, have largely been unexplored, majorly due to the unavailability of suitable datasets. In this work, we present ECTSum, a new dataset with transcripts of earnings calls (ECTs), hosted by publicly traded companies, as documents, and short experts-written telegram-style bullet point summaries derived from corresponding Reuters articles. ECTs are long unstructured documents without any prescribed length limit or format. We benchmark our dataset with state-of-the-art summarizers across various metrics evaluating the content quality and factual consistency of the generated summaries. Finally, we present a simple-yet-effective approach, ECT-BPS, to generate a set of bullet points that precisely capture the important facts discussed in the calls.
TODSum: Task-Oriented Dialogue Summarization with State Tracking Lulu Zhao, Fujia Zheng, Keqing He, Weihao Zeng, Yuejie Lei, Huixing Jiang, Wei Wu, Weiran Xu, Jun Guo, Fanyu Meng [pdf]
TWEETSUMM - A Dialog Summarization Dataset for Customer Service Guy Feigenblat, Chulaka Gunasekara, Benjamin Sznajder, Sachindra Joshi, David Konopnicki, Ranit Aharonov Findings of EMNLP 2021 [pdf] [data]
ForumSum: A Multi-Speaker Conversation Summarization Dataset Misha Khalman, Yao Zhao, Mohammad Saleh EMNLP 2021 Findings [pdf] [data]
CSDS: A Fine-grained Chinese Dataset for Customer Service Dialogue Summarization Haitao Lin, Liqun Ma, Junnan Zhu, Lu Xiang, Yu Zhou, Jiajun Zhang, Chengqing Zong EMNLP 2021 [pdf] [data]
EmailSum: Abstractive Email Thread Summarization Shiyue Zhang, Asli Celikyilmaz, Jianfeng Gao, Mohit Bansal ACL 2021 [pdf] [data]
DialSumm: A Real-Life Scenario Dialogue Summarization Dataset Yulong Chen, Yang Liu, Liang Chen, Yue Zhang Findings of ACL21 [pdf] [data]
ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining Alexander R. Fabbri, Faiaz Rahman, Imad Rizvi, Borui Wang, Haoran Li, Yashar Mehdad, Dragomir Radev ACL 2021 [pdf] [code]
MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization Chenguang Zhu, Yang Liu, Jie Mei, Michael Zeng NAACL21 [pdf] [code]
QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, Dragomir Radev NAACL21 [pdf] [data]
Storytelling with Dialogue: A Critical Role Dungeons and Dragons Dataset Revanth Rameshkumar, Peter Bailey ACL20 [pdf] [data]
SumTitles: a Summarization Dataset with Low Extractiveness Valentin Malykh, Konstantin Chernis, Ekaterina Artemova, Irina Piontkovskaya COLING20 [pdf] [code]
Summarizing Medical Conversations via Identifying Important Utterances Yan Song, Yuanhe Tian, Nan Wang, Fei Xia COLING20 [pdf] [code]
GupShup: Summarizing Open-Domain Code-Switched Conversations Laiba Mehnaz, Debanjan Mahata, Rakesh Gosangi, Uma Sushmitha Gunturi, Riya Jain, Gauri Gupta, Amardeep Kumar, Isabelle G. Lee, Anish Acharya, Rajiv Ratn Shah EMNLP 2021 [pdf][code]
SummScreen: A Dataset for Abstractive Screenplay Summarization Mingda Chen, Zewei Chu, Sam Wiseman, Kevin Gimpel ACL 2022 [pdf] [data]

[Abs]
We introduce SummScreen, a summarization dataset comprised of pairs of TV series transcripts and human written recaps. The dataset provides a challenging testbed for abstractive summarization for several reasons. Plot details are often expressed indirectly in character dialogues and may be scattered across the entirety of the transcript. These details must be found and integrated to form the succinct plot descriptions in the recaps. Also, TV scripts contain content that does not directly pertain to the central plot but rather serves to develop characters or provide comic relief. This information is rarely contained in recaps. Since characters are fundamental to TV series, we also propose two entity-centric evaluation metrics. Empirically, we characterize the dataset by evaluating several methods, including neural models and those based on nearest neighbors. An oracle extractive approach outperforms all benchmarked models according to automatic metrics, showing that the neural models are unable to fully exploit the input transcripts. Human evaluation and qualitative analysis reveal that our non-oracle models are competitive with their oracle counterparts in terms of generating faithful plot events and can benefit from better content selectors. Both oracle and non-oracle models generate unfaithful facts, suggesting future research directions.
SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization Bogdan Gliwa, Iwona Mochol, Maciej Biesek, Aleksander Wawer EMNLP19 [pdf] [data]
Dial2Desc: End-to-end Dialogue Description Generation Haojie Pan, Junpei Zhou, Zhou Zhao, Yan Liu, Deng Cai, Min Yang [pdf]
The AMI meeting corpus: A pre-announcement Carletta, Jean and Ashby, Simone and Bourban, Sebastien and Flynn, Mike and Guillemot, Mael and Hain, Thomas and Kadlec, Jaroslav and Karaiskos, Vasilis and Kraaij, Wessel and Kronenthal, Melissa and others [pdf]
The ICSI meeting corpus Janin, Adam and Baron, Don and Edwards, Jane and Ellis, Dan and Gelbart, David and Morgan, Nelson and Peskin, Barbara and Pfau, Thilo and Shriberg, Elizabeth and Stolcke, Andreas and others [pdf]

Email Summarization

Focus on the Action: Learning to Highlight and Summarize Jointly for Email To-Do Items Summarization Kexun Zhang, Jiaao Chen, Diyi Yang Findings of ACL 2022 [pdf]
EmailSum: Abstractive Email Thread Summarization Shiyue Zhang, Asli Celikyilmaz, Jianfeng Gao, Mohit Bansal ACL 2021 [pdf] [data]
Smart To-Do: Automatic Generation of To-Do Items from Emails Sudipto Mukherjee, Subhabrata Mukherjee, Marcello Hasegawa, Ahmed Hassan Awadallah, Ryen White ACL 2020 [pdf] [code] [bib]
Identifying Implicit Quotes for Unsupervised Extractive Summarization of Conversations Ryuji Kano, Yasuhide Miura, Tomoki Taniguchi, Tomoko Ohkuma AACL20 [pdf]
This Email Could Save Your Life: Introducing the Task of Email Subject Line Generation Rui Zhang, Joel Tetreault ACL 2019 [pdf] [data] [bib]
Building a Dataset for Summarization and Keyword Extraction from Emails Vanessa Loza, Shibamouli Lahiri, Rada Mihalcea, Po-Hsiang Lai LREC 2014 [pdf]
A Publicly Available Annotated Corpus for Supervised Email Summarization Jan Ulrich, Gabriel Murray, Giuseppe Carenini AAAI 2008 [pdf]
Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou WWW 2007 [pdf]
Task-focused Summarization of Email Simon H. Corston-Oliver Eric Ringger Michael Gamon Richard Campbell ACL 2004 [pdf]
Summarizing email threads Owen Rambow, Lokesh Shrestha, John Chen, Chirsty Lauridsen NAACL 2004 [pdf] [bib]
Facilitating email thread access by extractive summary generation Ani Nenkova Recent advances in natural language processing III: selected papers from RANLP [pdf]
Summarizing Archived Discussions: A Beginning Paula S. Newman, John C. Blitzer Proceedings of the 8th international conference on Intelligent user interfaces [pdf]
Combining linguistic and machine learning techniques for email summarization Smaranda Muresan, Evelyne Tzoukermann, Judith L. Klavans Proceedings of the ACL 2001 Workshop on Computational Natural Language Learning (ConLL) 2001 [pdf] [bib]

Meeting Summarization

Learning to Rank Utterances for Query-Focused Meeting Summarization Xingxian Liu, Yajing Xu Findings of ACL 2023 [pdf]

[Abs]
Query-focused meeting summarization(QFMS) aims to generate a specific summary for the given query according to the meeting transcripts. Due to the conflict between long meetings and limited input size, previous works mainly adopt extract-then-summarize methods, which use extractors to simulate binary labels or ROUGE scores to extract utterances related to the query and then generate a summary. However, the previous approach fails to fully use the comparison between utterances. To the extractor, comparison orders are more important than specific scores. In this paper, we propose a Ranker-Generator framework. It learns to rank the utterances by comparing them in pairs and learning from the global orders, then uses top utterances as the generator’s input. We show that learning to rank utterances helps to select utterances related to the query effectively, and the summarizer can benefit from it. Experimental results on QMSum show that the proposed model outperforms all existing multi-stage models with fewer parameters.
ExplainMeetSum: A Dataset for Explainable Meeting Summarization Aligned with Human Intent Hyun Kim, Minsoo Cho, Seung-Hoon Na ACL 2023 [pdf] [code]

[Abs]
To enhance the explainability of meeting summarization, we construct a new dataset called “ExplainMeetSum,” an augmented version of QMSum, by newly annotating evidence sentences that faithfully “explain” a summary. Using ExplainMeetSum, we propose a novel multiple extractor guided summarization, namely Multi-DYLE, which extensively generalizes DYLE to enable using a supervised extractor based on human-aligned extractive oracles. We further present an explainability-aware task, named “Explainable Evidence Extraction” (E3), which aims to automatically detect all evidence sentences that support a given summary. Experimental results on the QMSum dataset show that the proposed Multi-DYLE outperforms DYLE with gains of up to 3.13 in the ROUGE-1 score. We further present the initial results on the E3 task, under the settings using separate and joint evaluation metrics.
VCSUM: A Versatile Chinese Meeting Summarization Dataset Han Wu, Mingjie Zhan, Haochen Tan, Zhaohui Hou, Ding Liang, Linqi Song Findings of ACL 2023 [pdf] [data]

[Abs]
Compared to news and chat summarization, the development of meeting summarization is hugely decelerated by the limited data. To this end, we introduce a versatile Chinese meeting summarization dataset, dubbed VCSum, consisting of 239 real-life meetings, with a total duration of over 230 hours. We claim our dataset is versatile because we provide the annotations of topic segmentation, headlines, segmentation summaries, overall meeting summaries, and salient sentences for each meeting transcript. As such, the dataset can adapt to various summarization tasks or methods, including segmentation-based summarization, multi-granularity summarization and retrieval-then-generate summarization. Our analysis confirms the effectiveness and robustness of VCSum. We also provide a set of benchmark models regarding different downstream summarization tasks on VCSum to facilitate further research.
Query-Utterance Attention with Joint modeling for Query-Focused Meeting Summarization Xingxian Liu, Bin Duan, Bo Xiao, Yajing Xu ICASSP 2023 [pdf]

[Abs]
Query-focused meeting summarization (QFMS) aims to generate summaries from meeting transcripts in response to a given query. Previous works typically concatenate the query with meeting transcripts and implicitly model the query relevance only at the token level with attention mechanism. However, due to the dilution of key query-relevant information caused by long meeting transcripts, the original transformer-based model is insufficient to highlight the key parts related to the query. In this paper, we propose a query-aware framework with joint modeling token and utterance based on Query-Utterance Attention. It calculates the utterance-level relevance to the query with a dense retrieval module. Then both token-level query relevance and utterance-level query relevance are combined and incorporated into the generation process with attention mechanism explicitly. We show that the query relevance of different granularities contributes to generating a summary more related to the query. Experimental results on the QMSum dataset show that the proposed model achieves new state-of-the-art performance.
Meeting Decision Tracker: Making Meeting Minutes with De-Contextualized Utterances Shumpei Inoue, Hy Nguyen, Pham Viet Hoang, Tsungwei Liu, Minh-Tien Nguyen AACL-IJCNLP 2022 [pdf] [demo]

[Abs]
Meetings are a universal process to make decisions in business and project collaboration. The capability to automatically itemize the decisions in daily meetings allows for extensive tracking of past discussions. To that end, we developed Meeting Decision Tracker, a prototype system to construct decision items comprising decision utterance detector (DUD) and decision utterance rewriter (DUR). We show that DUR makes a sizable contribution to improving the user experience by dealing with utterance collapse in natural conversation. An introduction video of our system is also available at this https URL.
ESSumm: Extractive Speech Summarization from Untranscribed Meeting Jun Wang Interspeech 2022 [pdf]

[Abs]
In this paper, we propose a novel architecture for direct extractive speech-to-speech summarization, ESSumm, which is an unsupervised model without dependence on intermediate transcribed text. Different from previous methods with text presentation, we are aimed at generating a summary directly from speech without transcription. First, a set of smaller speech segments are extracted based on speech signal's acoustic features. For each candidate speech segment, a distance-based summarization confidence score is designed for latent speech representation measure. Specifically, we leverage the off-the-shelf self-supervised convolutional neural network to extract the deep speech features from raw audio. Our approach automatically predicts the optimal sequence of speech segments that capture the key information with a target summary length. Extensive results on two well-known meeting datasets (AMI and ICSI corpora) show the effectiveness of our direct speech-based method to improve the summarization quality with untranscribed data. We also observe that our unsupervised speech-based method even performs on par with recent transcript-based summarization approaches, where extra speech recognition is required.
Abstractive Meeting Summarization: A Survey Virgile Rennard, Guokan Shang, Julie Hunter, Michalis Vazirgiannis [pdf]

[Abs]
Recent advances in deep learning, and especially the invention of encoder-decoder architectures, has significantly improved the performance of abstractive summarization systems. While the majority of research has focused on written documents, we have observed an increasing interest in the summarization of dialogues and multi-party conversation over the past few years. A system that could reliably transform the audio or transcript of a human conversation into an abridged version that homes in on the most important points of the discussion would be valuable in a wide variety of real-world contexts, from business meetings to medical consultations to customer service calls. This paper focuses on abstractive summarization for multi-party meetings, providing a survey of the challenges, datasets and systems relevant to this task and a discussion of promising directions for future study.
ALIGNMEET: A Comprehensive Tool for Meeting Annotation, Alignment, and Evaluation Peter Polák, Muskaan Singh, Anna Nedoluzhko, Ondřej Bojar LREC 2022 [pdf] [data]
TANet: Thread-Aware Pretraining for Abstractive Conversational Summarization Ze Yang, Liran Wang, Zhoujin Tian, Wei Wu, Zhoujun Li Findings of NAACL 2022 [pdf]

[Abs]
Although pre-trained language models (PLMs) have achieved great success and become a milestone in NLP, abstractive conversational summarization remains a challenging but less studied task. The difficulty lies in two aspects. One is the lack of large-scale conversational summary data. Another is that applying the existing pre-trained models to this task is tricky because of the structural dependence within the conversation and its informal expression, etc. In this work, we first build a large-scale (11M) pretraining dataset called RCSum, based on the multi-person discussions in the Reddit community. We then present TANet, a thread-aware Transformer-based network. Unlike the existing pre-trained models that treat a conversation as a sequence of sentences, we argue that the inherent contextual dependency among the utterances plays an essential role in understanding the entire conversation and thus propose two new techniques to incorporate the structural information into our model. The first is thread-aware attention which is computed by taking into account the contextual dependency within utterances. Second, we apply thread prediction loss to predict the relations between utterances. We evaluate our model on four datasets of real conversations, covering types of meeting transcripts, customer-service records, and forum threads. Experimental results demonstrate that TANet achieves a new state-of-the-art in terms of both automatic evaluation and human judgment.
Summ^N: A Multi-Stage Summarization Framework for Long Input Dialogues and Documents Yusen Zhang, Ansong Ni, Ziming Mao, Chen Henry Wu, Chenguang Zhu, Budhaditya Deb, Ahmed H. Awadallah, Dragomir Radev, Rui Zhang ACL 2022 [pdf] [code]

[Abs]
Text summarization helps readers capture salient information from documents, news, interviews, and meetings. However, most state-of-the-art pretrained language models (LM) are unable to efficiently process long text for many summarization tasks. In this paper, we propose SummN, a simple, flexible, and effective multi-stage framework for input texts that are longer than the maximum context length of typical pretrained LMs. SummN first splits the data samples and generates a coarse summary in multiple stages and then produces the final fine-grained summary based on it. Our framework can process input text of arbitrary length by adjusting the number of stages while keeping the LM input size fixed. Moreover, it can deal with both single-source documents and dialogues, and it can be used on top of different backbone abstractive summarization models. To the best of our knowledge, SummN is the first multi-stage split-then-summarize framework for long input summarization. Our experiments demonstrate that SummN outperforms previous state-of-the-art methods by improving ROUGE scores on three long meeting summarization datasets AMI, ICSI, and QMSum, two long TV series datasets from SummScreen, and a long document summarization dataset GovReport. Our data and code are available at https://github.com/psunlpgroup/Summ-N.
Exploring Neural Models for Query-Focused Summarization Jesse Vig, Alexander R. Fabbri, Wojciech Kryściński [pdf] [code]
Improving Abstractive Dialogue Summarization with Hierarchical Pretraining and Topic Segment MengNan Qi, Hao Liu, YuZhuo Fu, Ting Liu EMNLP 2021 Findings [pdf]
Meeting Summarization with Pre-training and Clustering Methods Andras Huebner, Wei Ji, Xiang Xiao [pdf] [code]
Context or No Context? A preliminary exploration of human-in-the-loop approach for Incremental Temporal Summarization in meetings Nicole Beckage, Shachi H Kumar, Saurav Sahay, Ramesh Manuvinakurike EMNLP 2021| newsum [pdf]
RetrievalSum: A Retrieval Enhanced Framework for Abstractive Summarization Chenxin An, Ming Zhong, Zhichao Geng, Jianqiang Yang, Xipeng Qiu [pdf]
An Exploratory Study on Long Dialogue Summarization: What Works and What's Next Yusen Zhang, Ansong Ni, Tao Yu, Rui Zhang, Chenguang Zhu, Budhaditya Deb, Asli Celikyilmaz, Ahmed Hassan Awadallah, Dragomir Radev Findings of EMNLP 2021 Short [pdf]
DialogLM: Pre-trained Model for Long Dialogue Understanding and Summarization Ming Zhong, Yang Liu, Yichong Xu, Chenguang Zhu, Michael Zeng AAAI 2022 [pdf] [code]
Dynamic Sliding Window for Meeting Summarization Zhengyuan Liu, Nancy F. Chen SummDial@SIGDial 2021 [pdf]
MeetSum: Transforming Meeting Transcript Summarization using Transformers! Nima Sadri, Bohan Zhang, Bihan Liu [pdf]
Incremental temporal summarization in multiparty meetings Ramesh Manuvinakurike, Saurav Sahay, Wenda Chen, Lama Nachman SIGIR 2021 [pdf]
Abstractive Spoken Document Summarization using Hierarchical Model with Multi-stage Attention Diversity Optimization Potsawee Manakul, Mark J. F. Gales, Linlin Wang INTERSPEECH 2020 [pdf] [code]
What are meeting summaries? An analysis of human extractive summaries in meeting corpus Fei Liu, Yang Liu SIGDIAL 2008 [pdf]
Exploring Speaker Characteristics for Meeting Summarization Fei Liu, Yang Liu INTERSPEECH 2010 [pdf]
Automatic meeting summarization and topic detection system Tai-Chia Huang, Chia-Hsuan Hsieh, Hei-Chia Wang [pdf]
A keyphrase based approach to interactive meeting summarization Korbinian Riedhammer, Benoit Favre, Dilek Hakkani-T¨ur 2008 IEEE Spoken Language Technology Workshop [pdf]
A global optimization framework for meeting summarization Dan Gillick, Korbinian Riedhammerm, Benoit Favre, Dilek Hakkani-Tur 2009 IEEE International Conference on Acoustics, Speech and Signal Processing [pdf]
Evaluating the effectiveness of features and sampling in extractive meeting summarization Shasha Xie, Yang Liu, Hui Lin SLT 2008 [pdf]
Abstractive Meeting Summarization Using Dependency Graph Fusion Siddhartha Banerjee, Prasenjit Mitra, Kazunari Sugiyama WWW 2015 [pdf]
Automatic Community Creation for Abstractive Spoken Conversation Summarization Karan Singla, Evgeny Stepanov, Ali Orkan Bayer, Giuseppe Carenini, Giuseppe Riccardi ACL 2017 workshop [pdf] [bib]
Unsupervised Abstractive Meeting Summarization with Multi-Sentence Compression and Budgeted Submodular Maximization Guokan Shang, Wensi Ding, Zekun Zhang, Antoine Jean-Pierre Tixier, Polykarpos Meladianos, Michalis Vazirgiannis, Jean-Pierre Lorré ACL18 [pdf] [code]
Abstractive meeting summarization based on an attentional neural model Nouha Dammak, Yassine BenAyed [pdf]
A Study of Text Summarization Techniques for Generating Meeting Minutes Tu My Doan, Francois Jacquenet, Christine Largeron, Marc Bernard RCIS 2020 [pdf]
Meeting Summarization, A Challenge for Deep Learning Francois Jacquenet, Marc Bernard, Christine Largeron IWANN 2019 [pdf]
Generating Abstractive Summaries from Meeting Transcripts Siddhartha Banerjee, Prasenjit Mitra, Kazunari Sugiyama Proceedings of the 2015 ACM Symposium on Document Engineering, DocEng' 2015 [pdf]
Align then Summarize: Automatic Alignment Methods for Summarization Corpus Creation Paul Tardy, David Janiszek, Yannick Estève, Vincent Nguyen LREC 2020 [pdf] [bib]
Dialogue Discourse-Aware Graph Model and Data Augmentation for Meeting Summarization Xiachong Feng, Xiaocheng Feng, Bing Qin, Xinwei Geng IJCAI21 [pdf] [code]
How Domain Terminology Affects Meeting Summarization Performance Jia Jin Koay, Alexander Roustai, Xiaojin Dai, Dillon Burns, Alec Kerrigan, Fei Liu COLING20 Short [pdf] [code]
How to Interact and Change? Abstractive Dialogue Summarization with Dialogue Act Weight and Topic Change Info Jiasheng Di, Xiao Wei, Zhenyu Zhang KSEM 2020 [pdf] [code]
Abstractive Dialogue Summarization with Sentence-Gated Modeling Optimized by Dialogue Acts Chih-Wen Goo, Yun-Nung Chen SLT18 [pdf] [code]
A Sliding-Window Approach to Automatic Creation of Meeting Minutes Jia Jin Koay, Alexander Roustai, Xiaojin Dai, Fei Liu [pdf]
Hierarchical Learning for Generation with Long Source Sequences Tobias Rohde, Xiaoxia Wu, Yinhan Liu [pdf] [code]
A Hierarchical Network for Abstractive Meeting Summarization with Cross-Domain Pretraining Chenguang Zhu, Ruochen Xu, Michael Zeng, Xuedong Huang Findings of EMNLP20 [pdf] [code] [unofficial-code]
Abstractive Meeting Summarization via Hierarchical Adaptive Segmental Network Learning Zhou Zhao, Haojie Pan, Changjie Fan, Yan Liu, Linlin Li, Min Yang WWW19 [pdf]
Restructuring Conversations using Discourse Relations for Zero-shot Abstractive Dialogue Summarization Prakhar Ganesh, Saket Dingliwal [pdf]
Keep Meeting Summaries on Topic: Abstractive Multi-Modal Meeting Summarization Manling Li, Lingyu Zhang, Heng Ji, Richard J. Radke ACL19 [pdf]
Automatic analysis of multiparty meetings STEVE RENALS [pdf]
A Multimodal Meeting Browser that Implements an Important Utterance Detection Model based on Multimodal Information Fumio Nihei, Yukiko I. Nakano [pdf]
Exploring Methods for Predicting Important Utterances Contributing to Meeting Summarization Fumio Nihei, Yukiko I. Nakano [pdf]
Fusing Verbal and Nonverbal Information for Extractive Meeting Summarization Fumio Nihei, Yukiko I. Nakano, Yutaka Takase GIFT18 [pdf]
Meeting Extracts for Discussion Summarization Based on Multimodal Nonverbal Information Fumio Nihei, Yukiko I. Nakano, Yutaka Takase ICMI16 [pdf]
Extractive Summarization of Meeting Recordings Gabriel Murray, Steve Renals, Jean Carletta [pdf]
Multimodal Summarization of Meeting Recordings Bema Erol, Dar-Shyang Lee, Jonathan Hull ICME 2003 [pdf]
Few-Shot Learning of an Interleaved Text Summarization Model by Pretraining with Synthetic Data Sanjeev Kumar Karn, Francine Chen, Yan-Ying Chen, Ulli Waltinger, Hinrich Schütze EACL21 [pdf]
Leverage Unlabeled Data for Abstractive Speech Summarization with Self-Supervised Learning and Back-Summarization SPECOM 2020 SPECOM 2020 [pdf]
Focused Meeting Summarization via Unsupervised Relation Extraction Lu Wang, Claire Cardie SIGDIAL 2012 [pdf]
QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, Dragomir Radev NAACL21 [pdf] [data]
Domain-Independent Abstract Generation for Focused Meeting Summarization Lu Wang, Claire Cardie ACL 2013 [pdf]
Summarizing Decisions in Spoken Meetings Lu Wang, Claire Cardie ACL 2011 [pdf]
Extracting Decisions from Multi-Party Dialogue Using Directed Graphical Models and Semantic Similarity Trung Bui, Matthew Frampton, John Dowding, Stanley Peters SIGDIAL 2009 [pdf] [bib]
ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining Alexander R. Fabbri, Faiaz Rahman, Imad Rizvi, Borui Wang, Haoran Li, Yashar Mehdad, Dragomir Radev ACL2021 [pdf] [code]

Chat Summarization

Dialogue Summarization with Static-Dynamic Structure Fusion Graph ** Shen Gao, Xin Cheng, Mingzhe Li, Xiuying Chen, Jinpeng Li, Dongyan Zhao, Rui Yan [pdf] [code]

[Abs]
Dialogue, the most fundamental and specially privileged arena of language, gains increasing ubiquity across the Web in recent years. Quickly going through the long dialogue context and capturing salient information scattered over the whole dialogue session benefit users in many real-world Web applications such as email thread summarization and meeting minutes draft. Dialogue summarization is a challenging task in that dialogue has dynamic interaction nature and presumably inconsistent information flow among various speakers. Many researchers address this task by modeling dialogue with pre-computed static graph structure using external linguistic toolkits. However, such methods heavily depend on the reliability of external tools and the static graph construction is disjoint with the graph representation learning phase, which makes the graph can’t be dynamically adapted for the downstream summarization task. In this paper, we propose a Static-Dynamic graph-based Dialogue Summarization model (SDDS), which fuses prior knowledge from human expertise and adaptively learns the graph structure in an end-to-end learning fashion. To verify the effectiveness of SDDS, we conduct experiments on three benchmark datasets (SAMSum, MediaSum, and DialogSum) and the results verify the superiority of SDDS.
Mind the Gap! Injecting Commonsense Knowledge for Abstractive Dialogue Summarization Seungone Kim, Se June Joo, Hyungjoo Chae, Chaehyeong Kim, Seung-won Hwang, Jinyoung Yeo COLING 2022 [pdf]

[Abs]
In this paper, we propose to leverage the unique characteristics of dialogues sharing commonsense knowledge across participants, to resolve the difficulties in summarizing them. We present SICK, a framework that uses commonsense inferences as additional context. Compared to previous work that solely relies on the input dialogue, SICK uses an external knowledge model to generate a rich set of commonsense inferences and selects the most probable one with a similarity-based selection method. Built upon SICK, SICK++ utilizes commonsense as supervision, where the task of generating commonsense inferences is added upon summarizing the dialogue in a multi-task learning setting. Experimental results show that with injected commonsense knowledge, our framework generates more informative and consistent summaries than existing methods.
A Finer-grain Universal Dialogue Semantic Structures based Model For Abstractive Dialogue Summarization Yuejie Lei, Fujia Zheng, Yuanmeng Yan, Keqing He, Weiran Xu EMNLP 2021 Findings [pdf] [code]
Capturing Speaker Incorrectness: Speaker-Focused Post-Correction for Abstractive Dialogue Summarization Dongyub Lee, Jungwoo Lim, Taesun Whang, Chanhee Lee, Seungwoo Cho, Mingun Park, Heuiseok Lim EMNLP 2021| newsum [pdf]
Who says like a style of Vitamin: Towards Syntax-Aware DialogueSummarization using Multi-task Learning Seolhwa Lee, Kisu Yang, Chanjun Park, João Sedoc, Heuiseok Lim [pdf]
Controllable Neural Dialogue Summarization with Personal Named Entity Planning Zhengyuan Liu, Nancy F. Chen EMNLP 2021 [pdf]
GupShup: Summarizing Open-Domain Code-Switched Conversations Laiba Mehnaz, Debanjan Mahata, Rakesh Gosangi, Uma Sushmitha Gunturi, Riya Jain, Gauri Gupta, Amardeep Kumar, Isabelle G. Lee, Anish Acharya, Rajiv Ratn Shah EMNLP 2021 [pdf][code]
Topic-Aware Contrastive Learning for Abstractive Dialogue Summarization Junpeng Liu, Yanyan Zou, Hainan Zhang, Hongshen Chen, Zhuoye Ding, Caixia Yuan, Xiaojie Wang EMNLP 2021 Findings [pdf] [code]
Give the Truth: Incorporate Semantic Slot into Abstractive Dialogue Summarization Lulu Zhao, Weihao Zeng, Weiran Xu, Jun Guo EMNLP 2021 Findings [pdf]
Low-Resource Dialogue Summarization with Domain-Agnostic Multi-Source Pretraining Yicheng Zou, Bolin Zhu, Xingwu Hu, Tao Gui, Qi Zhang EMNLP 2021 [pdf] [code]
Enhancing Semantic Understanding with Self-Supervised Methods for Abstractive Dialogue Summarization Hyunjae Lee, Jaewoong Yun, Hyunjin Choi, Seongho Joe, Youngjune L. Gwon Interspeech 2021 [pdf]
Dialogue summarization with supporting utterance flow modeling and fact regularization Wang Chen, Piji Li, Hou PongChan, Irwin King Knowledge-Based Systems [pdf]
Situation-Based Multiparticipant Chat Summarization: a Concept, an Exploration-Annotation Tool and an Example Collection Anna Smirnova, Evgeniy Slobodkin, George Chernishev ACL 2021 Student Research Workshop [pdf] [tool] [data]
Coreference-Aware Dialogue Summarization Zhengyuan Liu, Ke Shi, Nancy F. Chen SIGDIAL 2021 [pdf]
Incorporating Commonsense Knowledge into Abstractive Dialogue Summarization via Heterogeneous Graph Networks Xiachong Feng, Xiaocheng Feng, Bing Qin CCL 2021 [pdf]
Hierarchical Speaker-Aware Sequence-to-Sequence Model for Dialogue Summarization Yuejie Lei, Yuanmeng Yan, Zhiyuan Zeng, Keqing He, Ximing Zhang, Weiran Xu ICASSP21 [pdf]
Summary Grounded Conversation Generation Chulaka Gunasekara, Guy Feigenblat, Benjamin Sznajder, Sachindra Joshi, David Konopnicki Findings of ACL 2021 [pdf]
Controllable Abstractive Dialogue Summarization with Sketch Supervision Chien-Sheng Wu, Linqing Liu, Wenhao Liu, Pontus Stenetorp, Caiming Xiong ACL-Findings 2021 [pdf] [code]
Structure-Aware Abstractive Conversation Summarization via Discourse and Action Graphs Jiaao Chen, Diyi Yang NAACL21 [pdf] [code]
Planning with Learned Entity Prompts for Abstractive Summarization Shashi Narayan, Yao Zhao, Joshua Maynez, Gonçalo Simoes, Ryan McDonald TACL 2021 [pdf]
Improving Abstractive Dialogue Summarization with Graph Structures and Topic Words Lulu Zhao, Weiran Xu, Jun Guo COLING20 [pdf]
Multi-View Sequence-to-Sequence Models with Conversational Structure for Abstractive Dialogue Summarization Jiaao Chen, Diyi Yang EMNLP20 [pdf] [code]
SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization Bogdan Gliwa, Iwona Mochol, Maciej Biesek, Aleksander Wawer EMNLP19 [pdf] [data]

Medical Dialogue Summarization

COSSUM: Towards Conversation-Oriented Structured Summarization for Automatic Medical Insurance Assessment Sheng Xu, Xiaojun Wan, Sen Hu, Mengdi Zhou, Teng Xu, Hongbin Wang, Haitao Mi KDD 2022 [pdf]

[Abs]
In medical insurance industry, a lot of human labor is required to collect information of claimants. Human assessors need to converse with claimants in order to record key information and organize it into a structured summary. With the purpose of helping save human labor, we propose the task of conversation-oriented structured summarization which aims to automatically produce the desired structured summary from a conversation automatically. One major challenge of the task is that the structured summary contains multiple fields of different types. To tackle this problem, we propose a unified approach COSSUM based on prompting to generate the values of all fields simultaneously. By learning all fields together, our approach can capture the inherent relationship between them. Moreover, we propose a specially designed curriculum learning strategy for model training. Both automatic and human evaluations are performed, and the results show the effectiveness of our proposed approach.
Counseling Summarization using Mental Health Knowledge Guided Utterance Filtering Aseem Srivastava, Tharun Suresh, Sarah Peregrine (Grin)Lord, Md. Shad Akhtar, Tanmoy Chakraborty KDD 2022 ADS Track [pdf]

[Abs]
The psychotherapy intervention technique is a multifaceted conversation between a therapist and a patient. Unlike general clinical discussions, psychotherapy's core components (viz. symptoms) are hard to distinguish, thus becoming a complex problem to summarize later. A structured counseling conversation may contain discussions about symptoms, history of mental health issues, or the discovery of the patient's behavior. It may also contain discussion filler words irrelevant to a clinical summary. We refer to these elements of structured psychotherapy as counseling components. In this paper, the aim is mental health counseling summarization to build upon domain knowledge and to help clinicians quickly glean meaning. We create a new dataset after annotating 12.9K utterances of counseling components and reference summaries for each dialogue. Further, we propose ConSum, a novel counseling-component guided summarization model. ConSum undergoes three independent modules. First, to assess the presence of depressive symptoms, it filters utterances utilizing the Patient Health Questionnaire (PHQ-9), while the second and third modules aim to classify counseling components. At last, we propose a problem-specific Mental Health Information Capture (MHIC) evaluation metric for counseling summaries. Our comparative study shows that we improve on performance and generate cohesive, semantic, and coherent summaries. We comprehensively analyze the generated summaries to investigate the capturing of psychotherapy elements. Human and clinical evaluations on the summary show that ConSum generates quality summary. Further, mental health experts validate the clinical acceptability of the ConSum. Lastly, we discuss the uniqueness in mental health counseling summarization in the real world and show evidences of its deployment on an online application with the support of http://mpathic.ai/
Adding more data does not always help: A study in medical conversation summarization with PEGASUS Varun Nair, Namit Katariya, Xavier Amatriain, Ilya Valmianski, Anitha Kannan [pdf]
Leveraging Pretrained Models for Automatic Summarization of Doctor-Patient Conversations Longxiang Zhang, Renato Negrinho, Arindam Ghosh, Vasudevan Jagannathan, Hamid Reza Hassanzadeh, Thomas Schaaf, and Matthew R. Gormley Findings of EMNLP 2021 [pdf]
Medically Aware GPT-3 as a Data Generator for Medical Dialogue Summarization Bharath Chintagunta, Namit Katariya, Xavier Amatriain, Anitha Kannan NAACL | NLPMC 2021 [pdf1] [pdf2]
Generating SOAP Notes from Doctor-Patient Conversations Using Modular Summarization Techniques Kundan Krishna, Sopan Khosla, Jeffrey P. Bigham, Zachary C. Lipton ACL 2021 [pdf] [code]
Summarizing Medical Conversations via Identifying Important Utterances Yan Song, Yuanhe Tian, Nan Wang, Fei Xia COLING 2020 [pdf] [code] [bib]
Dr.Summarize: Global Summarization of Medical Dialogue by Exploiting Local Structures Anirudh Joshi, Namit Katariya, Xavier Amatriain, Anitha Kannan Findings of EMNLP 2020 [pdf] [bib]
Medical Dialogue Summarization for Automated Reporting in Healthcare Sabine Molenaar, Lientje Maas, Verónica Burriel, Fabiano Dalpiaz,Sjaak Brinkkemper Advanced Information Systems Engineering Workshops 2020 [pdf] [bib]
Generating Medical Reports from Patient-Doctor Conversations using Sequence-to-Sequence Models Seppo Enarvi, Marilisa Amoia, Miguel Del-Agua Teba, Brian Delaney, Frank Diehl, Stefan Hahn, Kristina Harris, Liam McGrath, Yue Pan, Joel Pinto, Luca Rubini, Miguel Ruiz, Gagandeep Singh, Fabian Stemmer, Weiyi Sun, Paul Vozila, Thomas Lin, Ranjani Ramamurthy ACL 2020 Short [pdf] [bib]
Automatically Generating Psychiatric Case Notes From Digital Transcripts of Doctor-Patient Conversations Nazmul Kazi, Indika Kahanda NAACL 2019 [pdf] [bib]
Alignment Annotation for Clinic Visit Dialogue to Clinical Note Sentence Language Generation Wen-wai Yim, Meliha Yetisgen, Jenny Huang, Micah Grossman LREC 2020 [pdf] [bib]
Topic-aware Pointer-Generator Networks for Summarizing Spoken Conversations Zhengyuan Liu, Angela Ng, Sheldon Lee, Ai Ti Aw, Nancy F. Chen ASRU 2019 [pdf]

Customer Service Summarization

Other Roles Matter! Enhancing Role-Oriented Dialogue Summarization via Role Interactions Haitao Lin, Junnan Zhu, Lu Xiang, Yu Zhou, Jiajun Zhang, Chengqing Zong ACL 2022 [pdf] [code]

[Abs]
Role-oriented dialogue summarization is to generate summaries for different roles in the dialogue, e.g., merchants and consumers. Existing methods handle this task by summarizing each role’s content separately and thus are prone to ignore the information from other roles. However, we believe that other roles’ content could benefit the quality of summaries, such as the omitted information mentioned by other roles. Therefore, we propose a novel role interaction enhanced method for role-oriented dialogue summarization. It adopts cross attention and decoder self-attention interactions to interactively acquire other roles’ critical information. The cross attention interaction aims to select other roles’ critical dialogue utterances, while the decoder self-attention interaction aims to obtain key information from other roles’ summaries. Experimental results have shown that our proposed method significantly outperforms strong baselines on two public role-oriented dialogue summarization datasets. Extensive analyses have demonstrated that other roles’ content could help generate summaries with more complete semantics and correct topic structures.
An End-to-End Dialogue Summarization System for Sales Calls Abedelkadir Asi, Song Wang, Roy Eisenstadt, Dean Geckt, Yarin Kuper, Yi Mao, Royi Ronen NAACL 2022 [pdf]
Heuristic-based Inter-training to Improve Few-shot Multi-perspective Dialog Summarization Benjamin Sznajder, Chulaka Gunasekara, Guy Lev, Sachin Joshi, Eyal Shnarch, Noam Slonim [pdf]
Dialogue Summaries as Dialogue States (DS2), Template-Guided Summarization for Few-shot Dialogue State Tracking Jamin Shin, Hangyeol Yu, Hyeongdon Moon, Andrea Madotto, Juneyoung Park Findings of ACL 2022 [pdf] [code]
TWEETSUMM - A Dialog Summarization Dataset for Customer Service Guy Feigenblat, Chulaka Gunasekara, Benjamin Sznajder, Sachindra Joshi, David Konopnicki, Ranit Aharonov [pdf] [data]
Extractive Dialogue Summarization Without Annotation Based on Distantly Supervised Machine Reading Comprehension in Customer Service Bing Ma, Haifeng Sun , Jingyu Wang , Qi Qi, and Jianxin Liao TASLP [pdf]
TODSum: Task-Oriented Dialogue Summarization with State Tracking Lulu Zhao, Fujia Zheng, Keqing He, Weihao Zeng, Yuejie Lei, Huixing Jiang, Wei Wu, Weiran Xu, Jun Guo, Fanyu Meng [pdf]
CSDS: A Fine-grained Chinese Dataset for Customer Service Dialogue Summarization Haitao Lin, Liqun Ma, Junnan Zhu, Lu Xiang, Yu Zhou, Jiajun Zhang, Chengqing Zong EMNLP 2021 [pdf] [data]
Distant Supervision based Machine Reading Comprehension for Extractive Summarization in Customer Service Bing Ma, Cao Liu, Jingyu Wang, Shujie Hu, Fan Yang, Xunliang Cai, Guanglu Wan, Jiansong Chen, Jianxin Liao SIGIR 2021 [pdf]
Unsupervised Abstractive Dialogue Summarization for Tete-a-Tetes Xinyuan Zhang, Ruiyi Zhang, Manzil Zaheer, Amr Ahmed AAAI21 [pdf]
Topic-Oriented Spoken Dialogue Summarization for Customer Service with Saliency-Aware Topic Modeling Yicheng Zou, Lujun Zhao, Yangyang Kang, Jun Lin, Minlong Peng, Zhuoren Jiang, Changlong Sun, Qi Zhang, Xuanjing Huang, Xiaozhong Liu AAAI21 [pdf] [code]
Unsupervised Summarization for Chat Logs with Topic-Oriented Ranking and Context-Aware Auto-Encoders Yicheng Zou, Jun Lin, Lujun Zhao, Yangyang Kang, Zhuoren Jiang, Changlong Sun, Qi Zhang, Xuanjing Huang, Xiaozhong Liu AAAI21 [pdf] [code]
Abstractive Dialog Summarization with Semantic Scaffolds Lin Yuan, Zhou Yu [pdf]
Automatic Dialogue Summary Generation for Customer Service Chunyi Liu, Peng Wang, Jiang Xu, Zang Li and Jieping Ye KDD19 [pdf]

Domain Adaption

DIONYSUS: A Pre-trained Model for Low-Resource Dialogue Summarization Yu Li, Baolin Peng, Pengcheng He, Michel Galley, Zhou Yu, Jianfeng Gao [pdf]

[Abs]
Dialogue summarization has recently garnered significant attention due to its wide range of applications. However, existing methods for summarizing dialogues are suboptimal because they do not take into account the inherent structure of dialogue and rely heavily on labeled data, which can lead to poor performance in new domains. In this work, we propose DIONYSUS (dynamic input optimization in pre-training for dialogue summarization), a pre-trained encoder-decoder model for summarizing dialogues in any new domain. To pre-train DIONYSUS, we create two pseudo summaries for each dialogue example: one is produced by a fine-tuned summarization model, and the other is a collection of dialogue turns that convey important information. We then choose one of these pseudo summaries based on the difference in information distribution across different types of dialogues. This selected pseudo summary serves as the objective for pre-training DIONYSUS using a self-supervised approach on a large dialogue corpus. Our experiments show that DIONYSUS outperforms existing methods on six datasets, as demonstrated by its ROUGE scores in zero-shot and few-shot settings.
Domain-Oriented Prefix-Tuning: Towards Efficient and Generalizable Fine-tuning for Zero-Shot Dialogue Summarization Lulu Zhao, Fujia Zheng, Weihao Zeng, Keqing He, Weiran Xu, Huixing Jiang, Wei Wu, Yanan Wu NAACL 2022 [pdf] [code]
AdaptSum: Towards Low-Resource Domain Adaptation for Abstractive Summarization Tiezheng Yu, Zihan Liu, Pascale Fung NAACL21 [pdf] [code]
Domain Adaptation to Summarize Human Conversations Oana Sandu, Giuseppe Carenini, Gabriel Murray, Raymond Ng ACL2010 Workshop [pdf]

Others

Summarizing Community-based Question-Answer Pairs Ting-Yao Hsu, Yoshi Suhara, Xiaolan Wang EMNLP 2022 [pdf] [code]

[Abs]
Community-based Question Answering (CQA), which allows users to acquire their desired information, has increasingly become an essential component of online services in various domains such as E-commerce, travel, and dining. However, an overwhelming number of CQA pairs makes it difficult for users without particular intent to find useful information spread over CQA pairs. To help users quickly digest the key information, we propose the novel CQA summarization task that aims to create a concise summary from CQA pairs. To this end, we first design a multi-stage data annotation process and create a benchmark dataset, COQASUM, based on the Amazon QA corpus. We then compare a collection of extractive and abstractive summarization methods and establish a strong baseline approach DedupLED for the CQA summarization task. Our experiment further confirms two key challenges, sentence-type transfer and deduplication removal, towards the CQA summarization task. Our data and code are publicly available.
Curriculum Prompt Learning with Self-Training for Abstractive Dialogue Summarization Changqun Li, Linlin Wang, Xin Lin, Gerard de Melo, Liang He EMNLP 2022 [pdf]

[Abs]
Succinctly summarizing dialogue is a task of growing interest, but inherent challenges, such as insufficient training data and low information density impede our ability to train abstractive models. In this work, we propose a novel curriculum-based prompt learning method with self-training to address these problems. Specifically, prompts are learned using a curriculum learning strategy that gradually increases the degree of prompt perturbation, thereby improving the dialogue understanding and modeling capabilities of our model. Unlabeled dialogue is incorporated by means of self-training so as to reduce the dependency on labeled data. We further investigate topic-aware prompts to better plan for the generation of summaries. Experiments confirm that our model substantially outperforms strong baselines and achieves new state-of-the-art results on the AMI and ICSI datasets. Human evaluations also show the superiority of our model with regard to the summary generation quality.
STRUDEL: Structured Dialogue Summarization for Dialogue Comprehension Borui Wang, Chengcheng Feng, Arjun Nair, Madelyn Mao, Jai Desai, Asli Celikyilmaz, Haoran Li, Yashar Mehdad, Dragomir Radev EMNLP 2022 [pdf]

[Abs]
Abstractive dialogue summarization has long been viewed as an important standalone task in natural language processing, but no previous work has explored the possibility of whether abstractive dialogue summarization can also be used as a means to boost an NLP system's performance on other important dialogue comprehension tasks. In this paper, we propose a novel type of dialogue summarization task - STRUctured DiaLoguE Summarization - that can help pre-trained language models to better understand dialogues and improve their performance on important dialogue comprehension tasks. We further collect human annotations of STRUDEL summaries over 400 dialogues and introduce a new STRUDEL dialogue comprehension modeling framework that integrates STRUDEL into a graph-neural-network-based dialogue reasoning module over transformer encoder language models to improve their dialogue comprehension abilities. In our empirical experiments on two important downstream dialogue comprehension tasks - dialogue question answering and dialogue response prediction - we show that our STRUDEL dialogue comprehension model can significantly improve the dialogue comprehension performance of transformer encoder language models.
Enhancing Dialogue Summarization with Topic-Aware Global- and Local- Level Centrality Xinnian Liang, Shuangzhi Wu, Chenhao Cui, Jiaqi Bai, Chao Bian, Zhoujun Li EACL 2023 [pdf] [code]

[Abs]
Dialogue summarization aims to condense a given dialogue into a simple and focused summary text. Typically, both the roles' viewpoints and conversational topics change in the dialogue stream. Thus how to effectively handle the shifting topics and select the most salient utterance becomes one of the major challenges of this task. In this paper, we propose a novel topic-aware Global-Local Centrality (GLC) model to help select the salient context from all sub-topics. The centralities are constructed at both the global and local levels. The global one aims to identify vital sub-topics in the dialogue and the local one aims to select the most important context in each sub-topic. Specifically, the GLC collects sub-topic based on the utterance representations. And each utterance is aligned with one sub-topic. Based on the sub-topics, the GLC calculates global- and local-level centralities. Finally, we combine the two to guide the model to capture both salient context and sub-topics when generating summaries. Experimental results show that our model outperforms strong baselines on three public dialogue summarization datasets: CSDS, MC, and SAMSUM. Further analysis demonstrates that our GLC can exactly identify vital contents from sub-topics.
SWING: Balancing Coverage and Faithfulness for Dialogue Summarization Kung-Hsiang Huang, Siffi Singh, Xiaofei Ma, Wei Xiao, Feng Nan, Nicholas Dingwall, William Yang Wang, Kathleen McKeown Findings of EACL 2023 [pdf] [code]

[Abs]
Missing information is a common issue of dialogue summarization where some information in the reference summaries is not covered in the generated summaries. To address this issue, we propose to utilize natural language inference (NLI) models to improve coverage while avoiding introducing factual inconsistencies. Specifically, we use NLI to compute fine-grained training signals to encourage the model to generate content in the reference summaries that have not been covered, as well as to distinguish between factually consistent and inconsistent generated sentences. Experiments on the DialogSum and SAMSum datasets confirm the effectiveness of the proposed approach in balancing coverage and faithfulness, validated with automatic metrics and human evaluations. Additionally, we compute the correlation between commonly used automatic metrics with human judgments in terms of three different dimensions regarding coverage and factual consistency to provide insight into the most suitable metric for evaluating dialogue summaries.
Human-in-the-loop Abstractive Dialogue Summarization Jiaao Chen, Mohan Dodda, Diyi Yang [pdf] Findings of ACL 2023

[Abs]
Abstractive dialogue summarization has received increasing attention recently. Despite the fact that most of the current dialogue summarization systems are trained to maximize the likelihood of human-written summaries and have achieved significant results, there is still a huge gap in generating high-quality summaries as determined by humans, such as coherence and faithfulness, partly due to the misalignment in maximizing a single human-written summary. To this end, we propose to incorporate different levels of human feedback into the training process. This will enable us to guide the models to capture the behaviors humans care about for summaries. Specifically, we ask humans to highlight the salient information to be included in summaries to provide the local feedback, and to make overall comparisons among summaries in terms of coherence, accuracy, coverage, concise and overall quality, as the global feedback. We then combine both local and global feedback to fine-tune the dialog summarization policy with Reinforcement Learning. Experiments conducted on multiple datasets demonstrate the effectiveness and generalization of our methods over the state-of-the-art supervised baselines, especially in terms of human judgments.
ED-FAITH: Evaluating Dialogue Summarization on Faithfulness Sicong Huang, Asli Celikyilmaz, Haoran Li [pdf]

[Abs]
Abstractive summarization models typically generate content unfaithful to the input, thus highlighting the significance of evaluating the faithfulness of generated summaries. Most faithfulness metrics are only evaluated on news domain, can they be transferred to other summarization tasks? In this work, we first present a systematic study of faithfulness metrics for dialogue summarization. We evaluate common faithfulness metrics on dialogue datasets and observe that most metrics correlate poorly with human judgements despite performing well on news datasets. Given these findings, to improve existing metrics’ performance on dialogue summarization, we first finetune on in-domain dataset, then apply unlikelihood training on negative samples, and show that they can successfully improve metric performance on dialogue data. Inspired by the strong zero-shot performance of the T0 language model, we further propose T0-Score – a new metric for faithfulness evaluation, which shows consistent improvement against baseline metrics across multiple domains.
Towards Understanding Omission in Dialogue Summarization Yicheng Zou, Kaitao Song, Xu Tan, Zhongkai Fu, Tao Gui, Qi Zhang, Dongsheng Li `` [pdf]

[Abs]
Dialogue summarization aims to condense the lengthy dialogue into a concise summary, and has recently achieved significant progress. However, the result of existing methods is still far from satisfactory. Previous works indicated that omission is a major factor in affecting the quality of summarization, but few of them have further explored the omission problem, such as how omission affects summarization results and how to detect omission, which is critical for reducing omission and improving summarization quality. Moreover, analyzing and detecting omission relies on summarization datasets with omission labels (i.e., which dialogue utterances are omitted in the summarization), which are not available in the current literature. In this paper, we propose the OLDS dataset, which provides high-quality Omission Labels for Dialogue Summarization. By analyzing this dataset, we find that a large improvement in summarization quality can be achieved by providing ground-truth omission labels for the summarization model to recover omission information, which demonstrates the importance of omission detection for omission mitigation in dialogue summarization. Therefore, we formulate an omission detection task and demonstrate our proposed dataset can support the training and evaluation of this task well. We also call for research action on omission detection based on our proposed datasets. Our dataset and codes are publicly available.
Analyzing and Evaluating Faithfulness in Dialogue Summarization Bin Wang, Chen Zhang, Yan Zhang, Yiming Chen, Haizhou Li EMNLP 2022 [pdf] [code]

[Abs]
Dialogue summarization is abstractive in nature, making it suffer from factual errors. The factual correctness of summaries has the highest priority before practical applications. Many efforts have been made to improve faithfulness in text summarization. However, there is a lack of systematic study on dialogue summarization systems. In this work, we first perform the fine-grained human analysis on the faithfulness of dialogue summaries and observe that over 35% of generated summaries are faithfully inconsistent respective the source dialogues. Furthermore, we present a new model-level faithfulness evaluation method. It examines generation models with multi-choice questions created by rule-based transformations. Experimental results show that our evaluation schema is a strong proxy for the factual correctness of summarization models. The human-annotated faithfulness samples and the evaluation toolkit are released to facilitate future research toward faithful dialogue summarization.
Taxonomy of Abstractive Dialogue Summarization: Scenarios, Approaches and Future Directions Qi Jia, Siyu Ren, Yizhu Liu, Kenny Q. Zhu [pdf]

[Abs]
Abstractive dialogue summarization is to generate a concise and fluent summary covering the salient information in a dialogue among two or more interlocutors. It has attracted great attention in recent years based on the massive emergence of social communication platforms and an urgent requirement for efficient dialogue information understanding and digestion. Different from news or articles in traditional document summarization, dialogues bring unique characteristics and additional challenges, including different language styles and formats, scattered information, flexible discourse structures and unclear topic boundaries. This survey provides a comprehensive investigation on existing work for abstractive dialogue summarization from scenarios, approaches to evaluations. It categorizes the task into two broad categories according to the type of input dialogues, i.e., open-domain and task-oriented, and presents a taxonomy of existing techniques in three directions, namely, injecting dialogue features, designing auxiliary training tasks and using additional data.A list of datasets under different scenarios and widely-accepted evaluation metrics are summarized for completeness. After that, the trends of scenarios and techniques are summarized, together with deep insights on correlations between extensively exploited features and different scenarios. Based on these analyses, we recommend future directions including more controlled and complicated scenarios, technical innovations and comparisons, publicly available datasets in special domains, etc.
Leveraging Non-dialogue Summaries for Dialogue Summarization Seongmin Park, Dongchan Shin, Jihwa Lee Transcript Understanding Workshop at COLING 2022 [pdf]

[Abs]
To mitigate the lack of diverse dialogue summarization datasets in academia, we present methods to utilize non-dialogue summarization data for enhancing dialogue summarization systems. We apply transformations to document summarization data pairs to create training data that better befit dialogue summarization. The suggested transformations also retain desirable properties of non-dialogue datasets, such as improved faithfulness to the source text. We conduct extensive experiments across both English and Korean to verify our approach. Although absolute gains in ROUGE naturally plateau as more dialogue summarization samples are introduced, utilizing non-dialogue data for training significantly improves summarization performance in zero- and few-shot settings and enhances faithfulness across all training regimes.
Improving Abstractive Dialogue Summarization with Speaker-Aware Supervised Contrastive Learning Zhichao Geng, Ming Zhong, Zhangyue Yin, Xipeng Qiu, Xuanjing Huang COLING 2022 [pdf]

[Abs]
Pre-trained models have brought remarkable success on the text summarization task. For dialogue summarization, the subdomain of text summarization, utterances are concatenated to flat text before being processed. As a result, existing summarization systems based on pre-trained models are unable to recognize the unique format of the speaker-utterance pair well in the dialogue. To investigate this issue, we conduct probing tests and manual analysis, and find that the powerful pre-trained model can not identify different speakers well in the conversation, which leads to various factual errors. Moreover, we propose three speaker-aware supervised contrastive learning (SCL) tasks: Token-level SCL, Turn-level SCL, and Global-level SCL. Comprehensive experiments demonstrate that our methods achieve significant performance improvement on two mainstream dialogue summarization datasets. According to detailed human evaluations, pre-trained models equipped with SCL tasks effectively generate summaries with better factual consistency.
View Dialogue in 2D: A Two-stream Model in Time-speaker Perspective for Dialogue Summarization and beyond Keli Xie, Dongchen He, Jiaxin Zhuang, Siyuan Lu, Zhongfeng Wang COLING 2022 [pdf] [code]

[Abs]
Existing works on dialogue summarization often follow the common practice in document summarization and view the dialogue, which comprises utterances of different speakers, as a single utterance stream ordered by time. However, this single-stream approach without specific attention to the speaker-centered points has limitations in fully understanding the dialogue. To better capture the dialogue information, we propose a 2D view of dialogue based on a time-speaker perspective, where the time and speaker streams of dialogue can be obtained as strengthened input. Based on this 2D view, we present an effective two-stream model called ATM to combine the two streams. Extensive experiments on various summarization datasets demonstrate that ATM significantly surpasses other models regarding diverse metrics and beats the state-of-the-art models on the QMSum dataset in ROUGE scores. Besides, ATM achieves great improvements in summary faithfulness and human evaluation. Moreover, results on machine reading comprehension datasets show the generalization ability of the proposed methods and shed light on other dialogue-based tasks. Our code will be publicly available online.
Summarizing Dialogues with Negative Cues Junpeng Liu, Yanyan Zou, Yuxuan Xi, Shengjie Li, Mian Ma, Zhuoye Ding COLING 2022 [pdf]

[Abs]
Abstractive dialogue summarization aims to convert a long dialogue content into its short form where the salient information is preserved while the redundant pieces are ignored. Different from the well-structured text, such as news and scientific articles, dialogues often consist of utterances coming from two or more interlocutors, where the conversations are often informal, verbose, and repetitive, sprinkled with false-starts, backchanneling, reconfirmations, hesitations, speaker interruptions and the salient information is often scattered across the whole chat. The above properties of conversations make it difficult to directly concentrate on scattered outstanding utterances and thus present new challenges of summarizing dialogues. In this work, rather than directly forcing a summarization system to merely pay more attention to the salient pieces, we propose to explicitly have the model perceive the redundant parts of an input dialogue history during the training phase. To be specific, we design two strategies to construct examples without salient pieces as negative cues. Then, the sequence-to-sequence likelihood loss is cooperated with the unlikelihood objective to drive the model to focus less on the unimportant information and also pay more attention to the salient pieces. Extensive experiments on the benchmark dataset demonstrate that our simple method significantly outperforms the baselines with regard to both semantic matching and factual consistent based metrics. The human evaluation also proves the performance gains.
ClidSum: A Benchmark Dataset for Cross-Lingual Dialogue Summarization Jiaan Wang, Fandong Meng, Ziyao Lu, Duo Zheng, Zhixu Li, Jianfeng Qu, Jie Zhou EMNLP 2022 [pdf] [code]

[Abs]
We present ClidSum, a benchmark dataset for building cross-lingual summarization systems on dialogue documents. It consists of 67k+ dialogue documents from two subsets (i.e., SAMSum and MediaSum) and 112k+ annotated summaries in different target languages. Based on the proposed ClidSum, we introduce two benchmark settings for supervised and semi-supervised scenarios, respectively. We then build various baseline systems in different paradigms (pipeline and end-to-end) and conduct extensive experiments on ClidSum to provide deeper analyses. Furthermore, we propose mDialBART which extends mBART-50 (a multi-lingual BART) via further pre-training. The multiple objectives used in the further pre-training stage help the pre-trained model capture the structural characteristics as well as important content in dialogues and the transformation from source to the target language. Experimental results show the superiority of mDialBART, as an end-to-end model, outperforms strong pipeline models on ClidSum. Finally, we discuss specific challenges that current approaches faced with this task and give multiple promising directions for future research.
A Focused Study on Sequence Length for Dialogue Summarization Bin Wang, Chen Zhang, Chengwei Wei, Haizhou Li [pdf]

[Abs]
Output length is critical to dialogue summarization systems. The dialogue summary length is determined by multiple factors, including dialogue complexity, summary objective, and personal preferences. In this work, we approach dialogue summary length from three perspectives. First, we analyze the length differences between existing models' outputs and the corresponding human references and find that summarization models tend to produce more verbose summaries due to their pretraining objectives. Second, we identify salient features for summary length prediction by comparing different model settings. Third, we experiment with a length-aware summarizer and show notable improvement on existing models if summary length can be well incorporated. Analysis and experiments are conducted on popular DialogSum and SAMSum datasets to validate our findings.
DialogSum Challenge: Results of the Dialogue Summarization Shared Task Yulong Chen, Naihao Deng, Yang Liu, Yue Zhang [pdf]

[Abs]
We report the results of DialogSum Challenge, the shared task on summarizing real-life scenario dialogues at INLG 2022. Four teams participate in this shared task and three submit their system reports, exploring different methods to improve the performance of dialogue summarization. Although there is a great improvement over the baseline models regarding automatic evaluation metrics, such as Rouge scores, we find that there is a salient gap between model generated outputs and human annotated summaries by human evaluation from multiple aspects. These findings demonstrate the difficulty of dialogue summarization and suggest that more fine-grained evaluatuion metrics are in need.
Effectiveness of French Language Models on Abstractive Dialogue Summarization Task Yongxin Zhou, François Portet, Fabien Ringeval LREC 2022 [pdf]

[Abs]
Pre-trained language models have established the state-of-the-art on various natural language processing tasks, including dialogue summarization, which allows the reader to quickly access key information from long conversations in meetings, interviews or phone calls. However, such dialogues are still difficult to handle with current models because the spontaneity of the language involves expressions that are rarely present in the corpora used for pre-training the language models. Moreover, the vast majority of the work accomplished in this field has been focused on English. In this work, we present a study on the summarization of spontaneous oral dialogues in French using several language specific pre-trained models: BARThez, and BelGPT-2, as well as multilingual pre-trained models: mBART, mBARThez, and mT5. Experiments were performed on the DECODA (Call Center) dialogue corpus whose task is to generate abstractive synopses from call center conversations between a caller and one or several agents depending on the situation. Results show that the BARThez models offer the best performance far above the previous state-of-the-art on DECODA. We further discuss the limits of such pre-trained models and the challenges that must be addressed for summarizing spontaneous dialogues.
Data Augmentation for Low-Resource Dialogue Summarization Yongtai Liu, Joshua Maynez, Gonçalo Simões, Shashi Narayan Findings of NAACL 2022 [pdf]

[Abs]
We present DADS, a novel Data Augmentation technique for low-resource Dialogue Summarization. Our method generates synthetic examples by replacing sections of text from both the input dialogue and summary while preserving the augmented summary to correspond to a viable summary for the augmented dialogue. We utilize pretrained language models that produce highly likely dialogue alternatives while still being free to generate diverse alternatives. We applied our data augmentation method to the SAMSum dataset in low resource scenarios, mimicking real world problems such as chat, thread, and meeting summarization where large scale supervised datasets with human-written summaries are scarce. Through both automatic and human evaluations, we show that DADS shows strong improvements for low resource scenarios while generating topically diverse summaries without introducing additional hallucinations to the summaries.
An End-to-End Dialogue Summarization System for Sales Calls Abedelkadir Asi, Song Wang, Roy Eisenstadt, Dean Geckt, Yarin Kuper, Yi Mao, Royi Ronen NAACL 2022 Industry Track [pdf]

[Abs]
Summarizing sales calls is a routine task performed manually by salespeople. We present a production system which combines generative models fine-tuned for customer-agent setting, with a human-in-the-loop user experience for an interactive summary curation process. We address challenging aspects of dialogue summarization task in a real-world setting including long input dialogues, content validation, lack of labeled data and quality evaluation. We show how GPT-3 can be leveraged as an offline data labeler to handle training data scarcity and accommodate privacy constraints in an industrial setting. Experiments show significant improvements by our models in tackling the summarization and content validation tasks on public datasets.
Few-shot fine-tuning SOTA summarization models for medical dialogues David Fraile Navarro, Mark Dras, Shlomo Berkovsky NAACL 2022 Student Research Workshop [pdf] [code]

[Abs]
Abstractive summarization of medical dialogues presents a challenge for standard training approaches, given the paucity of suitable datasets. We explore the performance of state-of-the-art models with zero-shot and few-shot learning strategies and measure the impact of pretraining with general domain and dialogue-specific text on the summarization performance.
DialSummEval: Revisiting Summarization Evaluation for Dialogues Mingqi Gao, Xiaojun Wan NAACL 2022 [pdf] [code]

[Abs]
Dialogue summarization is receiving increasing attention from researchers due to its extraordinary difficulty and unique application value. We observe that current dialogue summarization models have flaws that may not be well exposed by frequently used metrics such as ROUGE. In our paper, we re-evaluate 18 categories of metrics in terms of four dimensions: coherence, consistency, fluency and relevance, as well as a unified human evaluation of various models for the first time. Some noteworthy trends which are different from the conventional summarization tasks are identified. We will release DialSummEval, a multi-faceted dataset of human judgments containing the outputs of 14 models on SAMSum.
Domain-Oriented Prefix-Tuning: Towards Efficient and Generalizable Fine-tuning for Zero-Shot Dialogue Summarization Lulu Zhao, Fujia Zheng, Weihao Zeng, Keqing He, Weiran Xu, Huixing Jiang, Wei Wu, Yanan Wu NAACL 2022 [pdf] [code]

[Abs]
The most advanced abstractive dialogue summarizers lack generalization ability on new domains and the existing researches for domain adaptation in summarization generally rely on large-scale pre-trainings. To explore the lightweight fine-tuning methods for domain adaptation of dialogue summarization, in this paper, we propose an efficient and generalizable Domain-Oriented Prefix-tuning model, which utilizes a domain word initialized prefix module to alleviate domain entanglement and adopts discrete prompts to guide the model to focus on key contents of dialogues and enhance model generalization. We conduct zero-shot experiments and build domain adaptation benchmarks on two multi-domain dialogue summarization datasets, TODSum and QMSum. Adequate experiments and qualitative analysis prove the effectiveness of our methods.
From spoken dialogue to formal summary: An utterance rewriting for dialogue summarization Yue Fang, Hainan Zhang, Hongshen Chen, Zhuoye Ding, Bo Long, Yanyan Lan, Yanquan Zhou NAACL 2022 [pdf]

[Abs]
Due to the dialogue characteristics of unstructured contexts and multi-parties with first-person perspective, many successful text summarization works have failed when dealing with dialogue summarization. In dialogue summarization task, the input dialogue is usually spoken style with ellipsis and co-references but the output summaries are more formal and complete. Therefore, the dialogue summarization model should be able to complete the ellipsis content and co-reference information and then produce a suitable summary accordingly. However, the current state-of-the-art models pay more attention on the topic or structure of summary, rather than the consistency of dialogue summary with its input dialogue context, which may suffer from the personal and logical inconsistency problem. In this paper, we propose a new model, named ReWriteSum, to tackle this problem. Firstly, an utterance rewriter is conducted to complete the ellipsis content of dialogue content and then obtain the rewriting utterances. Then, the co-reference data augmentation mechanism is utilized to replace the referential person name with its specific name to enhance the personal information. Finally, the rewriting utterances and the co-reference replacement data are used in the standard BART model. Experimental results on both SAMSum and DialSum datasets show that our ReWriteSum significantly outperforms baseline models, in terms of both metric-based and human evaluations. Further analysis on multi-speakers also shows that ReWriteSum can obtain relatively higher improvement with more speakers, validating the correctness and property of ReWriteSum.
Unsupervised Abstractive Dialogue Summarization with Word Graphs and POV Conversion Seongmin Park, Jihwa Lee WIT Workshop @ ACL2022 [pdf] [code]
MSAMSum: Towards Benchmarking Multi-lingual Dialogue Summarization Xiachong Feng, Xiaocheng Feng, Bing Qin ACL 2022 DialDoc Workshop [pdf] [data]
The Cross-lingual Conversation Summarization Challenge Yulong Chen, Ming Zhong, Xuefeng Bai, Naihao Deng, Jing Li, Xianchao Zhu, Yue Zhang [pdf]
Post-Training Dialogue Summarization using Pseudo-Paraphrasing Qi Jia, Yizhu Liu, Haifeng Tang, Kenny Q. Zhu Findings of NAACL 2022 [pdf] [code]

[Abs]
Previous dialogue summarization techniques adapt large language models pretrained on the narrative text by injecting dialogue-specific features into the models. These features either require additional knowledge to recognize or make the resulting models harder to tune. To bridge the format gap between dialogues and narrative summaries in dialogue summarization tasks, we propose to post-train pretrained language models (PLMs) to rephrase from dialogue to narratives. After that, the model is fine-tuned for dialogue summarization as usual. Comprehensive experiments show that our approach significantly improves vanilla PLMs on dialogue summarization and outperforms other SOTA models by the summary quality and implementation costs.
CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning Xiangru Tang, Arjun Nair, Borui Wang, Bingyao Wang, Jai Desai, Aaron Wade, Haoran Li, Asli Celikyilmaz, Yashar Mehdad, Dragomir Radev [pdf]
Are We Summarizing the Right Way? A Survey of Dialogue Summarization Data Sets Don Tuggener, Margot Mieskes, Jan Deriu, Mark Cieliebak EMNLP 2021| newsum [pdf]
Dialogue Inspectional Summarization with Factual Inconsistency Awareness Leilei Gan, Yating Zhang, Kun Kuang, Lin Yuan, Shuo Li, Changlong Sun, Xiaozhong Liu, Fei Wu [pdf]
Do Boat and Ocean Suggest Beach? Dialogue Summarization with External Knowledge Tianqing Fang, Haojie Pan, Hongming Zhang, Yangqiu Song, Kun Xu, Dong Yu AKBC 2021 [pdf] [code]
Prompt scoring system for dialogue summarization using GPT3 Prodan, George; Pelican, Elena [pdf]
Simple Conversational Data Augmentation for Semi-supervised Abstractive Dialogue SummarizationJiaao Jiaao Chen, Diyi Yang EMNLP 2021 [pdf] [code]
A Bag of Tricks for Dialogue Summarization Muhammad Khalifa, Miguel Ballesteros, Kathleen McKeown EMNLP 2021 Short [pdf]
Hierarchical Summarization for Longform Spoken Dialog Daniel Li, Thomas Chen, Albert Tung, Lydia Chilton UIST 2021 [pdf]
RepSum: Unsupervised Dialogue Summarization based on Replacement Strategy Xiyan Fu, Yating Zhang, Tianyi Wang, Xiaozhong Liu, Changlong Sun, Zhenglu Yang ACL 2021 [pdf] [code]
Language Model as an Annotator: Exploring DialoGPT for Dialogue Summarization Xiachong Feng, Xiaocheng Feng, Libo Qin, Bing Qin, Ting Liu ACL 2021 [pdf] [code]
A Two-Phase Approach for Abstractive Podcast Summarization Chujie Zheng, Kunpeng Zhang, Harry Jiannan Wang, Ling Fan TREC 2020 Podcasts Track [pdf]
Hierarchical Learning for Generation with Long Source Sequences Tobias Rohde, Xiaoxia Wu, Yinhan Liu [pdf] [code]
Improving Online Forums Summarization via Unifying Hierarchical Attention Networks with Convolutional Neural Networks Sansiri Tarnpradab, Fereshteh Jafariakinabad, Kien A. Hua [pdf] [code]
Extractive Summarization of Call Transcripts Pratik K. Biswas, Aleksandr Iakubovich [pdf]
Legal Summarization for Multi-role Debate Dialogue via Controversy Focus Mining and Multi-task Learning Xinyu Duan, Yating Zhang, Lin Yuan, Xin Zhou, Xiaozhong Liu, Tianyi Wang, Ruocheng Wang, Qiong Zhang, Changlong Sun, Fei Wu CIKM 2019 [pdf]
Collabot: Personalized Group Chat Summarization Naama Tepper, Anat Hashavit, Maya Barnea, Inbal Ronen, Lior Leiba WSDM 2018 [pdf]
Summarizing Dialogic Arguments from Social Media Amita Misra, Shereen Oraby, Shubhangi Tandon, Sharath TS, Pranav Anand, Marilyn Walker SemDial 2017 [pdf]
The SENSEI Annotated Corpus: Human Summaries of Reader Comment Conversations in On-line News Emma Barker, Monica Lestari Paramita, Ahmet Aker, Emina Kurtic, Mark Hepple, Robert Gaizauskas SIGDIAL 2016 [pdf]
Semantic Similarity Applied to Spoken Dialogue Summarization Iryna Gurevych, Michael Strube COLING 2004 [pdf] [bib] Switchboard dialogues

Long Document

SmartBook: AI-Assisted Situation Report Generation Revanth Gangi Reddy, Yi R. Fung, Qi Zeng, Manling Li, Ziqi Wang, Paul Sullivan, Heng Ji [pdf] [code]

[Abs]
While human evaluation remains best practice for accurately judging the faithfulness of automatically-generated summaries, few solutions exist to address the increased difficulty and workload when evaluating long-form summaries. Through a survey of 162 papers on long-form summarization, we first shed light on current human evaluation practices surrounding long-form summaries. We find that 73% of these papers do not perform any human evaluation on model-generated summaries, while other works face new difficulties that manifest when dealing with long documents (e.g., low inter-annotator agreement). Motivated by our survey, we present LongEval, a set of guidelines for human evaluation of faithfulness in long-form summaries that addresses the following challenges: (1) How can we achieve high inter-annotator agreement on faithfulness scores? (2) How can we minimize annotator workload while maintaining accurate faithfulness scores? and (3) Do humans benefit from automated alignment between summary and source snippets? We deploy LongEval in annotation studies on two long-form summarization datasets in different domains (SQuALITY and PubMed), and we find that switching to a finer granularity of judgment (e.g., clause-level) reduces inter-annotator variance in faithfulness scores (e.g., std-dev from 18.5 to 6.8). We also show that scores from a partial annotation of fine-grained units highly correlates with scores from a full annotation workload (0.89 Kendall's tau using 50% judgments). We release our human judgments, annotation templates, and our software as a Python library for future research.
LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit Iyyer, Pradeep Dasigi, Arman Cohan, Kyle Lo EACL 2023 [pdf] [code]

[Abs]
While human evaluation remains best practice for accurately judging the faithfulness of automatically-generated summaries, few solutions exist to address the increased difficulty and workload when evaluating long-form summaries. Through a survey of 162 papers on long-form summarization, we first shed light on current human evaluation practices surrounding long-form summaries. We find that 73% of these papers do not perform any human evaluation on model-generated summaries, while other works face new difficulties that manifest when dealing with long documents (e.g., low inter-annotator agreement). Motivated by our survey, we present LongEval, a set of guidelines for human evaluation of faithfulness in long-form summaries that addresses the following challenges: (1) How can we achieve high inter-annotator agreement on faithfulness scores? (2) How can we minimize annotator workload while maintaining accurate faithfulness scores? and (3) Do humans benefit from automated alignment between summary and source snippets? We deploy LongEval in annotation studies on two long-form summarization datasets in different domains (SQuALITY and PubMed), and we find that switching to a finer granularity of judgment (e.g., clause-level) reduces inter-annotator variance in faithfulness scores (e.g., std-dev from 18.5 to 6.8). We also show that scores from a partial annotation of fine-grained units highly correlates with scores from a full annotation workload (0.89 Kendall's tau using 50% judgments). We release our human judgments, annotation templates, and our software as a Python library for future research.
LoRaLay: A Multilingual and Multimodal Dataset for Long Range and Layout-Aware Summarization Laura Nguyen, Thomas Scialom, Benjamin Piwowarski, Jacopo Staiano EACL 2023 [pdf] [code]

[Abs]
Text Summarization is a popular task and an active area of research for the Natural Language Processing community. By definition, it requires to account for long input texts, a characteristic which poses computational challenges for neural models. Moreover, real-world documents come in a variety of complex, visually-rich, layouts. This information is of great relevance, whether to highlight salient content or to encode long-range interactions between textual passages. Yet, all publicly available summarization datasets only provide plain text content. To facilitate research on how to exploit visual/layout information to better capture long-range dependencies in summarization models, we present LoRaLay, a collection of datasets for long-range summarization with accompanying visual/layout information. We extend existing and popular English datasets (arXiv and PubMed) with layout information and propose four novel datasets -- consistently built from scholar resources -- covering French, Spanish, Portuguese, and Korean languages. Further, we propose new baselines merging layout-aware and long-range models -- two orthogonal approaches -- and obtain state-of-the-art results, showing the importance of combining both lines of research.
GoSum: Extractive Summarization of Long Documents by Reinforcement Learning and Graph Organized discourse state Junyi Bian, Xiaodi Huang, Hong Zhou, Shanfeng Zhu [pdf]

[Abs]
Handling long texts with structural information and excluding redundancy between summary sentences are essential in extractive document summarization. In this work, we propose GoSum, a novel reinforcement-learning-based extractive model for long-paper summarization. GoSum encodes states by building a heterogeneous graph from different discourse levels for each input document. We evaluate the model on two datasets of scientific articles summarization: PubMed and arXiv where it outperforms all extractive summarization models and most of the strong abstractive baselines.
Novel Chapter Abstractive Summarization using Spinal Tree Aware Sub-Sentential Content Selection Hardy Hardy, Miguel Ballesteros, Faisal Ladhak, Muhammad Khalifa, Vittorio Castelli, Kathleen McKeown [pdf]

[Abs]
Summarizing novel chapters is a difficult task due to the input length and the fact that sentences that appear in the desired summaries draw content from multiple places throughout the chapter. We present a pipelined extractive-abstractive approach where the extractive step filters the content that is passed to the abstractive component. Extremely lengthy input also results in a highly skewed dataset towards negative instances for extractive summarization; we thus adopt a margin ranking loss for extraction to encourage separation between positive and negative examples. Our extraction component operates at the constituent level; our approach to this problem enriches the text with spinal tree information which provides syntactic context (in the form of constituents) to the extraction model. We show an improvement of 3.71 Rouge-1 points over best results reported in prior work on an existing novel chapter dataset.
How Far are We from Robust Long Abstractive Summarization? Huan Yee Koh, Jiaxin Ju, He Zhang, Ming Liu, Shirui Pan EMNLP 2022 [pdf] [code]

[Abs]
Abstractive summarization has made tremendous progress in recent years. In this work, we perform fine-grained human annotations to evaluate long document abstractive summarization systems (i.e., models and metrics) with the aim of implementing them to generate reliable summaries. For long document abstractive models, we show that the constant strive for state-of-the-art ROUGE results can lead us to generate more relevant summaries but not factual ones. For long document evaluation metrics, human evaluation results show that ROUGE remains the best at evaluating the relevancy of a summary. It also reveals important limitations of factuality metrics in detecting different types of factual errors and the reasons behind the effectiveness of BARTScore. We then suggest promising directions in the endeavor of developing factual consistency metrics. Finally, we release our annotated long document dataset with the hope that it can contribute to the development of metrics across a broader range of summarization settings.
Toward Unifying Text Segmentation and Long Document Summarization Sangwoo Cho, Kaiqiang Song, Xiaoyang Wang, Fei Liu, Dong Yu EMNLP 2022 [pdf] [code]

[Abs]
Text segmentation is important for signaling a document's structure. Without segmenting a long document into topically coherent sections, it is difficult for readers to comprehend the text, let alone find important information. The problem is only exacerbated by a lack of segmentation in transcripts of audio/video recordings. In this paper, we explore the role that section segmentation plays in extractive summarization of written and spoken documents. Our approach learns robust sentence representations by performing summarization and segmentation simultaneously, which is further enhanced by an optimization-based regularizer to promote selection of diverse summary sentences. We conduct experiments on multiple datasets ranging from scientific articles to spoken transcripts to evaluate the model's performance. Our findings suggest that the model can not only achieve state-of-the-art performance on publicly available benchmarks, but demonstrate better cross-genre transferability when equipped with text segmentation. We perform a series of analyses to quantify the impact of section segmentation on summarizing written and spoken documents of substantial length and complexity.
HeterGraphLongSum: Heterogeneous Graph Neural Network with Passage Aggregation for Extractive Long Document Summarization Tuan-Anh Phan, Ngoc-Dung Ngoc Nguyen, Khac-Hoai Nam Bui COLING 2022 [pdf] [code]

[Abs]
Graph Neural Network (GNN)-based models have proven effective in various Natural Language Processing (NLP) tasks in recent years. Specifically, in the case of the Extractive Document Summarization (EDS) task, modeling documents under graph structure is able to analyze the complex relations between semantic units (e.g., word-to-word, word-to-sentence, sentence-to-sentence) and enrich sentence representations via valuable information from their neighbors. However, long-form document summarization using graph-based methods is still an open research issue. The main challenge is to represent long documents in a graph structure in an effective way. In this regard, this paper proposes a new heterogeneous graph neural network (HeterGNN) model to improve the performance of long document summarization (HeterGraphLongSum). Specifically, the main idea is to add the passage nodes into the heterogeneous graph structure of word and sentence nodes for enriching the final representation of sentences. In this regard, HeterGraphLongSum is designed with three types of semantic units such as word, sentence, and passage. Experiments on two benchmark datasets for long documents such as Pubmed and Arxiv indicate promising results of the proposed model for the extractive long document summarization problem. Especially, HeterGraphLongSum is able to achieve state-of-the-art performance without relying on any pre-trained language models (e.g., BERT). The source code is available for further exploitation on the Github.
Multi Graph Neural Network for Extractive Long Document Summarization Xuan-Dung Doan, Le-Minh Nguyen, Khac-Hoai Nam Bui COLING 2022 [pdf] [code]

[Abs]
Heterogeneous Graph Neural Networks (HeterGNN) have been recently introduced as an emergent approach for extracting document summarization (EDS) by exploiting the cross-relations between words and sentences. However, applying HeterGNN for long documents is still an open research issue. One of the main majors is the lacking of inter-sentence connections. In this regard, this paper exploits how to apply HeterGNN for long documents by building a graph on sentence-level nodes (homogeneous graph) and combine with HeterGNN for capturing the semantic information in terms of both inter and intra-sentence connections. Experiments on two benchmark datasets of long documents such as PubMed and ArXiv show that our method is able to achieve state-of-the-art results in this research field.
HEGEL: Hypergraph Transformer for Long Document Summarization Haopeng Zhang, Xiao Liu, Jiawei Zhang EMNLP 2022 [pdf]

[Abs]
Extractive summarization for long documents is challenging due to the extended structured input context. The long-distance sentence dependency hinders cross-sentence relations modeling, the critical step of extractive summarization. This paper proposes HEGEL, a hypergraph neural network for long document summarization by capturing high-order cross-sentence relations. HEGEL updates and learns effective sentence representations with hypergraph transformer layers and fuses different types of sentence dependencies, including latent topics, keywords coreference, and section structure. We validate HEGEL by conducting extensive experiments on two benchmark datasets, and experimental results demonstrate the effectiveness and efficiency of HEGEL.
GRETEL: Graph Contrastive Topic Enhanced Language Model for Long Document Extractive Summarization Qianqian Xie, Jimin Huang, Tulika Saha, Sophia Ananiadou COLING2022 [pdf] [code]

[Abs]
Recently, neural topic models (NTMs) have been incorporated into pre-trained language models (PLMs), to capture the global semantic information for text summarization. However, in these methods, there remain limitations in the way they capture and integrate the global semantic information. In this paper, we propose a novel model, the graph contrastive topic enhanced language model (GRETEL), that incorporates the graph contrastive topic model with the pre-trained language model, to fully leverage both the global and local contextual semantics for long document extractive summarization. To better capture and incorporate the global semantic information into PLMs, the graph contrastive topic model integrates the hierarchical transformer encoder and the graph contrastive learning to fuse the semantic information from the global document context and the gold summary. To this end, GRETEL encourages the model to efficiently extract salient sentences that are topically related to the gold summary, rather than redundant sentences that cover sub-optimal topics. Experimental results on both general domain and biomedical datasets demonstrate that our proposed method outperforms SOTA methods.
Sparse Optimization for Unsupervised Extractive Summarization of Long Documents with the Frank-Wolfe Algorithm Alicia Y. Tsai, Laurent El Ghaoui SustaiNLP at EMNLP 2020 [pdf]

[Abs]
We address the problem of unsupervised extractive document summarization, especially for long documents. We model the unsupervised problem as a sparse auto-regression one and approximate the resulting combinatorial problem via a convex, norm-constrained problem. We solve it using a dedicated Frank-Wolfe algorithm. To generate a summary with k sentences, the algorithm only needs to execute ≈k iterations, making it very efficient. We explain how to avoid explicit calculation of the full gradient and how to include sentence embedding information. We evaluate our approach against two other unsupervised methods using both lexical (standard) ROUGE scores, as well as semantic (embedding-based) ones. Our method achieves better results with both datasets and works especially well when combined with embeddings for highly paraphrased summaries.
An Efficient Coarse-to-Fine Facet-Aware Unsupervised Summarization Framework based on Semantic Blocks Xinnian Liang, Jing Li, Shuangzhi Wu, Jiali Zeng, Yufan Jiang, Mu Li, Zhoujun Li COLING 2022 [pdf] [code]

[Abs]
Unsupervised summarization methods have achieved remarkable results by incorporating representations from pre-trained language models. However, existing methods fail to consider efficiency and effectiveness at the same time when the input document is extremely long. To tackle this problem, in this paper, we proposed an efficient Coarse-to-Fine Facet-Aware Ranking (C2F-FAR) framework for unsupervised long document summarization, which is based on the semantic block. The semantic block refers to continuous sentences in the document that describe the same facet. Specifically, we address this problem by converting the one-step ranking method into the hierarchical multi-granularity two-stage ranking. In the coarse-level stage, we propose a new segment algorithm to split the document into facet-aware semantic blocks and then filter insignificant blocks. In the fine-level stage, we select salient sentences in each block and then extract the final summary from selected sentences. We evaluate our framework on four long document summarization datasets: Gov-Report, BillSum, arXiv, and PubMed. Our C2F-FAR can achieve new state-of-the-art unsupervised summarization results on Gov-Report and BillSum. In addition, our method speeds up 4-28 times more than previous methods.\footnote{\url{this https URL}}
Investigating Efficiently Extending Transformers for Long Input Summarization Jason Phang, Yao Zhao, Peter J. Liu [pdf] [code]

[Abs]
While large pretrained Transformer models have proven highly capable at tackling natural language tasks, handling long sequence inputs continues to be a significant challenge. One such task is long input summarization, where inputs are longer than the maximum input context of most pretrained models. Through an extensive set of experiments, we investigate what model architectural changes and pretraining paradigms can most efficiently adapt a pretrained Transformer for long input summarization. We find that a staggered, block-local Transformer with global encoder tokens strikes a good balance of performance and efficiency, and that an additional pretraining phase on long sequences meaningfully improves downstream summarization performance. Based on our findings, we introduce PEGASUS-X, an extension of the PEGASUS model with additional long input pretraining to handle inputs of up to 16K tokens. PEGASUS-X achieves strong performance on long input summarization tasks comparable with much larger models while adding few additional parameters and not requiring model parallelism to train.
An Empirical Survey on Long Document Summarization: Datasets, Models and Metrics uan Yee Koh, Jiaxin Ju, Ming Liu, Shirui Pan ACM Computing Surveys [pdf]

[Abs]
Long documents such as academic articles and business reports have been the standard format to detail out important issues and complicated subjects that require extra attention. An automatic summarization system that can effectively condense long documents into short and concise texts to encapsulate the most important information would thus be significant in aiding the reader's comprehension. Recently, with the advent of neural architectures, significant research efforts have been made to advance automatic text summarization systems, and numerous studies on the challenges of extending these systems to the long document domain have emerged. In this survey, we provide a comprehensive overview of the research on long document summarization and a systematic evaluation across the three principal components of its research setting: benchmark datasets, summarization models, and evaluation metrics. For each component, we organize the literature within the context of long document summarization and conduct an empirical analysis to broaden the perspective on current research progress. The empirical analysis includes a study on the intrinsic characteristics of benchmark datasets, a multi-dimensional analysis of summarization models, and a review of the summarization evaluation metrics. Based on the overall findings, we conclude by proposing possible directions for future exploration in this rapidly growing field.
MemSum: Extractive Summarization of Long Documents Using Multi-Step Episodic Markov Decision Processes Nianlong Gu, Elliott Ash, Richard Hahnloser ACL 2022 [pdf] [code]

[Abs]
We introduce MemSum (Multi-step Episodic Markov decision process extractive SUMmarizer), a reinforcement-learning-based extractive summarizer enriched at each step with information on the current extraction history. When MemSum iteratively selects sentences into the summary, it considers a broad information set that would intuitively also be used by humans in this task: 1) the text content of the sentence, 2) the global text context of the rest of the document, and 3) the extraction history consisting of the set of sentences that have already been extracted. With a lightweight architecture, MemSum obtains state-of-the-art test-set performance (ROUGE) in summarizing long documents taken from PubMed, arXiv, and GovReport. Ablation studies demonstrate the importance of local, global, and history information. A human evaluation confirms the high quality and low redundancy of the generated summaries, stemming from MemSum’s awareness of extraction history.
Semantic Self-Segmentation for Abstractive Summarization of Long Legal Documents in Low-Resource Regimes Gianluca Moro, Luca Ragazzi AAAI 2022 [pdf]
Factorizing Content and Budget Decisions in Abstractive Summarization of Long Documents by Sampling Summary Views Marcio Fonseca, Yftah Ziser, Shay B. Cohen EMNLP 2022 [pdf]

[Abs]
We argue that disentangling content selection from the budget used to cover salient content improves the performance and applicability of abstractive summarizers. Our method, FactorSum, does this disentanglement by factorizing summarization into two steps through an energy function: (1) generation of abstractive summary views covering salient information in subsets of the input document (document views); (2) combination of these views into a final summary, following a budget and content guidance. This guidance may come from different sources, including from an advisor model such as BART or BigBird, or in oracle mode – from the reference. This factorization achieves significantly higher ROUGE scores on multiple benchmarks for long document summarization, namely PubMed, arXiv, and GovReport. Most notably, our model is effective for domain adaptation. When trained only on PubMed samples, it achieves a 46.29 ROUGE-1 score on arXiv, outperforming PEGASUS trained in domain by a large margin. Our experimental results indicate that the performance gains are due to more flexible budget adaptation and processing of shorter contexts provided by partial document views.
Leveraging Locality in Abstractive Text Summarization Yixin Liu, Ansong Ni, Linyong Nan, Budhaditya Deb, Chenguang Zhu, Ahmed H. Awadallah, Dragomir Radev [pdf] EMNLP 2022

[Abs]
Neural attention models have achieved significant improvements on many natural language processing tasks. However, the quadratic memory complexity of the self-attention module with respect to the input length hinders their applications in long text summarization. Instead of designing more efficient attention modules, we approach this problem by investigating if models with a restricted context can have competitive performance compared with the memory-efficient attention models that maintain a global context by treating the input as a single sequence. Our model is applied to individual pages, which contain parts of inputs grouped by the principle of locality, during both the encoding and decoding stages. We empirically investigated three kinds of locality in text summarization at different levels of granularity, ranging from sentences to documents. Our experimental results show that our model has a better performance compared with strong baseline models with efficient attention modules, and our analysis provides further insights into our locality-aware modeling strategy.
SNaC: Coherence Error Detection for Narrative Summarization Tanya Goyal, Junyi Jessy Li, Greg Durrett EMNLP 2022 [pdf] [data]

[Abs]
Progress in summarizing long texts is inhibited by the lack of appropriate evaluation frameworks. A long summary that appropriately covers the facets of that text must also present a coherent narrative, but current automatic and human evaluation methods fail to identify gaps in coherence. In this work, we introduce SNaC, a narrative coherence evaluation framework for fine-grained annotations of long summaries. We develop a taxonomy of coherence errors in generated narrative summaries and collect span-level annotations for 6.6k sentences across 150 book and movie summaries. Our work provides the first characterization of coherence errors generated by state-of-the-art summarization models and a protocol for eliciting coherence judgments from crowdworkers. Furthermore, we show that the collected annotations allow us to benchmark past work in coherence modeling and train a strong classifier for automatically localizing coherence errors in generated summaries. Finally, our SNaC framework can support future work in long document summarization and coherence evaluation, including improved summarization modeling and post-hoc summary correction.
Sequence-Based Extractive Summarisation for Scientific Articles Daniel Kershaw, Rob Koeling `` [pdf]
LDKP: A Dataset for Identifying Keyphrases from Long Scientific Documents Debanjan Mahata, Naveen Agarwal, Dibya Gautam, Amardeep Kumar, Swapnil Parekh, Yaman Kumar Singla, Anish Acharya, Rajiv Ratn Shah [pdf] [data1] [data2]
HIBRIDS: Attention with Hierarchical Biases for Structure-aware Long Document Summarization Shuyang Cao, Lu Wang ACL 2022 [pdf] [code] [data]

[Abs]
Document structure is critical for efficient information consumption. However, it is challenging to encode it efficiently into the modern Transformer architecture. In this work, we present HIBRIDS, which injects Hierarchical Biases foR Incorporating Document Structure into attention score calculation. We further present a new task, hierarchical question-summary generation, for summarizing salient content in the source document into a hierarchy of questions and summaries, where each follow-up question inquires about the content of its parent question-summary pair. We also annotate a new dataset with 6,153 question-summary hierarchies labeled on government reports. Experiment results show that our model produces better question-summary hierarchies than comparisons on both hierarchy quality and content coverage, a finding also echoed by human judges. Additionally, our model improves the generation of long-form summaries from long government reports and Wikipedia articles, as measured by ROUGE scores.
HiStruct+: Improving Extractive Text Summarization with Hierarchical Structure Information Qian Ruan, Malte Ostendorff, Georg Rehm [pdf] [code]
Long Document Summarization with Top-down and Bottom-up Inference Bo Pang, Erik Nijkamp, Wojciech Kryściński, Silvio Savarese, Yingbo Zhou, Caiming Xiong [pdf]
Summ^N: A Multi-Stage Summarization Framework for Long Input Dialogues and Documents Yusen Zhang, Ansong Ni, Ziming Mao, Chen Henry Wu, Chenguang Zhu, Budhaditya Deb, Ahmed H. Awadallah, Dragomir Radev, Rui Zhang ACL 2022 [pdf]
DYLE: Dynamic Latent Extraction for Abstractive Long-Input Summarization Ziming Mao, Chen Henry Wu, Ansong Ni, Yusen Zhang, Rui Zhang, Tao Yu, Budhaditya Deb, Chenguang Zhu, Ahmed H. Awadallah, Dragomir Radev ACL 2022 [pdf] [code]

[Abs]
Transformer-based models have achieved state-of-the-art performance on short-input summarization. However, they still struggle with summarizing longer text. In this paper, we present DYLE, a novel dynamic latent extraction approach for abstractive long-input summarization. DYLE jointly trains an extractor and a generator and treats the extracted text snippets as the latent variable, allowing dynamic snippet-level attention weights during decoding. To provide adequate supervision, we propose simple yet effective heuristics for oracle extraction as well as a consistency loss term, which encourages the extractor to approximate the averaged dynamic weights predicted by the generator. We evaluate our method on different long-document and long-dialogue summarization tasks: GovReport, QMSum, and arXiv. Experiment results show that DYLE outperforms all existing methods on GovReport and QMSum, with gains up to 6.1 ROUGE, while yielding strong results on arXiv. Further analysis shows that the proposed dynamic weights provide interpretability of our generation process.
SciBERTSUM: Extractive Summarization for Scientific Documents Athar Sefid, C Lee Giles [pdf] [code]
Neural Content Extraction for Poster Generation of Scientific Papers Sheng Xu, Xiaojun Wan [pdf]
LongT5: Efficient Text-To-Text Transformer for Long Sequences Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang [pdf]
The Influence of Data Pre-processing and Post-processing on Long Document Summarization Xinwei Du, Kailun Dong, Yuchen Zhang, Yongsheng Li, Ruei-Yu Tsay [pdf]
End-to-End Segmentation-based News Summarization Yang Liu, Chenguang Zhu, Michael Zeng [pdf]
Leveraging Information Bottleneck for Scientific Document Summarization Jiaxin Ju, Ming Liu, Huan Yee Koh, Yuan Jin, Lan Du, Shirui Pan EMNLP 2021 Findings [pdf]
Generating Summaries for Scientific Paper Review Ana Sabina Uban, Cornelia Caragea [pdf]
Sparsity and Sentence Structure in Encoder-Decoder Attention of Summarization Systems Potsawee Manakul, Mark J. F. Gales EMNLP 2021 short paper [pdf] [code]
Bringing Structure into Summaries: a Faceted Summarization Dataset for Long Scientific Documents Rui Meng, Khushboo Thaker, Lei Zhang, Yue Dong, Xingdi Yuan, Tong Wang, Daqing He ACL 2021 short [pdf] [data]
Sliding Selector Network with Dynamic Memory for Extractive Summarization of Long Documents Peng Cui, Le Hu NAACL21 [pdf] [code]
Long-Span Summarization via Local Attention and Content Selection Potsawee Manakul, Mark J. F. Gales ACL 2021 [pdf]
Globalizing BERT-based Transformer Architectures for Long Document Summarization Quentin Grail, Julien Perez, Eric Gaussier EACL 2021 [pdf]
Discourse-Aware Unsupervised Summarization for Long Scientific Documents Yue Dong, Andrei Mircea Romascanu, Jackie Chi Kit Cheung EACL21 [pdf] [code]
Enhancing Scientific Papers Summarization with Citation Graph Chenxin An, Ming Zhong, Yiran Chen, Danqing Wang, Xipeng Qiu, Xuanjing Huang AAAI 2021 [pdf] [code]
Efficient Attentions for Long Document Summarization Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, Lu Wang NAACL 2021 [pdf] [code] [data]
Can We Automate Scientific Reviewing? Weizhe Yuan, Pengfei Liu, and Graham Neubig [pdf] [code]
Long Document Summarization in a Low Resource Setting using Pretrained Language Models Ahsaas Bajaj, Pavitra Dangati, Kalpesh Krishna, Pradhiksha Ashok Kumar, Rheeya Uppaal, Bradford Windsor, Eliot Brenner, Dominic Dotterrer, Rajarshi Das, Andrew McCallum ACL 2021 Student Research Workshop [pdf]
Summaformers @ LaySumm 20, LongSumm 20 Sayar Ghosh Roy, Nikhil Pinnaparaju, Risubh Jain, Manish Gupta, Vasudeva Varma SDP EMNLP 2020 [pdf]
On Generating Extended Summaries of Long Documents Sajad Sotudeh, Arman Cohan, Nazli Goharian SDU21 [pdf] [code]
Self-Supervised Learning for Visual Summary Identification in Scientific Publications Shintaro Yamamoto, Anne Lauscher, Simone Paolo Ponzetto, Goran Glavaš, Shigeo Morishima [pdf]
Systematically Exploring Redundancy Reduction in Summarizing Long Documents Wen Xiao, Giuseppe Carenini AACL20 [pdf [code]
On Extractive and Abstractive Neural Document Summarization with Transformer Language Models Sandeep Subramanian, Raymond Li, Jonathan Pilault, Christopher Pal EMNLP20 [pdf]
Dimsum @LaySumm 20: BART-based Approach for Scientific Document Summarization Tiezheng Yu, Dan Su, Wenliang Dai, Pascale Fung [pdf] [code]
SciSummPip: An Unsupervised Scientific Paper Summarization Pipeline Jiaxin Ju, Ming Liu, Longxiang Gao, Shirui Pan [pdf] [code]
Enhancing Extractive Text Summarization with Topic-Aware Graph Neural Networks Peng Cui, Le Hu, Yuanchao Liu COLING20 [pdf]
Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientiﬁc Articles Yao Lu, Yue Dong, Laurent Charlin EMNLP20 Short [pdf] [data]
A Divide-and-Conquer Approach to the Summarization of Long Documents Alexios Gidiotis, Grigorios Tsoumakas IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING [pdf]
TLDR: Extreme Summarization of Scientific Documents Isabel Cachola, Kyle Lo, Arman Cohan, Daniel S. Weld Findings of EMNLP20 [pdf] [data]
Extractive Summarization of Long Documents by Combining Global and Local Context Wen Xiao, Giuseppe Carenini EMNLP19 [pdf] [code]
ScisummNet: A Large Annotated Corpus and Content-Impact Models for Scientific Paper Summarization with Citation Networks Michihiro Yasunaga, Jungo Kasai, Rui Zhang, Alexander R. Fabbri, Irene Li, Dan Friedman, Dragomir R. Radev AAAI19 [pdf] [data]
TalkSumm: A Dataset and Scalable Annotation Method for Scientific Paper Summarization Based on Conference Talks Guy Lev, Michal Shmueli-Scheuer, Jonathan Herzig, Achiya Jerbi, David Konopnicki ACL19 [pdf] [data]
A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, Nazli Goharian NAACL18 [pdf] [data]

Factual Consistency

Toolkit: factsumm

ChatGPT as a Factual Inconsistency Evaluator for Abstractive Text Summarization Zheheng Luo, Qianqian Xie, Sophia Ananiadou [pdf]

[Abs]
Long-form clinical summarization of hospital admissions has real-world significance because of its potential to help both clinicians and patients. The faithfulness of summaries is critical to their safe usage in clinical settings. To better understand the limitations of abstractive systems, as well as the suitability of existing evaluation metrics, we benchmark faithfulness metrics against fine-grained human annotations for model-generated summaries of a patient's Brief Hospital Course. We create a corpus of patient hospital admissions and summaries for a cohort of HIV patients, each with complex medical histories. Annotators are presented with summaries and source notes, and asked to categorize manually highlighted summary elements (clinical entities like conditions and medications as well as actions like "following up") into one of three categories: "Incorrect," "Missing," and "Not in Notes." We meta-evaluate a broad set of proposed faithfulness metrics and, across metrics, explore the importance of domain adaptation (e.g. the impact of in-domain pre-training and metric fine-tuning), the use of source-summary alignments, and the effects of distilling a single metric from an ensemble of pre-existing metrics. Off-the-shelf metrics with no exposure to clinical text correlate well yet overly rely on summary extractiveness. As a practical guide to long-form clinical narrative summarization, we find that most metrics correlate best to human judgments when provided with one summary sentence at a time and a minimal set of relevant source context.
A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course Summarization Griffin Adams, Jason Zucker, Noémie Elhadad [pdf]

[Abs]
Long-form clinical summarization of hospital admissions has real-world significance because of its potential to help both clinicians and patients. The faithfulness of summaries is critical to their safe usage in clinical settings. To better understand the limitations of abstractive systems, as well as the suitability of existing evaluation metrics, we benchmark faithfulness metrics against fine-grained human annotations for model-generated summaries of a patient's Brief Hospital Course. We create a corpus of patient hospital admissions and summaries for a cohort of HIV patients, each with complex medical histories. Annotators are presented with summaries and source notes, and asked to categorize manually highlighted summary elements (clinical entities like conditions and medications as well as actions like "following up") into one of three categories: "Incorrect," "Missing," and "Not in Notes." We meta-evaluate a broad set of proposed faithfulness metrics and, across metrics, explore the importance of domain adaptation (e.g. the impact of in-domain pre-training and metric fine-tuning), the use of source-summary alignments, and the effects of distilling a single metric from an ensemble of pre-existing metrics. Off-the-shelf metrics with no exposure to clinical text correlate well yet overly rely on summary extractiveness. As a practical guide to long-form clinical narrative summarization, we find that most metrics correlate best to human judgments when provided with one summary sentence at a time and a minimal set of relevant source context.
Faithfulness-Aware Decoding Strategies for Abstractive Summarization David Wan, Mengwen Liu, Kathleen McKeown, Markus Dreyer, Mohit Bansal EACL 2023 [pdf] [code]

[Abs]
Despite significant progress in understanding and improving faithfulness in abstractive summarization, the question of how decoding strategies affect faithfulness is less studied. We present a systematic study of the effect of generation techniques such as beam search and nucleus sampling on faithfulness in abstractive summarization. We find a consistent trend where beam search with large beam sizes produces the most faithful summaries while nucleus sampling generates the least faithful ones. We propose two faithfulness-aware generation methods to further improve faithfulness over current generation techniques: (1) ranking candidates generated by beam search using automatic faithfulness metrics and (2) incorporating lookahead heuristics that produce a faithfulness score on the future summary. We show that both generation methods significantly improve faithfulness across two datasets as evaluated by four automatic faithfulness metrics and human evaluation. To reduce computational cost, we demonstrate a simple distillation approach that allows the model to generate faithful summaries with just greedy decoding. Our code is publicly available at this https URL
DIFFQG: Generating Questions to Summarize Factual Changes Jeremy R. Cole, Palak Jain, Julian Martin Eisenschlos, Michael J.Q. Zhang, Eunsol Choi, Bhuwan Dhingra EACL 2023 [pdf] [code]

[Abs]
Identifying the difference between two versions of the same article is useful to update knowledge bases and to understand how articles evolve. Paired texts occur naturally in diverse situations: reporters write similar news stories and maintainers of authoritative websites must keep their information up to date. We propose representing factual changes between paired documents as question-answer pairs, where the answer to the same question differs between two versions. We find that question-answer pairs can flexibly and concisely capture the updated contents. Provided with paired documents, annotators identify questions that are answered by one passage but answered differently or cannot be answered by the other. We release DIFFQG which consists of 759 QA pairs and 1153 examples of paired passages with no factual change. These questions are intended to be both unambiguous and information-seeking and involve complex edits, pushing beyond the capabilities of current question generation and factual change detection systems. Our dataset summarizes the changes between two versions of the document as questions and answers, studying automatic update summarization in a novel way.
Improving Faithfulness by Augmenting Negative Summaries from Fake Documents Tianshu Wang, Faisal Ladhak, Esin Durmus, He He EMNLP 2022 [pdf] [code]

[Abs]
Current abstractive summarization systems tend to hallucinate content that is unfaithful to the source document, posing a risk of misinformation. To mitigate hallucination, we must teach the model to distinguish hallucinated summaries from faithful ones. However, the commonly used maximum likelihood training does not disentangle factual errors from other model errors. To address this issue,we propose a back-translation-style approach to augment negative samples that mimic factual errors made by the model. Specifically, we train an elaboration model that generates hallucinated documents given the reference summaries, and then generates negative summaries from the fake documents. We incorporate the negative samples into training through a controlled generator, which produces faithful/unfaithful summaries conditioned on the control codes. Additionally, we find that adding textual entailment data through multitasking further boosts the performance. Experiments on three datasets (XSum, Gigaword, and WikiHow) show that our method consistently improves faithfulness without sacrificing informativeness according to both human and automatic evaluation
Learning with Rejection for Abstractive Text Summarization Meng Cao, Yue Dong, Jingyi He, Jackie Chi Kit Cheung EMNLP 2022 [pdf] [code]

[Abs]
State-of-the-art abstractive summarization systems frequently hallucinate content that is not supported by the source document, mainly due to noise in the training dataset.Existing methods opt to drop the noisy samples or tokens from the training set entirely, reducing the effective training set size and creating an artificial propensity to copy words from the source. In this work, we propose a training objective for abstractive summarization based on rejection learning, in which the model learns whether or not to reject potentially noisy tokens. We further propose a regularized decoding objective that penalizes non-factual candidate summaries during inference by using the rejection probability learned during training.We show that our method considerably improves the factuality of generated summaries in automatic and human evaluations when compared to five baseline models, and that it does so while increasing the abstractiveness of the generated summaries.
X-FACTOR: A Cross-metric Evaluation of Factual Correctness in Abstractive Summarization Subhajit Chaudhury, Sarathkrishna Swaminathan, Chulaka Gunasekara, Maxwell Crouse, Srinivas Ravishankar, Daiki Kimura, Keerthiram Murugesan, Ramón Fernandez Astudillo, Tahira Naseem, Pavan Kapanipathi, Alexander Gray EMNLP 2022 [pdf]

[Abs]
Abstractive summarization models often produce factually inconsistent summaries that are not supported by the original article. Recently, a number of fact-consistent evaluation techniques have been proposed to address this issue; however, a detailed analysis of how these metrics agree with one another has yet to be conducted. In this paper, we present X-FACTOR, a cross-evaluation of three high-performing fact-aware abstractive summarization methods. First, we show that summarization models are often fine-tuned on datasets that contain factually inconsistent summaries and propose a fact-aware filtering mechanism that improves the quality of training data and, consequently, the factuality of these models. Second, we propose a corrector module that can be used to improve the factual consistency of generated summaries. Third, we present a re-ranking technique that samples summary instances from the output distribution of a summarization model and re-ranks the sampled instances based on their factuality. Finally, we provide a detailed cross-metric agreement analysis that shows how tuning a model to output summaries based on a particular factuality metric influences factuality as determined by the other metrics. Our goal in this work is to facilitate research that improves the factuality and faithfulness of abstractive summarization models.
LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit Iyyer, Pradeep Dasigi, Arman Cohan, Kyle Lo EACL 2023 [pdf] [code]

[Abs]
While human evaluation remains best practice for accurately judging the faithfulness of automatically-generated summaries, few solutions exist to address the increased difficulty and workload when evaluating long-form summaries. Through a survey of 162 papers on long-form summarization, we first shed light on current human evaluation practices surrounding long-form summaries. We find that 73% of these papers do not perform any human evaluation on model-generated summaries, while other works face new difficulties that manifest when dealing with long documents (e.g., low inter-annotator agreement). Motivated by our survey, we present LongEval, a set of guidelines for human evaluation of faithfulness in long-form summaries that addresses the following challenges: (1) How can we achieve high inter-annotator agreement on faithfulness scores? (2) How can we minimize annotator workload while maintaining accurate faithfulness scores? and (3) Do humans benefit from automated alignment between summary and source snippets? We deploy LongEval in annotation studies on two long-form summarization datasets in different domains (SQuALITY and PubMed), and we find that switching to a finer granularity of judgment (e.g., clause-level) reduces inter-annotator variance in faithfulness scores (e.g., std-dev from 18.5 to 6.8). We also show that scores from a partial annotation of fine-grained units highly correlates with scores from a full annotation workload (0.89 Kendall's tau using 50% judgments). We release our human judgments, annotation templates, and our software as a Python library for future research.
mFACE: Multilingual Summarization with Factual Consistency Evaluation Roee Aharoni, Shashi Narayan, Joshua Maynez, Jonathan Herzig, Elizabeth Clark, Mirella Lapata [pdf]

[Abs]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets. Despite promising results, current models still suffer from generating factually inconsistent summaries, reducing their utility for real-world application. Several recent efforts attempt to address this by devising models that automatically detect factual inconsistencies in machine generated summaries. However, they focus exclusively on English, a language with abundant resources. In this work, we leverage factual consistency evaluation models to improve multilingual summarization. We explore two intuitive approaches to mitigate hallucinations based on the signal provided by a multilingual NLI model, namely data filtering and controlled generation. Experimental results in the 45 languages from the XLSum dataset show gains over strong baselines in both automatic and human evaluation.
Improving Faithfulness of Abstractive Summarization by Controlling Confounding Effect of Irrelevant Sentences Asish Ghoshal, Arash Einolghozati, Ankit Arun, Haoran Li, Lili Yu, Yashar Mehdad, Scott Wen-tau Yih, Asli Celikyilmaz[pdf]

[Abs]
Lack of factual correctness is an issue that still plagues state-of-the-art summarization systems despite their impressive progress on generating seemingly fluent summaries. In this paper, we show that factual inconsistency can be caused by irrelevant parts of the input text, which act as confounders. To that end, we leverage information-theoretic measures of causal effects to quantify the amount of confounding and precisely quantify how they affect the summarization performance. Based on insights derived from our theoretical results, we design a simple multi-task model to control such confounding by leveraging human-annotated relevant sentences when available. Crucially, we give a principled characterization of data distributions where such confounding can be large thereby necessitating the use of human annotated relevant sentences to generate factual summaries. Our approach improves faithfulness scores by 20% over strong baselines on AnswerSumm \citep{fabbri2021answersumm}, a conversation summarization dataset where lack of faithfulness is a significant issue due to the subjective nature of the task. Our best method achieves the highest faithfulness score while also achieving state-of-the-art results on standard metrics like ROUGE and METEOR. We corroborate these improvements through human evaluation.
Improved Beam Search for Hallucination Mitigation in Abstractive Summarization Arvind Krishna Sridhar, Erik Visser [pdf]

[Abs]
Advancement in large pretrained language models has significantly improved their performance for conditional language generation tasks including summarization albeit with hallucinations. To reduce hallucinations, conventional methods proposed improving beam search or using a fact checker as a postprocessing step. In this paper, we investigate the use of the Natural Language Inference (NLI) entailment metric to detect and prevent hallucinations in summary generation. We propose an NLI-assisted beam re-ranking mechanism by computing entailment probability scores between the input context and summarization model-generated beams during saliency-enhanced greedy decoding. Moreover, a diversity metric is introduced to compare its effectiveness against vanilla beam search. Our proposed algorithm significantly outperforms vanilla beam decoding on XSum and CNN/DM datasets.
Revisiting text decomposition methods for NLI-based factuality scoring of summaries John Glover, Federico Fancellu, Vasudevan Jagannathan, Matthew R. Gormley, Thomas Schaaf [pdf]

[Abs]
Scoring the factuality of a generated summary involves measuring the degree to which a target text contains factual information using the input document as support. Given the similarities in the problem formulation, previous work has shown that Natural Language Inference models can be effectively repurposed to perform this task. As these models are trained to score entailment at a sentence level, several recent studies have shown that decomposing either the input document or the summary into sentences helps with factuality scoring. But is fine-grained decomposition always a winning strategy? In this paper we systematically compare different granularities of decomposition -- from document to sub-sentence level, and we show that the answer is no. Our results show that incorporating additional context can yield improvement, but that this does not necessarily apply to all datasets. We also show that small changes to previously proposed entailment-based scoring methods can result in better performance, highlighting the need for caution in model and methodology selection for downstream tasks.
HaRiM+: Evaluating Summary Quality with Hallucination Risk Seonil Son, Junsoo Park, Jeong-in Hwang, Junghwa Lee, Hyungjong Noh, Yeonsoo Lee AACL 2022 [pdf]

[Abs]
One of the challenges of developing a summarization model arises from the difficulty in measuring the factual inconsistency of the generated text. In this study, we reinterpret the decoder overconfidence-regularizing objective suggested in (Miao et al., 2021) as a hallucination risk measurement to better estimate the quality of generated summaries. We propose a reference-free metric, HaRiM+, which only requires an off-the-shelf summarization model to compute the hallucination risk based on token likelihoods. Deploying it requires no additional training of models or ad-hoc modules, which usually need alignment to human judgments. For summary-quality estimation, HaRiM+ records state-of-the-art correlation to human judgment on three summary-quality annotation sets: FRANK, QAGS, and SummEval. We hope that our work, which merits the use of summarization models, facilitates the progress of both automated evaluation and generation of summary.
ED-FAITH: Evaluating Dialogue Summarization on Faithfulness Sicong Huang, Asli Celikyilmaz, Haoran Li [pdf]

[Abs]
Abstractive summarization models typically generate content unfaithful to the input, thus highlighting the significance of evaluating the faithfulness of generated summaries. Most faithfulness metrics are only evaluated on news domain, can they be transferred to other summarization tasks? In this work, we first present a systematic study of faithfulness metrics for dialogue summarization. We evaluate common faithfulness metrics on dialogue datasets and observe that most metrics correlate poorly with human judgements despite performing well on news datasets. Given these findings, to improve existing metrics’ performance on dialogue summarization, we first finetune on in-domain dataset, then apply unlikelihood training on negative samples, and show that they can successfully improve metric performance on dialogue data. Inspired by the strong zero-shot performance of the T0 language model, we further propose T0-Score – a new metric for faithfulness evaluation, which shows consistent improvement against baseline metrics across multiple domains.
Evaluating the Factual Consistency of Large Language Models Through Summarization Derek Tam, Anisha Mascarenhas, Shiyue Zhang, Sarah Kwan, Mohit Bansal, Colin Raffel [pdf]

[Abs]
While large language models (LLMs) have proven to be effective on a large variety of tasks, they are also known to hallucinate information. To measure whether an LLM prefers factually consistent continuations of its input, we propose a new benchmark called FIB(Factual Inconsistency Benchmark) that focuses on the task of summarization. Specifically, our benchmark involves comparing the scores an LLM assigns to a factually consistent versus a factually inconsistent summary for an input news article. For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent. To generate summaries that are factually inconsistent, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent. A model's factual consistency is then measured according to its accuracy, i.e.\ the proportion of documents where it assigns a higher score to the factually consistent summary. To validate the usefulness of FIB, we evaluate 23 large language models ranging from 1B to 176B parameters from six different model families including BLOOM and OPT. We find that existing LLMs generally assign a higher score to factually consistent summaries than to factually inconsistent summaries. However, if the factually inconsistent summaries occur verbatim in the document, then LLMs assign a higher score to these factually inconsistent summaries than factually consistent summaries. We validate design choices in our benchmark including the scoring method and source of distractor summaries. Our code and benchmark data can be found at this https URL.
Improving Factual Consistency in Summarization with Compression-Based Post-Editing Alexander R. Fabbri, Prafulla Kumar Choubey, Jesse Vig, Chien-Sheng Wu, Caiming Xiong EMNLP 2022 [pdf] [code]

[Abs]
State-of-the-art summarization models still struggle to be factually consistent with the input text. A model-agnostic way to address this problem is post-editing the generated summaries. However, existing approaches typically fail to remove entity errors if a suitable input entity replacement is not available or may insert erroneous content. In our work, we focus on removing extrinsic entity errors, or entities not in the source, to improve consistency while retaining the summary's essential information and form. We propose to use sentence-compression data to train the post-editing model to take a summary with extrinsic entity errors marked with special tokens and output a compressed, well-formed summary with those errors removed. We show that this model improves factual consistency while maintaining ROUGE, improving entity precision by up to 30% on XSum, and that this model can be applied on top of another post-editor, improving entity precision by up to a total of 38%. We perform an extensive comparison of post-editing approaches that demonstrate trade-offs between factual consistency, informativeness, and grammaticality, and we analyze settings where post-editors show the largest improvements.
Evaluating and Improving Factuality in Multimodal Abstractive Summarization David Wan, Mohit Bansal EMNLP 2022 [pdf] [code]

[Abs]
Current metrics for evaluating factuality for abstractive document summarization have achieved high correlations with human judgment, but they do not account for the vision modality and thus are not adequate for vision-and-language summarization. We propose CLIPBERTScore, a simple weighted combination of CLIPScore and BERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary, respectively. Next, due to the lack of meta-evaluation benchmarks to evaluate the quality of multimodal factuality metrics, we collect human judgments of factuality with respect to documents and images. We show that this simple combination of two metrics in the zero-shot setting achieves higher correlations than existing factuality metrics for document summarization, outperforms an existing multimodal summarization metric, and performs competitively with strong multimodal factuality metrics specifically fine-tuned for the task. Our thorough analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks. Finally, we demonstrate two practical downstream applications of our CLIPBERTScore metric: for selecting important images to focus on during training, and as a reward for reinforcement learning to improve factuality of multimodal summary generation w.r.t automatic and human evaluation. Our data and code are publicly available at this https URL
FRSUM: Towards Faithful Abstractive Summarization via Enhancing Factual Robustness Wenhao Wu, Wei Li, Jiachen Liu, Xinyan Xiao, Ziqiang Cao, Sujian Li, Hua Wu EMNLP 2022 [pdf]

[Abs]
Despite being able to generate fluent and grammatical text, current Seq2Seq summarization models still suffering from the unfaithful generation problem. In this paper, we study the faithfulness of existing systems from a new perspective of factual robustness which is the ability to correctly generate factual information over adversarial unfaithful information. We first measure a model's factual robustness by its success rate to defend against adversarial attacks when generating factual information. The factual robustness analysis on a wide range of current systems shows its good consistency with human judgments on faithfulness. Inspired by these findings, we propose to improve the faithfulness of a model by enhancing its factual robustness. Specifically, we propose a novel training strategy, namely FRSUM, which teaches the model to defend against both explicit adversarial samples and implicit factual adversarial perturbations. Extensive automatic and human evaluation results show that FRSUM consistently improves the faithfulness of various Seq2Seq models, such as T5, BART.
Questioning the Validity of Summarization Datasets and Improving Their Factual Consistency Yanzhu Guo, Chloé Clavel, Moussa Kamal Eddine, Michalis Vazirgiannis EMNLP 2022 [pdf] [data]

[Abs]
The topic of summarization evaluation has recently attracted a surge of attention due to the rapid development of abstractive summarization systems. However, the formulation of the task is rather ambiguous, neither the linguistic nor the natural language processing community has succeeded in giving a mutually agreed-upon definition. Due to this lack of well-defined formulation, a large number of popular abstractive summarization datasets are constructed in a manner that neither guarantees validity nor meets one of the most essential criteria of summarization: factual consistency. In this paper, we address this issue by combining state-of-the-art factual consistency models to identify the problematic instances present in popular summarization datasets. We release SummFC, a filtered summarization dataset with improved factual consistency, and demonstrate that models trained on this dataset achieve improved performance in nearly all quality aspects. We argue that our dataset should become a valid benchmark for developing and evaluating summarization systems.
Mutual Information Alleviates Hallucinations in Abstractive Summarization Liam van der Poel, Ryan Cotterell, Clara Meister EMNLP 2022 [pdf] [code]

[Abs]
Despite significant progress in the quality of language generated from abstractive summarization models, these models still exhibit the tendency to hallucinate, i.e., output content not supported by the source document. A number of works have tried to fix--or at least uncover the source of--the problem with limited success. In this paper, we identify a simple criterion under which models are significantly more likely to assign more probability to hallucinated content during generation: high model uncertainty. This finding offers a potential explanation for hallucinations: models default to favoring text with high marginal probability, i.e., high-frequency occurrences in the training set, when uncertain about a continuation. It also motivates possible routes for real-time intervention during decoding to prevent such hallucinations. We propose a decoding strategy that switches to optimizing for pointwise mutual information of the source and target token--rather than purely the probability of the target token--when the model exhibits uncertainty. Experiments on the XSum dataset show that our method decreases the probability of hallucinated tokens while maintaining the Rouge and BertS scores of top-performing decoding strategies.
Correcting Diverse Factual Errors in Abstractive Summarization via Post-Editing and Language Model Infilling Vidhisha Balachandran, Hannaneh Hajishirzi, William Cohen, Yulia Tsvetkov EMNLP 2022 [pdf] [code]

[Abs]
Abstractive summarization models often generate inconsistent summaries containing factual errors or hallucinated content. Recent works focus on correcting factual errors in generated summaries via post-editing. Such correction models are trained using adversarial non-factual summaries constructed using heuristic rules for injecting errors. However, generating non-factual summaries using heuristics often does not generalize well to actual model errors. In this work, we propose to generate hard, representative synthetic examples of non-factual summaries through infilling language models. With this data, we train a more robust fact-correction model to post-edit the summaries to improve factual consistency. Through quantitative and qualitative experiments on two popular summarization datasets -- CNN/DM and XSum -- we show that our approach vastly outperforms prior methods in correcting erroneous summaries. Our model -- FactEdit -- improves factuality scores by over ~11 points on CNN/DM and over ~31 points on XSum on average across multiple summarization models, producing more factual summaries while maintaining competitive summarization quality.
Phrase-Level Localization of Inconsistency Errors in Summarization by Weak Supervision Masato Takatsuka, Tetsunori Kobayashi, Yoshihiko Hayashi COLING 2022 [pdf] [code]

[Abs]
Although the fluency of automatically generated abstractive summaries has improved significantly with advanced methods, the inconsistency that remains in summarization is recognized as an issue to be addressed. In this study, we propose a methodology for localizing inconsistency errors in summarization. A synthetic dataset that contains a variety of factual errors likely to be produced by a common summarizer is created by applying sentence fusion, compression, and paraphrasing operations. In creating the dataset, we automatically label erroneous phrases and the dependency relations between them as “inconsistent,” which can contribute to detecting errors more adequately than existing models that rely only on dependency arc-level labels. Subsequently, this synthetic dataset is employed as weak supervision to train a model called SumPhrase, which jointly localizes errors in a summary and their corresponding sentences in the source document. The empirical results demonstrate that our SumPhrase model can detect factual errors in summarization more effectively than existing weakly supervised methods owing to the phrase-level labeling. Moreover, the joint identification of error-corresponding original sentences is proven to be effective in improving error detection accuracy.
Just ClozE! A Fast and Simple Method for Evaluating the Factual Consistency in Abstractive Summarization Yiyang Li, Lei Li, Qing Yang, Marina Litvak, Natalia Vanetik, Dingxin Hu, Yuze Li, Yanquan Zhou, Dongliang Xu, Xuanyu Zhang EMNLP 2022 [pdf]

[Abs]
The issue of factual consistency in abstractive summarization has attracted much attention in recent years, and the evaluation of factual consistency between summary and document has become an important and urgent task. Most of the current evaluation metrics are adopted from the question answering (QA). However, the application of QA-based metrics is extremely time-consuming in practice, causing the iteration cycle of abstractive summarization research to be severely prolonged. In this paper, we propose a new method called ClozE to evaluate factual consistency by cloze model, instantiated based on masked language model(MLM), with strong interpretability and substantially higher speed. We demonstrate that ClozE can reduce the evaluation time by nearly 96% relative to QA-based metrics while retaining their interpretability and performance through experiments on six human-annotated datasets and a meta-evaluation benchmark GO FIGURE \citep{gabriel2020go}. We also implement experiments to further demonstrate more characteristics of ClozE in terms of performance and speed. In addition, we conduct an experimental analysis of the limitations of ClozE, which suggests future research directions. The code and models for ClozE will be released upon the paper acceptance.
Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization Shiyue Zhang, David Wan, Mohit Bansal [pdf] [code]

[Abs]
The problems of unfaithful summaries have been widely discussed under the context of abstractive summarization. Though extractive summarization is less prone to the common unfaithfulness issues of abstractive summaries, does that mean extractive is equal to faithful? Turns out that the answer is no. In this work, we define a typology with five types of broad unfaithfulness problems (including and beyond not-entailment) that can appear in extractive summaries, including incorrect coreference, incomplete coreference, incorrect discourse, incomplete discourse, as well as other misleading information. We ask humans to label these problems out of 1500 English summaries produced by 15 diverse extractive systems. We find that 33% of the summaries have at least one of the five issues. To automatically detect these problems, we find that 5 existing faithfulness evaluation metrics for summarization have poor correlations with human judgment. To remedy this, we propose a new metric, ExtEval, that is designed for detecting unfaithful extractive summaries and is shown to have the best performance. We hope our work can increase the awareness of unfaithfulness problems in extractive summarization and help future work to evaluate and resolve these issues. Our data and code are publicly available at this https URL
Entity-based SpanCopy for Abstractive Summarization to Improve the Factual Consistency Wen Xiao, Giuseppe Carenini [pdf] [code]

[Abs]
Despite the success of recent abstractive summarizers on automatic evaluation metrics, the generated summaries still present factual inconsistencies with the source document. In this paper, we focus on entity-level factual inconsistency, i.e. reducing the mismatched entities between the generated summaries and the source documents. We therefore propose a novel entity-based SpanCopy mechanism, and explore its extension with a Global Relevance component. Experiment results on four summarization datasets show that SpanCopy can effectively improve the entity-level factual consistency with essentially no change in the word-level and entity-level saliency. The code is available at this https URL
Jointly Learning Guidance Induction and Faithful Summary Generation via Conditional Variational Autoencoders Wang Xu, Tiejun Zhao Findings of NAACL 2022 [pdf]

[Abs]
Abstractive summarization can generate high quality results with the development of the neural network. However, generating factual consistency summaries is a challenging task for abstractive summarization. Recent studies extract the additional information with off-the-shelf tools from the source document as a clue to guide the summary generation, which shows effectiveness to improve the faithfulness. Unlike these work, we present a novel framework based on conditional variational autoencoders, which induces the guidance information and generates the summary equipment with the guidance synchronously. Experiments on XSUM and CNNDM dataset show that our approach can generate relevant and fluent summaries which is more faithful than the existing state-of-the-art approaches, according to multiple factual consistency metrics.
Masked Summarization to Generate Factually Inconsistent Summaries for Improved Factual Consistency Checking Hwanhee Lee, Kang Min Yoo, Joonsuk Park, Hwaran Lee, Kyomin Jung Findings of NAACL 2022 [pdf] [code]

[Abs]
Despite the recent advances in abstractive summarization systems, it is still difficult to determine whether a generated summary is factual consistent with the source text. To this end, the latest approach is to train a factual consistency classifier on factually consistent and inconsistent summaries. Luckily, the former is readily available as reference summaries in existing summarization datasets. However, generating the latter remains a challenge, as they need to be factually inconsistent, yet closely relevant to the source text to be effective. In this paper, we propose to generate factually inconsistent summaries using source texts and reference summaries with key information masked. Experiments on seven benchmark datasets demonstrate that factual consistency classifiers trained on summaries generated using our method generally outperform existing models and show a competitive correlation with human judgments. We also analyze the characteristics of the summaries generated using our method. We will release the pre-trained model and the code at https://github.com/hwanheelee1993/MFMA.
Improving the Faithfulness of Abstractive Summarization via Entity Coverage Control Haopeng Zhang, Semih Yavuz, Wojciech Kryscinski, Kazuma Hashimoto, Yingbo Zhou Findings of NAACL 2022 [pdf] [code]

[Abs]
Abstractive summarization systems leveraging pre-training language models have achieved superior results on benchmark datasets. However, such models have been shown to be more prone to hallucinate facts that are unfaithful to the input context. In this paper, we propose a method to remedy entity-level extrinsic hallucinations with Entity Coverage Control (ECC). We first compute entity coverage precision and prepend the corresponding control code for each training example, which implicitly guides the model to recognize faithfulness contents in the training phase. We further extend our method via intermediate fine-tuning on large but noisy data extracted from Wikipedia to unlock zero-shot summarization. We show that the proposed method leads to more faithful and salient abstractive summarization in supervised fine-tuning and zero-shot settings according to our experimental results on three benchmark datasets XSum, Pubmed, and SAMSum of very different domains and styles.
FactPEGASUS: Factuality-Aware Pre-training and Fine-tuning for Abstractive Summarization David Wan, Mohit Bansal NAACL 2022 [pdf] [code]

[Abs]
We present FactPEGASUS, an abstractive summarization model that addresses the problem of factuality during pre-training and fine-tuning: (1) We augment the sentence selection strategy of PEGASUS’s (Zhang et al., 2019) pre-training objective to create pseudo-summaries that are both important and factual; (2) We introduce three complementary components for fine-tuning. The corrector removes hallucinations present in the reference summary, the contrastor uses contrastive learning to better differentiate nonfactual summaries from factual ones, and the connector bridges the gap between the pre-training and fine-tuning for better transfer of knowledge. Experiments on three downstream tasks demonstrate that FactPEGASUS substantially improves factuality evaluated by multiple automatic metrics and humans. Our thorough analysis suggests that FactPEGASUS is more factual than using the original pre-training objective in zero-shot and few-shot settings, retains factual behavior more robustly than strong baselines, and does not rely entirely on becoming more extractive to improve factuality.
Improving the Faithfulness of Abstractive Summarization via Entity Coverage Control Haopeng Zhang, Semih Yavuz, Wojciech Kryscinski, Kazuma Hashimoto, Yingbo Zhou NAACL 2022 findings [pdf]

[Abs]
Abstractive summarization systems leveraging pre-training language models have achieved superior results on benchmark datasets. However, such models have been shown to be more prone to hallucinate facts that are unfaithful to the input context. In this paper, we propose a method to remedy entity-level extrinsic hallucinations with Entity Coverage Control (ECC). We first compute entity coverage precision and prepend the corresponding control code for each training example, which implicitly guides the model to recognize faithfulness contents in the training phase. We further extend our method via intermediate fine-tuning on large but noisy data extracted from Wikipedia to unlock zero-shot summarization. We show that the proposed method leads to more faithful and salient abstractive summarization in supervised fine-tuning and zero-shot settings according to our experimental results on three benchmark datasets XSum, Pubmed, and SAMSum of very different domains and styles.
SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization Philippe Laban, Tobias Schnabel, Paul N. Bennett, Marti A. Hearst TACL 2022 Volume 10 [pdf] [code]

[Abs]
In the summarization domain, a key requirement for summaries is to be factually consistent with the input document. Previous work has found that natural language inference (NLI) models do not perform competitively when applied to inconsistency detection. In this work, we revisit the use of NLI for inconsistency detection, finding that past work suffered from a mismatch in input granularity between NLI datasets (sentence-level), and inconsistency detection (document level). We provide a highly effective and light-weight method called SummaCConv that enables NLI models to be successfully used for this task by segmenting documents into sentence units and aggregating scores between pairs of sentences. We furthermore introduce a new benchmark called SummaC (Summary Consistency) which consists of six large inconsistency detection datasets. On this dataset, SummaCConv obtains state-of-the-art results with a balanced accuracy of 74.4%, a 5% improvement compared with prior work.
Hallucinated but Factual! Inspecting the Factuality of Hallucinations in Abstractive Summarization Meng Cao, Yue Dong, Jackie Cheung ACL 2022 [pdf] [code]

[Abs]
State-of-the-art abstractive summarization systems often generate hallucinations; i.e., content that is not directly inferable from the source text. Despite being assumed to be incorrect, we find that much hallucinated content is actually consistent with world knowledge, which we call factual hallucinations. Including these factual hallucinations in a summary can be beneficial because they provide useful background information. In this work, we propose a novel detection approach that separates factual from non-factual hallucinations of entities. Our method is based on an entity’s prior and posterior probabilities according to pre-trained and finetuned masked language models, respectively. Empirical results suggest that our method vastly outperforms two baselines in both accuracy and F1 scores and has a strong correlation with human judgments on factuality classification tasks.Furthermore, we use our method as a reward signal to train a summarization system using an off-line reinforcement learning (RL) algorithm that can significantly improve the factuality of generated summaries while maintaining the level of abstractiveness.
Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors Liyan Tang, Tanya Goyal, Alexander R. Fabbri, Philippe Laban, Jiacheng Xu, Semih Yahvuz, Wojciech Kryściński, Justin F. Rousseau, Greg Durrett [pdf] [code]
Falsesum: Generating Document-level NLI Examples for Recognizing Factual Inconsistency in Summarization Prasetya Ajie Utama, Joshua Bambrick, Nafise Sadat Moosavi, Iryna Gurevych NAACL 2022 [pdf] [code]

[Abs]
Neural abstractive summarization models are prone to generate summaries that are factually inconsistent with their source documents. Previous work has introduced the task of recognizing such factual inconsistency as a downstream application of natural language inference (NLI). However, state-of-the-art NLI models perform poorly in this context due to their inability to generalize to the target task. In this work, we show that NLI models can be effective for this task when the training data is augmented with high-quality task-oriented examples. We introduce Falsesum, a data generation pipeline leveraging a controllable text generation model to perturb human-annotated summaries, introducing varying types of factual inconsistencies. Unlike previously introduced document-level NLI datasets, our generated dataset contains examples that are diverse and inconsistent yet plausible. We show that models trained on a Falsesum-augmented NLI dataset improve the state-of-the-art performance across four benchmarks for detecting factual inconsistency in summarization.
Masked Summarization to Generate Factually Inconsistent Summaries for Improved Factual Consistency Checking Hwanhee Lee, Kang Min Yoo, Joonsuk Park, Hwaran Lee, Kyomin Jung NAACL 2022 Findings [pdf] [code]
Faithful to the Document or to the World? Mitigating Hallucinations via Entity-linked Knowledge in Abstractive Summarization Yue Dong, John Wieting, Pat Verga [pdf]
Learning to Revise References for Faithful Summarization Griffin Adams, Han-Chin Shing, Qing Sun, Christopher Winestock, Kathleen McKeown, Noémie Elhadad [pdf] [code]
Factual Error Correction for Abstractive Summaries Using Entity Retrieval Hwanhee Lee, Cheoneum Park, Seunghyun Yoon, Trung Bui, Franck Dernoncourt, Juae Kim, Kyomin Jung [pdf]
Evaluating Factuality in Text Simplification Ashwin Devaraj, William Sheffield, Byron C. Wallace, Junyi Jessy Li ACL 2022 [pdf] [code]
FactGraph: Evaluating Factuality in Summarization with Semantic Graph Representations Leonardo F. R. Ribeiro, Mengwen Liu, Iryna Gurevych, Markus Dreyer, Mohit Bansal NAACL 2022 [pdf] [code]

[Abs]
Despite recent improvements in abstractive summarization, most current approaches generate summaries that are not factually consistent with the source document, severely restricting their trust and usage in real-world applications. Recent works have shown promising improvements in factuality error identification using text or dependency arc entailments; however, they do not consider the entire semantic graph simultaneously. To this end, we propose FactGraph, a method that decomposes the document and the summary into structured meaning representations (MR), which are more suitable for factuality evaluation. MRs describe core semantic concepts and their relations, aggregating the main content in both document and summary in a canonical form, and reducing data sparsity. FactGraph encodes such graphs using a graph encoder augmented with structure-aware adapters to capture interactions among the concepts based on the graph connectivity, along with text representations using an adapter-based text encoder. Experiments on different benchmarks for evaluating factuality show that FactGraph outperforms previous approaches by up to 15%. Furthermore, FactGraph improves performance on identifying content verifiability errors and better captures subsentence-level factual inconsistencies.
Don't Say What You Don't Know: Improving the Consistency of Abstractive Summarization by Constraining Beam Search Daniel King, Zejiang Shen, Nishant Subramani, Daniel S. Weld, Iz Beltagy, Doug Downey [pdf] [code]
CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning Xiangru Tang, Arjun Nair, Borui Wang, Bingyao Wang, Jai Desai, Aaron Wade, Haoran Li, Asli Celikyilmaz, Yashar Mehdad, Dragomir Radev NAACL 2022 [pdf]

[Abs]
Factual inconsistencies in generated summaries severely limit the practical applications of abstractive dialogue summarization. Although significant progress has been achieved by using pre-trained neural language models, substantial amounts of hallucinated content are found during the human evaluation. In this work, we first devised a typology of factual errors to better understand the types of hallucinations generated by current models and conducted human evaluation on popular dialog summarization dataset. We further propose a training strategy that improves the factual consistency and overall quality of summaries via a novel contrastive fine-tuning, called CONFIT. To tackle top factual errors from our annotation, we introduce additional contrastive loss with carefully designed hard negative samples and self-supervised dialogue-specific loss to capture the key information between speakers. We show that our model significantly reduces all kinds of factual errors on both SAMSum dialogue summarization and AMI meeting summarization. On both datasets, we achieve significant improvements over state-of-the-art baselines using both automatic metrics, ROUGE and BARTScore, and human evaluation.
QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization Alexander R. Fabbri, Chien-Sheng Wu, Wenhao Liu, Caiming Xiong NAACL 2022 [pdf] [code]

[Abs]
Factual consistency is an essential quality of text summarization models in practical settings. Existing work in evaluating this dimension can be broadly categorized into two lines of research, entailment-based and question answering (QA)-based metrics, and different experimental setups often lead to contrasting conclusions as to which paradigm performs the best. In this work, we conduct an extensive comparison of entailment and QA-based metrics, demonstrating that carefully choosing the components of a QA-based metric, especially question generation and answerability classification, is critical to performance. Building on those insights, we propose an optimized metric, which we call QAFactEval, that leads to a 14% average improvement over previous QA-based metrics on the SummaC factual consistency benchmark, and also outperforms the best-performing entailment-based metric. Moreover, we find that QA-based and entailment-based metrics can offer complementary signals and be combined into a single metric for a further performance boost.
CO2Sum:Contrastive Learning for Factual-Consistent Abstractive Summarization Wei Liu, Huanqin Wu, Wenjing Mu, Zhen Li, Tao Chen, Dan Nie [pdf]
Are Factuality Checkers Reliable? Adversarial Meta-evaluation of Factuality in Summarization Yiran Chen, Pengfei Liu, Xipeng Qiu EMNLP 2021 Findings [pdf] [code]
SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization Philippe Laban, Tobias Schnabel, Paul N. Bennett, Marti A. Hearst [pdf] [code]
Dialogue Inspectional Summarization with Factual Inconsistency Awareness Leilei Gan, Yating Zhang, Kun Kuang, Lin Yuan, Shuo Li, Changlong Sun, Xiaozhong Liu, Fei Wu [pdf]
Fine-grained Factual Consistency Assessment for Abstractive Summarization Models Sen Zhang, Jianwei Niu, Chuyuan Wei `` [pdf]
MoFE: Mixture of Factual Experts for Controlling Hallucinations in Abstractive Summarization Prafulla Kumar Choubey, Jesse Vig, Wenhao Liu, Nazneen Fatema Rajani [pdf]
Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries Xiangru Tang, Alexander R. Fabbri, Ziming Mao, Griffin Adams, Borui Wang, Haoran Li, Yashar Mehdad, Dragomir Radev NAACL 2022 [pdf]

[Abs]
Current pre-trained models applied for summarization are prone to factual inconsistencies that misrepresent the source text. Evaluating the factual consistency of summaries is thus necessary to develop better models. However, the human evaluation setup for evaluating factual consistency has not been standardized. To determine the factors that affect the reliability of the human evaluation, we crowdsource evaluations for factual consistency across state-of-the-art models on two news summarization datasets using the rating-based Likert Scale and ranking-based Best-Worst Scaling. Our analysis reveals that the ranking-based Best-Worst Scaling offers a more reliable measure of summary quality across datasets and that the reliability of Likert ratings highly depends on the target dataset and the evaluation design. To improve crowdsourcing reliability, we extend the scale of the Likert rating and present a scoring algorithm for Best-Worst Scaling that we call value learning. Our crowdsourcing guidelines will be publicly available to facilitate future work on factual consistency in summarization.
MiRANews: Dataset and Benchmarks for Multi-Resource-Assisted News Summarization Xinnuo Xu, Ondřej Dušek, Shashi Narayan, Verena Rieser, Ioannis Konstas EMNLP2021 Findings [pdf] [data]
Inspecting the Factuality of Hallucinated Entities in Abstractive Summarization Meng Cao, Yue Dong, Jackie Chi Kit Cheung [pdf]
CLIFF: Contrastive Learning for Improving Faithfulness and Factuality in Abstractive Summarization Shuyang Cao, Lu Wang EMNLP 2021 [pdf] [code]
Faithful or Extractive? On Mitigating the Faithfulness-Abstractiveness Trade-off in Abstractive Summarization Faisal Ladhak, Esin Durmus, He He, Claire Cardie, Kathleen McKeown ACL 2022 [pdf] [code]

[Abs]
Despite recent progress in abstractive summarization, systems still suffer from faithfulness errors. While prior work has proposed models that improve faithfulness, it is unclear whether the improvement comes from an increased level of extractiveness of the model outputs as one naive way to improve faithfulness is to make summarization models more extractive. In this work, we present a framework for evaluating the effective faithfulness of summarization systems, by generating a faithfulness-abstractiveness trade-off curve that serves as a control at different operating points on the abstractiveness spectrum. We then show that the Maximum Likelihood Estimation (MLE) baseline as well as recently proposed methods for improving faithfulness, fail to consistently improve over the control at the same level of abstractiveness. Finally, we learn a selector to identify the most faithful and abstractive summary for a given document, and show that this system can attain higher faithfulness scores in human evaluations while being more abstractive than the baseline system on two datasets. Moreover, we show that our system is able to achieve a better faithfulness-abstractiveness trade-off than the control at the same level of abstractiveness.
Factual Consistency Evaluation for Text Summarization via Counterfactual Estimation Yuexiang Xie, Fei Sun, Yang Deng, Yaliang Li, Bolin Ding EMNLP 2021 Findings [pdf] [code]
Improving Factual Consistency of Abstractive Summarization on Customer Feedback Yang Liu, Yifei Sun, Vincent Gao ACL 2021 Proceedings of The 4th Workshop on e-Commerce and NLP [pdf]
AgreeSum: Agreement-Oriented Multi-Document Summarization Richard Yuanzhe Pang, Adam D. Lelkes, Vinh Q. Tran, Cong Yu Findings of ACL 2021 [pdf] [data]
Focus Attention: Promoting Faithfulness and Diversity in Summarization Rahul Aralikatte, Shashi Narayan, Joshua Maynez, Sascha Rothe, Ryan McDonald ACL 2021 [pdf]
Improving Factual Consistency of Abstractive Summarization via Question Answering Feng Nan, Cicero Nogueira dos Santos, Henghui Zhu, Patrick Ng, Kathleen McKeown, Ramesh Nallapati, Dejiao Zhang, Zhiguo Wang, Andrew O. Arnold, Bing Xiang ACL 2021 [pdf] [code]
Discourse Understanding and Factual Consistency in Abstractive Summarization Saadia Gabriel, Antoine Bosselut, Jeff Da, Ari Holtzman, Jan Buys, Kyle Lo, Asli Celikyilmaz, Yejin Choi EACL21 [pdf] [code]
Improving Faithfulness in Abstractive Summarization with Contrast Candidate Generation and Selection Sihao Chen, Fan Zhang, Kazoo Sone and Dan Roth NAACL21 [pdf] [code]
Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics Artidoro Pagnoni, Vidhisha Balachandran and Yulia Tsvetkov NAACL21 [pdf] [code]
Annotating and Modeling Fine-grained Factuality in Summarization Tanya Goyal, Greg Durrett NAACL21 [pdf] [code]
SAFEval: Summarization Asks for Fact-based Evaluation Thomas Scialom, Paul-Alexis Dray, Patrick Gallinari, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang [pdf] [code]
Enhancing Factual Consistency of Abstractive Summarization Chenguang Zhu, William Hinthorn, Ruochen Xu, Qingkai Zeng, Michael Zeng, Xuedong Huang, Meng Jiang NAACL21 [pdf]
Entity-level Factual Consistency of Abstractive Text Summarization Feng Nan, Ramesh Nallapati, Zhiguo Wang, Cicero Nogueira dos Santos, Henghui Zhu, Dejiao Zhang, Kathleen McKeown, Bing Xiang EACL21 [pdf] [code]
On the Faithfulness for E-commerce Product Summarization Peng Yuan, Haoran Li, Song Xu, Youzheng Wu, Xiaodong He, Bowen Zhou COLING20 [pdf] [code]
FFCI: A Framework for Interpretable Automatic Evaluation of Summarization Fajri Koto, Jey Han Lau, Timothy Baldwin [pdf] [code]
GSum: A General Framework for Guided Neural Abstractive Summarization Zi-Yi Dou, Pengfei Liu, Hiroaki Hayashi, Zhengbao Jiang, Graham Neubig NAACL21 [pdf] [code]
Truth or Error? Towards systematic analysis of factual errors in abstractive summaries Klaus-Michael Lux, Maya Sappelli, Martha Larson EMNLP | Eval4NLP 20 [pdf]
Detecting Hallucinated Content in Conditional Neural Sequence Generation Chunting Zhou, Jiatao Gu, Mona Diab, Paco Guzman, Luke Zettlemoyer, Marjan Ghazvininejad [pdf] [code]
Go Figure! A Meta Evaluation of Factuality in Summarization Saadia Gabriel, Asli Celikyilmaz, Rahul Jha, Yejin Choi, Jianfeng Gao Findings of ACL 2021 [pdf]
Constrained Abstractive Summarization: Preserving Factual Consistency with Constrained Generation Yuning Mao, Xiang Ren, Heng Ji, Jiawei Han [pdf]
Factual Error Correction for Abstractive Summarization Models Meng Cao, Yue Dong, Jiapeng Wu, Jackie Chi Kit Cheung EMNLP20 short [pdf] [code]
Multi-Fact Correction in Abstractive Text Summarization. Yue Dong, Shuohang Wang, Zhe Gan, Yu Cheng, Jackie Chi Kit Cheung, Jingjing Liu EMNLP20 [pdf]
Factual Error Correction for Abstractive Summarization Models Cao Meng, Yue Cheung Dong, Jiapeng Wu, and Jackie Chi Kit EMNLP20 [pdf]
Evaluating the Factual Consistency of Abstractive Text Summarization Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher EMNLP20 [pdf] [code]
Reducing Quantity Hallucinations in Abstractive Summarization Zheng Zhao, Shay B. Cohen, Bonnie Webber Findings of EMNLP [pdf]
On Faithfulness and Factuality in Abstractive Summarization Joshua Maynez, Shashi Narayan, Bernd Bohnet, Ryan McDonaldACL20 [pdf] [data]
Improving Truthfulness of Headline Generation Kazuki Matsumaru, Sho Takase, Naoaki Okazaki ACL20[pdf]
Optimizing the Factual Correctness of a Summary: A Study of Summarizing Radiology Reports Yuhao Zhang, Derek Merck, Emily Bao Tsai, Christopher D. Manning, Curtis P. Langlotz ACL20[pdf]
FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization Esin Durmus, He He, Mona Diab ACL20 [pdf] [code]
Asking and Answering Questions to Evaluate the Factual Consistency of Summaries Alex Wang, Kyunghyun Cho, Mike Lewis ACL20 [pdf] [code]
Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven Cloze Reward Luyang Huang, Lingfei Wu, Lu Wang ACL20 [pdf]
Mind The Facts: Knowledge-Boosted Coherent Abstractive Text Summarization Beliz Gunel, Chenguang Zhu, Michael Zeng, Xuedong Huang NIPS19 [pdf]
Assessing The Factual Accuracy of Generated Text Ben Goodrich, Vinay Rao, Mohammad Saleh, Peter J Liu KDD19 [pdf]
Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie Utama, Ido Dagan, Iryna Gurevych ACL19 [pdf] [data]
Ensure the Correctness of the Summary: Incorporate Entailment Knowledge into Abstractive Sentence Summarization Haoran Li, Junnan Zhu, Jiajun Zhang, Chengqing Zong COLING18 [pdf] [code]
Faithful to the Original: Fact-Aware Neural Abstractive Summarization Ziqiang Cao, Furu Wei, Wenjie Li, Sujian Li AAAI18 [pdf]
FAR-ASS：Fact-aware reinforced abstractive sentence summarization MengLi Zhanga, Gang Zhoua, Wanting Yua, Wenfen Liub [pdf]

Contrastive Learning

COLO: A Contrastive Learning based Re-ranking Framework for One-Stage Summarization COLO: A Contrastive Learning based Re-ranking Framework for One-Stage Summarization COLING 2022 [pdf] [code]

[Abs]
Traditional training paradigms for extractive and abstractive summarization systems always only use token-level or sentence-level training objectives. However, the output summary is always evaluated from summary-level which leads to the inconsistency in training and evaluation. In this paper, we propose a Contrastive Learning based re-ranking framework for one-stage summarization called COLO. By modeling a contrastive objective, we show that the summarization model is able to directly generate summaries according to the summary-level score without additional modules and parameters. Extensive experiments demonstrate that COLO boosts the extractive and abstractive results of one-stage systems on CNN/DailyMail benchmark to 44.58 and 46.33 ROUGE-1 score while preserving the parameter efficiency and inference efficiency. Compared with state-of-the-art multi-stage systems, we save more than 100 GPU training hours and obtaining 3~8 speed-up ratio during inference while maintaining comparable results.
CLIFF: Contrastive Learning for Improving Faithfulness and Factuality in Abstractive Summarization Shuyang Cao, Lu Wang EMNLP 2021 [pdf] [code]
Sequence Level Contrastive Learning for Text Summarization Shusheng Xu, Xingxing Zhang, Yi Wu, Furu Wei AAAI 2022 [pdf]](https://arxiv.org/abs/2109.03481)
Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization Chujie Zheng, Kunpeng Zhang, Harry Jiannan Wang, Ling Fan, Zhe Wang [pdf] [code]
Constructing Contrastive samples via Summarization for Text Classification with limited annotations Yangkai Du, Tengfei Ma, Lingfei Wu, Fangli Xu, Xuhong Zhang, Shouling Ji Findings of EMNLP 2021 Short [pdf]
Alleviating Exposure Bias via Contrastive Learning for Abstractive Text Summarization Shichao Sun, Wenjie Li [pdf] [code]
SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization Yixin Liu, Pengfei Liu ACL 2021 short [pdf] [code]
Contrastive Learning with Adversarial Perturbations for Conditional Text Generation Seanie Lee, Dong Bok Lee, Sung Ju Hwang ICLR 2021 [pdf]
DeepChannel: Salience Estimation by Contrastive Learning for Extractive Document Summarization Jiaxin Shi, Chen Liang, Lei Hou, Juanzi Li, Zhiyuan Liu, Hanwang Zhang AAAI 2019 [pdf] [code]
Unsupervised Reference-Free Summary Quality Evaluation via Contrastive Learning Hanlu Wu, Tengfei Ma, Lingfei Wu, Tariro Manyumwa, Shouling Ji EMNLP 2020 [pdf] [code]
Contrastive Attention Mechanism for Abstractive Sentence Summarization Xiangyu Duan, Hongfei Yu, Mingming Yin, Min Zhang, Weihua Luo, Yue Zhang EMNLP 2019 [pdf] [code]

Evaluation

ChatGPT as a Factual Inconsistency Evaluator for Abstractive Text Summarization Zheheng Luo, Qianqian Xie, Sophia Ananiadou [pdf]

[Abs]
Long-form clinical summarization of hospital admissions has real-world significance because of its potential to help both clinicians and patients. The faithfulness of summaries is critical to their safe usage in clinical settings. To better understand the limitations of abstractive systems, as well as the suitability of existing evaluation metrics, we benchmark faithfulness metrics against fine-grained human annotations for model-generated summaries of a patient's Brief Hospital Course. We create a corpus of patient hospital admissions and summaries for a cohort of HIV patients, each with complex medical histories. Annotators are presented with summaries and source notes, and asked to categorize manually highlighted summary elements (clinical entities like conditions and medications as well as actions like "following up") into one of three categories: "Incorrect," "Missing," and "Not in Notes." We meta-evaluate a broad set of proposed faithfulness metrics and, across metrics, explore the importance of domain adaptation (e.g. the impact of in-domain pre-training and metric fine-tuning), the use of source-summary alignments, and the effects of distilling a single metric from an ensemble of pre-existing metrics. Off-the-shelf metrics with no exposure to clinical text correlate well yet overly rely on summary extractiveness. As a practical guide to long-form clinical narrative summarization, we find that most metrics correlate best to human judgments when provided with one summary sentence at a time and a minimal set of relevant source context.
Large Language Models are Diverse Role-Players for Summarization Evaluation Ning Wu, Ming Gong, Linjun Shou, Shining Liang, Daxin Jiang [pdf]

[Abs]
Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities (e.g., zero-shot instruction) from generative pre-trained models to score generated texts. Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that this approach can effectively allow us to achieve what one desires to evaluate for texts simply by natural language instructions. This nature helps us overcome several long-standing challenges in text evaluation--how to achieve customized, multi-faceted evaluation without the need for annotated samples. We make our code publicly available at this https URL.
GPTScore: Evaluate as You Desire Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, Pengfei Liu [pdf] [code]

[Abs]
Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities (e.g., zero-shot instruction) from generative pre-trained models to score generated texts. Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that this approach can effectively allow us to achieve what one desires to evaluate for texts simply by natural language instructions. This nature helps us overcome several long-standing challenges in text evaluation--how to achieve customized, multi-faceted evaluation without the need for annotated samples. We make our code publicly available at this https URL.
Needle in a Haystack: An Analysis of Finding Qualified Workers on MTurk for Summarization Lining Zhang, João Sedoc, Simon Mille, Yufang Hou, Sebastian Gehrmann, Daniel Deutsch, Elizabeth Clark, Yixin Liu, Miruna Clinciu, Saad Mahamood, Khyathi Chandu [pdf]

[Abs]
The acquisition of high-quality human annotations through crowdsourcing platforms like Amazon Mechanical Turk (MTurk) is more challenging than expected. The annotation quality might be affected by various aspects like annotation instructions, Human Intelligence Task (HIT) design, and wages paid to annotators, etc. To avoid potentially low-quality annotations which could mislead the evaluation of automatic summarization system outputs, we investigate the recruitment of high-quality MTurk workers via a three-step qualification pipeline. We show that we can successfully filter out bad workers before they carry out the evaluations and obtain high-quality annotations while optimizing the use of resources. This paper can serve as basis for the recruitment of qualified annotators in other challenging annotation tasks.
DocAsRef: A Pilot Empirical Study on Repurposing Reference-Based Summary Quality Metrics Reference-Freely Forrest Sheng Bao, Ruixuan Tu, Ge Luo [pdf]

[Abs]
Summary quality assessment metrics have two categories: reference-based and reference-free. Reference-based metrics are theoretically more accurate but are limited by the availability and quality of the human-written references, which are both difficulty to ensure. This inspires the development of reference-free metrics, which are independent from human-written references, in the past few years. However, existing reference-free metrics cannot be both zero-shot and accurate. In this paper, we propose a zero-shot but accurate reference-free approach in a sneaky way: feeding documents, based upon which summaries generated, as references into reference-based metrics. Experimental results show that this zero-shot approach can give us the best-performing reference-free metrics on nearly all aspects on several recently-released datasets, even beating reference-free metrics specifically trained for this task sometimes. We further investigate what reference-based metrics can benefit from such repurposing and whether our additional tweaks help.
RISE: Leveraging Retrieval Techniques for Summarization Evaluation David Uthus, Jianmo Ni [pdf] [code]

[Abs]
Evaluating automatically-generated text summaries is a challenging task. While there have been many interesting approaches, they still fall short of human evaluations. We present RISE, a new approach for evaluating summaries by leveraging techniques from information retrieval. RISE is first trained as a retrieval task using a dual-encoder retrieval setup, and can then be subsequently utilized for evaluating a generated summary given an input document, without gold reference summaries. RISE is especially well suited when working on new datasets where one may not have reference summaries available for evaluation. We conduct comprehensive experiments on the SummEval benchmark (Fabbri et al., 2021) and the results show that RISE has higher correlation with human evaluations compared to many past approaches to summarization evaluation. Furthermore, RISE also demonstrates data-efficiency and generalizability across languages.
Universal Evasion Attacks on Summarization Scoring Wenchuan Mu, Kwan Hui Lim [pdf]

[Abs]
The automatic scoring of summaries is important as it guides the development of summarizers. Scoring is also complex, as it involves multiple aspects such as fluency, grammar, and even textual entailment with the source text. However, summary scoring has not been considered a machine learning task to study its accuracy and robustness. In this study, we place automatic scoring in the context of regression machine learning tasks and perform evasion attacks to explore its robustness. Attack systems predict a non-summary string from each input, and these non-summary strings achieve competitive scores with good summarizers on the most popular metrics: ROUGE, METEOR, and BERTScore. Attack systems also "outperform" state-of-the-art summarization methods on ROUGE-1 and ROUGE-L, and score the second-highest on METEOR. Furthermore, a BERTScore backdoor is observed: a simple trigger can score higher than any automatic summarization method. The evasion attacks in this work indicate the low robustness of current scoring systems at the system level. We hope that our highlighting of these proposed attacks will facilitate the development of summary scores.
Self-Repetition in Abstractive Neural Summarizers Nikita Salkar, Thomas Trikalinos, Byron C. Wallace, Ani Nenkova [pdf]

[Abs]
We provide a quantitative and qualitative analysis of self-repetition in the output of neural summarizers. We measure self-repetition as the number of n-grams of length four or longer that appear in multiple outputs of the same system. We analyze the behavior of three popular architectures (BART, T5, and Pegasus), fine-tuned on five datasets. In a regression analysis, we find that the three architectures have different propensities for repeating content across output summaries for inputs, with BART being particularly prone to self-repetition. Fine-tuning on more abstractive data, and on data featuring formulaic language, is associated with a higher rate of self-repetition. In qualitative analysis we find systems produce artefacts such as ads and disclaimers unrelated to the content being summarized, as well as formulaic phrases common in the fine-tuning domain. Our approach to corpus-level analysis of self-repetition may help practitioners clean up training data for summarizers and ultimately support methods for minimizing the amount of self-repetition.
How to Find Strong Summary Coherence Measures? A Toolbox and a Comparative Study for Summary Coherence Measure Evaluation Julius Steen, Katja Markert COLING 2022 [pdf] [code]

[Abs]
Automatically evaluating the coherence of summaries is of great significance both to enable cost-efficient summarizer evaluation and as a tool for improving coherence by selecting high-scoring candidate summaries. While many different approaches have been suggested to model summary coherence, they are often evaluated using disparate datasets and metrics. This makes it difficult to understand their relative performance and identify ways forward towards better summary coherence modelling. In this work, we conduct a large-scale investigation of various methods for summary coherence modelling on an even playing field. Additionally, we introduce two novel analysis measures, intra-system correlation and bias matrices, that help identify biases in coherence measures and provide robustness against system-level confounders. While none of the currently available automatic coherence measures are able to assign reliable coherence scores to system summaries across all evaluation metrics, large-scale language models fine-tuned on self-supervised tasks show promising results, as long as fine-tuning takes into account that they need to generalize across different summary lengths.
PrefScore: Pairwise Preference Learning for Reference-free Summarization Quality Assessment Ge Luo, Hebi Li, Youbiao He, Forrest Sheng Bao COLING 2022 [pdf] [code]

[Abs]
Evaluating machine-generated summaries without a human-written reference summary has been a need for a long time. Inspired by preference labeling in existing work of summarization evaluation, we propose to judge summary quality by learning the preference rank of summaries using the Bradley-Terry power ranking model from inferior summaries generated by corrupting base summaries. Extensive experiments on several datasets show that our weakly supervised scheme can produce scores highly correlated with human ratings.
How to Find Strong Summary Coherence Measures? A Toolbox and a Comparative Study for Summary Coherence Measure Evaluation Julius Steen, Katja Markert COLING 2022 [pdf] [code]

[Abs]
Automatically evaluating the coherence of summaries is of great significance both to enable cost-efficient summarizer evaluation and as a tool for improving coherence by selecting high-scoring candidate summaries. While many different approaches have been suggested to model summary coherence, they are often evaluated using disparate datasets and metrics. This makes it difficult to understand their relative performance and identify ways forward towards better summary coherence modelling. In this work, we conduct a large-scale investigation of various methods for summary coherence modelling on an even playing field. Additionally, we introduce two novel analysis measures, intra-system correlation and bias matrices, that help identify biases in coherence measures and provide robustness against system-level confounders. While none of the currently available automatic coherence measures are able to assign reliable coherence scores to system summaries across all evaluation metrics, large-scale language models fine-tuned on self-supervised tasks show promising results, as long as fine-tuning takes into account that they need to generalize across different summary lengths.
SummScore: A Comprehensive Evaluation Metric for Summary Quality Based on Cross-Encoder Wuhang Lin, Shasha Li, Chen Zhang, Bin Ji, Jie Yu, Jun Ma, Zibo Yi APWeb-WAIM2022 [pdf]

[Abs]
Text summarization models are often trained to produce summaries that meet human quality requirements. However, the existing evaluation metrics for summary text are only rough proxies for summary quality, suffering from low correlation with human scoring and inhibition of summary diversity. To solve these problems, we propose SummScore, a comprehensive metric for summary quality evaluation based on CrossEncoder. Firstly, by adopting the original-summary measurement mode and comparing the semantics of the original text, SummScore gets rid of the inhibition of summary diversity. With the help of the text-matching pre-training Cross-Encoder, SummScore can effectively capture the subtle differences between the semantics of summaries. Secondly, to improve the comprehensiveness and interpretability, SummScore consists of four fine-grained submodels, which measure Coherence, Consistency, Fluency, and Relevance separately. We use semi-supervised multi-rounds of training to improve the performance of our model on extremely limited annotated data. Extensive experiments show that SummScore significantly outperforms existing evaluation metrics in the above four dimensions in correlation with human scoring. We also provide the quality evaluation results of SummScore on 16 mainstream summarization models for later research.
Does Summary Evaluation Survive Translation to Other Languages? Spencer Braun, Oleg Vasilyev, Neslihan Iskender, John Bohannon NAACL 2022 [pdf] [code]

[Abs]
The creation of a quality summarization dataset is an expensive, time-consuming effort, requiring the production and evaluation of summaries by both trained humans and machines. The returns to such an effort would increase significantly if the dataset could be used in additional languages without repeating human annotations. To investigate how much we can trust machine translation of summarization datasets, we translate the English SummEval dataset to seven languages and compare performances across automatic evaluation measures. We explore equivalence testing as the appropriate statistical paradigm for evaluating correlations between human and automated scoring of summaries. We also consider the effect of translation on the relative performance between measures. We find some potential for dataset reuse in languages similar to the source and along particular dimensions of summary quality. Our code and data can be found at https://github.com/PrimerAI/primer-research/.
Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics Daniel Deutsch, Rotem Dror, Dan Roth NAACL 2022 [pdf] [code]

[Abs]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations. We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice and propose changes to rectify this disconnect. First, we calculate the system score for an automatic metric using the full test set instead of the subset of summaries judged by humans, which is currently standard practice. We demonstrate how this small change leads to more precise estimates of system-level correlations. Second, we propose to calculate correlations only on pairs of systems that are separated by small differences in automatic scores which are commonly observed in practice. This allows us to demonstrate that our best estimate of the correlation of ROUGE to human judgments is near 0 in realistic scenarios. The results from the analyses point to the need to collect more high-quality human judgments and to improve automatic metrics when differences in system scores are small.
SueNes: A Weakly Supervised Approach to Evaluating Single-Document Summarization via Negative Sampling Forrest Bao, Ge Luo, Hebi Li, Minghui Qiu, Yinfei Yang, Youbiao He, Cen Chen NAACL 2022 [pdf] [code]

[Abs]
Canonical automatic summary evaluation metrics, such as ROUGE, focus on lexical similarity which cannot well capture semantics nor linguistic quality and require a reference summary which is costly to obtain. Recently, there have been a growing number of efforts to alleviate either or both of the two drawbacks. In this paper, we present a proof-of-concept study to a weakly supervised summary evaluation approach without the presence of reference summaries. Massive data in existing summarization datasets are transformed for training by pairing documents with corrupted reference summaries. In cross-domain tests, our strategy outperforms baselines with promising improvements, and show a great advantage in gauging linguistic qualities over all metrics.
Reference-free Summarization Evaluation via Semantic Correlation and Compression Ratio Yizhu Liu, Qi Jia, Kenny Zhu NAACL 2022 [pdf] [code]

[Abs]
A document can be summarized in a number of ways. Reference-based evaluation of summarization has been criticized for its inflexibility. The more sufficient the number of abstracts, the more accurate the evaluation results. However, it is difficult to collect sufficient reference summaries. In this paper, we propose a new automatic reference-free evaluation metric that compares semantic distribution between source document and summary by pretrained language models and considers summary compression ratio. The experiments show that this metric is more consistent with human evaluation in terms of coherence, consistency, relevance and fluency.
MaskEval: Weighted MLM-Based Evaluation for Text Summarization and Simplification Yu Lu Liu, Rachel Bawden, Thomas Scaliom, Benoît Sagot, Jackie Chi Kit Cheung [pdf] [code]
TRUE: Re-evaluating Factual Consistency Evaluation NAACL 2022 [pdf]
Play the Shannon Game With Language Models: A Human-Free Approach to Summary Evaluation Nicholas Egan, Oleg Vasilyev, John Bohannon AAAI 2022 [pdf] [code]
Differentiable N-gram Objective on Abstractive Summarization Yunqi Zhu, Wensheng Zhang, Mingjin Zhu [pdf] [code]
DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence Wei Zhao, Michael Strube, Steffen Eger [pdf] [code]
WIDAR -- Weighted Input Document Augmented ROUGE Raghav Jain, Vaibhav Mavi, Anubhav Jangra, Sriparna Saha ECIR 2022 [pdf] [code]
InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation Pierre Colombo, Chloe Clave, Pablo Piantanida AAAI 2022 [pdf]
Evaluation of Summarization Systems across Gender, Age, and Race Anna Jørgensen, Anders Søgaard EMNLP 2021| newsum [pdf]
Evaluation of Abstractive Summarisation Models with Machine Translation in Deliberative Processes M. Arana-Catania, Rob Procter, Yulan He, Maria Liakata EMNLP 2021 New Frontiers in Summarization Workshop [pdf]
Evaluation of Summarization Systems across Gender, Age, and Race Anna Jørgensen, Anders Søgaard [pdf]
Finding a Balanced Degree of Automation for Summary Evaluation Shiyue Zhang, Mohit Bansal EMNLP 2021 [pdf] [code]
QuestEval: Summarization Asks for Fact-based Evaluation Thomas Scialom, Paul-Alexis Dray, Patrick Gallinari, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang EMNLP 2021 [pdf] [code]
BARTScore: Evaluating Generated Text as Text Generation Weizhe Yuan, Graham Neubig, Pengfei Liu [pdf] [code]
A Training-free and Reference-free Summarization Evaluation Metric via Centrality-weighted Relevance and Self-referenced Redundancy Wang Chen, Piji Li, Irwin King ACL 2021 [pdf] [code]
Evaluating the Efficacy of Summarization Evaluation across Languages Fajri Koto, Jey Han Lau, Timothy Baldwin Findings of ACL 2021 [pdf]
Question-aware Transformer Models for Consumer Health Question Summarization Shweta Yadav, Deepak Gupta, Asma Ben Abacha, Dina Demner-Fushman [pdf]
Towards Human-Free Automatic Quality Evaluation of German Summarization Neslihan Iskender, Oleg Vasilyev, Tim Polzehl, John Bohannon, Sebastian Möller [pdf]
Reliability of Human Evaluation for Text Summarization: Lessons Learned and Challenges Ahead Neslihan Iskender, Tim Polzehl, Sebastian Möller EACL21 [pdf] [code]
SummVis: Interactive Visual Analysis of Models, Data, and Evaluation for Text Summarization Jesse Vig, Wojciech Kryscinski, Karan Goel, Nazneen Fatema Rajani ACL 2021 demo [pdf] [data]
Is human scoring the best criteria for summary evaluation? Findings of ACL 2021 Oleg Vasilyev, John Bohannon [pdf]
How to Evaluate a Summarizer: Study Design and Statistical Analysis for Manual Linguistic Quality Evaluation Julius Steen, Katja Markert EACL21 [pdf] [code]
HOLMS: Alternative Summary Evaluation with Large Language Models Yassine Mrabet, Dina Demner-Fushman COLING20 [pdf] [bib]
FFCI: A Framework for Interpretable Automatic Evaluation of Summarization Fajri Koto, Jey Han Lau, Timothy Baldwin [pdf] [code]
Unsupervised Reference-Free Summary Quality Evaluation via Contrastive Learning Hanlu Wu, Tengfei Ma, Lingfei Wu, Tariro Manyumwa, Shouling Ji EMNLP20 [pdf] [code]
SacreROUGE: An Open-Source Library for Using and Developing Summarization Evaluation Metrics Daniel Deutsch, Dan Roth [pdf] [code]
SummEval: Re-evaluating Summarization Evaluation Alexander R. Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, Dragomir Radev [pdf] [code]
HIGHRES: Highlight-based Reference-less Evaluation of Summarization Hardy, Shashi Narayan, Andreas Vlachos ACL19 [pdf] [code]

Multi-Document

Compressed Heterogeneous Graph for Abstractive Multi-Document Summarization Miao Li, Jianzhong Qi, Jey Han Lau AAAI 2023 [pdf] [code]

[Abs]
Multi-document summarization (MDS) aims to generate a summary for a number of related documents. We propose HGSUM, an MDS model that extends an encoder-decoder architecture, to incorporate a heterogeneous graph to represent different semantic units (e.g., words and sentences) of the documents. This contrasts with existing MDS models which do not consider different edge types of graphs and as such do not capture the diversity of relationships in the documents. To preserve only key information and relationships of the documents in the heterogeneous graph, HGSUM uses graph pooling to compress the input graph. And to guide HGSUM to learn compression, we introduce an additional objective that maximizes the similarity between the compressed graph and the graph constructed from the ground-truth summary during training. HGSUM is trained end-to-end with graph similarity and standard cross-entropy objectives. Experimental results over MULTI-NEWS, WCEP-100, and ARXIV show that HGSUM outperforms state-of-the-art MDS models. The code for our model and experiments is available at: this https URL.
Do Multi-Document Summarization Models Synthesize? Jay DeYoung, Stephanie C. Martinez, Iain J. Marshall, Byron C. Wallace [pdf]

[Abs]
Multi-document summarization entails producing concise synopses of collections of inputs. For some applications, the synopsis should accurately \emph{synthesize} inputs with respect to a key property or aspect. For example, a synopsis of film reviews all written about a particular movie should reflect the average critic consensus. As a more consequential example, consider narrative summaries that accompany biomedical \emph{systematic reviews} of clinical trial results. These narratives should fairly summarize the potentially conflicting results from individual trials. In this paper we ask: To what extent do modern multi-document summarization models implicitly perform this type of synthesis? To assess this we perform a suite of experiments that probe the degree to which conditional generation models trained for summarization using standard methods yield outputs that appropriately synthesize inputs. We find that existing models do partially perform synthesis, but do so imperfectly. In particular, they are over-sensitive to changes in input ordering and under-sensitive to changes in input compositions (e.g., the ratio of positive to negative movie reviews). We propose a simple, general method for improving model synthesis capabilities by generating an explicitly diverse set of candidate outputs, and then selecting from these the string best aligned with the expected aggregate measure for the inputs, or \emph{abstaining} when the model produces no good candidate. This approach improves model synthesis performance. We hope highlighting the need for synthesis (in some summarization settings), motivates further research into multi-document summarization methods and learning objectives that explicitly account for the need to synthesize.
Exploring the Challenges of Open Domain Multi-Document Summarization John Giorgi, Luca Soldaini, Bo Wang, Gary Bader, Kyle Lo, Lucy Lu Wang, Arman Cohan [pdf] [code]

[Abs]
Multi-document summarization (MDS) has traditionally been studied assuming a set of ground-truth topic-related input documents is provided. In practice, the input document set is unlikely to be available a priori and would need to be retrieved based on an information need, a setting we call open-domain MDS. We experiment with current state-of-the-art retrieval and summarization models on several popular MDS datasets extended to the open-domain setting. We find that existing summarizers suffer large reductions in performance when applied as-is to this more realistic task, though training summarizers with retrieved inputs can reduce their sensitivity retrieval errors. To further probe these findings, we conduct perturbation experiments on summarizer inputs to study the impact of different types of document retrieval errors. Based on our results, we provide practical guidelines to help facilitate a shift to open-domain MDS. We release our code and experimental results alongside all data or model artifacts created during our investigation.
How "Multi" is Multi-Document Summarization? Ruben Wolhandler, Arie Cattan, Ori Ernst, Ido Dagan EMNLP 2022 [pdf] [code]

[Abs]
The task of multi-document summarization (MDS) aims at models that, given multiple documents as input, are able to generate a summary that combines disperse information, originally spread across these documents. Accordingly, it is expected that both reference summaries in MDS datasets, as well as system summaries, would indeed be based on such dispersed information. In this paper, we argue for quantifying and assessing this expectation. To that end, we propose an automated measure for evaluating the degree to which a summary is ``disperse'', in the sense of the number of source documents needed to cover its content. We apply our measure to empirically analyze several popular MDS datasets, with respect to their reference summaries, as well as the output of state-of-the-art systems. Our results show that certain MDS datasets barely require combining information from multiple documents, where a single document often covers the full summary content. Overall, we advocate using our metric for assessing and improving the degree to which summarization datasets require combining multi-document information, and similarly how summarization models actually meet this challenge. Our code is available in this https URL.
Analyzing the Dialect Diversity in Multi-document Summaries Olubusayo Olabisi, Aaron Hudson, Antonie Jetter, Ameeta Agrawal COLING 2022 [pdf] [code]

[Abs]
Social media posts provide a compelling, yet challenging source of data of diverse perspectives from many socially salient groups. Automatic text summarization algorithms make this data accessible at scale by compressing large collections of documents into short summaries that preserve salient information from the source text. In this work, we take a complementary approach to analyzing and improving the quality of summaries generated from social media data in terms of their ability to represent salient as well as diverse perspectives. We introduce a novel dataset, DivSumm, of dialect diverse tweets and human-written extractive and abstractive summaries. Then, we study the extent of dialect diversity reflected in human-written reference summaries as well as system-generated summaries. The results of our extensive experiments suggest that humans annotate fairly well-balanced dialect diverse summaries, and that cluster-based pre-processing approaches seem beneficial in improving the overall quality of the system-generated summaries without loss in diversity.
Document-aware Positional Encoding and Linguistic-guided Encoding for Abstractive Multi-document Summarization Congbo Ma, Wei Emma Zhang, Pitawelayalage Dasun Dileepa Pitawela, Yutong Qu, Haojie Zhuang, Hu Wang [pdf]

[Abs]
One key challenge in multi-document summarization is to capture the relations among input documents that distinguish between single document summarization (SDS) and multi-document summarization (MDS). Few existing MDS works address this issue. One effective way is to encode document positional information to assist models in capturing cross-document relations. However, existing MDS models, such as Transformer-based models, only consider token-level positional information. Moreover, these models fail to capture sentences' linguistic structure, which inevitably causes confusions in the generated summaries. Therefore, in this paper, we propose document-aware positional encoding and linguistic-guided encoding that can be fused with Transformer architecture for MDS. For document-aware positional encoding, we introduce a general protocol to guide the selection of document encoding functions. For linguistic-guided encoding, we propose to embed syntactic dependency relations into the dependency relation mask with a simple but effective non-linear encoding learner for feature learning. Extensive experiments show the proposed model can generate summaries with high quality.
Multi-Document Scientific Summarization from a Knowledge Graph-Centric View Pancheng Wang, Shasha Li, Kunyuan Pang, Liangliang He, Dong Li, Jintao Tang, Ting Wang COLING 2022 [pdf] [code]

[Abs]
Multi-Document Scientific Summarization (MDSS) aims to produce coherent and concise summaries for clusters of topic-relevant scientific papers. This task requires precise understanding of paper content and accurate modeling of cross-paper relationships. Knowledge graphs convey compact and interpretable structured information for documents, which makes them ideal for content modeling and relationship modeling. In this paper, we present KGSum, an MDSS model centred on knowledge graphs during both the encoding and decoding process. Specifically, in the encoding process, two graph-based modules are proposed to incorporate knowledge graph information into paper encoding, while in the decoding process, we propose a two-stage decoder by first generating knowledge graph information of summary in the form of descriptive sentences, followed by generating the final summary. Empirical results show that the proposed architecture brings substantial improvements over baselines on the Multi-Xscience dataset.
Generating a Structured Summary of Numerous Academic Papers: Dataset and Method Shuaiqi LIU, Jiannong Cao, Ruosong Yang, Zhiyuan Wen IJCAI 2022 [pdf] [data]

[Abs]
Writing a survey paper on one research topic usually needs to cover the salient content from numerous related papers, which can be modeled as a multi-document summarization (MDS) task. Existing MDS datasets usually focus on producing the structureless summary covering a few input documents. Meanwhile, previous structured summary generation works focus on summarizing a single document into a multi-section summary. These existing datasets and methods cannot meet the requirements of summarizing numerous academic papers into a structured summary. To deal with the scarcity of available data, we propose BigSurvey, the first large-scale dataset for generating comprehensive summaries of numerous academic papers on each topic. We collect target summaries from more than seven thousand survey papers and utilize their 430 thousand reference papers’ abstracts as input documents. To organize the diverse content from dozens of input documents and ensure the efficiency of processing long text sequences, we propose a summarization method named category-based alignment and sparse transformer (CAST). The experimental results show that our CAST method outperforms various advanced summarization methods.
Proposition-Level Clustering for Multi-Document Summarization Ori Ernst, Avi Caciularu, Ori Shapira, Ramakanth Pasunuru, Mohit Bansal, Jacob Goldberger, Ido Dagan NAACL 2022 [pdf] [code]

[Abs]
Text clustering methods were traditionally incorporated into multi-document summarization (MDS) as a means for coping with considerable information repetition. Particularly, clusters were leveraged to indicate information saliency as well as to avoid redundancy. Such prior methods focused on clustering sentences, even though closely related sentences usually contain also non-aligned parts. In this work, we revisit the clustering approach, grouping together sub-sentential propositions, aiming at more precise information alignment. Specifically, our method detects salient propositions, clusters them into paraphrastic clusters, and generates a representative sentence for each cluster via text fusion.Our summarization method improves over the previous state-of-the-art MDS method in the DUC 2004 and TAC 2011 datasets, both in automatic ROUGE scores and human preference.
Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities Zejiang Shen, Kyle Lo, Lauren Yu, Nathan Dahlberg, Margo Schlanger, Doug Downey [pdf] [data]

[Abs]
With the advent of large language models, methods for abstractive summarization have made great strides, creating potential for use in applications to aid knowledge workers processing unwieldy document collections. One such setting is the Civil Rights Litigation Clearinghouse (CRLC) (this https URL),which posts information about large-scale civil rights lawsuits, serving lawyers, scholars, and the general public. Today, summarization in the CRLC requires extensive training of lawyers and law students who spend hours per case understanding multiple relevant documents in order to produce high-quality summaries of key events and outcomes. Motivated by this ongoing real-world summarization effort, we introduce Multi-LexSum, a collection of 9,280 expert-authored summaries drawn from ongoing CRLC writing. Multi-LexSum presents a challenging multi-document summarization task given the length of the source documents, often exceeding two hundred pages per case. Furthermore, Multi-LexSum is distinct from other datasets in its multiple target summaries, each at a different granularity (ranging from one-sentence "extreme" summaries to multi-paragraph narrations of over five hundred words). We present extensive analysis demonstrating that despite the high-quality summaries in the training data (adhering to strict content and style guidelines), state-of-the-art summarization models perform poorly on this task. We release Multi-LexSum for further research in summarization methods as well as to facilitate development of applications to assist in the CRLC's mission at this https URL.
AnswerSumm: A Manually-Curated Dataset and Pipeline for Answer Summarization Alexander R. Fabbri, Xiaojian Wu, Srini Iyer, Haoran Li, Mona Diab NAACL 2022 [pdf] [code]

[Abs]
Community Question Answering (CQA) fora such as Stack Overflow and Yahoo! Answers contain a rich resource of answers to a wide range of community-based questions. Each question thread can receive a large number of answers with different perspectives. One goal of answer summarization is to produce a summary that reflects the range of answer perspectives. A major obstacle for this task is the absence of a dataset to provide supervision for producing such summaries. Recent works propose heuristics to create such data, but these are often noisy and do not cover all answer perspectives present. This work introduces a novel dataset of 4,631 CQA threads for answer summarization curated by professional linguists. Our pipeline gathers annotations for all subtasks of answer summarization, including relevant answer sentence selection, grouping these sentences based on perspectives, summarizing each perspective, and producing an overall summary. We analyze and benchmark state-of-the-art models on these subtasks and introduce a novel unsupervised approach for multi-perspective data augmentation that boosts summarization performance according to automatic evaluation. Finally, we propose reinforcement learning rewards to improve factual consistency and answer coverage and analyze areas for improvement.
The patient is more dead than alive: exploring the current state of the multi-document summarisation of the biomedical literature Yulia Otmakhova, Karin Verspoor, Timothy Baldwin, Jey Han Lau ACL 2022 [pdf]

[Abs]
Although multi-document summarisation (MDS) of the biomedical literature is a highly valuable task that has recently attracted substantial interest, evaluation of the quality of biomedical summaries lacks consistency and transparency. In this paper, we examine the summaries generated by two current models in order to understand the deficiencies of existing evaluation approaches in the context of the challenges that arise in the MDS task. Based on this analysis, we propose a new approach to human evaluation and identify several challenges that must be overcome to develop effective biomedical MDS systems.
Predicting Intervention Approval in Clinical Trials through Multi-Document Summarization Georgios Katsimpras, Georgios Paliouras ACL 2022 [pdf] [code]

[Abs]
Clinical trials offer a fundamental opportunity to discover new treatments and advance the medical knowledge. However, the uncertainty of the outcome of a trial can lead to unforeseen costs and setbacks. In this study, we propose a new method to predict the effectiveness of an intervention in a clinical trial. Our method relies on generating an informative summary from multiple documents available in the literature about the intervention under study. Specifically, our method first gathers all the abstracts of PubMed articles related to the intervention. Then, an evidence sentence, which conveys information about the effectiveness of the intervention, is extracted automatically from each abstract. Based on the set of evidence sentences extracted from the abstracts, a short summary about the intervention is constructed. Finally, the produced summaries are used to train a BERT-based classifier, in order to infer the effectiveness of an intervention. To evaluate our proposed method, we introduce a new dataset which is a collection of clinical trials together with their associated PubMed articles. Our experiments, demonstrate the effectiveness of producing short informative summaries and using them to predict the effectiveness of an intervention.
Discriminative Marginalized Probabilistic Neural Method for Multi-Document Summarization of Medical Literature Gianluca Moro, Luca Ragazzi, Lorenzo Valgimigli, Davide Freddi ACL 2022 [pdf] [code]

[Abs]
Although current state-of-the-art Transformer-based solutions succeeded in a wide range for single-document NLP tasks, they still struggle to address multi-input tasks such as multi-document summarization. Many solutions truncate the inputs, thus ignoring potential summary-relevant contents, which is unacceptable in the medical domain where each information can be vital. Others leverage linear model approximations to apply multi-input concatenation, worsening the results because all information is considered, even if it is conflicting or noisy with respect to a shared background. Despite the importance and social impact of medicine, there are no ad-hoc solutions for multi-document summarization. For this reason, we propose a novel discriminative marginalized probabilistic method (DAMEN) trained to discriminate critical information from a cluster of topic-related medical documents and generate a multi-document summary via token probability marginalization. Results prove we outperform the previous state-of-the-art on a biomedical dataset for multi-document summarization of systematic literature reviews. Moreover, we perform extensive ablation studies to motivate the design choices and prove the importance of each module of our method.
ACM -- Attribute Conditioning for Abstractive Multi Document Summarization Aiswarya Sankar, Ankit Chadha [pdf]
Improving Multi-Document Summarization through Referenced Flexible Extraction with Credit-Awareness Yun-Zhu Song, Yi-Syuan Chen, Hong-Han Shuai NAACL 2022 [pdf] [code]

[Abs]
A notable challenge in Multi-Document Summarization (MDS) is the extremely-long length of the input. In this paper, we present an extract-then-abstract Transformer framework to overcome the problem. Specifically, we leverage pre-trained language models to construct a hierarchical extractor for salient sentence selection across documents and an abstractor for rewriting the selected contents as summaries. However, learning such a framework is challenging since the optimal contents for the abstractor are generally unknown. Previous works typically create pseudo extraction oracle to enable the supervised learning for both the extractor and the abstractor. Nevertheless, we argue that the performance of such methods could be restricted due to the insufficient information for prediction and inconsistent objectives between training and testing. To this end, we propose a loss weighting mechanism that makes the model aware of the unequal importance for the sentences not in the pseudo extraction oracle, and leverage the fine-tuned abstractor to generate summary references as auxiliary signals for learning the extractor. Moreover, we propose a reinforcement learning method that can efficiently apply to the extractor for harmonizing the optimization between training and testing. Experiment results show that our framework substantially outperforms strong baselines with comparable model sizes and achieves the best results on the Multi-News, Multi-XScience, and WikiCatSum corpora.
NeuS: Neutral Multi-News Summarization for Mitigating Framing Bias Nayeon Lee, Yejin Bang, Tiezheng Yu, Andrea Madotto, Pascale Fung NAACL 2022 [pdf] [code]

[Abs]
Media news framing bias can increase political polarization and undermine civil society. The need for automatic mitigation methods is therefore growing. We propose a new task, a neutral summary generation from multiple news articles of the varying political leaningsto facilitate balanced and unbiased news reading.In this paper, we first collect a new dataset, illustrate insights about framing bias through a case study, and propose a new effective metric and model (NeuS-Title) for the task. Based on our discovery that title provides a good signal for framing bias, we present NeuS-Title that learns to neutralize news content in hierarchical order from title to article. Our hierarchical multi-task learning is achieved by formatting our hierarchical data pair (title, article) sequentially with identifier-tokens (“TITLE=>”, “ARTICLE=>”) and fine-tuning the auto-regressive decoder with the standard negative log-likelihood objective.We then analyze and point out the remaining challenges and future directions. One of the most interesting observations is that neural NLG models can hallucinate not only factually inaccurate or unverifiable content but also politically biased content.
Read Top News First: A Document Reordering Approach for Multi-Document News Summarization Chao Zhao, Tenghao Huang, Somnath Basu Roy Chowdhury, Muthu Kumar Chandrasekaran, Kathleen McKeown, Snigdha Chaturvedi Findings of ACL 2022 [pdf] [code]
A Multi-Document Coverage Reward for RELAXed Multi-Document Summarization Jacob Parnell, Inigo Jauregi Unanue, Massimo Piccardi ACL 2022 [pdf] [code]

[Abs]
Multi-document summarization (MDS) has made significant progress in recent years, in part facilitated by the availability of new, dedicated datasets and capacious language models. However, a standing limitation of these models is that they are trained against limited references and with plain maximum-likelihood objectives. As for many other generative tasks, reinforcement learning (RL) offers the potential to improve the training of MDS models; yet, it requires a carefully-designed reward that can ensure appropriate leverage of both the reference summaries and the input documents. For this reason, in this paper we propose fine-tuning an MDS baseline with a reward that balances a reference-based metric such as ROUGE with coverage of the input documents. To implement the approach, we utilize RELAX (Grathwohl et al., 2018), a contemporary gradient estimator which is both low-variance and unbiased, and we fine-tune the baseline in a few-shot style for both stability and computational efficiency. Experimental results over the Multi-News and WCEP MDS datasets show significant improvements of up to +0.95 pp average ROUGE score and +3.17 pp METEOR score over the baseline, and competitive results with the literature. In addition, they show that the coverage of the input documents is increased, and evenly across all documents.
PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization Wen Xiao, Iz Beltagy, Giuseppe Carenini, Arman Cohan ACL 2022 [pdf] [code]

[Abs]
We introduce PRIMERA, a pre-trained model for multi-document representation with a focus on summarization that reduces the need for dataset-specific architectures and large amounts of fine-tuning labeled data. PRIMERA uses our newly proposed pre-training objective designed to teach the model to connect and aggregate information across documents. It also uses efficient encoder-decoder transformers to simplify the processing of concatenated input documents. With extensive experiments on 6 multi-document summarization datasets from 3 different domains on zero-shot, few-shot and full-supervised settings, PRIMERA outperforms current state-of-the-art dataset-specific and pre-trained models on most of these settings with large margins.
PeerSum: A Peer Review Dataset for Abstractive Multi-document Summarization Miao Li, Jianzhong Qi, Jey Han Lau [pdf] [data]
A Proposition-Level Clustering Approach for Multi-Document Summarization Ori Ernst, Avi Caciularu, Ori Shapira, Ramakanth Pasunuru, Mohit Bansal, Jacob Goldberger, Ido Dagan [pdf] [code]
MSˆ2: Multi-Document Summarization of Medical Studies Jay DeYoung, Iz Beltagy, Madeleine van Zuylen, Bailey Kuehl, Lucy Wang EMNLP 2021 [pdf] [data]
SgSum: Transforming Multi-document Summarization into Sub-graph Selection Moye Chen, Wei Li, Jiachen Liu, Xinyan Xiao, Hua Wu, Haifeng Wang EMNLP 2021 [pdf] [code]
Topic-Guided Abstractive Multi-Document Summarization Peng Cui, Le Hu Findings of EMNLP 2021 [pdf]
Modeling Endorsement for Multi-Document Abstractive Summarization Logan Lebanoff, Bingqing Wang, Zhe Feng, Fei Liu EMNLP 2021|newsum [pdf]
Incorporating Linguistic Knowledge for Abstractive Multi-document Summarization Congbo Ma, Wei Emma Zhang, Hu Wang, Shubham Gupta, Mingyu Guo [pdf]
Capturing Relations between Scientific Papers: An Abstractive Model for Related Work Section Generation Xiuying Chen, Hind Alamro, Mingzhe Li, Shen Gao, Xiangliang Zhang, Dongyan Zhao, Rui Yan ACL 2021 [pdf] [data]
Highlight-Transformer: Leveraging Key Phrase Aware Attention to Improve Abstractive Multi-Document Summarization Shuaiqi Liu, Jiannong Cao, Ruosong Yang, Zhiyuan Wen ACL 2021 Findings [pdf]
Entity-Aware Abstractive Multi-Document Summarization Hao Zhou, Weidong Ren, Gongshen Liu, Bo Su, Wei Lu ACL 2021 Findings [pdf] [code]
TWAG: A Topic-Guided Wikipedia Abstract Generator Fangwei Zhu, Shangqing Tu, Jiaxin Shi, Juanzi Li, Lei Hou, Tong Cui ACL 2021 [pdf] [code]
AgreeSum: Agreement-Oriented Multi-Document Summarization Richard Yuanzhe Pang, Adam D. Lelkes, Vinh Q. Tran, Cong Yu Findings of ACL 2021 [pdf] [data]
Analysis of GraphSum's Attention Weights to Improve the Explainability of Multi-Document Summarization M. Lautaro Hickmann, Fabian Wurzberger, Megi Hoxhalli, Arne Lochner, Jessica Töllich, Ansgar Scherp [pdf]
Extending Multi-Document Summarization Evaluation to the Interactive Setting Ori Shapira, Ramakanth Pasunuru, Hadar Ronen, Mohit Bansal, Yael Amsterdamer, Ido Dagan NAACL21 [pdf] [code]
Efficiently Summarizing Text and Graph Encodings of Multi-Document Clusters Ramakanth Pasunuru, Mengwen Liu, Mohit Bansal, Sujith Ravi, Markus Dreyer NAACL21 [pdf] [code]
Self-Supervised and Controlled Multi-Document Opinion Summarization Hady Elsahar, Maximin Coavoux, Jos Rozen, Matthias Gallé EACL 2021 [pdf]
MS2: Multi-Document Summarization of Medical Studies Jay DeYoung, Iz Beltagy, Madeleine van Zuylen, Bailey Keuhl, Lucy Lu Wang [pdf] [data]
Nutri-bullets: Summarizing Health Studies by Composing Segments Darsh J Shah, Lili Yu, Tao Lei, Regina Barzilay AAAI21 [pdf] [code]
Multi-document Summarization using Semantic Role Labeling and Semantic Graph for Indonesian News Article Yuly Haruka Berliana Gunawan, Masayu Leylia Khodra [pdf]
Flight of the PEGASUS? Comparing Transformers on Few-Shot and Zero-Shot Multi-document Abstractive Summarization Travis Goodwin, Max Savery, Dina Demner-Fushman COLING20 [pdf]
Abstractive Multi-Document Summarization via Joint Learning with Single-Document Summarization Hanqi Jin, Xiaojun Wan Findings of EMNLP [pdf] [code]
Coarse-to-Fine Query Focused Multi-Document Summarization Yumo Xu, Mirella Lapata EMNLP20 [pdf] [code] [code]
WSL-DS: Weakly Supervised Learning with Distant Supervision for Query Focused Multi-Document Abstractive Summarization Md Tahmid Rahman Laskar, Enamul Hoque, Jimmy Xiangji Huang COLING20 Short [pdf] [code]
AQuaMuSe: Automatically Generating Datasets for Query-Based Multi-Document Summarization Sayali Kulkarni, Sheide Chammas, Wan Zhu, Fei Sha, Eugene Ie [pdf] [data]
Multi-document Summarization with Maximal Marginal Relevance-guided Reinforcement Learning Yuning Mao, Yanru Qu, Yiqing Xie, Xiang Ren, Jiawei Han EMNLP20 [pdf] [code]
Heterogeneous Graph Neural Networks for Extractive Document Summarization Danqing Wang, Pengfei Liu, Yining Zheng, Xipeng Qiu, Xuanjing Huang ACL20 [pdf] [code]
Multi-Granularity Interaction Network for Extractive and Abstractive Multi-Document Summarization Hanqi Jin, Tianming Wang, Xiaojun Wan ACL20 [pdf]
SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization Yang Gao, Wei Zhao, Steffen Eger ACL20 [pdf] [code]
Leveraging Graph to Improve Abstractive Multi-Document Summarization Wei Li, Xinyan Xiao, Jiachen Liu, Hua Wu, Haifeng Wang, Junping Du ACL20 [pdf] [code]
Generating Representative Headlines for News Stories Xiaotao Gu, Yuning Mao, Jiawei Han, Jialu Liu, Hongkun Yu, You Wu, Cong Yu, Daniel Finnie, Jiaqi Zhai, Nicholas Zukoski WWW20 [pdf] [code]
Learning to Create Sentence Semantic Relation Graphs for Multi-Document Summarization Diego Antognini, Boi Faltings EMNLP19 [pdf]
Improving the Similarity Measure of Determinantal Point Processes for Extractive Multi-Document Summarization Sangwoo Cho, Logan Lebanoff, Hassan Foroosh, Fei Liu ACL19 [pdf] [code]
Hierarchical Transformers for Multi-Document Summarization Yang Liu, Mirella Lapata ACL19 [pdf] [code]
MeanSum: A Neural Model for Unsupervised Multi-Document Abstractive Summarization Eric Chu, Peter J. Liu ICML19 [pdf] [code]
Generating Wikipedia By Summarizing Long Sequence Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, Noam Shazeer ICLR18 [pdf] [code]
Adapting the Neural Encoder-Decoder Framework from Single to Multi-Document Summarization Logan Lebanoff, Kaiqiang Song, Fei Liu EMNLP18 [pdf] [code]
Graph-based Neural Multi-Document Summarization Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu, Ayush Pareek, Krishnan Srinivasan, Dragomir Radev CoNLL17 [pdf]
Improving Multi-Document Summarization via Text Classification Ziqiang Cao, Wenjie Li, Sujian Li, Furu Wei AAAI17 [pdf]
Automatic generation of related work through summarizing citations Jingqiang Chen, Hai Zhuge [pdf] [data]
An Unsupervised Multi-Document Summarization Framework Based on Neural Document Model Shulei Ma, Zhi-Hong Deng, Yunlun Yang COLING16 [pdf]
Event-Centric Summary Generation Lucy Vanderwende Michele Banko Arul Menezes ACL04 [pdf]

Cross-Lingual

Towards Unifying Multi-Lingual and Cross-Lingual Summarization Jiaan Wang, Fandong Meng, Duo Zheng, Yunlong Liang, Zhixu Li, Jianfeng Qu, Jie Zhou ACL 2023 [pdf]

[Abs]
To adapt text summarization to the multilingual world, previous work proposes multi-lingual summarization (MLS) and cross-lingual summarization (CLS). However, these two tasks have been studied separately due to the different definitions, which limits the compatible and systematic research on both of them. In this paper, we aim to unify MLS and CLS into a more general setting, i.e., many-to-many summarization (M2MS), where a single model could process documents in any language and generate their summaries also in any language. As the first step towards M2MS, we conduct preliminary studies to show that M2MS can better transfer task knowledge across different languages than MLS and CLS. Furthermore, we propose Pisces, a pre-trained M2MS model that learns language modeling, cross-lingual ability and summarization ability via three-stage pre-training. Experimental results indicate that our Pisces significantly outperforms the state-of-the-art baselines, especially in the zero-shot directions, where there is no training data from the source-language documents to the target-language summaries.
XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation in Low Resource Languages Dhaval Taunk, Shivprasad Sagare, Anupam Patil, Shivansh Subramanian, Manish Gupta, Vasudeva Varma [pdf] [code]

[Abs]
Lack of encyclopedic text contributors, especially on Wikipedia, makes automated text generation for \emph{low resource (LR) languages} a critical problem. Existing work on Wikipedia text generation has focused on \emph{English only} where English reference articles are summarized to generate English Wikipedia pages. But, for low-resource languages, the scarcity of reference articles makes monolingual summarization ineffective in solving this problem. Hence, in this work, we propose \task{}, which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text. Accordingly, we contribute a benchmark dataset, \data{}, spanning ∼69K Wikipedia articles covering five domains and eight languages. We harness this dataset to train a two-stage system where the input is a set of citations and a section title and the output is a section-specific LR summary. The proposed system is based on a novel idea of neural unsupervised extractive summarization to coarsely identify salient information followed by a neural abstractive model to generate the section-specific text. Extensive experiments show that multi-domain training is better than the multi-lingual setup on average.
CroCoSum: A Benchmark Dataset for Cross-Lingual Code-Switched Summarization Ruochen Zhang, Carsten Eickhoff [pdf] [code]

[Abs]
Cross-lingual summarization (CLS) has attracted increasing interest in recent years due to the availability of large-scale web-mined datasets and the advancements of multilingual language models. However, given the rareness of naturally occurring CLS resources, the majority of datasets are forced to rely on translation which can contain overly literal artifacts. This restricts our ability to observe naturally occurring CLS pairs that capture organic diction, including instances of code-switching. This alteration between languages in mid-message is a common phenomenon in multilingual settings yet has been largely overlooked in cross-lingual contexts due to data scarcity. To address this gap, we introduce CroCoSum, a dataset of cross-lingual code-switched summarization of technology news. It consists of over 24,000 English source articles and 18,000 human-curated Chinese news summaries, with more than 92% of the summaries containing code-switched phrases. For reference, we evaluate the performance of existing approaches including pipeline, end-to-end, and zero-shot methods. We show that leveraging existing resources as a pretraining step does not improve performance on CroCoSum, indicating the limited generalizability of existing resources. Finally, we discuss the challenges of evaluating cross-lingual summarizers on code-switched generation through qualitative error analyses. Our collection and code can be accessed at this https URL.
Large Scale Multi-Lingual Multi-Modal Summarization Dataset Yash Verma, Anubhav Jangra, Raghvendra Kumar, Sriparna Saha [pdf] [code]

[Abs]
Significant developments in techniques such as encoder-decoder models have enabled us to represent information comprising multiple modalities. This information can further enhance many downstream tasks in the field of information retrieval and natural language processing; however, improvements in multi-modal techniques and their performance evaluation require large-scale multi-modal data which offers sufficient diversity. Multi-lingual modeling for a variety of tasks like multi-modal summarization, text generation, and translation leverages information derived from high-quality multi-lingual annotated data. In this work, we present the current largest multi-lingual multi-modal summarization dataset (M3LS), and it consists of over a million instances of document-image pairs along with a professionally annotated multi-modal summary for each pair. It is derived from news articles published by British Broadcasting Corporation(BBC) over a decade and spans 20 languages, targeting diversity across five language roots, it is also the largest summarization dataset for 13 languages and consists of cross-lingual summarization data for 2 languages. We formally define the multi-lingual multi-modal summarization task utilizing our dataset and report baseline scores from various state-of-the-art summarization techniques in a multi-lingual setting. We also compare it with many similar datasets to analyze the uniqueness and difficulty of M3LS.https://arxiv.org/abs/2302.06560
Understanding Translationese in Cross-Lingual Summarization Jiaan Wang, Fandong Meng, Tingyi Zhang, Yunlong Liang, Jiarong Xu, Zhixu Li, Jie Zhou [pdf]

[Abs]
Given a document in a source language, cross-lingual summarization (CLS) aims at generating a concise summary in a different target language. Unlike monolingual summarization (MS), naturally occurring source-language documents paired with target-language summaries are rare. To collect large-scale CLS samples, existing datasets typically involve translation in their creation. However, the translated text is distinguished from the text originally written in that language, i.e., translationese. Though many efforts have been devoted to CLS, none of them notice the phenomenon of translationese. In this paper, we first confirm that the different approaches to constructing CLS datasets will lead to different degrees of translationese. Then we design systematic experiments to investigate how translationese affects CLS model evaluation and performance when it appears in source documents or target summaries. In detail, we find that (1) the translationese in documents or summaries of test sets might lead to the discrepancy between human judgment and automatic evaluation; (2) the translationese in training sets would harm model performance in the real scene; (3) though machine-translated documents involve translationese, they are very useful for building CLS systems on low-resource languages under specific training strategies. Furthermore, we give suggestions for future CLS research including dataset and model developments. We hope that our work could let researchers notice the phenomenon of translationese in CLS and take it into account in the future.
Searching for Effective Multilingual Fine-Tuning Methods: A Case Study in Summarization Yiwei Qin, Graham Neubig, Pengfei Liu `` [pdf] [code]

[Abs]
Recently, a large number of tuning strategies have been proposed to adapt pre-trained language models to downstream tasks. In this paper, we perform an extensive empirical evaluation of various tuning strategies for multilingual learning, particularly in the context of text summarization. Specifically, we explore the relative advantages of three families of multilingual tuning strategies (a total of five models) and empirically evaluate them for summarization over 45 languages. Experimentally, we not only established a new state-of-the-art on the XL-Sum dataset but also derive a series of observations that hopefully can provide hints for future research on the design of multilingual tuning strategies.
ClidSum: A Benchmark Dataset for Cross-Lingual Dialogue Summarization Jiaan Wang, Fandong Meng, Ziyao Lu, Duo Zheng, Zhixu Li, Jianfeng Qu, Jie Zhou EMNLP 2022 [pdf] [code]
A Survey on Cross-Lingual Summarization Jiaan Wang, Fandong Meng, Duo Zheng, Yunlong Liang, Zhixu Li, Jianfeng Qu, Jie Zhou TACL 2022 [pdf]
Overcoming Catastrophic Forgetting in Zero-Shot Cross-Lingual Generation Tu Vu, Aditya Barua, Brian Lester, Daniel Cer, Mohit Iyyer, Noah Constant [pdf]
MSAMSum: Towards Benchmarking Multi-lingual Dialogue Summarization Xiachong Feng, Xiaocheng Feng, Bing Qin ACL 2022 DialDoc Workshop [pdf] [data]
The Cross-lingual Conversation Summarization Challenge Yulong Chen, Ming Zhong, Xuefeng Bai, Naihao Deng, Jing Li, Xianchao Zhu, Yue Zhang [pdf]
Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization Ruipeng Jia, Xingxing Zhang, Yanan Cao, Shi Wang, Zheng Lin, Furu Wei ACL 2022 [pdf]

[Abs]
In zero-shot multilingual extractive text summarization, a model is typically trained on English summarization dataset and then applied on summarization datasets of other languages. Given English gold summaries and documents, sentence-level labels for extractive summarization are usually generated using heuristics. However, these monolingual labels created on English datasets may not be optimal on datasets of other languages, for that there is the syntactic or semantic discrepancy between different languages. In this way, it is possible to translate the English dataset to other languages and obtain different sets of labels again using heuristics. To fully leverage the information of these different sets of labels, we propose NLSSum (Neural Label Search for Summarization), which jointly learns hierarchical weights for these different sets of labels together with our summarization model. We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations across these two datasets.
Bridging Cross-Lingual Gaps During Leveraging the Multilingual Sequence-to-Sequence Pretraining for Text Generation Changtong Zan, Liang Ding, Li Shen, Yu Cao, Weifeng Liu, Dacheng Tao [pdf]
A Variational Hierarchical Model for Neural Cross-Lingual Summarization Yunlong Liang, Fandong Meng, Chulun Zhou, Jinan Xu, Yufeng Chen, Jinsong Su, Jie Zhou ACL 2022 [pdf] [code]

[Abs]
The goal of the cross-lingual summarization (CLS) is to convert a document in one language (e.g., English) to a summary in another one (e.g., Chinese). The CLS task is essentially the combination of machine translation (MT) and monolingual summarization (MS), and thus there exists the hierarchical relationship between MT&MS and CLS. Existing studies on CLS mainly focus on utilizing pipeline methods or jointly training an end-to-end model through an auxiliary MT or MS objective. However, it is very challenging for the model to directly conduct CLS as it requires both the abilities to translate and summarize. To address this issue, we propose a hierarchical model for the CLS task, based on the conditional variational auto-encoder. The hierarchical model contains two kinds of latent variables at the local and global levels, respectively. At the local level, there are two latent variables, one for translation and the other for summarization. As for the global level, there is another latent variable for cross-lingual summarization conditioned on the two local-level variables. Experiments on two language directions (English-Chinese) verify the effectiveness and superiority of the proposed approach. In addition, we show that our model is able to generate better cross-lingual summaries than comparison models in the few-shot setting.
CptGraphSum: Let key clues guide the cross-lingual abstractive summarization Shuyu Jiang, Dengbiao Tu, Xingshu Chen, Rui Tang, Wenxian Wang, Haizhou Wang [pdf]
CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs Tahmid Hasan, Abhik Bhattacharjee, Wasi Uddin Ahmad, Yuan-Fang Li, Yong-Bin Kang, Rifat Shahriyar [pdf] [code]
Improving Neural Cross-Lingual Summarization via Employing Optimal Transport Distance for Knowledge Distillation Thong Nguyen, Luu Anh Tuan AAAI 2022 [pdf] [code]
Evaluation of Abstractive Summarisation Models with Machine Translation in Deliberative Processes Miguel Arana-Catania, Rob Procter, Yulan He, Maria Liakata EMNLP 2021| newsum [pdf]
Models and Datasets for Cross-Lingual Summarisation Laura Perez-Beltrachini, Mirella Lapata EMNLP 2021 [pdf] [data]
MassiveSumm: a very large-scale, very multilingual, news summarisation dataset Daniel Varab, Natalie Schluter EMNLP 2021 [pdf] [code]
Bridging the Gap: Cross-Lingual Summarization with Compression Rate Yu Bai, Heyan Huang, Kai Fan, Yang Gao, Zewen Chi, Boxing Chen [pdf]
Contrastive Aligned Joint Learning for Multilingual Summarization Danqing Wang, Jiaze Chen, Hao Zhou, Xipeng Qiu, Lei Li ACL 2021 Findings [pdf] [data]
XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages T. Hasan, A. Bhattacharjee, M. S. Islam, K. Samin, Y. Li, Y. Kang, M. S. Rahman, R. Shahriyar Findings of ACL 2021 [pdf] [data]
ZmBART: An Unsupervised Cross-lingual Transfer Framework for Language Generation Kaushal Kumar Maurya, Maunendra Sankar Desarkar, Yoshinobu Kano, Kumari Deepshikha Findings of ACL 2021 [pdf] [code]
mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs Zewen Chi, Li Dong, Shuming Ma, Shaohan Huang Xian-Ling Mao, Heyan Huang, Furu Wei [pdf] [code]
Evaluating the Efficacy of Summarization Evaluation across Languages Fajri Koto, Jey Han Lau, Timothy Baldwin Findings of ACL 2021 [pdf]
Cross-Lingual Abstractive Summarization with Limited Parallel Resources Yu Bai, Yang Gao, Heyan Huang ACL 2021 [pdf] [code]
Unsupervised Approach to Multilingual User Comments Summarization Aleš Žagar, Marko Robnik-Šikonja EACL21 [pdf] [code]
MultiHumES: Multilingual Humanitarian Dataset for Extractive Summarization Jenny Paola Yela-Bello, Ewan Oglethorpe, Navid Rekabsaz EACL21 [pdf] [data]
Cross-lingual Approach to Abstractive Summarization Aleš Žagar, Marko Robnik-Šikonja [pdf]
Mixed-Lingual Pre-training for Cross-lingual Summarization Ruochen Xu, Chenguang Zhu, Yu Shi, Michael Zeng, Xuedong Huang AACL20 [pdf]
Multi-Task Learning for Cross-Lingual Abstractive Summarization Sho Takase, Naoaki Okazaki [pdf]
WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization Faisal Ladhak, Esin Durmus, Claire Cardie, Kathleen McKeown Findings of EMNLP20 [pdf] [data]
A Deep Reinforced Model for Zero-Shot Cross-Lingual Summarization with Bilingual Semantic Similarity Rewards Zi-Yi Dou, Sachin Kumar, Yulia Tsvetkov ACL20 workshop [pdf] [code]
Jointly Learning to Align and Summarize for Neural Cross-Lingual Summarization Yue Cao, Hui Liu, Xiaojun Wan ACL20 [pdf]
Attend, Translate and Summarize: An Efficient Method for Neural Cross-Lingual Summarization Junnan Zhu, Yu Zhou, Jiajun Zhang, Chengqing Zong ACL20 [pdf] [code]
MultiSumm: Towards a Unified Model for Multi-Lingual Abstractive Summarization Yue Cao, Xiaojun Wan, Jinge Yao, Dian Yu AAAI20 [pdf] [code]
Cross-Lingual Natural Language Generation via Pre-Training Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-Ling Mao, Heyan Huang AAAI 2020 [pdf] [code]
Global Voices: Crossing Borders in Automatic News Summarization Khanh Nguyen, Hal Daumé III EMNLP19 workshop [pdf] [data]
NCLS: Neural Cross-Lingual Summarization Junnan Zhu, Qian Wang, Yining Wang, Yu Zhou, Jiajun Zhang, Shaonan Wang, Chengqing Zong EMNLP19 [pdf] [code]
Zero-Shot Cross-Lingual Abstractive Sentence Summarization through Teaching Generation and Attention Xiangyu Duan, Mingming Yin, Min Zhang, Boxing Chen, Weihua Luo ACL19 [pdf] [code]
A Robust Abstractive System for Cross-Lingual Summarization Jessica Ouyang, Boya Song, Kathy McKeown NAACL19 [pdf]
Cross-Lingual Korean Speech-to-Text Summarization HyoJeon Yoon, Dinh Tuyen Hoang, Ngoc Thanh Nguyen, Dosam Hwang ACIIDS19 [pdf]
Cross-language document summarization via extraction and ranking of multiple summaries Xiaojun Wan, Fuli Luo, Xue Sun, Songfang Huang & Jin-ge Yao [pdf]
Zero-Shot Cross-Lingual Neural Headline Generation Shi-qi Shen, Yun Chen, Cheng Yang, Zhi-yuan Liu, Mao-song Sun TASLP18 [pdf]
Cross-Language Text Summarization using Sentence and Multi-Sentence Compression Elvys Linhares Pontes, Stéphane Huet, Juan-Manuel Torres-Moreno, Andréa Carneiro Linhares NLDB18 [pdf]
Abstractive Cross-Language Summarization via Translation Model Enhanced Predicate Argument Structure Fusing Jiajun Zhang, Yu Zhou, Chengqing Zong TASLP16 [pdf]
Phrase-based Compressive Cross-Language Summarization Jin-ge Yao ,Xiaojun Wan ,Jianguo Xiao EMNLP15 [pdf]
Multilingual Single-Document Summarization with MUSE Marina Litvak, Mark Last MultiLing13 [pdf]
Using bilingual information for cross-language document summarization Xiaojun Wan ACL11 [pdf]
A Graph-based Approach to Cross-language Multi-document Summarization Florian Boudin, Stéphane Huet, Juan-Manuel Torres-Moreno [pdf]
Cross-language document summarization based on machine translation quality prediction Xiaojun Wan, Huiying Li, Jianguo Xiao ACL10 [pdf]
Evaluation of a Cross-lingual Romanian-English Multi-document Summariser Constantin Orasan, Oana Andreea Chiorean LREC08 [pdf]
Cross-lingual C*ST*RD: English access to Hindi information Anton Leuski, Chin-Yew Lin, Liang Zhou, Ulrich Germann, Franz Josef Och, Eduard Hovy [pdf]

Multi-modal

Exploiting Pseudo Image Captions for Multimodal Summarization Chaoya Jiang, Rui Xie, Wei Ye, Jinan Sun, Shikun Zhang Findings ACL 2023 [pdf]

[Abs]
Cross-modal contrastive learning in vision language pretraining (VLP) faces the challenge of (partial) false negatives. In this paper, we study this problem from the perspective of Mutual Information (MI) optimization. It is common sense that InfoNCE loss used in contrastive learning will maximize the lower bound of MI between anchors and their positives, while we theoretically prove that MI involving negatives also matters when noises commonly exist. Guided by a more general lower bound form for optimization, we propose a contrastive learning strategy regulated by progressively refined cross-modal similarity, to more accurately optimize MI between an image/text anchor and its negative texts/images instead of improperly minimizing it. Our method performs competitively on four downstream cross-modal tasks and systematically balances the beneficial and harmful effects of (partial) false negative samples under theoretical guidance.
Learning Summary-Worthy Visual Representation for Abstractive Summarization in Video Zenan Xu, Xiaojun Meng, Yasheng Wang, Qinliang Su, Zexuan Qiu, Xin Jiang, Qun Liu IJCAI 2023 [pdf]

[Abs]
Multimodal abstractive summarization for videos (MAS) requires generating a concise textual summary to describe the highlights of a video according to multimodal resources, in our case, the video content and its transcript. Inspired by the success of the large-scale generative pre-trained language model (GPLM) in generating high-quality textual content (e.g., summary), recent MAS methods have proposed to adapt the GPLM to this task by equipping it with the visual information, which is often obtained through a general-purpose visual feature extractor. However, the generally extracted visual features may overlook some summary-worthy visual information, which impedes model performance. In this work, we propose a novel approach to learning the summary-worthy visual representation that facilitates abstractive summarization. Our method exploits the summary-worthy information from both the cross-modal transcript data and the knowledge that distills from the pseudo summary. Extensive experiments on three public multimodal datasets show that our method outperforms all competing baselines. Furthermore, with the advantages of summary-worthy visual information, our model can have a significant improvement on small datasets or even datasets with limited training data.
VideoXum: Cross-modal Visual and Textural Summarization of Videos Jingyang Lin, Hang Hua, Ming Chen, Yikang Li, Jenhao Hsiao, Chiuman Ho, Jiebo Luo [pdf] [code]

[Abs]
Video summarization aims to distill the most important information from a source video to produce either an abridged clip or a textual narrative. Traditionally, different methods have been proposed depending on whether the output is a video or text, thus ignoring the correlation between the two semantically related tasks of visual summarization and textual summarization. We propose a new joint video and text summarization task. The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video, collectively referred to as a cross-modal summary. The generated shortened video clip and text narratives should be semantically well aligned. To this end, we first build a large-scale human-annotated dataset -- VideoXum (X refers to different modalities). The dataset is reannotated based on ActivityNet. After we filter out the videos that do not meet the length requirements, 14,001 long videos remain in our new dataset. Each video in our reannotated dataset has human-annotated video summaries and the corresponding narrative summaries. We then design a novel end-to-end model -- VTSUM-BILP to address the challenges of our proposed task. Moreover, we propose a new metric called VT-CLIPScore to help evaluate the semantic consistency of cross-modality summary. The proposed model achieves promising performance on this new task and establishes a benchmark for future research.
Sample Efficient Multimodal Semantic Augmentation for Incremental Summarization Sumanta Bhattacharyya, Ramesh Manuvinakurike, Sahisnu Mazumder, Saurav Sahay [pdf]

[Abs]
In this work, we develop a prompting approach for incremental summarization of task videos. We develop a sample-efficient few-shot approach for extracting semantic concepts as an intermediate step. We leverage an existing model for extracting the concepts from the images and extend it to videos and introduce a clustering and querying approach for sample efficiency, motivated by the recent advances in perceiver-based architectures. Our work provides further evidence that an approach with richer input context with relevant entities and actions from the videos and using these as prompts could enhance the summaries generated by the model. We show the results on a relevant dataset and discuss possible directions for the work.
Large Scale Multi-Lingual Multi-Modal Summarization Dataset Yash Verma, Anubhav Jangra, Raghvendra Kumar, Sriparna Saha [pdf] [code]

[Abs]
Significant developments in techniques such as encoder-decoder models have enabled us to represent information comprising multiple modalities. This information can further enhance many downstream tasks in the field of information retrieval and natural language processing; however, improvements in multi-modal techniques and their performance evaluation require large-scale multi-modal data which offers sufficient diversity. Multi-lingual modeling for a variety of tasks like multi-modal summarization, text generation, and translation leverages information derived from high-quality multi-lingual annotated data. In this work, we present the current largest multi-lingual multi-modal summarization dataset (M3LS), and it consists of over a million instances of document-image pairs along with a professionally annotated multi-modal summary for each pair. It is derived from news articles published by British Broadcasting Corporation(BBC) over a decade and spans 20 languages, targeting diversity across five language roots, it is also the largest summarization dataset for 13 languages and consists of cross-lingual summarization data for 2 languages. We formally define the multi-lingual multi-modal summarization task utilizing our dataset and report baseline scores from various state-of-the-art summarization techniques in a multi-lingual setting. We also compare it with many similar datasets to analyze the uniqueness and difficulty of M3LS.https://arxiv.org/abs/2302.06560
Assist Non-native Viewers: Multimodal Cross-Lingual Summarization for How2 Videos Nayu Liu, Kaiwen Wei, Xian Sun, Hongfeng Yu, Fanglong Yao, Li Jin, Guo Zhi, Guangluan Xu EMNLP 2022 [pdf] [data]

[Abs]
Multimodal summarization for videos aims to generate summaries from multi-source information (videos, audio transcripts), which has achieved promising progress. However, existing works are restricted to monolingual video scenarios, ignoring the demands of non-native video viewers to understand the cross-language videos in practical applications. It stimulates us to propose a new task, named Multimodal Cross-Lingual Summarization for videos (MCLS), which aims to generate cross-lingual summaries from multimodal inputs of videos. First, to make it applicable to MCLS scenarios, we conduct a Video-guided Dual Fusion network (VDF) that integrates multimodal and cross-lingual information via diverse fusion strategies at both encoder and decoder. Moreover, to alleviate the problem of high annotation costs and limited resources in MCLS, we propose a triple-stage training framework to assist MCLS by transferring the knowledge from monolingual multimodal summarization data, which includes: 1) multimodal summarization on sufficient prevalent language videos with a VDF model; 2) knowledge distillation (KD) guided adjustment on bilingual transcripts; 3) multimodal summarization for cross-lingual videos with a KD induced VDF model. Experiment results on the reorganized How2 dataset show that the VDF model alone outperforms previous methods for multimodal summarization, and the performance further improves by a large margin via the proposed triple-stage training framework.
TLDW: Extreme Multimodal Summarisation of News Videos Peggy Tang, Kun Hu, Lei Zhang, Jiebo Luo, Zhiyong Wang [pdf]

[Abs]
Multimodal summarisation with multimodal output is drawing increasing attention due to the rapid growth of multimedia data. While several methods have been proposed to summarise visual-text contents, their multimodal outputs are not succinct enough at an extreme level to address the information overload issue. To the end of extreme multimodal summarisation, we introduce a new task, eXtreme Multimodal Summarisation with Multimodal Output (XMSMO) for the scenario of TL;DW - Too Long; Didn't Watch, akin to TL;DR. XMSMO aims to summarise a video-document pair into a summary with an extremely short length, which consists of one cover frame as the visual summary and one sentence as the textual summary. We propose a novel unsupervised Hierarchical Optimal Transport Network (HOT-Net) consisting of three components: hierarchical multimodal encoders, hierarchical multimodal fusion decoders, and optimal transport solvers. Our method is trained, without using reference summaries, by optimising the visual and textual coverage from the perspectives of the distance between the semantic distributions under optimal transport plans. To facilitate the study on this task, we collect a large-scale dataset XMSMO-News by harvesting 4,891 video-document pairs. The experimental results show that our method achieves promising performance in terms of ROUGE and IoU metrics.
Hierarchical3D Adapters for Long Video-to-text Summarization Pinelopi Papalampidi, Mirella Lapata [pdf]

[Abs]
In this paper, we focus on video-to-text summarization and investigate how to best utilize multimodal information for summarizing long inputs (e.g., an hour-long TV show) into long outputs (e.g., a multi-sentence summary). We extend SummScreen (Chen et al., 2021), a dialogue summarization dataset consisting of transcripts of TV episodes with reference summaries, and create a multimodal variant by collecting corresponding full-length videos. We incorporate multimodal information into a pre-trained textual summarizer efficiently using adapter modules augmented with a hierarchical structure while tuning only 3.8% of model parameters. Our experiments demonstrate that multimodal information offers superior performance over more memory-heavy and fully fine-tuned textual summarization methods.
Modeling Paragraph-Level Vision-Language Semantic Alignment for Multi-Modal Summarization Xinnian Liang, Chenhao Cui, Shuangzhi Wu, Jiali Zeng, Yufan Jiang, Zhoujun Li [pdf]

[Abs]
Most current multi-modal summarization methods follow a cascaded manner, where an off-the-shelf object detector is first used to extract visual features, then these features are fused with language representations to generate the summary with an encoder-decoder model. The cascaded way cannot capture the semantic alignments between images and paragraphs, which are crucial to a precise summary. In this paper, we propose ViL-Sum to jointly model paragraph-level \textbf{Vi}sion-\textbf{L}anguage Semantic Alignment and Multi-Modal \textbf{Sum}marization. The core of ViL-Sum is a joint multi-modal encoder with two well-designed tasks, image reordering and image selection. The joint multi-modal encoder captures the interactions between modalities, where the reordering task guides the model to learn paragraph-level semantic alignment and the selection task guides the model to selected summary-related images in the final summary. Experimental results show that our proposed ViL-Sum significantly outperforms current state-of-the-art methods. In further analysis, we find that two well-designed tasks and joint multi-modal encoder can effectively guide the model to learn reasonable paragraphs-images and summary-images relations.
MHMS: Multimodal Hierarchical Multimedia Summarization Jielin Qiu, Jiacheng Zhu, Mengdi Xu, Franck Dernoncourt, Trung Bui, Zhaowen Wang, Bo Li, Ding Zhao, Hailin Jin [pdf]
Video Summarization Based on Video-text Representation Li Haopeng, Ke Qiuhong, Gong Mingming, Zhang Rui [pdf]
UniMS: A Unified Framework for Multimodal Summarization with Knowledge Distillation Zhengkun Zhang, Xiaojun Meng, Yasheng Wang, Xin Jiang, Qun Liu, Zhenglu Yang AAAI 2022 [pdf]
Hierarchical Cross-Modality Semantic Correlation Learning Model for Multimodal Summarization Litian Zhang, Xiaoming Zhang, Junshu Pan, Feiran Huang AAAI 2022 [pdf] [data]
Attention-based Multi-hypothesis Fusion for Speech Summarization Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Shinji Watanabe [pdf]
Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization Tiezheng Yu, Wenliang Dai, Zihan Liu, Pascale Fung EMNLP 2021 [pdf] [code]
Multi-Modal Supplementary-Complementary Summarization using Multi-Objective Optimization Anubhav Jangra, Sriparna Saha, Adam Jatowt, Mohammad Hasanuzzaman SIGIR 2021 [pdf]
Self-Supervised Multimodal Opinion Summarization Jinbae Im, Moonki Kim, Hoyeop Lee, Hyunsouk Cho, Sehee Chung ACL21 [pdf] [code]
GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization Jia-Hong Huang, Luka Murn, Marta Mrak, Marcel Worring ICMR21 [pdf]
Multimodal Sentence Summarization via Multimodal Selective Encoding Haoran Li, Junnan Zhu, Jiajun Zhang, Xiaodong He, Chengqing Zong COLING20 [pdf]
Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos Nayu Liu, Xian Sun, Hongfeng Yu, Wenkai Zhang, Guangluan Xu EMNLP20 [pdf]
MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention Aman Khullar, Udit Arora EMNLP20 Workshop [pdf] [code]
VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles Mingzhe Li, Xiuying Chen, Shen Gao, Zhangming Chan, Dongyan Zhao, Rui Yan EMNLP20 [pdf] [data]
Multi-modal Summarization for Video-containing Documents Xiyan Fu, Jun Wang, Zhenglu Yang [pdf] [code]
Text-Image-Video Summary Generation Using Joint Integer Linear Programming Anubhav Jangra, Adam Jatowt, Mohammad Hasanuzzaman, Sriparna Saha ECIR20 [pdf]
Aspect-Aware Multimodal Summarization for Chinese E-Commerce Products Haoran Li, Peng Yuan, Song Xu, Youzheng Wu, Xiaodong He, Bowen Zhou AAAI20 [pdf] [code]
Convolutional Hierarchical Attention Network for Query-Focused Video Summarization Shuwen Xiao, Zhou Zhao, Zijian Zhang, Xiaohui Yan, Min Yang AAAI20 [pdf]
Multimodal Summarization with Guidance of Multimodal Reference Junnan Zhu, Yu Zhou, Jiajun Zhang, Haoran Li, Chengqing Zong, Changliang Li AAAI20 [pdf]
EmotionCues: Emotion-Oriented Visual Summarization of Classroom Videos Haipeng Zeng, Xinhuan Shu, Yanbang Wang, Yong Wang, Liguo Zhang, Ting-Chuen Pong, Huamin Qu [pdf]
A Survey on Automatic Summarization Using Multi-Modal Summarization System for Asynchronous Collections Shilpadevi Vasant Bhagwat, Sheetal .S. Thokal [pdf]
Extractive summarization of documents with images based on multi-modal RNN Jingqiang Chen, Hai Zhuge [pdf]
Keep Meeting Summaries on Topic: Abstractive Multi-Modal Meeting Summarization Manling Li, Lingyu Zhang, Heng Ji, Richard J. Radke ACL19 [pdf]
Multimodal Abstractive Summarization for How2 Videos Shruti Palaskar, Jindřich Libovický, Spandana Gella, Florian Metze ACL19 [pdf]
MSMO: Multimodal Summarization with Multimodal Output Junnan Zhu, Haoran Li, Tianshang Liu, Yu Zhou, Jiajun Zhang, Chengqing Zong EMNLP18 [pdf] [data]
Abstractive Text-Image Summarization Using Multi-Modal Attentional Hierarchical RNN Jingqiang Chen, Hai Zhuge EMNLP18 [pdf]
Multi-modal Sentence Summarization with Modality Attention and Image Filtering Haoran Li, Junnan Zhu, Tianshang Liu, Jiajun Zhang, Chengqing Zong IJCAI18 [pdf]
Multimodal Abstractive Summarization for Open-Domain Videos Jindrich Libovický, Shruti Palaskar, Spandana Gella, Florian Metze NIPS18 [pdf] [data]
Read, Watch, Listen, and Summarize: Multi-Modal Summarization for Asynchronous Text, Image, Audio and Video Haoran Li, Junnan Zhu, Cong Ma, Jiajun Zhang, Chengqing Zong [pdf]
Fusing Verbal and Nonverbal Information for Extractive Meeting Summarization Fumio Nihei, Yukiko Nakano, Yukiko I. Nakano, Yutaka Takase, Yutaka Takase GIFT18 [pdf]
Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video Haoran Li, Junnan Zhu, Cong Ma, Jiajun Zhang, Chengqing Zong EMNLP17 [pdf]
Meeting Extracts for Discussion Summarization Based on Multimodal Nonverbal Information Fumio Nihei, Yukiko Nakano, Yukiko I. Nakano, Yutaka Takase, Yutaka Takase ICMI16 [pdf]
Summarizing a multimodal set of documents in a Smart Room Maria Fuentes, Horacio Rodríguez, Jordi Turmo LREC12 [pdf]
Multi-modal summarization of key events and top players in sports tournament videos Dian Tjondronegoro, Xiaohui Tao, Johannes Sasongko and Cher Han Lau [pdf]
Multimodal Summarization of Complex Sentences Naushad UzZaman, Jeffrey P. Bigham, James F. Allen [pdf]
Summarization of Multimodal Information Saif Ahmad, Paulo C F de Oliveira, Khurshid Ahmad LREC04 [pdf]
Multimodal Summarization of Meeting Recordings Berna Erol, Dar-Shyang Lee, and Jonathan Hull ICME03 [pdf]

Sentiment Related

Why Do You Feel This Way? Summarizing Triggers of Emotions in Social Media Posts Hongli Zhan, Tiberiu Sosea, Cornelia Caragea, Junyi Jessy Li EMNLP 2022 [pdf] [code]

[Abs]
Crises such as the COVID-19 pandemic continuously threaten our world and emotionally affect billions of people worldwide in distinct ways. Understanding the triggers leading to people’s emotions is of crucial importance. Social media posts can be a good source of such analysis, yet these texts tend to be charged with multiple emotions, with triggers scattering across multiple sentences. This paper takes a novel angle, namely, emotion detection and trigger summarization, aiming to both detect perceived emotions in text, and summarize events and their appraisals that trigger each emotion. To support this goal, we introduce CovidET (Emotions and their Triggers during Covid-19), a dataset of ~1,900 English Reddit posts related to COVID-19, which contains manual annotations of perceived emotions and abstractive summaries of their triggers described in the post. We develop strong baselines to jointly detect emotions and summarize emotion triggers. Our analyses show that CovidET presents new challenges in emotion-specific summarization, as well as multi-emotion detection in long social media posts.
Making the Best Use of Review Summary for Sentiment Analysis Sen Yang, Leyang Cui, Jun Xie, Yue Zhang COLING20 [pdf] [code] [bib]
A Unified Dual-view Model for Review Summarization and Sentiment Classification with Inconsistency Loss Hou Pong Chan, Wang Chen, Irwin King SIGIR20 [pdf] [code]
A Hierarchical End-to-End Model for Jointly Improving Text Summarization and Sentiment Classification Shuming Ma, Xu Sun, Junyang Lin, Xuancheng Ren IJCAI18 [pdf]
Two-level Text Summarization from Online News Sources with Sentiment Analysis Tarun B. Mirani, Sreela Sasi IEEE17 [pdf]
Creating Video Summarization From Emotion Perspective Yijie Lan, Shikui Wei, Ruoyu Liu, Yao Zhao ICSP16 [pdf]

Pre-trained Language Model Based

SOCRATIC Pretraining: Question-Driven Pretraining for Controllable Summarization Yixin Liu, Budhaditya Deb, Milagro Teruel, Aaron Halfaker, Dragomir Radev, Ahmed Hassan Awadallah ACL 2023 [pdf] [code]

[Abs]
In long document controllable summarization, where labeled data is scarce, pretrained models struggle to adapt to the task and effectively respond to user queries. In this paper, we introduce Socratic pretraining, a question-driven, unsupervised pretraining objective specifically designed to improve controllability in summarization tasks. By training a model to generate and answer relevant questions in a given context, Socratic pretraining enables the model to more effectively adhere to user-provided queries and identify relevant content to be summarized. We demonstrate the effectiveness of this approach through extensive experimentation on two summarization domains, short stories and dialogue, and multiple control strategies: keywords, questions, and factoid QA pairs. Our pretraining method relies only on unlabeled documents and a question generation system and outperforms pre-finetuning approaches that use additional supervised data. Furthermore, our results show that Socratic pretraining cuts task-specific labeled data requirements in half, is more faithful to user-provided queries, and achieves state-of-the-art performance on QMSum and SQuALITY.
An Analysis of Abstractive Text Summarization Using Pre-trained Models Tohida Rehman, Suchandan Das, Debarshi Kumar Sanyal, Samiran Chattopadhyay [pdf]

[Abs]
People nowadays use search engines like Google, Yahoo, and Bing to find information on the Internet. Due to explosion in data, it is helpful for users if they are provided relevant summaries of the search results rather than just links to webpages. Text summarization has become a vital approach to help consumers swiftly grasp vast amounts of this http URL this paper, different pre-trained models for text summarization are evaluated on different datasets. Specifically, we have used three different pre-trained models, namely, google/pegasus-cnn-dailymail, T5-base, facebook/bart-large-cnn. We have considered three different datasets, namely, CNN-dailymail, SAMSum and BillSum to get the output from the above three models. The pre-trained models are compared over these different datasets, each of 2000 examples, through ROUGH and BLEU metrics.
Z-Code++: A Pre-trained Language Model Optimized for Abstractive Summarization Pengcheng He, Baolin Peng, Liyang Lu, Song Wang, Jie Mei, Yang Liu, Ruochen Xu, Hany Hassan Awadalla, Yu Shi, Chenguang Zhu, Wayne Xiong, Michael Zeng, Jianfeng Gao, Xuedong Huang [pdf]

[Abs]
This paper presents Z-Code++, a new pre-trained language model optimized for abstractive text summarization. The model extends the state of the art encoder-decoder model using three techniques. First, we use a two-phase pre-training process to improve model's performance on low-resource summarization tasks. The model is first pre-trained using text corpora for language understanding, and then is continually pre-trained on summarization corpora for grounded text generation. Second, we replace self-attention layers in the encoder with disentangled attention layers, where each word is represented using two vectors that encode its content and position, respectively. Third, we use fusion-in-encoder, a simple yet effective method of encoding long sequences in a hierarchical manner. Z-Code++ creates new state of the art on 9 out of 13 text summarization tasks across 5 languages. Our model is parameter-efficient in that it outperforms the 600x larger PaLM-540B on XSum, and the finetuned 200x larger GPT3-175B on SAMSum. In zero-shot and few-shot settings, our model substantially outperforms the competing models.
MVP: Multi-task Supervised Pre-training for Natural Language Generation Tianyi Tang, Junyi Li, Wayne Xin Zhao, Ji-Rong Wen [pdf] [code]

[Abs]
Pre-trained language models (PLMs) have achieved notable success in natural language generation (NLG) tasks. Up to now, most of the PLMs are pre-trained in an unsupervised manner using large-scale general corpus. In the meanwhile, an increasing number of models pre-trained with less labeled data showcase superior performance compared to unsupervised models. Motivated by the success of supervised pre-training, we propose Multi-task superVised Pre-training (MVP) for natural language generation. For pre-training the text generation model MVP, we collect a labeled pre-training corpus from 45 datasets over seven generation tasks. For each task, we further pre-train specific soft prompts to stimulate the model capacity in performing a specific task. Extensive experiments have demonstrated the effectiveness of our supervised pre-training in a number of NLG tasks, and our general methods achieve state-of-the-art performance on 12 of 17 datasets.
E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language Understanding and Generation Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, Dacheng Tao [pdf]
Does Pretraining for Summarization Require Knowledge Transfer? Kundan Krishna, Jeffrey Bigham, Zachary C. Lipton EMNLP 2021 Findings [pdf] [code]
ARMAN: Pre-training with Semantically Selecting and Reordering of Sentences for Persian Abstractive Summarization Alireza Salemi, Emad Kebriaei, Ghazal Neisi Minaei, Azadeh Shakery EMNLP 2021 [pdf] [code]
Leveraging Lead Bias for Zero-shot Abstractive News Summarization Chenguang Zhu, Ziyi Yang, Robert Gmyr, Michael Zeng, Xuedong Huang SIGIR 2021 [pdf]
ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation Yu Sun, Shuohuan Wang, Shikun Feng, Siyu Ding, Chao Pang, Junyuan Shang, Jiaxiang Liu, Xuyi Chen, Yanbin Zhao, Yuxiang Lu, Weixin Liu, Zhihua Wu, Weibao Gong, Jianzhong Liang, Zhizhou Shang, Peng Sun, Wei Liu, Xuan Ouyang, Dianhai Yu, Hao Tian, Hua Wu, Haifeng Wang [pdf]
BANG: Bridging Autoregressive and Non-autoregressive Generation with Large Scale Pretraining Weizhen Qi, Yeyun Gong, Jian Jiao, Yu Yan, Weizhu Chen, Dayiheng Liu, Kewen Tang, Houqiang Li, Jiusheng Chen, Ruofei Zhang, Ming Zhou, Nan Duan ICML 2021 [pdf] [code]
Fact-level Extractive Summarization with Hierarchical Graph Mask on BERT Ruifeng Yuan, Zili Wang, Wenjie Li COLING20 [pdf] [code]
Towards Zero-Shot Conditional Summarization with Adaptive Multi-Task Fine-Tuning Travis Goodwin, Max Savery, Dina Demner-Fushman Findings of EMNLP [pdf] [code]
Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation Alexander R. Fabbri, Simeng Han, Haoyuan Li, Haoran Li, Marjan Ghazvininejad, Shafiq Joty, Dragomir Radev, Yashar Mehdad [pdf]
Pre-trained Summarization Distillation Sam Shleifer, Alexander M. Rush [pdf] [code]
Pre-training for Abstractive Document Summarization by Reinstating Source Text Yanyan Zou, Xingxing Zhang, Wei Lu, Furu Wei, Ming Zhou EMNLP20 [pdf] [code]
PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation Bin Bi, Chenliang Li, Chen Wu, Ming Yan, Wei Wang, Songfang Huang, Fei Huang, Luo Si EMNLP20 [pdf]
TED: A Pretrained Unsupervised Summarization Model with Theme Modeling and Denoising Ziyi Yang Chenguang Zhu Robert Gmyr Michael Zeng Xuedong Huang Eric Darve Findings of EMNLP20 [pdf]
QURIOUS: Question Generation Pretraining for Text Generation Shashi Narayan, Gonçalo Simoes, Ji Ma, Hannah Craighead, Ryan Mcdonald ACL20 Short [pdf]
PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization Jingqing Zhang, Yao Zhao, Mohammad Saleh, Peter J. Liu ICML20 [pdf] [code]
Abstractive Text Summarization based on Language Model Conditioning and Locality Modeling Dmitrii Aksenov, Julián Moreno-Schneider, Peter Bourgonje, Robert Schwarzenberg, Leonhard Hennig, Georg Rehm LREC20 [pdf]
Abstractive Summarization with Combination of Pre-trained Sequence-to-Sequence and Saliency Models Dmitrii Aksenov, Julián Moreno-Schneider, Peter Bourgonje, Robert Schwarzenberg, Leonhard Hennig, Georg Rehm [pdf]
Learning by Semantic Similarity Makes Abstractive Summarization Better Wonjin Yoon, Yoon Sun Yeo, Minbyul Jeong, Bong-Jun Yi, Jaewoo Kang ICML20 [pdf] [code]
Text Summarization with Pretrained Encoders Yang Liu, Mirella Lapata EMNLP19 [pdf] [code]
HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization Xingxing Zhang, Furu Wei, Ming Zhou ACL19 [pdf]
MASS: Masked Sequence to Sequence Pre-training for Language Generation Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu ICML19 [pdf] [code]
Pretraining-Based Natural Language Generation for Text Summarization Haoyu Zhang, Jianjun Xu, Ji Wang [pdf]
Fine-tune BERT for Extractive Summarization Yang Liu [pdf] [code]
Unified Language Model Pre-training for Natural Language Understanding and Generation Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon NIPS19 [pdf] [code]
Self-Supervised Learning for Contextualized Extractive Summarization Hong Wang, Xin Wang, Wenhan Xiong, Mo Yu, Xiaoxiao Guo, Shiyu Chang, William Yang Wang ACL19 [pdf] [code]
Efficient Adaptation of Pretrained Transformers for Abstractive Summarization Andrew Hoang, Antoine Bosselut, Asli Celikyilmaz, Yejin Choi [pdf] [code]

Controllable

Summarization with Precise Length Control Lesly Miculicich, Yujia Xie, Song Wang, Pengcheng He [pdf]

[Abs]
Many applications of text generation such as summarization benefit from accurately controlling the text length. Existing approaches on length-controlled summarization either result in degraded performance or can only control the length approximately. In this work, we present a framework to generate summaries with precisely the specified number of tokens or sentences, while maintaining or even improving the text quality. In addition, we jointly train the models to predict the lengths, so our model can generate summaries with optimal length. We evaluate the proposed framework on the CNNDM dataset and show improved performance compared to existing methods.
HydraSum: Disentangling Style Features in Text Summarization with Multi-Decoder Models Tanya Goyal, Nazneen Rajani, Wenhao Liu, Wojciech Kryscinski EMNLP 2022 [pdf] [code]

[Abs]
Summarization systems make numerous “decisions” about summary properties during inference, e.g. degree of copying, specificity and length of outputs, etc. However, these are implicitly encoded within model parameters and specific styles cannot be enforced. To address this, we introduce HydraSum, a new summarization architecture that extends the single decoder framework of current models to a mixture-of-experts version with multiple decoders. We show that HydraSum’s multiple decoders automatically learn contrasting summary styles when trained under the standard training objective without any extra supervision. Through experiments on three summarization datasets (CNN, Newsroom and XSum), we show that HydraSum provides a simple mechanism to obtain stylistically-diverse summaries by sampling from either individual decoders or their mixtures, outperforming baseline models. Finally, we demonstrate that a small modification to the gating strategy during training can enforce an even stricter style partitioning, e.g. high- vs low-abstractiveness or high- vs low-specificity, allowing users to sample from a larger area in the generation space and vary summary styles along multiple dimensions.
Socratic Pretraining: Question-Driven Pretraining for Controllable Summarization Artidoro Pagnoni, Alexander R. Fabbri, Wojciech Kryściński, Chien-Sheng Wu [pdf] [code]

[Abs]
In long document controllable summarization, where labeled data is scarce, pretrained models struggle to adapt to the task and effectively respond to user queries. In this paper, we introduce Socratic pretraining, a question-driven, unsupervised pretraining objective specifically designed to improve controllability in summarization tasks. By training a model to generate and answer relevant questions in a given context, Socratic pretraining enables the model to more effectively adhere to user-provided queries and identify relevant content to be summarized. We demonstrate the effectiveness of this approach through extensive experimentation on two summarization domains, short stories and dialogue, and multiple control strategies: keywords, questions, and factoid QA pairs. Our pretraining method relies only on unlabeled documents and a question generation system and outperforms pre-finetuning approaches that use additional supervised data. Furthermore, our results show that Socratic pretraining cuts task-specific labeled data requirements in half, is more faithful to user-provided queries, and achieves state-of-the-art performance on QMSum and SQuALITY.
Attend to the Right Context: A Plug-and-Play Module for Content-Controllable Summarization Wen Xiao, Lesly Miculicich, Yang Liu, Pengcheng He, Giuseppe Carenini [pdf] [code]

[Abs]
Content-Controllable Summarization generates summaries focused on the given controlling signals. Due to the lack of large-scale training corpora for the task, we propose a plug-and-play module RelAttn to adapt any general summarizers to the content-controllable summarization task. RelAttn first identifies the relevant content in the source documents, and then makes the model attend to the right context by directly steering the attention weight. We further apply an unsupervised online adaptive parameter searching algorithm to determine the degree of control in the zero-shot setting, while such parameters are learned in the few-shot setting. By applying the module to three backbone summarization models, experiments show that our method effectively improves all the summarizers, and outperforms the prefix-based method and a widely used plug-and-play model in both zero- and few-shot settings. Tellingly, more benefit is observed in the scenarios when more control is needed.
MACSUM: Controllable Summarization with Mixed Attributes Yusen Zhang, Yang Liu, Ziyi Yang, Yuwei Fang, Yulong Chen, Dragomir Radev, Chenguang Zhu, Michael Zeng, Rui Zhang [pdf] [code]

[Abs]
Controllable summarization allows users to generate customized summaries with specified attributes. However, due to the lack of designated annotations of controlled summaries, existing works have to craft pseudo datasets by adapting generic summarization benchmarks. Furthermore, most research focuses on controlling single attributes individually (e.g., a short summary or a highly abstractive summary) rather than controlling a mix of attributes together (e.g., a short and highly abstractive summary). In this paper, we propose MACSum, the first human-annotated summarization dataset for controlling mixed attributes. It contains source texts from two domains, news articles and dialogues, with human-annotated summaries controlled by five designed attributes (Length, Extractiveness, Specificity, Topic, and Speaker). We propose two simple and effective parameter-efficient approaches for the new task of mixed controllable summarization based on hard prompt tuning and soft prefix tuning. Results and analysis demonstrate that hard prompt models yield the best performance on all metrics and human evaluations. However, mixed-attribute control is still challenging for summarization tasks. Our dataset and code are available at this https URL.
SentBS: Sentence-level Beam Search for Controllable Summarization Chenhui Shen, Liying Cheng, Lidong Bing, Yang You, Luo Si EMNLP 2022 [pdf] [code]

[Abs]
A wide range of control perspectives have been explored in controllable text generation. Structure-controlled summarization is recently proposed as a useful and interesting research direction. However, current structure-controlling methods have limited effectiveness in enforcing the desired structure. To address this limitation, we propose a sentence-level beam search generation method (SentBS), where evaluation is conducted throughout the generation process to select suitable sentences for subsequent generations. We experiment with different combinations of decoding methods to be used as subcomponents by SentBS and evaluate results on the structure-controlled dataset MReD. Experiments show that all explored combinations for SentBS can improve the agreement between the generated text and the desired structure, with the best method significantly reducing the structural discrepancies suffered by the existing model, by approximately 68%.
Readability Controllable Biomedical Document Summarization Readability Controllable Biomedical Document Summarization Findings of EMNLP 2022 [pdf]

[Abs]
Different from general documents, it is recognised that the ease with which people can understand a biomedical text is eminently varied, owing to the highly technical nature of biomedical documents and the variance of readers' domain knowledge. However, existing biomedical document summarization systems have paid little attention to readability control, leaving users with summaries that are incompatible with their levels of expertise. In recognition of this urgent demand, we introduce a new task of readability controllable summarization for biomedical documents, which aims to recognise users' readability demands and generate summaries that better suit their needs: technical summaries for experts and plain language summaries (PLS) for laymen. To establish this task, we construct a corpus consisting of biomedical papers with technical summaries and PLSs written by the authors, and benchmark multiple advanced controllable abstractive and extractive summarization models based on pre-trained language models (PLMs) with prevalent controlling and generation techniques. Moreover, we propose a novel masked language model (MLM) based metric and its variant to effectively evaluate the readability discrepancy between lay and technical summaries. Experimental results from automated and human evaluations show that though current control techniques allow for a certain degree of readability adjustment during generation, the performance of existing controllable summarization methods is far from desirable in this task.
EDU-level Extractive Summarization with Varying Summary Lengths Yuping Wu, Ching-Hsun Tseng, Jiayu Shang, Shengzhong Mao, Goran Nenadic, Xiao-Jun Zeng `` [pdf]

[Abs]
Extractive models usually formulate text summarization as extracting top-k important sentences from document as summary. Few work exploited extracting finer-grained Elementary Discourse Unit (EDU) and there is little analysis and justification for the extractive unit selection. To fill such a gap, this paper firstly conducts oracle analysis to compare the upper bound of performance for models based on EDUs and sentences. The analysis provides evidences from both theoretical and experimental perspectives to justify that EDUs make more concise and precise summary than sentences without losing salient information. Then, considering this merit of EDUs, this paper further proposes EDU-level extractive model with Varying summary Lengths (EDU-VL) and develops the corresponding learning algorithm. EDU-VL learns to encode and predict probabilities of EDUs in document, and encode EDU-level candidate summaries with different lengths based on various k values and select the best candidate summary in an end-to-end training manner. Finally, the proposed and developed approach is experimented on single and multi-document benchmark datasets and shows the improved performances in comparison with the state-of-the-art models.
Topic-Aware Evaluation and Transformer Methods for Topic-Controllable Summarization Tatiana Passali, Grigorios Tsoumakas `` [pdf] [code]

[Abs]
Topic-controllable summarization is an emerging research area with a wide range of potential applications. However, existing approaches suffer from significant limitations. First, there is currently no established evaluation metric for this task. Furthermore, existing methods built upon recurrent architectures, which can significantly limit their performance compared to more recent Transformer-based architectures, while they also require modifications to the model's architecture for controlling the topic. In this work, we propose a new topic-oriented evaluation measure to automatically evaluate the generated summaries based on the topic affinity between the generated summary and the desired topic. We also conducted a user study that validates the reliability of this measure. Finally, we propose simple, yet powerful methods for topic-controllable summarization either incorporating topic embeddings into the model's architecture or employing control tokens to guide the summary generation. Experimental results show that control tokens can achieve better performance compared to more complicated embedding-based approaches while being at the same time significantly faster.
Length Control in Abstractive Summarization by Pretraining Information Selection Yizhu Liu, Qi Jia, Kenny Zhu ACL 2022 [pdf] [code]

[Abs]
Previous length-controllable summarization models mostly control lengths at the decoding stage, whereas the encoding or the selection of information from the source document is not sensitive to the designed length. They also tend to generate summaries as long as those in the training data. In this paper, we propose a length-aware attention mechanism (LAAM) to adapt the encoding of the source based on the desired length. Our approach works by training LAAM on a summary length balanced dataset built from the original training data, and then fine-tuning as usual. Results show that this approach is effective in generating high-quality summaries with desired lengths and even those short lengths never seen in the original training set.
A Character-Level Length-Control Algorithm for Non-Autoregressive Sentence Summarization Puyuan Liu, Xiang Zhang, Lili Mou [pdf] [code]
EntSUM: A Data Set for Entity-Centric Summarization Mounica Maddela, Mayank Kulkarni, Daniel Preotiuc-Pietro ACL 2022 [pdf] [code] [data]
Reinforced Abstractive Summarization with Adaptive Length Controlling Mingyang Song, Yi Feng, Liping Jing [pdf]
HydraSum -- Disentangling Stylistic Features in Text Summarization using Multi-Decoder Models Tanya Goyal, Nazneen Fatema Rajani, Wenhao Liu, Wojciech Kryściński [pdf]
RetrievalSum: A Retrieval Enhanced Framework for Abstractive Summarization Chenxin An, Ming Zhong, Zhichao Geng, Jianqiang Yang, Xipeng Qiu [pdf]
Aspect-Controllable Opinion Summarization Reinald Kim Amplayo, Stefanos Angelidis, Mirella Lapata EMNLP 2021 [pdf] [code]
Extract, Denoise, and Enforce: Evaluating and Predicting Lexical Constraints for Conditional Text Generation Yuning Mao, Wenchang Ma, Deren Lei, Xiang Ren [pdf] [code]
Planning with Learned Entity Prompts for Abstractive Summarization Shashi Narayan, Yao Zhao, Joshua Maynez, Gonçalo Simoes, Ryan McDonald TACL [pdf]
GSum: A General Framework for Guided Neural Abstractive Summarization Zi-Yi Dou, Pengfei Liu, Hiroaki Hayashi, Zhengbao Jiang, Graham Neubig NAACL21 [pdf] [code]
Abstractive summarization with combination of pre-trained sequence-to-sequence and saliency models Itsumi Saito, Kyosuke Nishida, Kosuke Nishida, Junji Tomita [pdf]
Self-Supervised and Controlled Multi-Document Opinion Summarization Hady Elsahar, Maximin Coavoux, Jos Rozen, Matthias Gallé EACL 2021 [pdf]
Controllable Summarization with Constrained Markov Decision Process Hou Pong Chan, Lu Wang, Irwin King TACL 2021 [pdf] [code]
LenAtten: An Effective Length Controlling Unit For Text Summarization Zhongyi Yu, Zhenghao Wu, Hao Zheng, Zhe XuanYuan, Jefferson Fong, Weifeng Su Findings of ACL 2021 (short) [pdf] [code]
Controllable Abstractive Dialogue Summarization with Sketch Supervision Chien-Sheng Wu, Linqing Liu, Wenhao Liu, Pontus Stenetorp, Caiming Xiong ACL-Findings 2021 [pdf] [code]
Enhancing Factual Consistency of Abstractive Summarization Chenguang Zhu, William Hinthorn, Ruochen Xu, Qingkai Zeng, Michael Zeng, Xuedong Huang, Meng Jiang NAACL21 [pdf]
Inference Time Style Control for Summarization Shuyang Cao, Lu Wang NAACL21 short [pdf] [code]
CTRLsum: Towards Generic Controllable Text Summarization Junxian He, Wojciech Kryściński, Bryan McCann, Nazneen Rajani, Caiming Xiong [pdf] [code]
Constrained Abstractive Summarization: Preserving Factual Consistency with Constrained Generation Yuning Mao, Xiang Ren, Heng Ji, Jiawei Han [pdf]
Keywords-Guided Abstractive Sentence Summarization Haoran Li, Junnan Zhu, Jiajun Zhang, Chengqing Zong, Xiaodong He AAAI20 [pdf]
SemSUM: Semantic Dependency Guided Neural Abstractive Summarization Hanqi Jin, Tianming Wang, Xiaojun Wan AAAI2020 [pdf] [code]
Interpretable Multi-Headed Attention for Abstractive Summarization at Controllable Lengths Ritesh Sarkhel, Moniba Keymanesh, Arnab Nandi, Srinivasan Parthasarathy COLING20 [pdf]
Controllable Abstractive Sentence Summarization with Guiding Entities Changmeng Zheng, Yi Cai, Guanjie Zhang, Qing Li COLING20 [pdf] [code]
Summarizing Text on Any Aspects: A Knowledge-Informed Weakly-Supervised Approach Bowen Tan, Lianhui Qin, Eric P. Xing, Zhiting Hu EMNLP20 Short [pdf] [code]
Length-controllable Abstractive Summarization by Guiding with Summary Prototype Itsumi Saito, Kyosuke Nishida, Kosuke Nishida, Atsushi Otsuka, Hisako Asano, Junji Tomita, Hiroyuki Shindo, Yuji Matsumoto [pdf]
The Summary Loop: Learning to Write Abstractive Summaries Without Examples Philippe Laban, Andrew Hsi, John Canny, Marti A. Hearst ACL20 [pdf]
Hooks in the Headline: Learning to Generate Headlines with Controlled Styles Di Jin, Zhijing Jin, Joey Tianyi Zhou, Lisa Orii, Peter Szolovits ACL20 [pdf] [code]
BiSET: Bi-directional Selective Encoding with Template for Abstractive Summarization Kai Wang, Xiaojun Quan, Rui Wang ACL19 [pdf] [code]
Improving Abstractive Document Summarization with Salient Information Modeling Yongjian You, Weijia Jia, Tianyi Liu, Wenmian Yang ACL19 [pdf] [code]
Positional Encoding to Control Output Sequence Length Sho Takase, Naoaki Okazaki NAACL19 [pdf] [code]
Query Focused Abstractive Summarization: Incorporating Query Relevance, Multi-Document Coverage, and Summary Length Constraints into seq2seq Models Tal Baumel, Matan Eyal, Michael Elhadad [pdf]
Guiding Generation for Abstractive Text Summarization based on Key Information Guide Network Chenliang Li, Weiran Xu, Si Li, Sheng Gao NAACL18 [pdf]
Controllable Abstractive Summarization Angela Fan, David Grangier, Michael Auli ACL2018 Workshop [pdf]
Retrieve, Rerank and Rewrite: Soft Template Based Neural Summarization Ziqiang Cao, Wenjie Li, Sujian Li, Furu Wei ACL18 [pdf]
Controlling Length in Abstractive Summarization Using a Convolutional Neural Network Yizhu Liu, Zhiyi Luo, Kenny Zhu EMNLP18 [pdf] [code]
Generating Wikipedia By Summarizing Long Sequence Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, Noam Shazeer ICLR18 [pdf] [code]
Controlling Output Length in Neural Encoder-Decoders Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya Takamura, Manabu Okumura EMNLP16 [pdf] [code]

Abstractive

Exploiting Summarization Data to Help Text Simplification Renliang Sun, Zhixian Yang, Xiaojun Wan EACL 2023 [pdf] [code]

[Abs]
One of the major problems with text simplification is the lack of high-quality data. The sources of simplification datasets are limited to Wikipedia and Newsela, restricting further development of this field. In this paper, we analyzed the similarity between text summarization and text simplification and exploited summarization data to help simplify. First, we proposed an alignment algorithm to extract sentence pairs from summarization datasets. Then, we designed four attributes to characterize the degree of simplification and proposed a method to filter suitable pairs. We named these pairs Sum4Simp (S4S). Next, we conducted human evaluations to show that S4S is high-quality and compared it with a real simplification dataset. Finally, we conducted experiments to illustrate that the S4S can improve the performance of several mainstream simplification models, especially in low-resource scenarios.
Curriculum-Guided Abstractive Summarization Sajad Sotudeh, Hanieh Deilamsalehy, Franck Dernoncourt, Nazli Goharian [pdf]

[Abs]
Recent Transformer-based summarization models have provided a promising approach to abstractive summarization. They go beyond sentence selection and extractive strategies to deal with more complicated tasks such as novel word generation and sentence paraphrasing. Nonetheless, these models have two shortcomings: (1) they often perform poorly in content selection, and (2) their training strategy is not quite efficient, which restricts model performance. In this paper, we explore two orthogonal ways to compensate for these pitfalls. First, we augment the Transformer network with a sentence cross-attention module in the decoder, encouraging more abstraction of salient content. Second, we include a curriculum learning approach to reweight the training samples, bringing about an efficient learning procedure. Our second approach to enhance the training strategy of Transformers networks makes stronger gains as compared to the first approach. We apply our model on extreme summarization dataset of Reddit TIFU posts. We further look into three cross-domain summarization datasets (Webis-TLDR-17, CNN/DM, and XSum), measuring the efficacy of curriculum learning when applied in summarization. Moreover, a human evaluation is conducted to show the efficacy of the proposed method in terms of qualitative criteria, namely, fluency, informativeness, and overall quality.
R-TeaFor: Regularized Teacher-Forcing for Abstractive Summarization Guan-Yu Lin, Pu-Jen Cheng EMNLP 2022 [pdf]

[Abs]
Teacher-forcing is widely used in training sequence generation models to improve sampling efficiency and to stabilize training. However, teacher-forcing is vulnerable to the exposure bias problem. Previous works have attempted to address exposure bias by modifying the training data to simulate model-generated results. Nevertheless, they do not consider the pairwise relationship between the original training data and the modified ones, which provides more information during training. Hence, we propose Regularized Teacher-Forcing (R-TeaFor) to utilize this relationship for better regularization. Empirically, our experiments show that R-TeaFor outperforms previous summarization state-of-the-art models, and the results can be generalized to different pre-trained models.
Improving abstractive summarization with energy-based re-ranking Diogo Pernes, Afonso Mendes, André F.T. Martins GEM at EMNLP 2022 [pdf] [code]

[Abs]
Current abstractive summarization systems present important weaknesses which prevent their deployment in real-world applications, such as the omission of relevant information and the generation of factual inconsistencies (also known as hallucinations). At the same time, automatic evaluation metrics such as CTC scores have been recently proposed that exhibit a higher correlation with human judgments than traditional lexical-overlap metrics such as ROUGE. In this work, we intend to close the loop by leveraging the recent advances in summarization metrics to create quality-aware abstractive summarizers. Namely, we propose an energy-based model that learns to re-rank summaries according to one or a combination of these metrics. We experiment using several metrics to train our energy-based re-ranker and show that it consistently improves the scores achieved by the predicted summaries. Nonetheless, human evaluation results show that the re-ranking approach should be used with care for highly abstractive summaries, as the available metrics are not yet sufficiently reliable for this purpose.
Salience Allocation as Guidance for Abstractive Summarization Fei Wang, Kaiqiang Song, Hongming Zhang, Lifeng Jin, Sangwoo Cho, Wenlin Yao, Xiaoyang Wang, Muhao Chen, Dong Yu EMNLP 2022 [pdf] [code]

[Abs]
Abstractive summarization models typically learn to capture the salient information from scratch implicitly. Recent literature adds extractive summaries as guidance for abstractive summarization models to provide hints of salient content and achieves better performance. However, extractive summaries as guidance could be over strict, leading to information loss or noisy signals. Furthermore, it cannot easily adapt to documents with various abstractiveness. As the number and allocation of salience content pieces vary, it is hard to find a fixed threshold deciding which content should be included in the guidance. In this paper, we propose a novel summarization approach with a flexible and reliable salience guidance, namely SEASON (SaliencE Allocation as Guidance for Abstractive SummarizatiON). SEASON utilizes the allocation of salience expectation to guide abstractive summarization and adapts well to articles in different abstractiveness. Automatic and human evaluations on two benchmark datasets show that the proposed method is effective and reliable. Empirical results on more than one million news articles demonstrate a natural fifteen-fifty salience split for news article sentences, providing a useful insight for composing news articles.
Towards Summary Candidates Fusion Mathieu Ravaut, Shafiq Joty, Nancy F. Chen EMNLP 2022 [pdf] [code]

[Abs]
Sequence-to-sequence deep neural models fine-tuned for abstractive summarization can achieve great performance on datasets with enough human annotations. Yet, it has been shown that they have not reached their full potential, with a wide gap between the top beam search output and the oracle beam. Recently, re-ranking methods have been proposed, to learn to select a better summary candidate. However, such methods are limited by the summary quality aspects captured by the first-stage candidates. To bypass this limitation, we propose a new paradigm in second-stage abstractive summarization called SummaFusion that fuses several summary candidates to produce a novel abstractive second-stage summary. Our method works well on several summarization datasets, improving both the ROUGE scores and qualitative properties of fused summaries. It is especially good when the candidates to fuse are worse, such as in the few-shot setup where we set a new state-of-the-art. We will make our code and checkpoints available at this https URL.
Generation of Patient After-Visit Summaries to Support Physicians Pengshan Cai, Fei Liu, Adarsha Bajracharya, Joe Sills, Alok Kapoor, Weisong Liu, Dan Berlowitz, David Levy, Richeek Pradhan, Hong Yu `` [pdf] [code]

[Abs]
An after-visit summary (AVS) is a summary note given to patients after their clinical visit. It recaps what happened during their clinical visit and guides patients’ disease self-management. Studies have shown that a majority of patients found after-visit summaries useful. However, many physicians face excessive workloads and do not have time to write clear and informative summaries. In this paper, we study the problem of automatic generation of after-visit summaries and examine whether those summaries can convey the gist of clinical visits. We report our findings on a new clinical dataset that contains a large number of electronic health record (EHR) notes and their associated summaries. Our results suggest that generation of lay language after-visit summaries remains a challenging task. Crucially, we introduce a feedback mechanism that alerts physicians when an automatic summary fails to capture the important details of the clinical notes or when it contains hallucinated facts that are potentially detrimental to the summary quality. Automatic and human evaluation demonstrates the effectiveness of our approach in providing writing feedback and supporting physicians.
ArgLegalSumm: Improving Abstractive Summarization of Legal Documents with Argument Mining Mohamed Elaraby, Diane Litman COLING 2022 [pdf] [code]

[Abs]
A challenging task when generating summaries of legal documents is the ability to address their argumentative nature. We introduce a simple technique to capture the argumentative structure of legal documents by integrating argument role labeling into the summarization process. Experiments with pretrained language models show that our proposed approach improves performance over strong baselines.
Source-summary Entity Aggregation in Abstractive Summarization José Ángel González, Annie Louis, Jackie Chi Kit Cheung COLING 2022 [pdf] [code]

[Abs]
In a text, entities mentioned earlier can be referred to in later discourse by a more general description. For example, Celine Dion and Justin Bieber can be referred to by Canadian singers or celebrities. In this work, we study this phenomenon in the context of summarization, where entities from a source text are generalized in the summary. We call such instances source-summary entity aggregations. We categorize these aggregations into two types and analyze them in the Cnn/Dailymail corpus, showing that they are reasonably frequent. We then examine how well three state-of-the-art summarization systems can generate such aggregations within summaries. We also develop techniques to encourage them to generate more aggregations. Our results show that there is significant room for improvement in producing semantically correct aggregations.
Summarizing Patients Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models Yanjun Gao, Dmitry Dligach, Timothy Miller, Dongfang Xu, Matthew M. Churpek, Majid Afshar COLING 2022 [pdf]

[Abs]
Automatically summarizing patients' main problems from daily progress notes using natural language processing methods helps to battle against information and cognitive overload in hospital settings and potentially assists providers with computerized diagnostic decision support. Problem list summarization requires a model to understand, abstract, and generate clinical documentation. In this work, we propose a new NLP task that aims to generate a list of problems in a patient's daily care plan using input from the provider's progress notes during hospitalization. We investigate the performance of T5 and BART, two state-of-the-art seq2seq transformer architectures, in solving this problem. We provide a corpus built on top of progress notes from publicly available electronic health record progress notes in the Medical Information Mart for Intensive Care (MIMIC)-III. T5 and BART are trained on general domain text, and we experiment with a data augmentation method and a domain adaptation pre-training method to increase exposure to medical vocabulary and knowledge. Evaluation methods include ROUGE, BERTScore, cosine similarity on sentence embedding, and F-score on medical concepts. Results show that T5 with domain adaptive pre-training achieves significant performance gains compared to a rule-based system and general domain pre-trained language models, indicating a promising direction for tackling the problem summarization task.
Semantic-Preserving Abstractive Text Summarization with Siamese Generative Adversarial Net Xin Sheng, Linli Xu, Yinlong Xu, Deqiang Jiang, Bo Ren Findings of NAACL 2022 [pdf]

[Abs]
We propose a novel siamese generative adversarial net for abstractive text summarization (SSPGAN), which can preserve the main semantics of the source text. Different from previous generative adversarial net based methods, SSPGAN is equipped with a siamese semantic-preserving discriminator, which can not only be trained to discriminate the machine-generated summaries from the human-summarized ones, but also ensure the semantic consistency between the source text and target summary. As a consequence of the min-max game between the generator and the siamese semantic-preserving discriminator, the generator can generate a summary that conveys the key content of the source text more accurately. Extensive experiments on several text summarization benchmarks in different languages demonstrate that the proposed model can achieve significant improvements over the state-of-the-art methods.
ExtraPhrase: Efficient Data Augmentation for Abstractive Summarization Mengsay Loem, Sho Takase, Masahiro Kaneko, Naoaki Okazaki NAACL 2022 Student Research Workshop [pdf] [code]

[Abs]
TNeural models trained with large amount of parallel data have achieved impressive performance in abstractive summarization tasks. However, large-scale parallel corpora are expensive and challenging to construct. In this work, we introduce a low-cost and effective strategy, ExtraPhrase, to augment training data for abstractive summarization tasks. ExtraPhrase constructs pseudo training data in two steps: extractive summarization and paraphrasing. We extract major parts of an input text in the extractive summarization step and obtain its diverse expressions with the paraphrasing step. Through experiments, we show that ExtraPhrase improves the performance of abstractive summarization tasks by more than 0.50 points in ROUGE scores compared to the setting without data augmentation. ExtraPhrase also outperforms existing methods such as back-translation and self-training. We also show that ExtraPhrase is significantly effective when the amount of genuine training data is remarkably small, i.e., a low-resource setting. Moreover, ExtraPhrase is more cost-efficient than the existing approaches
BRIO: Bringing Order to Abstractive Summarization Yixin Liu, Pengfei Liu, Dragomir Radev, Graham Neubig ACL 2022 [pdf] [code]

[Abs]
Abstractive summarization models are commonly trained using maximum likelihood estimation, which assumes a deterministic (one-point) target distribution in which an ideal model will assign all the probability mass to the reference summary. This assumption may lead to performance degradation during inference, where the model needs to compare several system-generated (candidate) summaries that have deviated from the reference summary. To address this problem, we propose a novel training paradigm which assumes a non-deterministic distribution so that different candidate summaries are assigned probability mass according to their quality. Our method achieves a new state-of-the-art result on the CNN/DailyMail (47.78 ROUGE-1) and XSum (49.07 ROUGE-1) datasets. Further analysis also shows that our model can estimate probabilities of candidate summaries that are more correlated with their level of quality.
SummaReranker: A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization Mathieu Ravaut, Shafiq Joty, Nancy F. Chen ACL 2022 [pdf] [code]

[Abs]
Sequence-to-sequence neural networks have recently achieved great success in abstractive summarization, especially through fine-tuning large pre-trained language models on the downstream dataset. These models are typically decoded with beam search to generate a unique summary. However, the search space is very large, and with the exposure bias, such decoding is not optimal. In this paper, we show that it is possible to directly train a second-stage model performing re-ranking on a set of summary candidates. Our mixture-of-experts SummaReranker learns to select a better candidate and consistently improves the performance of the base model. With a base PEGASUS, we push ROUGE scores by 5.44% on CNN- DailyMail (47.16 ROUGE-1), 1.31% on XSum (48.12 ROUGE-1) and 9.34% on Reddit TIFU (29.83 ROUGE-1), reaching a new state-of-the-art. Our code and checkpoints will be available at https://github.com/ntunlp/SummaReranker.
Adaptive Beam Search to Enhance On-device Abstractive Summarization Harichandana B S S, Sumit Kumar IEEE INDICON 2021 [pdf]
PLSUM: Generating PT-BR Wikipedia by Summarizing Multiple Websites André Seidel Oliveira, Anna Helena Reali Costa ENIAC 2021 [pdf]
Pointer over Attention: An Improved Bangla Text Summarization Approach Using Hybrid Pointer Generator Network Nobel Dhar, Gaurob Saha, Prithwiraj Bhattacharjee, Avi Mallick, Md Saiful Islam [pdf]
Template-aware Attention Model for Earnings Call Report Generation Yangchen Huang, Prashant K. Dhingra, Seyed Danial Mohseni Taheri EMNLP 2021| newsum [pdf]
Rewards with Negative Examples for Reinforced Topic-Focused Abstractive Summarization Khalil Mrini, Can Liu, Markus Dreyer EMNLP 2021| newsum [pdf]
Knowledge and Keywords Augmented Abstractive Sentence Summarization Shuo Guan, Ping Zhu, Zhihua Wei EMNLP 2021| newsum [pdf] [code]
Sentence-level Planning for Especially Abstractive Summarization Andreas Marfurt, James Henderson EMNLP 2021| newsum [pdf] [code]
Learn to Copy from the Copying History: Correlational Copy Network for Abstractive Summarization Haoran Li, Song Xu, Peng Yuan, Yujia Wang, Youzheng Wu, Xiaodong He, Bowen Zhou EMNLP 2021 [pdf] [code]
Enhance Long Text Understanding via Distilled Gist Detector from Abstractive Summarization Yan Liu, Yazheng Yang [pdf]
VieSum: How Robust Are Transformer-based Models on Vietnamese Summarization? Hieu Nguyen, Long Phan, James Anibal, Alec Peltekian, Hieu Tran [pdf]
Enriching and Controlling Global Semantics for Text Summarization Thong Nguyen, Anh Tuan Luu, Truc Lu, Tho Quan EMNLP 2021 [pdf]
Augmented Abstractive Summarization With Document-LevelSemantic Graph Qiwei Bi, Haoyuan Li, Kun Lu, Hanfang Yang Journal of Data Science [pdf]
ARMAN: Pre-training with Semantically Selecting and Reordering of Sentences for Persian Abstractive Summarization Alireza Salemi, Emad Kebriaei, Ghazal Neisi Minaei, Azadeh Shakery [pdf] [data]
Subjective Bias in Abstractive Summarization Lei Li, Wei Liu, Marina Litvak, Natalia Vanetik, Jiacheng Pei, Yinan Liu, Siya Qi [pdf] [code]
Neural Abstractive Unsupervised Summarization of Online News Discussions Ignacio Tampe Palma, Marcelo Mendoza, Evangelos Milios [pdf]
Attention Temperature Matters in Abstractive Summarization Distillation Shengqiang Zhang, Xingxing Zhang, Hangbo Bao, Furu Wei ACL 2022 [pdf] [code]

[Abs]
Recent progress of abstractive text summarization largely relies on large pre-trained sequence-to-sequence Transformer models, which are computationally expensive. This paper aims to distill these large models into smaller ones for faster inference and with minimal performance loss. Pseudo-labeling based methods are popular in sequence-to-sequence model distillation. In this paper, we find simply manipulating attention temperatures in Transformers can make pseudo labels easier to learn for student models. Our experiments on three summarization datasets show our proposed method consistently improves vanilla pseudo-labeling based methods. Further empirical analysis shows that both pseudo labels and summaries produced by our students are shorter and more abstractive.
BASS: Boosting Abstractive Summarization with Unified Semantic Graph Wenhao Wu, Wei Li, Xinyan Xiao, Jiachen Liu, Ziqiang Cao, Sujian Li, Hua Wu, Haifeng Wang ACL21 [pdf]
Enriching Transformers with Structured Tensor-Product Representations for Abstractive Summarization Yichen Jiang, Asli Celikyilmaz, Paul Smolensky, Paul Soulos, Sudha Rao, Hamid Palangi, Roland Fernandez, Caitlin Smith, Mohit Bansal, Jianfeng Gao NAACL21 [pdf] [code]
Uncertainty-Aware Abstractive Summarization Alexios Gidiotis, Grigorios Tsoumakas [pdf]
What's in a Summary? Laying the Groundwork for Advances in Hospital-Course Summarization Griffin Adams, Emily Alsentzer, Mert Ketenci, Jason Zucker, Noémie Elhadad NAACL21 [pdf]
Generating abstractive summaries of Lithuanian news articles using a transformer model Lukas Stankevičius, Mantas Lukoševičius [pdf]
Summarization, Simplification, and Generation: The Case of Patents Silvia Casola, Alberto Lavelli [pdf]
Quantifying Appropriateness of Summarization Data for Curriculum Learning Ryuji Kano, Takumi Takahashi, Toru Nishino, Motoki Taniguchi, Tomoki Taniguchi, Tomoko Ohkuma EACL21 [pdf]
Text Summarization of Czech News Articles Using Named Entities Petr Marek, Štěpán Müller, Jakub Konrád, Petr Lorenc, Jan Pichl, Jan Šedivý Journal [pdf]
Planning with Entity Chains for Abstractive Summarization Shashi Narayan, Yao Zhao, Joshua Maynez, Gonçalo Simoes, Ryan McDonald [pdf]
Attention Head Masking for Inference Time Content Selection in Abstractive Summarization Shuyang Cao, Lu Wang NAACL21 short [pdf] [code]
A New Approach to Overgenerating and Scoring Abstractive Summaries Kaiqiang Song, Bingqing Wang, Zhe Feng, Fei Liu NAACL21 [pdf] [code]
Exploring Explainable Selection to Control Abstractive Summarization Wang Haonan, Gao Yang, Bai Yu, Mirella Lapata, Huang Heyan AAAI21 [pdf] [code]
Friendly Topic Assistant for Transformer Based Abstractive Summarization Zhengjue Wang, Zhibin Duan, Hao Zhang, Chaojie Wang, Long Tian, Bo Chen, Mingyuan Zhou EMNLP20 [pdf] [code]
Neural Abstractive Text Summarizer for Telugu Language Mohan Bharath B, Aravindh Gowtham B, Akhil M ICSCSP20 [pdf]
Topic-Aware Abstractive Text Summarization Chujie Zheng, Kunpeng Zhang, Harry Jiannan Wang, Ling Fan [pdf] [code]
Multi-hop Inference for Question-driven Summarization Yang Deng, Wenxuan Zhang, Wai Lam EMNLP20 [pdf]
Quantitative Argument Summarization and Beyond-Cross-Domain Key Point Analysis Roy Bar-Haim, Yoav Kantor, Lilach Eden, Roni Friedman, Dan Lahav, Noam Slonim EMNLP20 [pdf]
Learning to Fuse Sentences with Transformers for Summarization Logan Lebanoff, Franck Dernoncourt, Doo Soon Kim, Lidan Wang, Walter Chang, Fei Liu EMNLP20 short [pdf] [code]
A Cascade Approach to Neural Abstractive Summarization with Content Selection and Fusion Logan Lebanoff, Franck Dernoncourt, Doo Soon Kim, Walter Chang, Fei Liu AACL20 [pdf] [code]
AutoSurvey: Automatic Survey Generation based on a Research Draft Hen-Hsen Huang IJCAI20 [pdf] [code]
Neural Abstractive Summarization with Structural Attention Tanya Chowdhury, Sachin Kumar, Tanmoy Chakraborty IJCAI20 [pdf]
A Unified Model for Financial Event Classification, Detection and Summarization Quanzhi Li, Qiong Zhang IJCAI20 Special Track on AI in FinTech [pdf]
Discriminative Adversarial Search for Abstractive Summarization Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano ICML20 [pdf]
Controlling the Amount of Verbatim Copying in Abstractive Summarization Kaiqiang Song, Bingqing Wang, Zhe Feng, Liu Ren, Fei Liu AAAI20 [pdf] [code]
GRET：Global Representation Enhanced Transformer Rongxiang Weng, Haoran Wei, Shujian Huang, Heng Yu, Lidong Bing, Weihua Luo, Jiajun Chen AAAI20 [pdf]
Abstractive Summarization of Spoken and Written Instructions with BERT Alexandra Savelieva, Bryan Au-Yeung, Vasanth Ramani KDD Converse 2020 [pdf]
Concept Pointer Network for Abstractive Summarization Wang Wenbo, Gao Yang, Huang Heyan, Zhou Yuxiang EMNLP19 [pdf] [code]
Co-opNet: Cooperative Generator–Discriminator Networks for Abstractive Summarization with Narrative Flow Saadia Gabriel, Antoine Bosselut, Ari Holtzman, Kyle Lo, Asli Celikyilmaz, Yejin Choi [pdf]
Contrastive Attention Mechanism for Abstractive Sentence Summarization Xiangyu Duan, Hongfei Yu, Mingming Yin, Min Zhang, Weihua Luo, Yue Zhang EMNLP19 [pdf] [code]
An Entity-Driven Framework for Abstractive Summarization Eva Sharma, Luyang Huang, Zhe Hu, Lu Wang EMNLP19 [pdf] [code]
Abstract Text Summarization: A Low Resource Challenge Shantipriya Parida, Petr Motlicek EMNLP19 [pdf] [code]
Attention Optimization for Abstractive Document Summarization Min Gui, Junfeng Tian, Rui Wang, Zhenglu Yang EMNLP19 [pdf] [code]
Scoring Sentence Singletons and Pairs for Abstractive Summarization Logan Lebanoff, Kaiqiang Song, Franck Dernoncourt, Doo Soon Kim, Seokhwan Kim, Walter Chang, Fei Liu ACL19 [pdf] [code]
Inducing Document Structure for Aspect-based Summarization Lea Frermann, Alexandre Klementiev ACL19 [pdf] [code]
Generating Summaries with Topic Templates and Structured Convolutional Decoders Laura Perez-Beltrachini, Yang Liu, Mirella Lapata ACL19 [pdf] [code]
Summary Refinement through Denoising Nikola I. Nikolov, Alessandro Calmanovici, Richard H.R. Hahnloser RANLP19 [pdf] [code]
Closed-Book Training to Improve Summarization Encoder Memory Yichen Jiang, Mohit Bansal EMNLP18 [pdf]
Improving Neural Abstractive Document Summarization with Structural Regularization Wei Li, Xinyan Xiao, Yajuan Lyu, Yuanzhuo Wang EMNLP18 [pdf]
Bottom-Up Abstractive Summarization Sebastian Gehrmann, Yuntian Deng, Alexander M. Rush EMNLP18 [pdf] [code]
A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss Wan-Ting Hsu, Chieh-Kai Lin, Ming-Ying Lee, Kerui Min, Jing Tang, Min Sun ACL18 [pdf]
Soft Layer-Specific Multi-Task Summarization with Entailment and Question Generation Han Guo, Ramakanth Pasunuru, Mohit Bansal ACL18 [pdf]
Abstractive Document Summarization via Bidirectional Decoder Xin WanChen LiRuijia WangDing XiaoChuan Shi ADMA18 [pdf]
Entity Commonsense Representation for Neural Abstractive Summarization Reinald Kim Amplayo, Seonjae Lim, Seung-won Hwang NAACL18 [pdf]
Get To The Point: Summarization with Pointer-Generator Networks Abigail See, Peter J. Liu, Christopher D. Manning ACL17 [pdf] [code]
Selective Encoding for Abstractive Sentence Summarization Qingyu Zhou, Nan Yang, Furu Wei, Ming Zhou ACL17 [pdf]
Abstractive Document Summarization with a Graph-Based Attentional Neural Model Jiwei Tan, Xiaojun Wan, Jianguo Xiao ACL17 [pdf]
Toward Abstractive Summarization Using Semantic Representations Fei Liu, Jeffrey Flanigan, Sam Thomson, Norman Sadeh, Noah A. Smith NAACL15 [pdf]
Abstractive Meeting Summarization with Entailment and Fusion Yashar Mehdad, Giuseppe Carenini, Frank Tompa, Raymond T. Ng ENLG13 [pdf]

Graph-Based

Abstractive Summarization Guided by Latent Hierarchical Document Structure Yifu Qiu, Shay B. Cohen EMNLP 2022 [pdf] [code]

[Abs]
Sequential abstractive neural summarizers often do not use the underlying structure in the input article or dependencies between the input sentences. This structure is essential to integrate and consolidate information from different parts of the text. To address this shortcoming, we propose a hierarchy-aware graph neural network (HierGNN) which captures such dependencies through three main steps: 1) learning a hierarchical document structure through a latent structure tree learned by a sparse matrix-tree computation; 2) propagating sentence information over this structure using a novel message-passing node propagation mechanism to identify salient information; 3) using graph-level attention to concentrate the decoder on salient information. Experiments confirm HierGNN improves strong sequence models such as BART, with a 0.55 and 0.75 margin in average ROUGE-1/2/L for CNN/DM and XSum. Further human evaluation demonstrates that summaries produced by our model are more relevant and less redundant than the baselines, into which HierGNN is incorporated. We also find HierGNN synthesizes summaries by fusing multiple source sentences more, rather than compressing a single source sentence, and that it processes long inputs more effectively.
Hierarchical Heterogeneous Graph Attention Network for Syntax-Aware Summarization Zixing Song, Irwin King AAAI 2022 [pdf]
Summarization with Graphical Elements Maartje ter Hoeve, Julia Kiseleva, Maarten de Rijke [pdf] [code]
HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text Extractive Summarization Ye Liu, Jian-Guo Zhang, Yao Wan, Congying Xia, Lifang He, Philip S. Yu EMNLP 2021 short [pdf]
Centrality Meets Centroid: A Graph-based Approach for Unsupervised Document Summarization Haopeng Zhang, Jiawei Zhang [pdf]
Neural Extractive Summarization with Hierarchical Attentive Heterogeneous Graph Network Ruipeng Jia, Yanan Cao, Hengzhu Tang, Fang Fang, Cong Cao, Shi Wang EMNLP20 [pdf] [code]
Enhancing Extractive Text Summarization with Topic-Aware Graph Neural Networks Peng Cui, Le Hu, Yuanchao Liu COLING20 [pdf]
Heterogeneous Graph Neural Networks for Extractive Document Summarization Danqing Wang, Pengfei Liu, Yining Zheng, Xipeng Qiu, Xuanjing Huang ACL20 [pdf] [code]
Structured Neural Summarization Patrick Fernandes, Miltiadis Allamanis, Marc Brockschmidt ICLR19 [pdf] [code]
Hierarchical Transformers for Multi-Document Summarization Yang Liu, Mirella Lapata ACL19 [pdf] [code]
Learning to Create Sentence Semantic Relation Graphs for Multi-Document Summarization Diego Antognini, Boi Faltings EMNLP19 [pdf]
Graph-based Neural Multi-Document Summarization Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu, Ayush Pareek, Krishnan Srinivasan, Dragomir Radev CoNLL17 [pdf]
Abstractive Document Summarization with a Graph-Based Attentional Neural Model Jiwei Tan, Xiaojun Wan, Jianguo Xiao ACL17 [pdf]

Unsupervised

Improving Sentence Similarity Estimation for Unsupervised Extractive Summarization Shichao Sun, Ruifeng Yuan, Wenjie Li, Sujian Li ICASSP 2023 [pdf] [code]

[Abs]
Unsupervised extractive summarization aims to extract salient sentences from a document as the summary without labeled data. Recent literatures mostly research how to leverage sentence similarity to rank sentences in the order of salience. However, sentence similarity estimation using pre-trained language models mostly takes little account of document-level information and has a weak correlation with sentence salience ranking. In this paper, we proposed two novel strategies to improve sentence similarity estimation for unsupervised extractive summarization. We use contrastive learning to optimize a document-level objective that sentences from the same document are more similar than those from different documents. Moreover, we use mutual learning to enhance the relationship between sentence similarity estimation and sentence salience ranking, where an extra signal amplifier is used to refine the pivotal information. Experimental results demonstrate the effectiveness of our strategies.
Generating Multiple-Length Summaries via Reinforcement Learning for Unsupervised Sentence Summarization Dongmin Hyun, Xiting Wang, Chanyoung Park, Xing Xie, Hwanjo Yu [pdf] [code]

[Abs]
Sentence summarization shortens given texts while maintaining core contents of the texts. Unsupervised approaches have been studied to summarize texts without human-written summaries. However, recent unsupervised models are extractive, which remove words from texts and thus they are less flexible than abstractive summarization. In this work, we devise an abstractive model based on reinforcement learning without ground-truth summaries. We formulate the unsupervised summarization based on the Markov decision process with rewards representing the summary quality. To further enhance the summary quality, we develop a multi-summary learning mechanism that generates multiple summaries with varying lengths for a given text, while making the summaries mutually enhance each other. Experimental results show that the proposed model substantially outperforms both abstractive and extractive models, yet frequently generating new words not contained in input texts.
Referee: Reference-Free Sentence Summarization with Sharper Controllability through Symbolic Knowledge Distillation Melanie Sclar, Peter West, Sachin Kumar, Yulia Tsvetkov, Yejin Choi EMNLP 2022 [pdf] [code]

[Abs]
We present Referee, a novel framework for sentence summarization that can be trained reference-free (i.e., requiring no gold summaries for supervision), while allowing direct control for compression ratio. Our work is the first to demonstrate that reference-free, controlled sentence summarization is feasible via the conceptual framework of Symbolic Knowledge Distillation (West et al., 2022), where latent knowledge in pre-trained language models is distilled via explicit examples sampled from the teacher models, further purified with three types of filters: length, fidelity, and Information Bottleneck. Moreover, we uniquely propose iterative distillation of knowledge, where student models from the previous iteration of distillation serve as teacher models in the next iteration. Starting off from a relatively modest set of GPT3-generated summaries, we demonstrate how iterative knowledge distillation can lead to considerably smaller, but better summarizers with sharper controllability. A useful by-product of this iterative distillation process is a high-quality dataset of sentence-summary pairs with varying degrees of compression ratios. Empirical results demonstrate that the final student models vastly outperform the much larger GPT3-Instruct model in terms of the controllability of compression ratios, without compromising the quality of resulting summarization.
UPER: Boosting Multi-Document Summarization with an Unsupervised Prompt-based Extractor Shangqing Tu, Jifan Yu, Fangwei Zhu, Juanzi Li, Lei Hou, Jian-Yun Nie COLING 2022 [pdf] [code]

[Abs]
Multi-Document Summarization (MDS) commonly employs the 2-stage extract-then-abstract paradigm, which first extracts a relatively short meta-document, then feeds it into the deep neural networks to generate an abstract. Previous work usually takes the ROUGE score as the label for training a scoring model to evaluate source documents. However, the trained scoring model is prone to under-fitting for low-resource settings, as it relies on the training data. To extract documents effectively, we construct prompting templates that invoke the underlying knowledge in Pre-trained Language Model (PLM) to calculate the document and keyword’s perplexity, which can assess the document’s semantic salience. Our unsupervised approach can be applied as a plug-in to boost other metrics for evaluating a document’s salience, thus improving the subsequent abstract generation. We get positive results on 2 MDS datasets, 2 data settings, and 2 abstractive backbone models, showing our method’s effectiveness. Our code is available at https://github.com/THU-KEG/UPER
Learning Non-Autoregressive Models from Search for Unsupervised Sentence Summarization Puyuan Liu, Chenyang Huang, Lili Mou ACL 2022 [[pdf] [code]

[Abs]
Text summarization aims to generate a short summary for an input text. In this work, we propose a Non-Autoregressive Unsupervised Summarization (NAUS) approach, which does not require parallel data for training. Our NAUS first performs edit-based search towards a heuristically defined score, and generates a summary as pseudo-groundtruth. Then, we train an encoder-only non-autoregressive Transformer based on the search result. We also propose a dynamic programming approach for length-control decoding, which is important for the summarization task. Experiments on two datasets show that NAUS achieves state-of-the-art performance for unsupervised summarization, yet largely improving inference efficiency. Further, our algorithm is able to perform explicit length-transfer summary generation.
Unsupervised Extractive Opinion Summarization Using Sparse Coding Somnath Basu Roy Chowdhury, Chao Zhao, Snigdha Chaturvedi ACL 2022 [pdf] [code]

[Abs]
Opinion summarization is the task of automatically generating summaries that encapsulate information expressed in multiple user reviews. We present Semantic Autoencoder (SemAE) to perform extractive opinion summarization in an unsupervised manner. SemAE uses dictionary learning to implicitly capture semantic information from the review text and learns a latent representation of each sentence over semantic units. Our extractive summarization algorithm leverages the representations to identify representative opinions among hundreds of reviews. SemAE is also able to perform controllable summarization to generate aspect-specific summaries using only a few samples. We report strong performance on SPACE and AMAZON datasets and perform experiments to investigate the functioning of our model.
Want To Reduce Labeling Cost? GPT-3 Can Help Shuohang Wang, Yang Liu, Yichong Xu, Chenguang Zhu, Michael Zeng Findings of EMNLP 2021 [pdf]
Improving Unsupervised Extractive Summarization with Facet-Aware Modeling Xinnian Liang, Shuangzhi Wu, Mu Li, Zhoujun Li ACL 2021 Findings [pdf] [code]
MRCBert: A Machine Reading ComprehensionApproach for Unsupervised Summarization Saurabh Jain, Guokai Tang, Lim Sze Chi [pdf] [code]
Centrality Meets Centroid: A Graph-based Approach for Unsupervised Document Summarization Haopeng Zhang, Jiawei Zhang [pdf]
Unsupervised Opinion Summarization with Content Planning Reinald Kim Amplayo, Stefanos Angelidis, Mirella Lapata AAAI21 [pdf] [code]
Biased TextRank: Unsupervised Graph-Based Content Extraction Ashkan Kazemi, Verónica Pérez-Rosas, Rada Mihalcea COLING20 [pdf] [code]
Unsupervised Extractive Summarization by Pre-training Hierarchical Transformers Shusheng Xu, Xingxing Zhang, Yi Wu, Furu Wei, Ming Zhou [pdf] [code]
Q-learning with Language Model for Edit-based Unsupervised Summarization Ryosuke Kohita, Akifumi Wachi, Yang Zhao, Ryuki Tachibana EMNLP20 [pdf] [code]
Abstractive Document Summarization without Parallel Data Nikola I. Nikolov, Richard H.R. Hahnloser LREC20 [pdf] [code]
Unsupervised Neural Single-Document Summarization of Reviews via Learning Latent Discourse Structure and its Ranking Masaru Isonuma, Junichiro Mori, Ichiro Sakata ACL19 [pdf] [code]
Sentence Centrality Revisited for Unsupervised Summarization Hao Zheng, Mirella Lapata ACL19 [pdf] [code]
Discrete Optimization for Unsupervised Sentence Summarization with Word-Level Extraction Raphael Schumann, Lili Mou, Yao Lu, Olga Vechtomova, Katja Markert ACL20 [pdf] [code]
SummAE: Zero-Shot Abstractive Text Summarization using Length-Agnostic Auto-Encoders Peter J. Liu, Yu-An Chung, Jie Ren [pdf] [code]
MeanSum : A Neural Model for Unsupervised Multi-Document Abstractive Summarization Eric Chu, Peter J. Liu ICML19 [pdf] [code]
SEQ3: Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive Sentence Compression Christos Baziotis, Ion Androutsopoulos, Ioannis Konstas, Alexandros Potamianos NAACL19 [pdf] [code]
Learning to Encode Text as Human-Readable Summaries usingGenerative Adversarial Networks Yaushian Wang, Hung-Yi Lee EMNLP18 [pdf] [code]
Unsupervised Abstractive Meeting Summarization with Multi-Sentence Compression and Budgeted Submodular Maximization Guokan Shang, Wensi Ding, Zekun Zhang, Antoine Tixier, Polykarpos Meladianos, Michalis Vazirgiannis, Jean-Pierre Lorré ACL18 [pdf] [code]

Concept-map-based

Fast Concept Mention Grouping for Concept Map–based Multi-Document Summarization Tobias Falke, Iryna Gurevych NAACL19 [pdf] [code]
Bringing Structure into Summaries : Crowdsourcing a Benchmark Corpus of Concept Maps Tobias Falke, Iryna Gurevych EMNLP17 [pdf] [code]

Timeline

Follow the Timeline! Generating Abstractive and Extractive Timeline Summary in Chronological Order Xiuying Chen, Mingzhe Li, Shen Gao, Zhangming Chan, Dongyan Zhao, Xin Gao, Xiangliang Zhang, Rui Yan TOIS [pdf] [code]

[Abs]
Nowadays, time-stamped web documents related to a general news query floods spread throughout the Internet, and timeline summarization targets concisely summarizing the evolution trajectory of events along the timeline. Unlike traditional document summarization, timeline summarization needs to model the time series information of the input events and summarize important events in chronological order. To tackle this challenge, in this paper, we propose a Unified Timeline Summarizer (UTS) that can generate abstractive and extractive timeline summaries in time order. Concretely, in the encoder part, we propose a graph-based event encoder that relates multiple events according to their content dependency and learns a global representation of each event. In the decoder part, to ensure the chronological order of the abstractive summary, we propose to extract the feature of event-level attention in its generation process with sequential information remained and use it to simulate the evolutionary attention of the ground truth summary. The event-level attention can also be used to assist in extracting summary, where the extracted summary also comes in time sequence. We augment the previous Chinese large-scale timeline summarization dataset and collect a new English timeline dataset. Extensive experiments conducted on these datasets and on the out-of-domain Timeline 17 dataset show that UTS achieves state-of-the-art performance in terms of both automatic and human evaluations.
CrisisLTLSum: A Benchmark for Local Crisis Event Timeline Extraction and Summarization Hossein Rajaby Faghihi, Bashar Alhafni, Ke Zhang, Shihao Ran, Joel Tetreault, Alejandro Jaimes [pdf] [data]

[Abs]
Social media has increasingly played a key role in emergency response: first responders can use public posts to better react to ongoing crisis events and deploy the necessary resources where they are most needed. Timeline extraction and abstractive summarization are critical technical tasks to leverage large numbers of social media posts about events. Unfortunately, there are few datasets for benchmarking technical approaches for those tasks. This paper presents CrisisLTLSum, the largest dataset of local crisis event timelines available to date. CrisisLTLSum contains 1,000 crisis event timelines across four domains: wildfires, local fires, traffic, and storms. We built CrisisLTLSum using a semi-automated cluster-then-refine approach to collect data from the public Twitter stream. Our initial experiments indicate a significant gap between the performance of strong baselines compared to the human performance on both tasks. Our dataset, code, and models are publicly available.
Joint Learning-based Heterogeneous Graph Attention Network for Timeline Summarization Jingyi You, Dongyuan Li, Hidetaka Kamigaito, Kotaro Funakoshi, Manabu Okumura NAACL 2022 [pdf] [data]

[Abs]
Previous studies on the timeline summarization (TLS) task ignored the information interaction between sentences and dates, and adopted pre-defined unlearnable representations for them. They also considered date selection and event detection as two independent tasks, which makes it impossible to integrate their advantages and obtain a globally optimal summary. In this paper, we present a joint learning-based heterogeneous graph attention network for TLS (HeterTls), in which date selection and event detection are combined into a unified framework to improve the extraction accuracy and remove redundant sentences simultaneously. Our heterogeneous graph involves multiple types of nodes, the representations of which are iteratively learned across the heterogeneous graph attention layer. We evaluated our model on four datasets, and found that it significantly outperformed the current state-of-the-art baselines with regard to ROUGE scores and date selection metrics.
Updated Headline Generation: Creating Updated Summaries for Evolving News Stories Sheena Panthaplackel, Adrian Benton, Mark Dredze ACL 2022 [pdf] [code]

[Abs]
We propose the task of updated headline generation, in which a system generates a headline for an updated article, considering both the previous article and headline. The system must identify the novel information in the article update, and modify the existing headline accordingly. We create data for this task using the NewsEdits corpus by automatically identifying contiguous article versions that are likely to require a substantive headline update. We find that models conditioned on the prior headline and body revisions produce headlines judged by humans to be as factual as gold headlines while making fewer unnecessary edits compared to a standard headline generation model. Our experiments establish benchmarks for this new contextual summarization task.
Abstractive summarization of hospitalisation histories with transformer networks Alexander Yalunin, Dmitriy Umerenkov, Vladimir Kokh [pdf]
Follow the Timeline! Generating Abstractive and Extractive Timeline Summary in Chronological Order Xiuying Chen, Mingzhe Li, Shen Gao, Zhangming Chan, Dongyan Zhao, Xin Gao, Xiangliang Zhang, Rui Yan TOIS [pdf] [data]
Multi-TimeLine Summarization (MTLS): Improving Timeline Summarization by Generating Multiple Summaries Yi Yu, Adam Jatowt, Antoine Doucet, Kazunari Sugiyama, Masatoshi Yoshikawa ACL 2021 [pdf] [data]
Summarize Dates First: A Paradigm Shift in Timeline Summarization Moreno La Quatra, Luca Cagliero, Elena Baralis, Alberto Messina, Maurizio Montagnuolo SIGIR 2021 [pdf] [data]
Examining the State-of-the-Art in News Timeline Summarization Demian Gholipour Ghalandari, Georgiana Ifrim ACL20 [pdf] [code]
Learning towards Abstractive Timeline Summarization Xiuying Chen, Zhangming Chan, Shen Gao, Meng-Hsuan Yu, Dongyan Zhao, Rui Yan IJCAI19 [pdf] [data]

Opinion

Simple Yet Effective Synthetic Dataset Construction for Unsupervised Opinion Summarization Ming Shen, Jie Ma, Shuai Wang, Yogarshi Vyas, Kalpit Dixit, Miguel Ballesteros, Yassine Benajiba EACL 2023 Findings [pdf]

[Abs]
Opinion summarization provides an important solution for summarizing opinions expressed among a large number of reviews. However, generating aspect-specific and general summaries is challenging due to the lack of annotated data. In this work, we propose two simple yet effective unsupervised approaches to generate both aspect-specific and general opinion summaries by training on synthetic datasets constructed with aspect-related review contents. Our first approach, Seed Words Based Leave-One-Out (SW-LOO), identifies aspect-related portions of reviews simply by exact-matching aspect seed words and outperforms existing methods by 3.4 ROUGE-L points on SPACE and 0.5 ROUGE-1 point on OPOSUM+ for aspect-specific opinion summarization. Our second approach, Natural Language Inference Based Leave-One-Out (NLI-LOO) identifies aspect-related sentences utilizing an NLI model in a more general setting without using seed words and outperforms existing approaches by 1.2 ROUGE-L points on SPACE for aspect-specific opinion summarization and remains competitive on other metrics.
Opinion Summarization by Weak-Supervision from Mix-structured Data Yizhu Liu, Qi Jia, Kenny Zhu EMNLP 2022 [pdf] [code]

[Abs]
Opinion summarization of multiple reviews suffers from the lack of reference summaries for training.Most previous approaches construct multiple reviews and their summary based on textual similarities between reviews,resulting in information mismatch between the review input and the summary. In this paper, we convert each review into a mixof structured and unstructured data, which we call opinion-aspect pairs (OAs) and implicit sentences (ISs).We propose a new method to synthesize training pairs of such mix-structured data as input and the textual summary as output,and design a summarization model with OA encoder and IS encoder.Experiments show that our approach outperforms previous methods on Yelp, Amazon and RottenTomatos datasets.
OpineSum: Entailment-based self-training for abstractive opinion summarization Annie Louis, Joshua Maynez [pdf]

[Abs]
A typical product or place often has hundreds of reviews, and summarization of these texts is an important and challenging problem. Recent progress on abstractive summarization in domains such as news has been driven by supervised systems trained on hundreds of thousands of news articles paired with human-written summaries. However for opinion texts, such large scale datasets are rarely available. Unsupervised methods, self-training, and few-shot learning approaches bridge that gap. In this work, we present a novel self-training approach, OpineSum, for abstractive opinion summarization. The summaries in this approach are built using a novel application of textual entailment and capture the consensus of opinions across the various reviews for an item. This method can be used to obtain silver-standard summaries on a large scale and train both unsupervised and few-shot abstractive summarization systems. OpineSum achieves state-of-the-art performance in both settings.
Zero-Shot Opinion Summarization with GPT-3 Adithya Bhaskar, Alexander R. Fabbri, Greg Durrett [pdf] [code]

[Abs]
Very large language models such as GPT-3 have shown impressive performance across a wide variety of tasks, including text summarization. In this paper, we show that this strong performance extends to opinion summarization. We explore several pipeline methods for applying GPT-3 to summarize a large collection of user reviews in a zero-shot fashion, notably approaches based on recursive summarization and selecting salient content to summarize through supervised clustering or extraction. On two datasets, an aspect-oriented summarization dataset of hotel reviews and a generic summarization dataset of Amazon and Yelp reviews, we show that the GPT-3 models achieve very strong performance in human evaluation. We argue that standard evaluation metrics do not reflect this, and evaluate against several new measures targeting faithfulness, factuality, and genericity to contrast these different methods.
Unsupervised Opinion Summarisation in the Wasserstein Space Jiayu Song, Iman Munire Bilal, Adam Tsakalidis, Rob Procter, Maria Liakata [pdf]

[Abs]
Opinion summarisation synthesises opinions expressed in a group of documents discussing the same topic to produce a single summary. Recent work has looked at opinion summarisation of clusters of social media posts. Such posts are noisy and have unpredictable structure, posing additional challenges for the construction of the summary distribution and the preservation of meaning compared to online reviews, which has been so far the focus of opinion summarisation. To address these challenges we present \textit{WassOS}, an unsupervised abstractive summarization model which makes use of the Wasserstein distance. A Variational Autoencoder is used to get the distribution of documents/posts, and the distributions are disentangled into separate semantic and syntactic spaces. The summary distribution is obtained using the Wasserstein barycenter of the semantic and syntactic distributions. A latent variable sampled from the summary distribution is fed into a GRU decoder with a transformer layer to produce the final summary. Our experiments on multiple datasets including Twitter clusters, Reddit threads, and reviews show that WassOS almost always outperforms the state-of-the-art on ROUGE metrics and consistently produces the best summaries with respect to meaning preservation according to human evaluations.
Noisy Pairing and Partial Supervision for Opinion Summarization Hayate Iso, Xiaolan Wang, Yoshi Suhara [pdf]

[Abs]
Current opinion summarization systems simply generate summaries reflecting important opinions from customer reviews, but the generated summaries may not attract the reader's attention. Although it is helpful to automatically generate professional reviewer-like summaries from customer reviews, collecting many training pairs of customer and professional reviews is generally tricky. We propose a weakly supervised opinion summarization framework, Noisy Pairing and Partial Supervision (NAPA) that can build a stylized opinion summarization system with no customer-professional review pairs. Experimental results show consistent improvements in automatic evaluation metrics, and qualitative analysis shows that our weakly supervised opinion summarization system can generate summaries that look more like those written by professional reviewers.
Unsupervised Opinion Summarization Using Approximate Geodesics Somnath Basu Roy Chowdhury, Nicholas Monath, Avinava Dubey, Amr Ahmed, Snigdha Chaturvedi [pdf]

[Abs]
Opinion summarization is the task of creating summaries capturing popular opinions from user reviews. In this paper, we introduce Geodesic Summarizer (GeoSumm), a novel system to perform unsupervised extractive opinion summarization. GeoSumm involves an encoder-decoder based representation learning model, that generates representations of text as a distribution over latent semantic units. GeoSumm generates these representations by performing dictionary learning over pre-trained text representations at multiple decoder layers. We then use these representations to quantify the relevance of review sentences using a novel approximate geodesic distance based scoring mechanism. We use the relevance scores to identify popular opinions in order to compose general and aspect-specific summaries. Our proposed model, GeoSumm, achieves state-of-the-art performance on three opinion summarization datasets. We perform additional experiments to analyze the functioning of our model and showcase the generalization ability of {\X} across different domains.
Template-based Abstractive Microblog Opinion Summarisation Iman Munire Bilal, Bo Wang, Adam Tsakalidis, Dong Nguyen, Rob Procter, Maria Liakata TACL 2022 [pdf]

[Abs]
We introduce the task of microblog opinion summarisation (MOS) and share a dataset of 3100 gold-standard opinion summaries to facilitate research in this domain. The dataset contains summaries of tweets spanning a 2-year period and covers more topics than any other public Twitter summarisation dataset. Summaries are abstractive in nature and have been created by journalists skilled in summarising news articles following a template separating factual information (main story) from author opinions. Our method differs from previous work on generating gold-standard summaries from social media, which usually involves selecting representative posts and thus favours extractive summarisation models. To showcase the dataset's utility and challenges, we benchmark a range of abstractive and extractive state-of-the-art summarisation models and achieve good performance, with the former outperforming the latter. We also show that fine-tuning is necessary to improve performance and investigate the benefits of using different sample sizes.
Efficient Few-Shot Fine-Tuning for Opinion Summarization Arthur Bražinskas, Ramesh Nallapati, Mohit Bansal, Markus Dreyer Findings of NAACL 202 [pdf] [code]

[Abs]
Abstractive summarization models are typically pre-trained on large amounts of generic texts, then fine-tuned on tens or hundreds of thousands of annotated samples. However, in opinion summarization, large annotated datasets of reviews paired with reference summaries are not available and would be expensive to create. This calls for fine-tuning methods robust to overfitting on small datasets. In addition, generically pre-trained models are often not accustomed to the specifics of customer reviews and, after fine-tuning, yield summaries with disfluencies and semantic mistakes. To address these problems, we utilize an efficient few-shot method based on adapters which, as we show, can easily store in-domain knowledge. Instead of fine-tuning the entire model, we add adapters and pre-train them in a task-specific way on a large corpus of unannotated customer reviews, using held-out reviews as pseudo summaries. Then, fine-tune the adapters on the small available human-annotated dataset. We show that this self-supervised adapter pre-training improves summary quality over standard fine-tuning by 2.0 and 1.3 ROUGE-L points on the Amazon and Yelp datasets, respectively. Finally, for summary personalization, we condition on aspect keyword queries, automatically created from generic datasets. In the same vein, we pre-train the adapters in a query-based manner on customer reviews and then fine-tune them on annotated datasets. This results in better-organized summary content reflected in improved coherence and fewer redundancies.
DSGPT: Domain-Specific Generative Pre-Training of Transformers for Text Generation in E-commerce Title and Review Summarization Xueying Zhang, Yunjiang Jiang, Yue Shang, Zhaomeng Cheng, Chi Zhang, Xiaochuan Fan, Yun Xiao, Bo Long SIGIR 2021 [pdf]
Convex Aggregation for Opinion Summarization Hayate Iso, Xiaolan Wang, Yoshihiko Suhara, Stefanos Angelidis, Wang-Chiew Tan EMNLP 2021 Findings [pdf] [code]
Measuring Similarity of Opinion-bearing Sentences Wenyi Tay, Xiuzhen Zhang, Stephen Wan, Sarvnaz Karimi EMNLP 2021| newsum [pdf] [data]
Comparative Opinion Summarization via Collaborative Decoding Hayate Iso, Xiaolan Wang, Yoshihiko Suhara [pdf] [data]
Learning Opinion Summarizers by Selecting Informative Reviews Arthur Bražinskas, Mirella Lapata, Ivan Titov EMNLP 2021 [pdf] [code]
Aspect-Controllable Opinion Summarization Reinald Kim Amplayo, Stefanos Angelidis, Mirella Lapata EMNLP 2021 [pdf] [code]
CUSTOM: Aspect-Oriented Product Summarization for E-Commerce Jiahui Liang, Junwei Bao, Yifan Wang, Youzheng Wu, Xiaodong He, Bowen Zhou [pdf] [code]
TransSum: Translating Aspect and Sentiment Embeddings for Self-Supervised Opinion Summarization Ke Wang, Xiaojun Wan `

Name		Name	Last commit message	Last commit date
Latest commit History 457 Commits
paper_statistics		paper_statistics
pic		pic
slides		slides
.gitattributes		.gitattributes
README.md		README.md
summarization.bib		summarization.bib

xcfcode/Summarization-Papers

Folders and files

Latest commit

History

Repository files navigation

Summarization Papers

Contributor

Summarization Learning Route

Trending

Presentations && Notes

Big Model Era

Decomposed

Benchmark

Survey

Toolkit

Analysis

Thesis

Theory

Dataset

Dialogue

Dataset

Email Summarization

Meeting Summarization

Chat Summarization

Medical Dialogue Summarization

Customer Service Summarization

Domain Adaption

Others

Long Document

Factual Consistency

Contrastive Learning

Evaluation

Multi-Document

Cross-Lingual

Multi-modal

Sentiment Related

Pre-trained Language Model Based

Controllable

Abstractive

Graph-Based

Unsupervised

Concept-map-based

Timeline

Opinion

About

Topics

Resources

Stars

Watchers

Forks

Languages