Invited Talk

The 4th Workshop on "Evaluation & Comparison of NLP Systems" 1st November 2023, co-located at IJCNLP-AACL 2023

Re-Evaluating Summarization Evaluation in the Era of LLMs

Alexander Fabbri

View Slides

Recent advances in Large Language Models (LLMs) have yielded significant performance gains across NLP tasks, including text summarization. The improvements in these models also necessitate a more nuanced evaluation. In this talk, I will first explore the current landscape of human evaluation in summarization and present a fine-grained protocol and evaluation benchmark for assessing the salience of summaries. This work highlights possible biases in human evaluation in the era of LLMs, underscoring the need for more targeted evaluation. I will then introduce our work that points to issues in existing evaluation benchmarks for factual consistency in summarization and proposes a novel protocol for efficiently creating similar benchmarks along targeted error types. Our resulting benchmark points to gaps in the ability of LLMs to detect factual inconsistencies. To conclude, I will discuss additional challenges and future directions for summarization evaluation.