The 2nd Workshop on "Evaluation & Comparison of NLP Systems" Co-located at EMNLP 2021

Wednesday 10th November 2021

Punta Cana Time (UTC-4)
09:00 - 09:10Opening Remarks
09:15 - 09:55Keynote Talk 1 (Ehud Reiter)
High-Quality Human Evaluations of NLG
10:00 - 10:40Paper Presentation Session 1
Session chair: Yang Gao
  • How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology? (Findings)
    Chantal Amrhein and Rico Sennrich
  • AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models (Findings)
    Harish Tayyar Madabushi, Edward Gow-Smith, Carolina Scarton and Aline Villavicencio
  • Entity-Based Semantic Adequacy for Data-to-Text Generation (Findings)
    Juliette Faille, Albert Gatt and Claire Gardent
  • Differential Evaluation: a Qualitative Analysis of Natural Language Processing System Behavior Based Upon Data Resistance to Processing
    Lucie Gianola, Hicham El Boukkouri, Cyril Grouin, Thomas Lavergne, Patrick Paroubek and Pierre Zweigenbaum
  • Validating Label Consistency in NER Data Annotation
    Qingkai Zeng, Mengxia Yu, Wenhao Yu, Tianwen Jiang and Meng Jiang
  • MIPE: A Metric Independent Pipeline for Effective Code-Mixed NLG Evaluation
    Ayush Garg, Sammed Kagi, Vivek Srivastava and Mayank Singh
10:45 - 11:25Keynote Talk 2 (Sebastian Ruder)
Challenges and Opportunities in Multilingual Evaluation
11:30 - 12:10Paper Presentation Session 2
Session chair: Steffen Eger
  • How Emotionally Stable is ALBERT? Testing Robustness with Stochastic Weight Averaging on a Sentiment Analysis Task
    Urja Khurana, Eric Nalisnick and Antske Fokkenss
  • Challenges in Detoxifying Language Models (Findings)
    Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin and Po-Sen Huang
  • Adversarial Examples for Evaluating Math Word Problem Solvers (Findings)
    Vivek Kumar, Rishabh Maheshwary and Vikram Pudi
  • Making Heads and Tails of Models with Marginal Calibration for Sparse Tagsets (Findings)
    Michael Kranzlein, Nelson F. Liu and Nathan Schneider
  • TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation (Findings)
    Adaku Uchendu, Zeyu Ma, Thai Le, Rui Zhang and Dongwon Lee
  • StoryDB: Broad Multi-language Narrative Dataset
    Alexey Tikhonov, Igor Samenko and Ivan Yamshchikov
12:15 - 12:55Keynote Talk 3 (Dan Roth)
Evaluating Evaluation
13:00 - 14:00Lunch Break
14:00 - 14:40Keynote Talk 4 (Jason Wu)
Towards Trustworthy Evaluation and Interpretation for Summarization and Dialogue
14:45 - 15:25Paper Presentation Session 3
Session chair: Piyawat Lertvittayakumjorn
  • SeqScore: Addressing Barriers to Reproducible Named Entity Recognition Evaluation
    Chester Palen-Michel, Nolan Holley and Constantine Lignos
  • Trainable Ranking Models to Evaluate the Semantic Accuracy of Data-to-Text Neural Generator
    Nicolas Garneau and Luc Lamontagne
  • TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning (Findings)
    Kexin Wang, Nils Reimers and Iryna Gurevych
  • Evaluation of Unsupervised Automatic Readability Assessors Using Rank Correlations
    Yo Ehara
  • Evaluating Cross-Database Semantic Parsers With Canonical Utterances
    Heather Lent, Semih Yavuz, Tao Yu, Tong Niu, Yingbo Zhou, Dragomir Radev and Xi Victoria Lin
  • Writing Style Author Embedding Evaluation
    Enzo Terreau, Antoine Gourru and Julien Velcin
15:30 - 16:10Keynote Talk 5 (Ani Nenkova)
Temporal effects on NLP models
16:15 - 16:55Paper Presentation Session 4
Session chair: Marina Fomicheva
  • ESTIME: Estimation of Summary-to-Text Inconsistency by Mismatched Embeddings
    Oleg Vasilyev and John Bohannon
  • Towards Realistic Single-Task Continuous Learning Research for NER (Findings)
    Justin Payan, Yuval Merhav, He Xie, Satyapriya Krishna, Anil Ramakrishna, Mukund Sridhar and Rahul Gupta
  • Statistically Significant Detection of Semantic Shifts using Contextual Word Embeddings
    Yang Liu, Alan Medlar and Dorota Glowacka
  • Benchmarking Meta-embeddings: What Works and What Does Not (Findings)
    Iker García, Rodrigo Agerri and German Rigau
  • Referenceless Parsing-Based Evaluation of AMR-to-English Generation
    Emma Manning and Nathan Schneider
17:00 - 17:45Award Announcement & Shared Task Winners Presentation
Session chair: Steffen Eger
  • IST-Unbabel 2021 Submission for the Explainable Quality Estimation Shared Task
    Marcos Treviso, Nuno M. Guerreiro, Ricardo Rei and André F. T. Martins
  • Error Identification for Machine Translation with Metric Embedding and Attention
    Raphael Rubino, Atsushi Fujita and Benjamin Marie
  • Reference-Free Word- and Sentence-Level Translation Evaluation with Token-Matching Metrics
    Christoph Wolfgang Leiter
  • Explainable Quality Estimation: CUNI Eval4NLP Submission
    Peter Polák, Muskaan Singh and Ondřej Bojar
  • The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and Results
    Marina Fomicheva, Piyawat Lertvittayakumjorn, Wei Zhao, Steffen Eger and Yang Gao
  • Award Announcement
17:50 - 18:00Concluding Remarks

Invited Keynote Speakers

Ehud Reiter

Talk: High-Quality Human Evaluations of NLG [Slides]
Most evaluations in NLP are based on metrics or cheap human evaluations. But it is important to also conduct high-quality human evaluations, even though these can be expensive and time-consuming; such evaluations give us the best understanding of the performance of NLP systems, and also can serve as "gold-standard" evaluations for validating and assessing the reliability of metrics and cheaper human evaluations. In this talk, I will review some of our previous work on high-quality human evaluation of NLG, and then discuss current work on evaluation of factual accuracy in generated texts, evaluation of real-world utility of summaries of medical consultations, and enhancing reproducibility of human evaluations.

Sebastian Ruder

Talk: Challenges and Opportunities in Multilingual Evaluation [Slides]
As NLP systems become increasingly multilingual, we are presented with the challenge of how to effectively evaluate them across many languages. In this talk, I will discuss some of the challenges in multilingual evaluation, including creating benchmarks that accurately assess performance in different languages as well as designing evaluation setups and metrics that are not biased towards a particular language. I will provide examples to illustrate these challenges and finally discuss potential solutions.

Dan Roth

Talk: Evaluating Evaluation [Slides]
I will address problems with our communitiy's evaluation methodologies in a range of NLP tasks, from Text Correction to Summarization to Commonsense tasks, and will propose changes in our methodolies to address these problems.

Chien-Sheng (Jason) Wu

Talk: Towards Trustworthy Evaluation and Interpretation for Summarization and Dialogue [Slides]
In this talk, we first introduce the SummEval library, which re-evaluates 23 text summarizers on 14 automatic metrics and investigates correlation with human judgement. Then we introduce the SummVis library, a visualization toolkit to interact with model, data, and prediction. We show a few case studies to demonstrate its usage to identify hallucinations in text generation. Lastly, we briefly discuss the recent trends and challenges of factual consistency evaluation in summarization and dialogue tasks.

Ani Nenkova

Talk: Temporal effects on NLP models
How does the performance of models trained to perform language-related tasks change over time? Proper experimental design to study this question is tricker than for most other tasks. I will present a set of experiments with systems powered by large neural pretrained representations for English to demonstrate that temporal model deterioration is not as big a concern, with some models in fact improving when tested on data drawn from later time periods. It is however the case that temporal domain adaptation is beneficial, with better performance for a given time period possible when the system is trained on temporally more recent data. I will highlight the difficulties in evaluating temporal effects and how these can be potentially mitigated.