Accepted Papers

The 2nd Workshop on "Evaluation & Comparison of NLP Systems" Co-located at EMNLP 2021

Research papers

Statistically Significant Detection of Semantic Shifts using Contextual Word Embeddings
Yang Liu, Alan Medlar and Dorota Glowacka

Developing a Benchmark for Reducing Data Bias in Authorship Attribution
Benjamin Murauer and Günther Specht

ESTIME: Estimation of Summary-to-Text Inconsistency by Mismatched Embeddings
Oleg Vasilyev and John Bohannon

Testing Cross-Database Semantic Parsers With Canonical Utterances
Heather Lent, Semih Yavuz, Tao Yu, Tong Niu, Yingbo Zhou, Dragomir Radev and Xi Victoria Lin

Writing Style Author Embedding Evaluation
Enzo Terreau, Antoine Gourru and Julien Velcin

HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish Text
Vivek Srivastava and Mayank Singh

Error-Sensitive Evaluation for Ordinal Target Variables
David Chen, Maury Courtland, Adam Faulkner and Aysu Ezen-Can

MIPE: A Metric Independent Pipeline for Effective Code-Mixed NLG Evaluation
Ayush Garg, Sammed Kagi, Vivek Srivastava and Mayank Singh

SeqScore: Addressing Barriers to Reproducible Named Entity Recognition Evaluation
Chester Palen-Michel, Nolan Holley and Constantine Lignos

How Emotionally Stable is ALBERT? Testing Robustness with Stochastic Weight Averaging on a Sentiment Analysis Task
Urja Khurana, Eric Nalisnick and Antske Fokkens

StoryDB: Broad Multi-language Narrative Dataset
Alexey Tikhonov, Igor Samenko and Ivan Yamshchikov

What is SemEval evaluating? A Systematic Analysis of Evaluation Campaigns in NLP
Oskar Wysocki, Malina Florea, Dónal Landers and André Freitas

Differential Evaluation: a Qualitative Analysis of Natural Language Processing System Behavior Based Upon Data Resistance to Processing
Lucie Gianola, Hicham El Boukkouri, Cyril Grouin, Thomas Lavergne, Patrick Paroubek and Pierre Zweigenbaum

Trainable Ranking Models to Evaluate the Semantic Accuracy of Data-to-Text Neural Generator
Nicolas Garneau and Luc Lamontagne

Referenceless Parsing-Based Evaluation of AMR-to-English Generation
Emma Manning and Nathan Schneider

Evaluation of Unsupervised Automatic Readability Assessors Using Rank Correlations
Yo Ehara

Validating Label Consistency in NER Data Annotation
Qingkai Zeng, Mengxia Yu, Wenhao Yu, Tianwen Jiang and Meng Jiang


Shared task papers

The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and Results
Marina Fomicheva, Piyawat Lertvittayakumjorn, Wei Zhao, Steffen Eger and Yang Gao

IST-Unbabel 2021 Submission for the Explainable Quality Estimation Shared Task
Marcos Treviso, Nuno M. Guerreiro, Ricardo Rei and André F. T. Martins

The UMD Submission to the Explainable MT Quality Estimation Shared Task: Combining Explanation Models with Sequence Labeling
Tasnim Kabir and Marine Carpuat

Reference-Free Word- and Sentence-Level Translation Evaluation with Token-Matching Metrics
Christoph Wolfgang Leiter

Explaining Errors in Machine Translation with Absolute Gradient Ensembles
Melda Eksi, Erik Gelbing, Jonathan Stieber and Chi Viet Vu

Explainable Quality Estimation: CUNI Eval4NLP Submission
Peter Polák, Muskaan Singh and Ondřej Bojar

Error Identification for Machine Translation with Metric Embedding and Attention
Raphael Rubino, Atsushi Fujita and Benjamin Marie