The 2nd Workshop on "Evaluation & Comparison of NLP Systems"Co-located at EMNLP 2021
Zhao et al. MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. EMNLP 2019.
Clark et al. Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts. ACL 2019.
Louis and Nenkova. Automatically assessing machine summary content without a gold standard. Computational Linguistics, 39(2), 2013.
Reimers and Gurevych. Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging. EMNLP 2017
Dror et al. The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing. ACL 2018.
Glavas et al. How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions. ACL 2019.
Shen et al. Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms. ACL 2018.
McCoy et al. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. ACL 2019.
Dodge et al. Show Your Work: Improved Reporting of Experimental Results. EMNLP 2019
Wu et al. Errudite: Scalable, Reproducible, and Testable Error Analysis. ACL 2019
Böhm et al. Better Rewards Yield Better Summaries: Learning to Summarise Without References. EMNLP 2019
Sun et al. How to Compare Summarizers without Target Length? Pitfalls, Solutions and Re-Examination of the Neural Summarization Literature. NeuralGen Workshop@NAACL 2019.
Jin et al. Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. arxiv e-print:1907.11932
Peyrard. Studying Summarization Evaluation Metrics in the Appropriate Scoring Range. ACL 2019
Peyrard and Eckle-Kohler. A Principled Framework for Evaluating Summarizers: Comparing Models of Summary Quality against Human Judgments. ACL 2017
Owczarzak et al. An Assessment of the Accuracy of Automatic Evaluation in Summarization. In Workshop on Evaluation Metrics and System Comparison for Automatic Summarization. 2012
Graham. Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE. EMNLP 2015
Nenkova and Passonneau. Evaluating content selection in summarization: The pyramid method. NAACL 2004
Zhao et al. XMoverScore: Evaluating Machine Translation without Human Reference. EurNLP 2019.
Hase and Bansal. Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? ACL 2020.
Lertvittayakumjorn and Toni. Human-grounded Evaluations of Explanation Methods for Text Classification. EMNLP 2019.
DeYoung et al. ERASER: A Benchmark to Evaluate Rationalized NLP Models. ACL 2020.
Jacovi and Goldberg. Towards Faithfully Interpretable NLP Systems: How should we define and evaluate faithfulness? ACL 2020.
Gao et al. SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization. In: ACL 2020
Zhao et al. On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation. In: ACL 2020
Shikib and Eskenazi. USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation. In: ACL 2020
Klebanov and Madnani. Automated Evaluation of Writing – 50 Years and Counting. In: ACL 2020
Mueller et al. Cross-Linguistic Syntactic Evaluation of Word Prediction Models. In: ACL 2020