Vu, Moosavi, Eger. Layer or Representation Space: What makes BERT-based Evaluation Metrics Robust? In: COLING 2022
Deutsch and Roth. Benchmarking Answer Verification Methods for Question Answering-Based Summarization Evaluation Metrics. In: ACL Findings 2022
Deutsch and Roth. Understanding the Extent to which Content Quality Metrics Measure the Information Quality of Summaries. In: CoNLL 2021
Deutsch et al. A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods. In: TACL 2021
Deutsch et al. Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary. In: TACL 2021
Deutsch and Roth. SacreROUGE: An Open-Source Library for Using and Developing Summarization Evaluation Metrics. In: NLP-OSS 2020
Mueller et al. Cross-Linguistic Syntactic Evaluation of Word Prediction Models. In: ACL 2020
Klebanov and Madnani. Automated Evaluation of Writing – 50 Years and Counting. In: ACL 2020
Shikib and Eskenazi. USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation. In: ACL 2020
Zhao et al. On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation. In: ACL 2020
Gao et al. SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization. In: ACL 2020
Jacovi and Goldberg. Towards Faithfully Interpretable NLP Systems: How should we define and evaluate faithfulness? ACL 2020.
DeYoung et al. ERASER: A Benchmark to Evaluate Rationalized NLP Models. ACL 2020.
Lertvittayakumjorn and Toni. Human-grounded Evaluations of Explanation Methods for Text Classification. EMNLP 2019.
Hase and Bansal. Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? ACL 2020.
Zhao et al. XMoverScore: Evaluating Machine Translation without Human Reference. EurNLP 2019.
Nenkova and Passonneau. Evaluating content selection in summarization: The pyramid method. NAACL 2004
Graham. Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE. EMNLP 2015
Owczarzak et al. An Assessment of the Accuracy of Automatic Evaluation in Summarization. In Workshop on Evaluation Metrics and System Comparison for Automatic Summarization. 2012
Peyrard and Eckle-Kohler. A Principled Framework for Evaluating Summarizers: Comparing Models of Summary Quality against Human Judgments. ACL 2017
Peyrard. Studying Summarization Evaluation Metrics in the Appropriate Scoring Range. ACL 2019
Jin et al. Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. arxiv e-print:1907.11932
Sun et al. How to Compare Summarizers without Target Length? Pitfalls, Solutions and Re-Examination of the Neural Summarization Literature. NeuralGen Workshop@NAACL 2019.
Böhm et al. Better Rewards Yield Better Summaries: Learning to Summarise Without References. EMNLP 2019
Wu et al. Errudite: Scalable, Reproducible, and Testable Error Analysis. ACL 2019
Dodge et al. Show Your Work: Improved Reporting of Experimental Results. EMNLP 2019
McCoy et al. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. ACL 2019.
Shen et al. Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms. ACL 2018.
Glavas et al. How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions. ACL 2019.
Dror et al. The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing. ACL 2018.
Reimers and Gurevych. Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging. EMNLP 2017
Louis and Nenkova. Automatically assessing machine summary content without a gold standard. Computational Linguistics, 39(2), 2013.
Clark et al. Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts. ACL 2019.
Zhao et al. MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. EMNLP 2019.