References | Eval4NLP

Reference Papers

Ricardo Rei et al., The Inside Story: Towards Better Understanding of Machine Translation Neural Evaluation Metrics. In ACL: 2023.
Kocmi and Federmann, Large Language Models Are State-of-the-Art Evaluators of Translation Quality. arxiv-eprint: 2302.14520. 2023.
Chiang and Lee, Can Large Language Models Be an Alternative to Human Evaluations?. In: ACL 2023.
Xu et al., INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback. arxiv-eprint: 2305.14282. 2023.
Rofin et al., Vote’n’Rank: Revision of Benchmarking with Social Choice Theory. In: EACL 2923.
Liu et al., G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arxiv-eprint: 2303.16634. 2023.
Leiter et al., Towards Explainable Evaluation Metrics for Machine Translation. arxiv-eprint: 2306.13041.
Fu, et al.(2023). GPTScore: Evaluate as You Desire. arxiv-eprint: 2302.04166.
Eddine et al. FrugalScore: Learning cheaper, lighter and faster evaluation metrics for automatic text generation. In: ACL 2022.
Deutsch, Dror, Roth. On the Limitations of Reference-Free Evaluations of Generated Text. In: ACL 2022.
Karpinska et al., DEMETR: Diagnosing Evaluation Metrics for Translation. In: EMNLP 2022.
Golovneva et al., ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning. In: ICLR 2023.
Zhao, Strube, Eger. DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence. In: EACL 2023.
Belouadi and Eger. USCORE: An Effective Approach to Fully Unsupervised Evaluation Metrics for Machine Translation. In: EACL 2023.
Chen and Eger. MENLI: Robust Evaluation Metrics from Natural Language Inference. In: TACL 2023.
Chen, Belouadi, Eger. Reproducibility Issues for BERT-based Evaluation Metrics. In: EMNLP 2022
Vu, Moosavi, Eger. Layer or Representation Space: What makes BERT-based Evaluation Metrics Robust? In: COLING 2022
Deutsch and Roth. Benchmarking Answer Verification Methods for Question Answering-Based Summarization Evaluation Metrics. In: ACL Findings 2022
Deutsch and Roth. Understanding the Extent to which Content Quality Metrics Measure the Information Quality of Summaries. In: CoNLL 2021
Deutsch et al. A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods. In: TACL 2021
Deutsch et al. Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary. In: TACL 2021
Deutsch and Roth. SacreROUGE: An Open-Source Library for Using and Developing Summarization Evaluation Metrics. In: NLP-OSS 2020
Mueller et al. Cross-Linguistic Syntactic Evaluation of Word Prediction Models. In: ACL 2020
Klebanov and Madnani. Automated Evaluation of Writing – 50 Years and Counting. In: ACL 2020
Shikib and Eskenazi. USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation. In: ACL 2020
Zhao et al. On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation. In: ACL 2020
Gao et al. SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization. In: ACL 2020
Jacovi and Goldberg. Towards Faithfully Interpretable NLP Systems: How should we define and evaluate faithfulness? ACL 2020.
DeYoung et al. ERASER: A Benchmark to Evaluate Rationalized NLP Models. ACL 2020.
Lertvittayakumjorn and Toni. Human-grounded Evaluations of Explanation Methods for Text Classification. EMNLP 2019.
Hase and Bansal. Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? ACL 2020.
Zhao et al. XMoverScore: Evaluating Machine Translation without Human Reference. EurNLP 2019.
Nenkova and Passonneau. Evaluating content selection in summarization: The pyramid method. NAACL 2004
Graham. Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE. EMNLP 2015
Owczarzak et al. An Assessment of the Accuracy of Automatic Evaluation in Summarization. In Workshop on Evaluation Metrics and System Comparison for Automatic Summarization. 2012
Peyrard and Eckle-Kohler. A Principled Framework for Evaluating Summarizers: Comparing Models of Summary Quality against Human Judgments. ACL 2017
Peyrard. Studying Summarization Evaluation Metrics in the Appropriate Scoring Range. ACL 2019
Jin et al. Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. arxiv e-print:1907.11932
Sun et al. How to Compare Summarizers without Target Length? Pitfalls, Solutions and Re-Examination of the Neural Summarization Literature. NeuralGen Workshop@NAACL 2019.
Böhm et al. Better Rewards Yield Better Summaries: Learning to Summarise Without References. EMNLP 2019
Wu et al. Errudite: Scalable, Reproducible, and Testable Error Analysis. ACL 2019
Dodge et al. Show Your Work: Improved Reporting of Experimental Results. EMNLP 2019
McCoy et al. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. ACL 2019.
Shen et al. Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms. ACL 2018.
Glavas et al. How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions. ACL 2019.
Dror et al. The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing. ACL 2018.
Reimers and Gurevych. Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging. EMNLP 2017
Louis and Nenkova. Automatically assessing machine summary content without a gold standard. Computational Linguistics, 39(2), 2013.
Clark et al. Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts. ACL 2019.
Zhao et al. MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. EMNLP 2019.