References

The 2nd Workshop on "Evaluation & Comparison of NLP Systems" Co-located at EMNLP 2021
  1. Zhao et al. MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. EMNLP 2019.
  2. Clark et al. Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts. ACL 2019.
  3. Louis and Nenkova. Automatically assessing machine summary content without a gold standard. Computational Linguistics, 39(2), 2013.
  4. Reimers and Gurevych. Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging. EMNLP 2017
  5. Dror et al. The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing. ACL 2018.
  6. Glavas et al. How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions. ACL 2019.
  7. Shen et al. Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms. ACL 2018.
  8. McCoy et al. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. ACL 2019.
  9. Dodge et al. Show Your Work: Improved Reporting of Experimental Results. EMNLP 2019
  10. Wu et al. Errudite: Scalable, Reproducible, and Testable Error Analysis. ACL 2019
  11. Böhm et al. Better Rewards Yield Better Summaries: Learning to Summarise Without References. EMNLP 2019
  12. Sun et al. How to Compare Summarizers without Target Length? Pitfalls, Solutions and Re-Examination of the Neural Summarization Literature. NeuralGen Workshop@NAACL 2019.
  13. Jin et al. Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. arxiv e-print:1907.11932
  14. Peyrard. Studying Summarization Evaluation Metrics in the Appropriate Scoring Range. ACL 2019
  15. Peyrard and Eckle-Kohler. A Principled Framework for Evaluating Summarizers: Comparing Models of Summary Quality against Human Judgments. ACL 2017
  16. Owczarzak et al. An Assessment of the Accuracy of Automatic Evaluation in Summarization. In Workshop on Evaluation Metrics and System Comparison for Automatic Summarization. 2012
  17. Graham. Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE. EMNLP 2015
  18. Nenkova and Passonneau. Evaluating content selection in summarization: The pyramid method. NAACL 2004
  19. Zhao et al. XMoverScore: Evaluating Machine Translation without Human Reference. EurNLP 2019.
  20. Hase and Bansal. Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? ACL 2020.
  21. Lertvittayakumjorn and Toni. Human-grounded Evaluations of Explanation Methods for Text Classification. EMNLP 2019.
  22. DeYoung et al. ERASER: A Benchmark to Evaluate Rationalized NLP Models. ACL 2020.
  23. Jacovi and Goldberg. Towards Faithfully Interpretable NLP Systems: How should we define and evaluate faithfulness? ACL 2020.
  24. Gao et al. SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization. In: ACL 2020
  25. Zhao et al. On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation. In: ACL 2020
  26. Shikib and Eskenazi. USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation. In: ACL 2020
  27. Klebanov and Madnani. Automated Evaluation of Writing – 50 Years and Counting. In: ACL 2020
  28. Mueller et al. Cross-Linguistic Syntactic Evaluation of Word Prediction Models. In: ACL 2020