Eval4NLP 2022

The 3rd Workshop on "Evaluation & Comparison of NLP Systems" 20th November 2022, co-located at AACL 2022

Latest News

September 13, 2022The link to submit your paper with reviews from another venue is now available. Upload your paper and 3 reviews from any other venue (ARR, *ACL, etc.) here by September 21, AOE! See here for more details.
August 8, 2022The submission deadline has been extended and the important dates have been adjusted accordingly. See the Call for Papers for more details.
May 18, 2022The Call for Papers is out!
April 10, 2022Workshop accepted at AACL 2022!

Important Dates

All deadlines are 11.59 pm UTC -12h (“Anywhere on Earth”).

  • August 8, 2022 August 15, 2022: Direct submission to Eval4NLP deadline through OpenReview here
  • September 10, 2022 September 21, 2022: Submission of papers and reviews from another venue (ARR, *ACL, etc.) to Eval4NLP (See more details)
  • September 25, 2022 September 28, 2022: Notification of acceptance
  • October 10, 2022: Camera-ready papers due
  • November 20, 2022: Workshop day


Fair evaluations and comparisons are of fundamental importance to the NLP community to properly track progress, especially within the current deep learning revolution, with new state-of-the-art results reported in ever shorter intervals. This concerns the creation of benchmark datasets that cover typical use cases and blind spots of existing systems, the designing of metrics for evaluating the performance of NLP systems on different dimensions, and the reporting of evaluation results in an unbiased manner.

Although certain aspects of NLP evaluation and comparison have been addressed in previous workshops (e.g., Metrics Tasks at WMT, NeuralGen, NLG-Evaluation, and New Frontiers in Summarization), we believe that new insights and methodology, particularly in the last 2-3 years, have led to much renewed interest in the workshop topic. The first workshop in the series, Eval4NLP’20 (collocated with EMNLP’20), was the first workshop to take a broad and unifying perspective on the subject matter. The second workshop, Eval4NLP’21 (collocated with EMNLP’21) extended this perspective. We believe the third workshop will continue the tradition and become a reputed platform for presenting and discussing latest advances in NLP evaluation methods and resources.

Particular topics of interest of the workshop include (but not limited to):

  1. Designing evaluation metrics
    Proposing and/or analyzing:
    • Metrics with desirable properties, e.g., high correlations with human judgments, strong in distinguishing high-quality outputs from mediocre and low-quality outputs, robust across lengths of input and output sequences, efficient to run, etc.;
    • Reference-free evaluation metrics, which only require source text(s) and system predictions;
    • Cross-domain metrics, which can reliably and robustly measure the quality of system outputs from heterogeneous modalities (e.g., image and speech), different genres (e.g., newspapers, Wikipedia articles and scientific papers) and different languages;
    • Cost-effective methods for eliciting high-quality manual annotations; and
    • Methods and metrics for evaluating interpretability and explanations of NLP models
  2. Creating adequate evaluation data
    Proposing new datasets or analyzing existing ones by studying their:
    • Coverage and diversity, e.g., size of the corpus, covered phenomena, representativeness of samples, distribution of sample types, variability among data sources, eras, and genres; and
    • Quality of annotations, e.g., consistency of annotations, inter-rater agreement, and bias check
  3. Reporting correct results
    Ensuring and reporting:
    • Statistics for the trustworthiness of results, e.g., via appropriate significance tests, and reporting of score distributions rather than single-point estimates, to avoid chance findings;
    • Reproducibility of experiments, e.g., quantifying the reproducibility of papers and issuing reproducibility guidelines; and
    • Comprehensive and unbiased error analyses and case studies, avoiding cherry-picking and sampling bias.

See reference papers here.

Contact us

Email: eval4nlp@gmail.com