Eval4NLP 2021

The 2nd Workshop on "Evaluation & Comparison of NLP Systems" 10th November 2021, co-located virtually at EMNLP 2021

Latest News

Nov 10, 2021Our workshop has passed successfully. We would like to thank all authors, reviewers, steering committee, keynote speakers, sponsors, and participants for making this workshop fantastic. Hope to see you all again in the 3rd Eval4NLP workshop.
Nov 10, 2021The list of best paper awards has been announced. Congratulations!
Nov 08, 2021The list of keynote speakers and their talk abstracts has been added to our Program.
Oct 28, 2021The program of the workshop has been published.
Oct 10, 2021The list of accepted papers has been published.
Aug 31, 2021As the CodaLab competition system is unstable, the organizers have decided to extend the submission deadline of the shared task to September 3, 2021.
Aug 20, 2021The test phase of our shared task begins now! The test data is here. Please don't forget to join our Google Group for latest updates.
Jul 25, 2021The submission deadline of research papers has been extended to July 31, 2021.
Jul 20, 2021The Multiple Submission Policy and Presenting Published Papers sections in our call for papers have been updated.
Jun 30, 2021The CodaLab competition of our shared task is now live!
Jun 16, 2021Please join this Google Group for posting questions related to our shared task.
Jun 10, 2021The baseline and the annotation guidelines of our shared task are now available.
May 24, 2021We announce the shared task on "Explainable Quality Estimation".
May 22, 2021The submission system is now open!
More details about preprints and supplementary materials are added to the Call for Papers.
May 14, 2021We also welcome submissions from ACL Rolling Review.
Apr 22, 2021The Call for Papers is out!
Apr 17, 2021The Artificial Intelligence Journal (AIJ) and Salesforce are our generous sponsors this year.
Nov 19, 2020Launch the workshop website


Fair evaluations and comparisons are of fundamental importance to the NLP community to properly track progress, especially within the current deep learning revolution, with new state-of-the-art results reported in ever shorter intervals. This concerns the creation of benchmark datasets that cover typical use cases and blind spots of existing systems, the designing of metrics for evaluating the performance of NLP systems on different dimensions, and the reporting of evaluation results in an unbiased manner.

Although certain aspects of NLP evaluation and comparison have been addressed in previous workshops (e.g., Metrics Tasks at WMT, NeuralGen, NLG-Evaluation, and New Frontiers in Summarization), we believe that new insights and methodology, particularly in the last 1-2 years, have led to much renewed interest in the workshop topic. The first workshop in the series, Eval4NLP’20 (collocated with EMNLP’20), was the first workshop to take a broad and unifying perspective on the subject matter. We believe the second workshop will continue the tradition and become a reputed platform for presenting and discussing latest advances in NLP evaluation methods and resources.

Particular topics of interest of the workshop include (but not limited to):

  1. Designing evaluation metrics
    Proposing and/or analyzing:
    • Metrics with desirable properties, e.g., high correlations with human judgments, strong in distinguishing high-quality outputs from mediocre and low-quality outputs, robust across lengths of input and output sequences, efficient to run, etc.;
    • Reference-free evaluation metrics, which only require source text(s) and system predictions;
    • Cross-domain metrics, which can reliably and robustly measure the quality of system outputs from heterogeneous modalities (e.g., image and speech), different genres (e.g., newspapers, Wikipedia articles and scientific papers) and different languages;
    • Cost-effective methods for eliciting high-quality manual annotations; and
    • Methods and metrics for evaluating interpretability and explanations of NLP models
  2. Creating adequate evaluation data
    Proposing new datasets or analyzing existing ones by studying their:
    • Coverage and diversity, e.g., size of the corpus, covered phenomena, representativeness of samples, distribution of sample types, variability among data sources, eras, and genres; and
    • Quality of annotations, e.g., consistency of annotations, inter-rater agreement, and bias check
  3. Reporting correct results
    Ensuring and reporting:
    • Statistics for the trustworthiness of results, e.g., via appropriate significance tests, and reporting of score distributions rather than single-point estimates, to avoid chance findings;
    • Reproducibility of experiments, e.g., quantifying the reproducibility of papers and issuing reproducibility guidelines; and
    • Comprehensive and unbiased error analyses and case studies, avoiding cherry-picking and sampling bias.

See reference papers here.

Related Workshops

HumEval invites submissions on all aspects of human evaluation of NLP systems.

Contact us

Email: eval4nlp@gmail.com