Eval4NLP 2023

The 4th Workshop on "Evaluation & Comparison of NLP Systems" 1st November 2023, co-located at AACL 2023
The proceedings of our workshop (including the shared task) are available here. An updated version of the shared task overview paper can also be found on arxiv.
We're excited to announce this year's invited talk by Alexander Fabbri! It will take place on 01 November 2023 at 09:00 Central Indonesian time zone (UTC+8). See our program for full scheduling details.
Important information: Eval4NLP 2023 will take place virtually!

Important Dates

All deadlines are 11.59 pm UTC -12h (“Anywhere on Earth”).

  • August 25September 1, 2023: Direct submission to Eval4NLP deadline through OpenReview here
  • September 25September 30, 2023: Submission of pre-reviewed papers to Eval4NLP through OpenReview here
  • October 2, 2023: Notification of acceptance
  • October 10, 2023:October 13, 2023: Camera-ready papers due
  • November 1, 2023: Workshop day

New: This year's edition of the Eval4NLP workshop puts a focus on the evaluation of and through large language models (LLMs). Notably, the workshop will feature a shared task on LLM evaluation and specifically encourages the submission of LLM evaluation focused papers. Other submissions that fit the general scope of Eval4NLP are of course also welcome. See below for more details.


The current year has brought astonishing achievements in NLP. Generative large language models (LLMs) like ChatGPT and GPT4 demonstrate wide capabilities in understanding and performing tasks from in-context descriptions without fine-tuning, bringing world-wide attention to the risks and opportunities that arise from current and ongoing research. Further, the release of open-source models like LLaMA and Falcon LLM, better quantization techniques for inference and training, as well as the adaptation of efficient fine-tuning techniques such as LORA accelerate the research progress by allowing hardware and runtime efficiency. Given the ever growing speed of research, fair evaluations and comparisons are of fundamental importance to the NLP community in order to properly track progress. This concerns the creation of benchmark datasets that cover typical use cases and blind spots of existing systems, the designing of metrics for evaluating the performance of NLP systems on different dimensions, and the reporting of evaluation results in an unbiased manner.

Although certain aspects of NLP evaluation and comparison have been addressed in previous workshops (e.g., Metrics Tasks at WMT, NeuralGen, NLG-Evaluation, and New Frontiers in Summarization), we believe that new insights and methodology, particularly in the last 2-3 years, have led to much renewed interest in the workshop topic. The first workshop in the series, Eval4NLP’20 (collocated with EMNLP’20), was the first workshop to take a broad and unifying perspective on the subject matter. The second (Eval4NLP’21 collocated with EMNLP’21) and third (Eval4NLP’22 collocated with AACL’22) workshop extended this perspective. We believe the fourth workshop will continue the tradition and become a reputed platform for presenting and discussing latest advances in NLP evaluation methods and resources. As indicated above, this year we especially encourage the submission of works that consider the evaluation of LLMs and their generated content as well as works that leverage LLMs in their evaluation strategies.

Further topics of interest of the workshop include (but not limited to):

  1. Designing evaluation metrics and evaluation methodology
  2. Creating adequate evaluation data and evaluation test suites
  3. Reporting correct and reproducible results

See call for papers for more details. Further, reference papers here.

Contact us

Email: eval4nlp@gmail.com