09:00 - 09:10 | Opening Remarks |
---|---|
09:15 - 09:55 | Keynote Talk 1 (Ehud Reiter)
High-Quality Human Evaluations of NLG |
10:00 - 10:40 | Paper Presentation Session 1
Session chair: Yang Gao
|
10:45 - 11:25 | Keynote Talk 2 (Sebastian Ruder)
Challenges and Opportunities in Multilingual Evaluation |
11:30 - 12:10 | Paper Presentation Session 2
Session chair: Steffen Eger
|
12:15 - 12:55 | Keynote Talk 3 (Dan Roth)
Evaluating Evaluation |
13:00 - 14:00 | Lunch Break |
14:00 - 14:40 | Keynote Talk 4 (Jason Wu)
Towards Trustworthy Evaluation and Interpretation for Summarization and Dialogue |
14:45 - 15:25 | Paper Presentation Session 3
Session chair: Piyawat Lertvittayakumjorn
|
15:30 - 16:10 | Keynote Talk 5 (Ani Nenkova)
Temporal effects on NLP models |
16:15 - 16:55 | Paper Presentation Session 4
Session chair: Marina Fomicheva
|
17:00 - 17:45 | Award Announcement & Shared Task Winners Presentation
Session chair: Steffen Eger
|
17:50 - 18:00 | Concluding Remarks |
Talk: High-Quality Human Evaluations of NLG [Slides]
Most evaluations in NLP are based on metrics or cheap human evaluations. But it is important to also conduct high-quality human evaluations, even though these can be expensive and time-consuming; such evaluations give us the best understanding of the performance of NLP systems, and also can serve as "gold-standard" evaluations for validating and assessing the reliability of metrics and cheaper human evaluations. In this talk, I will review some of our previous work on high-quality human evaluation of NLG, and then discuss current work on evaluation of factual accuracy in generated texts, evaluation of real-world utility of summaries of medical consultations, and enhancing reproducibility of human evaluations.
Talk: Challenges and Opportunities in Multilingual Evaluation [Slides]
As NLP systems become increasingly multilingual, we are presented with the challenge of how to effectively evaluate them across many languages. In this talk, I will discuss some of the challenges in multilingual evaluation, including creating benchmarks that accurately assess performance in different languages as well as designing evaluation setups and metrics that are not biased towards a particular language. I will provide examples to illustrate these challenges and finally discuss potential solutions.
Talk: Towards Trustworthy Evaluation and Interpretation for Summarization and Dialogue [Slides]
In this talk, we first introduce the SummEval library, which re-evaluates 23 text summarizers on 14 automatic metrics and investigates correlation with human judgement. Then we introduce the SummVis library, a visualization toolkit to interact with model, data, and prediction. We show a few case studies to demonstrate its usage to identify hallucinations in text generation. Lastly, we briefly discuss the recent trends and challenges of factual consistency evaluation in summarization and dialogue tasks.
Talk: Temporal effects on NLP models
How does the performance of models trained to perform language-related tasks change over time? Proper experimental design to study this question is tricker than for most other tasks. I will present a set of experiments with systems powered by large neural pretrained representations for English to demonstrate that temporal model deterioration is not as big a concern, with some models in fact improving when tested on data drawn from later time periods. It is however the case that temporal domain adaptation is beneficial, with better performance for a given time period possible when the system is trained on temporally more recent data. I will highlight the difficulties in evaluating temporal effects and how these can be potentially mitigated.