|Aug 31, 2021||As the CodaLab competition system is unstable, the organizers have decided to extend the submission deadline to September 3, 2021.|
|Aug 20, 2021||The test phase of our shared task begins now! The test data is here. Please don't forget to join our Google Group for latest updates.|
|Aug 10, 2021||The Submission Format has been updated. We require an additional |
|Jun 30, 2021||The CodaLab competition of our shared task is now live!|
|Jun 16, 2021||Please join our Google Group for posting questions related to the shared task.|
|Jun 10, 2021||The annotation guidelines of our shared task are now available.|
|Jun 09, 2021||The baseline of our shared task is now available.|
|May 24, 2021||The shared task is announced.|
All deadlines are 11.59 pm UTC -12h (“Anywhere on Earth”).
Recent Natural Language Processing (NLP) systems based on pre-trained representations from Transformer language models, such as BERT and XLM-Roberta, have achieved outstanding results in a variety of tasks. This boost in performance, however, comes at the cost of efficiency and interpretability. Interpretability is a major concern in modern Artificial Intelligence (AI) and NLP research, as black-box models undermine users’ trust in new technologies.
In this shared task, we focus on evaluating machine translation (MT) as an example of this problem. Specifically, we look at the task of quality estimation (QE) (a.k.a. reference-free evaluation) where the aim is to predict the quality of MT output at inference time, without access to reference translations. Translation quality can be assessed at different levels of granularity: sentence-level, i.e. predicting the overall quality of translated sentences and word-level, i.e. highlighting specific errors in the MT output. Those have traditionally been treated as two separate tasks, each one requiring dedicated training data.
In this shared task, we propose to address translation error identification as an explainability task. Explainability is a broad area aimed at explaining predictions of machine learning models. Rationale extraction methods achieve this by selecting a portion of the input that justifies model output for a given data point. In translation, human perception of quality is guided by the number and severity of translation errors. By framing error identification as rationale extraction for sentence-level quality estimation systems, this shared task offers an opportunity to study whether such systems behave in the same way as humans would do.
Explanations can be obtained either by building inherently interpretable models (Yu et al., 2019) or by using post-hoc explanation methods which extract explanations from an existing model. In the shared task, we will provide both sentence-level training data and strong sentence-level models, and thus encourage the participants to explore both of the approaches.
The task will consist of building a quality estimation system that i) predicts the quality score for an input pair of source text and MT hypothesis, ii) provides word-level evidence for its predictions.
The repository linked below contains datasets, evaluation scripts, and instructions on how to produce baseline results.
The training and development data for this shared task is the Estonian-English (Et-En) and Romanian-English (Ro-En) partitions of the MLQE-PE dataset (Fomicheva et al. 2020). The sentence-level QE systems can be trained using sentence-level quality scores. Word-level labels derived from post-editing can be used for development purposes. However, we discourage participants from using the word-level data for training, as the goal of the shared task is to explore word-level quality estimation in an unsupervised setting, i.e. as a rationale extraction task.
As test data, we will collect sentence-level quality scores and word-level error annotations for these two language pairs. We will also provide a zero-shot test set for the German-Chinese (De-Zh) and the Russian-German (Ru-De) language pairs where no sentence-level or word-level annotation would be available at training time. Human annotators will be asked to indicate translation errors as an explanation for the overall sentence scores, as well as the corresponding words in the source sentence.
Below we provide an example of the test data and the output that is expected from the participants:
The aim of evaluation is to assess the quality of explanations, not sentence-level predictions. Therefore, the main metrics for evaluation will be AUC and AUPRC scores for word-level explanations.
The participants can submit "only target explanations" or "both source explanations and target explanations".
In the repository linked above, we provide links to the TransQuest sentence-level QE models (Ranasinghe et al. 2020) that were one of the top-performing submissions at WMT2020 QE Shared Task. The models are based on fine-tuning multilingual pre-trained representations for the sentence-level QE task on the direct assessment (DA) quality scores from the MLQE-PE dataset. Both models and code are freely available. The participants can use these models, and explore post-hoc approaches to rationale extraction. Participants are also free to train their own QE models and explore architectures that would allow word-level interpretation of model predictions. As a baseline, we will use TransQuest as a QE model and LIME (Ribeiro et al. 2016), a model agnostic explanation method for rationale extraction.
We use CodaLab as a platform for participants to submit their predictions for the test dataset. The link to our CodaLab competition is https://competitions.codalab.org/competitions/33038.The competition consists of two main phases.
metadata.txtmust have exactly three non-empty lines.
unconstrained, indicating the submission track.
constrainedmeans that you did not train your system on word-level labels, whereas
unconstrainedmeans that you trained your system on word-level labels.
sentence.submissionwith sentence-level scores, one score per line.
target.submissionwith target token-level scores. Each line must contain a sequence of scores separated by white space. The number of scores must correspond to the number of target tokens.
source.submissionwith source token-level scores. Each line must contain a sequence of scores separated by white space. The number of scores must correspond to the number of source tokens.
Token-level scores must represent the importance of each token towards the sentence-level prediction, where a higher score means the token is more likely to be an error (in the case of target tokens) or related to an error (in the case of source tokens). The scores do not need to be normalized.
Examples of the submission files for the two development phases can be found here.
We encourage each team to submit a paper describing their system to the workshop in order to be included in the workshop proceedings. It could be either a long paper (8 pages) or a short paper (4 pages) following the EMNLP 2021 templates and formatting requirements. The deadline for system paper submissions is September 17, 2021. We will announce the instructions for system paper submission as well as the submission system soon.
The authors of the best submissions will be awarded with monetary rewards. These will be judged according to several criteria -- the scores on the leaderboards, the article/paper explaining their method, the resourcefulness and creativity of the method. The monetary rewards are kindly sponsored by Artificial Intelligence Journal and Salesforce Research.