Shared Task

The 2nd Workshop on "Evaluation & Comparison of NLP Systems" Co-located at EMNLP 2021
CodaLab GitHub

Explainable Quality Estimation

Latest News

Aug 31, 2021As the CodaLab competition system is unstable, the organizers have decided to extend the submission deadline to September 3, 2021.
Aug 20, 2021The test phase of our shared task begins now! The test data is here. Please don't forget to join our Google Group for latest updates.
Aug 10, 2021The Submission Format has been updated. We require an additional metadata.txt file to be included in the submission zip file of the test phases.
Jun 30, 2021The CodaLab competition of our shared task is now live!
Jun 16, 2021Please join our Google Group for posting questions related to the shared task.
Jun 10, 2021The annotation guidelines of our shared task are now available.
Jun 09, 2021The baseline of our shared task is now available.
May 24, 2021The shared task is announced.

Important Dates

All deadlines are 11.59 pm UTC -12h (“Anywhere on Earth”).

  • Training and development data release: May 24, 2021
  • Test data release: August 20, 2021
  • Submission deadline: September 1, 2021  September 3, 2021
  • System paper submission deadline: September 17, 2021
  • Workshop day: November 10 or 11, 2021


Recent Natural Language Processing (NLP) systems based on pre-trained representations from Transformer language models, such as BERT and XLM-Roberta, have achieved outstanding results in a variety of tasks. This boost in performance, however, comes at the cost of efficiency and interpretability. Interpretability is a major concern in modern Artificial Intelligence (AI) and NLP research, as black-box models undermine users’ trust in new technologies.

In this shared task, we focus on evaluating machine translation (MT) as an example of this problem. Specifically, we look at the task of quality estimation (QE) (a.k.a. reference-free evaluation) where the aim is to predict the quality of MT output at inference time, without access to reference translations. Translation quality can be assessed at different levels of granularity: sentence-level, i.e. predicting the overall quality of translated sentences and word-level, i.e. highlighting specific errors in the MT output. Those have traditionally been treated as two separate tasks, each one requiring dedicated training data.

In this shared task, we propose to address translation error identification as an explainability task. Explainability is a broad area aimed at explaining predictions of machine learning models. Rationale extraction methods achieve this by selecting a portion of the input that justifies model output for a given data point. In translation, human perception of quality is guided by the number and severity of translation errors. By framing error identification as rationale extraction for sentence-level quality estimation systems, this shared task offers an opportunity to study whether such systems behave in the same way as humans would do.

Explanations can be obtained either by building inherently interpretable models (Yu et al., 2019) or by using post-hoc explanation methods which extract explanations from an existing model. In the shared task, we will provide both sentence-level training data and strong sentence-level models, and thus encourage the participants to explore both of the approaches.


  • We would like to foster progress in the plausibility aspect of explanations, i.e., how similar generated explanations are to human explanations, by proposing a new challenging task and a test set with manually annotated rationales.
  • Word-level MT error annotation is hard and time-consuming. This shared task will encourage research on unsupervised or semi-supervised methods for error identification.
  • This task offers an opportunity to study how current NLP evaluation systems arrive at their predictions and to what extent this process is aligned with human reasoning.

Task Description

The task will consist of building a quality estimation system that i) predicts the quality score for an input pair of source text and MT hypothesis, ii) provides word-level evidence for its predictions.

The repository linked below contains datasets, evaluation scripts, and instructions on how to produce baseline results.


The training and development data for this shared task is the Estonian-English (Et-En) and Romanian-English (Ro-En) partitions of the MLQE-PE dataset (Fomicheva et al. 2020). The sentence-level QE systems can be trained using sentence-level quality scores. Word-level labels derived from post-editing can be used for development purposes. However, we discourage participants from using the word-level data for training, as the goal of the shared task is to explore word-level quality estimation in an unsupervised setting, i.e. as a rationale extraction task.

As test data, we will collect sentence-level quality scores and word-level error annotations for these two language pairs. We will also provide a zero-shot test set for the German-Chinese (De-Zh) and the Russian-German (Ru-De) language pairs where no sentence-level or word-level annotation would be available at training time. Human annotators will be asked to indicate translation errors as an explanation for the overall sentence scores, as well as the corresponding words in the source sentence.

Below we provide an example of the test data and the output that is expected from the participants:

  • Source: Pronksiajal võeti kasutusele pronksist tööriistad , ent käepidemed valmistati ikka puidust
  • MT: Bronking tools were introduced during the long term, but handholds were still made up of wood .
  • Gold Explanations Source: 1 0 0 1 0 0 0 1 0 0 0
  • Gold Explanations MT: 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0
  • Model Explanations Source: 0.8 0.5 0.6 0.7 0.4 0.2 0.3 0.6 0.1 0.2 0.2
  • Model Explanations MT: 0.9 0.6 0.6 0.8 0.5 0.5 0.6 0.7 0.2 0.1 0.9 0.2 0.1 0.3 0.5 0.6 0.1 0.5
  • Gold Sentence Level Score: 58
  • Predicted Sentence Level Score: 44

  • Highlighted tokens in the MT output represent various major errors that distort the meaning of the source sentence and explain the low sentence-level score.
  • Highlighted tokens in the source sentence correspond to translation errors in the target.
  • Gold Explanations Source and Gold Explanations MT are the binary scores that will be used as ground truth for evaluation, where 1 represents tokens that are relevant for the overall quality score, as they indicate why the translation is imperfect.
  • Model Explanations Source and Model Explanations Target are continuous scores that need to be provided by the participants, where the tokens with the highest scores are expected to correspond to the tokens considered relevant by human annotators. Thus, participants are expected to provide a continuous score for each token indicating its importance for model prediction.
  • Gold Sentence Level Score is the ground truth sentence score in the range [0..100] where higher score means better translation. These scores will be available at training time and will be used to assess the overall performance of the sentence-level model at test time.
  • Predicted Sentence Level Score is the predicted sentence score from the QE model that is expected to be provided by the participants.


The aim of evaluation is to assess the quality of explanations, not sentence-level predictions. Therefore, the main metrics for evaluation will be AUC and AUPRC scores for word-level explanations.

  • Since the explanations are required to correspond to translation errors, these statistics will be computed for the subset of translations that contain errors according to human annotation.
  • We also ask the participants to provide the sentence-level predictions of their models and compute Pearson correlation with human judgments to measure the overall performance of the system.

The participants can submit "only target explanations" or "both source explanations and target explanations".

  • For target only evaluation, participants are only expected to provide scores for the target words. Those scores will be evaluated against error labels resulting from manual annotation. Missing word errors will be ignored in this track.
  • For source and target evaluation, participants are expected to provide scores for both source and target tokens. The scores must capture errors in the target sentence as well as the corresponding words in the source sentence. Also, if a source word is missing in the translation, it is expected to receive a high score.
In both cases, the predicted sentence-level scores are also required.


In the repository linked above, we provide links to the TransQuest sentence-level QE models (Ranasinghe et al. 2020) that were one of the top-performing submissions at WMT2020 QE Shared Task. The models are based on fine-tuning multilingual pre-trained representations for the sentence-level QE task on the direct assessment (DA) quality scores from the MLQE-PE dataset. Both models and code are freely available. The participants can use these models, and explore post-hoc approaches to rationale extraction. Participants are also free to train their own QE models and explore architectures that would allow word-level interpretation of model predictions. As a baseline, we will use TransQuest as a QE model and LIME (Ribeiro et al. 2016), a model agnostic explanation method for rationale extraction.


Submission Website

We use CodaLab as a platform for participants to submit their predictions for the test dataset. The link to our CodaLab competition is

The competition consists of two main phases.
  • DEVELOPMENT PHASE: Submit your predictions and explanations on the dev set. (For each language pair, max submissions per day = 999; max submissions overall = 999)
    • Estonian-English (Et-En)
    • Romanian-English (Ro-En)
  • TEST PHASE: Submit your predictions and explanations on the test set. (For each language pair, max submissions per day = 5; max submissions overall = 30)
    • Estonian-English (Et-En)
    • Romanian-English (Ro-En)
    • German-Chinese (De-Zh)
    • Russian-German (Ru-De)

Submission Format

For each language pair, a submission is a zip file consisting of three or four files.
  • metadata.txt must have exactly three non-empty lines.
    • The first line contains your team name. You might use your CodaLab username as your team name.
    • The second line must be either constrained or unconstrained, indicating the submission track. constrained means that you did not train your system on word-level labels, whereas unconstrained means that you trained your system on word-level labels.
    • The third line contains a short description (2-3 sentences) of the system you used to generate the results. This description will not be shown to other participants.
  • sentence.submission with sentence-level scores, one score per line.
  • target.submission with target token-level scores. Each line must contain a sequence of scores separated by white space. The number of scores must correspond to the number of target tokens.
  • (Optional) source.submission with source token-level scores. Each line must contain a sequence of scores separated by white space. The number of scores must correspond to the number of source tokens.

Token-level scores must represent the importance of each token towards the sentence-level prediction, where a higher score means the token is more likely to be an error (in the case of target tokens) or related to an error (in the case of source tokens). The scores do not need to be normalized.

Examples of the submission files for the two development phases can be found here.

System Paper Submission

We encourage each team to submit a paper describing their system to the workshop in order to be included in the workshop proceedings. It could be either a long paper (8 pages) or a short paper (4 pages) following the EMNLP 2021 templates and formatting requirements. The deadline for system paper submissions is September 17, 2021. We will announce the instructions for system paper submission as well as the submission system soon.

Awards for Best Submissions

The authors of the best submissions will be awarded with monetary rewards. These will be judged according to several criteria -- the scores on the leaderboards, the article/paper explaining their method, the resourcefulness and creativity of the method. The monetary rewards are kindly sponsored by Artificial Intelligence Journal and Salesforce Research.

Recommended Resources

Quality Estimation Systems
Post-Hoc Explainability Tools


  • Scott M. Lundberg, Su-In Lee (2017). A Unified Approach to Interpreting Model Predictions. NIPS2017
  • Tharindu Ranasinghe, Constantin Orasan, Ruslan Mitkov (2020). TransQuest at WMT2020: Sentence-Level Direct Assessment
  • Tharindu Ranasinghe, Constantin Orasan, Ruslan Mitkov (2020). TransQuest: Translation Quality Estimation with Cross-lingual Transformers
  • Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin. (2016). "Why Should I Trust You?": Explaining the Predictions of Any Classifier
  • Wei Zhao, Goran Glavaš, Maxime Peyrard, Yang Gao, Robert West, Steffen Eger. (2020). On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation
  • Marina Fomicheva, Shuo Sun, Erick Fonseca, Frédéric Blain, Vishrav Chaudhary, Francisco Guzmán, Nina Lopatina, Lucia Specia, André F. T. Martins. (2020) MLQE-PE: A Multilingual Quality Estimation and Post-Editing Dataset
  • Mo Yu, Shiyu Chang, Yang Zhang, Tommi Jaakkola (2019). Rethinking Cooperative Rationalization: Introspective Extraction and Complement Control
  • Mukund Sundararajan, Ankur Taly, Qiqi Yan (2017). Axiomatic Attribution for Deep Networks
  • Eric Wallace, Jens Tuyls, Junlin Wang, Sanjay Subramanian, Matt Gardner, Sameer Singh (2019). AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models

Contact Information

  • Please join our Google Group for posting questions related to the shared task.
  • If you want to contact the organizers privately, please send an email to