Has Neural Machine Translation Achieved Human Parity? A Case for Document-level Evaluation

This repository contains the data we collected to assess the impact of document-level context on human perception of machine translation quality.

We briefly outline the contents of each file below. Please see our paper for more detailed information:

@inproceedings{laeubli2018parity,
  author = "L{\"a}ubli, Samuel and Sennrich, Rico and Volk, Martin",
  title = "Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation",
  booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
  year = "2018",
  address = "Brussels, Belgium",
  publisher = "Association for Computational Linguistics",
  url = "https://arxiv.org/abs/1808.07048"
}

`participants.csv`

Meta information on all participants – professional translators recruited from ProZ. They produced the ratings available in ratings.csv.

`documents.csv`

55 full articles randomly sampled from the WMT 2017 English–Chinese test set, only considering the 123 native Chinese articles. We only used Chinese sources from WMT; human (Reference-HT; the human column) and machine translations (Combo-6; the mt column) were obtained from data released by Microsoft.

Documents E-1 to E-55 and I-1 to I-55 are the same articles with a different random subset (5 articles) converted to control items (spam).

`sentences.csv`

2 x 120 sentences randomly sampled from the WMT 2017 English–Chinese test set, only considering the 123 native Chinese articles. We only used Chinese sources from WMT; human (Reference-HT; the human column) and machine translations (Combo-6; the mt column) were obtained from data released by Microsoft.

Sentences U-1 to U-120 overlap with the full documents in documents.csv. 16 random sentences were converted to control items (spam) in each set.

`ratings.csv`

Ratings produced by participants (see participants.csv). In our paper, we excluded ratings for sentences U-1 to U-120 from analysis because of overlap with full documents (see above).

`ratings.with-spam.csv`

The same as ratings.csv, including control items (spam).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Has Neural Machine Translation Achieved Human Parity? A Case for Document-level Evaluation

`participants.csv`

`documents.csv`

`sentences.csv`

`ratings.csv`

`ratings.with-spam.csv`

Files

README.md

Latest commit

History

README.md

File metadata and controls

Has Neural Machine Translation Achieved Human Parity? A Case for Document-level Evaluation

participants.csv

documents.csv

sentences.csv

ratings.csv

ratings.with-spam.csv

`participants.csv`

`documents.csv`

`sentences.csv`

`ratings.csv`

`ratings.with-spam.csv`