Skip to content

evavnmssnhv/Europarl-Speaker-Information

Repository files navigation

!!! IMPORTANT NOTICE!!! Some people informed me there were some issues downloading the files via GitHub. I am trying to resolve this issue asap, if you want to access the dataset right now please send an email to the following e-mail address:

Europarl-Speaker-Information

Europarl parallel corpora with per sentence speaker annotations for 20 language pairs.

The need for more personalized Machine Translation (MT) is highlighted in Mirkin et al. (2015). Nevertheless, research on personalized NMT remains scarce. One of the main challenges for more personalized MT systems is finding large enough annotated parallel datasets. Rabinovich et al. (2017) published an annotated parallel dataset for EN-FR and EN-DE, however, for many other language pairs no sufficiently large annotated datasets are available.

To address the aforementioned problem, we compiled a collection of parallel corpora for 20 language pairs. We annotated parallel sentences from Europarl (Koehn, 2005) with speaker information (name, gender, age, date of birth, euroID and date of the session) based on monolingual Europarl source files which contain speaker names on the paragraph level. We used meta-information of the members of the European Parliament (MEPs) released by Rabinovich et al. (2017) (which includes a.o. name, country, date of birth and gender predictions per MEP) to retrieve the demographic annotations. An example of the annotations used:

An overview of the language pairs as well as the number of annotated parallel sentences we retrieved per language pair is given in table below:

EN-BG:		306.380
EN-CS:		491.848
EN-DA:	        1.421.197
EN-DE:  	1.296.843
EN-EL:		921.540
EN-ES:		1.419.507
EN-ET:		494.645
EN-FI: 	        1.393.572
EN-FR:		1.440.620
EN-HU:		251.833
EN-IT:		1.297.635
EN-LT:		477.358 
EN-LV:		487.287	
EN-NL:		1.419.359  
EN-PL:		478.008  
EN-PT:		1.426.043	  
EN-RO:		303.396	
EN-SK:		488.351  
EN-SL:		479.313	
EN-SV:		1.349.472	

The data is provided in zip files per language. Every zip file contains 3 files:

[1] SOURCE.source-target.txt: Tokenized source data

  Example:
    Is the Commission taking the initiative going to be a question of weeks or a question of months ?
    I fully support this report and welcome its adoption by this Parliament .

[2] TARGET.source-target.txt: Tokenized parallel target data

  Example:
    Est-ce une question de semaines ou de mois ?
    Je soutiens pleinement ce rapport et je me réjouis de son adoption par ce Parlement .

[3] Tags.source-target.txt: Tags corresponding to the SOURCE and TARGET sentences providing information about the speaker, i.e. name, date of birth, date of the session, age of the speaker.

  Example:
    <LINECOUNT="5" EUROID="28257" NAME="ivo belet" LANGUAGE="UNK" GENDER="MALE" DATE_OF_BIRTH="1959-6-7" SESSION_DATE="11-02-15" AGE="52"/>
    <LINECOUNT="6" EUROID="1289" NAME="proinsias de rossa" LANGUAGE="UNK" GENDER="MALE" DATE_OF_BIRTH="1940-5-15" SESSION_DATE="04-03-11" AGE="64"/>

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

If you use our dataset, please cite:

Vanmassenhove, Eva, Christian Hardmeier, and Andy Way. "Getting Gender Right in Neural MT." Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). November 2-4, Brussels, Belgium. 2018.
    
Vanmassenhove, Eva, Christian Hardmeier. "Europarl Datasets with Demographic Speaker Information." Proceedings of the 2018 Conference of the European Association for Machine Translation (EAMT). May 28-30, Alicante, Spain. 2018.

References:

    Ella Rabinovich, Shachar Mirkin, Raj Nath Patel, Lucia Specia and Shuly Wintner, 2017. Personalized Machine Translation: Preserving Original Author Traits. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2017), April 3-7, Valencia, Spain, 1074--1084.

    Philipp Koehn, 2005. Europarl: A parallel corpus for statistical machine translation. Proceedings of the In MT Summit, Phuket, September 12-16, Thailand, 79--86. 

    Shachar Mirkin, Scott Nowson, Caroline Brun and Julien Perez, 2015. Motivating personality-aware machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, September 17-21, Lisbon, Portugal, 1102--1108.

About

Europarl parallel corpora with per sentence speaker annotations for 20 language pairs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •