From 80f3b541c31edcdd1f0e49ece55ca1b512711f92 Mon Sep 17 00:00:00 2001 From: Vivek Yadav <126377225+Taskmaster-1@users.noreply.github.com> Date: Thu, 3 Oct 2024 01:44:32 +0530 Subject: [PATCH] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 85153ec..088f504 100644 --- a/README.md +++ b/README.md @@ -67,7 +67,7 @@ The target language is English(en). ## Data Preparation and Preprocessing -Please note that these data preparation steps have to be done manually as we are dealing with a Multilingual system and each language pair might have different sources of data. For instance, I used many different data sources like europarl, newscommentary, commoncrawl & other opern source datasets. One can have a look at shared task on Machine Translation i.e. WMT, to get better datasets. I wrote a bash script which can be used to process & prepare dataset for MT. The following steps can be used to prepare dataset for MT: +Please note that these data preparation steps have to be done manually as we are dealing with a Multilingual system and each language pair might have different sources of data. For instance, I used many different data sources like europarl, newscommentary, commoncrawl & other open source datasets. One can have a look at shared task on Machine Translation i.e. WMT, to get better datasets. I wrote a bash script which can be used to process & prepare dataset for MT. The following steps can be used to prepare dataset for MT: 1) First copy the raw dataset files in the language($src-$tgt) subdirectory of the data directory in the following format: * train.$src-$tgt.$src * train.$src-$tgt.$tgt