machinetranslate.org/customisation/filtering.md at master · jorgtied/machinetranslate.org · GitHub

parent	title	description	featured
Customisation	Filtering	Filtering training data for machine translation	true

Filtering for machine translation is the process of cleaning parallel data for training a machine translation system.

Parallel data can be filtered manually or automatically.

To filter parallel data, risky translations and obvious noise are either dropped or fixed.

Risky translations include, for example:

Creative human translations
Translations that are structured differently than the original

Obvious noise includes, for example:

Sentences with mismatched URLs, names, or numbers
Translated sentences that are equal to their original

Tools

ModelFront
Zipporah
Bicleaner
LASER

Techniques

Normalization
Tokenisation
Removing duplicate segments
Removing non-alphabetical symbols
Removing irrelevant languages
Spelling out or collapsing acronyms
Replacing named entities with placeholders
Matching the original and the translated sentences punctuation
Fixing typos and spelling mistakes