Skip to content

Latest commit

 

History

History
41 lines (30 loc) · 1.08 KB

filtering.md

File metadata and controls

41 lines (30 loc) · 1.08 KB
parent title description featured
Customisation
Filtering
Filtering training data for machine translation
true

Filtering for machine translation is the process of cleaning parallel data for training a machine translation system.

Parallel data can be filtered manually or automatically.

To filter parallel data, risky translations and obvious noise are either dropped or fixed.

Risky translations include, for example:

  • Creative human translations
  • Translations that are structured differently than the original

Obvious noise includes, for example:

  • Sentences with mismatched URLs, names, or numbers
  • Translated sentences that are equal to their original

Tools

  • ModelFront
  • Zipporah
  • Bicleaner
  • LASER

Techniques

  • Normalization
  • Tokenisation
  • Removing duplicate segments
  • Removing non-alphabetical symbols
  • Removing irrelevant languages
  • Spelling out or collapsing acronyms
  • Replacing named entities with placeholders
  • Matching the original and the translated sentences punctuation
  • Fixing typos and spelling mistakes