Skip to content
paupowpow edited this page Nov 30, 2018 · 1 revision

requirements for the text processing

assumptions:

  • we know the language of the text at hand: english or german
  • we have a set of common word endings in english and german, e.g. ["heit", "keit", "ung"] ["ery", "cation", "ed"]
  1. separate into words, i.e. take out spaces and special characters
  2. from the remaining character, make random splits into:
  • 20% 1-character units
  • 50% 2-character units
  • 30% 3-character units
Clone this wiki locally