Skip to content

Overview of the project

Minchul Park edited this page Apr 16, 2016 · 5 revisions

Our project consists of below components:

  1. Data generation
  2. Language identification
  3. Evaluation

Data generation

This component generates multilingual documents by composing texts in our corpora. Our primary resource comes from Wikipedia, but is not necessarily limited to it. Currently, our methodology is largely based on the ALTW-2010 shared task set, but we're planning to extend it.

In detail, this task is done over 3 steps:

  1. Text extraction from Wikipedia dump.
  2. Text indexing, which speeds up the later stage.
  3. Multilingual text composition.

Language identification

This component tries to identify the exact spans of each language in the given document. Currently we use CRF as a learning model, and CRFsuite as a tool. In a broad view, this task has 2 parts:

  1. Feature engineering.
  2. Similar language clustering.

Those parts should play their role cooperatively. For the detailed explanation, see the proposal.

Evaluation

This component assess the performance of our language identifier. Our plan is to use word level accuracy for assessment.