Overview of the project

Our project consists of below components:

Data generation
Language identification
Evaluation

Data generation

This component generates multilingual documents by composing texts in our corpora. Our primary resource comes from Wikipedia, but is not necessarily limited to it. Currently, our methodology is largely based on the ALTW-2010 shared task set, but we're planning to extend it.

In detail, this task is done over 3 steps:

Text extraction from Wikipedia dump.
Text indexing, which speeds up the later stage.
Multilingual text composition.

Language identification

This component tries to identify the exact spans of each language in the given document. Currently we use CRF as a learning model, and CRFsuite as a tool. In a broad view, this task has 2 parts:

Feature engineering.
Similar language clustering.

Those parts should play their role cooperatively. For the detailed explanation, see the proposal.

Evaluation

This component assess the performance of our language identifier. Our plan is to use word level accuracy for assessment.

We will keep adding the minutes of meeting and the upcoming meetings in this page

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overview of the project

Data generation

Language identification

Evaluation

Clone this wiki locally