-
Notifications
You must be signed in to change notification settings - Fork 2
Overview of the project
Our project consists of below components:
- Data generation
- Language identification
- Evaluation
This component generates multilingual documents by composing texts in our corpora. Our primary resource comes from Wikipedia, but is not necessarily limited to it. Currently, our methodology is largely based on the ALTW-2010 shared task set, but we're planning to extend it.
In detail, this task is done over 3 steps:
- Text extraction from Wikipedia dump.
- Text indexing, which speeds up the later stage.
- Multilingual text composition.
This component tries to identify the exact spans of each language in the given document. Currently we use CRF as a learning model, and CRFsuite as a tool. In a broad view, this task has 2 parts:
- Feature engineering.
- Similar language clustering.
Those parts should play their role cooperatively. For the detailed explanation, see the proposal.
This component assess the performance of our language identifier. Our plan is to use word level accuracy for assessment.
We will keep adding the minutes of meeting and the upcoming meetings in this page