Chinese Word Segmentation - Deep Learning solution

Overview

Processing words and the connection between them is a key requirement for many tasks in language processing, and some languages, like Chinese, do not separate the words in written text. The objective of this work is to create a model that processes a text with no spaces, and generates the corresponding segmentation annotation, in BIES format, with deep learning tools, specially LSTM layers.

This repository provides an already trained model to perform the task on a given file, and the training codes if a new tuning or dataset wants to be used.

BIES format: (B)eginning (I)nside (E)nd (S)ingle

Output example:

Language	Input	Output	Result
English (example)	`theansweris42!`	BIEBIIIIEBEBES	`the answer is 42 !`
Chinese (this repo)	分佈：主要分佈在熱帶及亞熱帶水域，溫帶水域亦有其活動記錄。	BES BEBESBESBIEBES BEBESSSBEBES	分佈：主要分佈在熱帶及亞熱帶水域，溫帶水域亦有其活動記錄。

Training

Much of the preprocessing and training was performed in Google Colab, with notebooks that explain step-by-step the whole process. Feel free to explore them in resources/FilesUsedForTraining.

Datasets

Download the dataset: http://sighan.cs.uchicago.edu/bakeoff2005/ The full dataset contains four smaller datasets:

AS (Traditional Chinese)
CITYU (Traditional Chinese)
MSR (Simplified Chinese)
PKU (Simplified Chinese)

Model

Based on paper of reference model: https://aclweb.org/anthology/D18-1529

(Ma et al., 2018) State-of-the-art Chinese Word Segmentation with Bi-LSTMs:

Model

Figure: Bi-LSTM models: (a) non-stacking, (b) stacking. Blue circles are input (char and char bigram) embeddings. Red squares are LSTM cells. BIES is a 4-way softmax.

Training

Training was performed with 30% of the merged dataset, and final model with 100% after tuning parameters. The following figures depict one of the performed grid search variations. Static learning rate of 0.0005, and variable dropout in recurrent units with values [0, 0.1,0.4,0.6]. Shows sings of overfitting. Read the complete information report in NLP_HW1_Report.pdf.

Testing

Pretrained model

A pretrained model and vocabulary set is provided in this folder:

https://drive.google.com/drive/folders/1WBGOIS-VK7E8vpzOEUWD5vzXjQLjWM2g?usp=sharing

The files should be located in resources/Model to be used by the predict.py and score.py scripts.

Predict

To process an file through the segmenting model:

cd ChineseWordSegmentation
python code/predict.py input_path ________ output_path __________ resources_path resources

The output_file path should have the desired separation indication.

Scoring

If a gold_file is available, the accuracy of the prediction can be tested on the score function. It returns the precision of the model's predictions w.r.t. the gold standard (i.e. the tags of the correct word segmentation).

Example:
    predictions_iter = ["BEBESBIIE", "BIIIEBEBESS"]
    gold_iter = ["BEBIEBIES", "BIIESBEBESS"]
    output: 0.7

Usage

python code/score.py prediction_file ___________ gold_file ______________

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
code		code
media		media
resources		resources
.gitignore		.gitignore
NLP_HW1_Report.pdf		NLP_HW1_Report.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chinese Word Segmentation - Deep Learning solution

Overview

Training

Datasets

Model

Testing

Predict

Scoring

About

Releases

Packages

Languages

ArciAndres/ChineseWordSegmentation

Folders and files

Latest commit

History

Repository files navigation

Chinese Word Segmentation - Deep Learning solution

Overview

Training

Datasets

Model

Testing

Predict

Scoring

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages