Skip to content

Latest commit

 

History

History
374 lines (278 loc) · 20.9 KB

README.md

File metadata and controls

374 lines (278 loc) · 20.9 KB

RNNSharp

RNNSharp is a toolkit of deep recurrent neural network which is widely used for many different kinds of tasks, such as sequence labeling, sequence-to-sequence and so on. It's written by C# language and based on .NET framework 4.6 or above version.

This page introduces what is RNNSharp, how it works and how to use it. To get the demo package, you can access release page.

Overview

RNNSharp supports many different types of deep recurrent neural network (aka DeepRNN) structures.

For network structure, it supports forward RNN and bi-directional RNN. Forward RNN considers histrocial information before current token, however, bi-directional RNN considers both histrocial information and information in future.

For hidden layer structure, it supports LSTM and Dropout. Compared to BPTT, LSTM is very good at keeping long term memory, since it has some gates to contorl information flow. Dropout is used to add noise during training in order to avoid overfitting.

In terms of output layer structure, simple, softmax, sampled softmax and recurrent CRFs[1] are supported. Softmax is the tranditional type which is widely used in many kinds of tasks. Sampled softmax is especially used for the tasks with large output vocabulary, such as sequence generation tasks (sequence-to-sequence model). Simple type is usually used with recurrent CRF together. For recurrent CRF, based on simple outputs and tags transition, it computes CRF output for entire sequence. For sequence labeling tasks in offline, such as word segmentation, named entity recognition and so on, recurrent CRF has better performance than softmax, sampled softmax and linear CRF.

Here is an example of deep bi-directional RNN-CRF network. It contains 3 hidden layers, 1 native RNN output layer and 1 CRF output layer.

Here is the inner structure of one bi-directional hidden layer.

Here is the neural network for sequence-to-sequence task. "TokenN" are from source sequence, and "ELayerX-Y" are auto-encoder's hidden layers. Auto-encoder is defined in feature configuration file. <s> is always the beginning of target sentence, and "DLayerX-Y" means the decoder's hidden layers. In decoder, it generates one token at one time until </s> is generated.

Supported Feature Types

RNNSharp supports many different feature types, so the following paragraph will introduce how these feaures work.

Template Features

Template features are generated by templates. By given templates and corpus, these features can be automatically generated. In RNNSharp, template features are sparse features, so if the feature exists in current token, the feature value will be 1 (or feature frequency), otherwise, it will be 0. It's similar as CRFSharp features. In RNNSharp, TFeatureBin.exe is the console tool to generate this type of features.

In template file, each line describes one template which consists of prefix, id and rule-string. The prefix indicates template type. So far, RNNSharp supports U-type feature, so the prefix is always as "U". Id is used to distinguish different templates. And rule-string is the feature body.

# Unigram
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[-1,0]/%x[0,0]
U05:%x[0,0]/%x[1,0]
U06:%x[-1,0]/%x[1,0]
U07:%x[-1,1]
U08:%x[0,1]
U09:%x[1,1]
U10:%x[-1,1]/%x[0,1]
U11:%x[0,1]/%x[1,1]
U12:%x[-1,1]/%x[1,1]
U13:C%x[-1,0]/%x[-1,1]
U14:C%x[0,0]/%x[0,1]
U15:C%x[1,0]/%x[1,1]

The rule-string has two types, one is constant string, and the other is variable. The simplest variable format is {“%x[row,col]”}. Row specifies the offset between current focusing token and generate feature token in row. Col specifies the absolute column position in corpus. Moreover, variable combination is also supported, for example: {“%x[row1, col1]/%x[row2, col2]”}. When we build feature set, variable will be expanded to specific string. Here is an example in training data for named entity task.

Word Pos Tag
! PUN S
Tokyo NNP S_LOCATION
and CC S
New NNP B_LOCATION
York NNP E_LOCATION
are VBP S
major JJ S
financial JJ S
centers NNS S
. PUN S
---empty line---
! PUN S
p FW S
' PUN S
y NN S
h FW S
44 CD S
University NNP B_ORGANIZATION
of IN M_ORGANIZATION
Texas NNP M_ORGANIZATION
Austin NNP E_ORGANIZATION

According above templates, assuming current focusing token is “York NNP E_LOCATION”, below features are generated:

U01:New
U02:York
U03:are
U04:New/York
U05:York/are
U06:New/are
U07:NNP
U08:NNP
U09:are
U10:NNP/NNP
U11:NNP/VBP
U12:NNP/VBP
U13:CNew/NNP
U14:CYork/NNP
U15:Care/VBP

Although U07 and U08, U11 and U12’s rule-string are the same, we can still distinguish them by id string.

Context Template Features

Context template features are based on template features and combined with context. In this example, if the context setting is "-1,0,1", the feature will combine the features of current token with its previous token and next token. For instance, if the sentence is "how are you". the generated feature set will be {Feature("how"), Feature("are"), Feature("you")}.

Pretrained Features

RNNSharp supports two types of pretrained features. The one is embedding features, and the other is auto-encoder features. Both of them are able to present a given token by a fixd-length vector. This feature is dense feature in RNNSharp.

For embedding features, they are trained from unlabled corpus by Text2Vec project. And RNNSharp uses them as static features for each given token. However, for auto-encoder features, they are trained by RNNSharp as well, and then they can be used as dense features for other trainings. Note that, the token's granularity in pretrained features should be consistent with training corpus in main training, otherwise, some tokens will mis-match with pretrained feature.

Likes template features, embedding feature also supports context feature. It can combine all features of given contexts into a single embedding feature. For auto-encoder features, it does not support it yet.

Run Time Features

Compared with other features generated offline, this feature is generated in run time. It uses the result of previous tokens as run time feature for current token. This feature is only available for forward-RNN, bi-directional RNN does not support it.

Source Sequence Encoding Feature

This feature is only for sequence-to-sequence task. In sequence-to-sequence task, RNNSharp encodes given source sequence into a fixed-length vector, and then pass it as dense feature to generate target sequence.

Configuration File

The configuration file describes model structure and features. In console tool, use -cfgfile as parameter to specify this file. Here is an example for sequence labeling task:

#Working directory. It is the parent directory of below relatived paths.
CURRENT_DIRECTORY = .

#Network type. Four types are supported:
#For sequence labeling tasks, we could use: Forward, BiDirectional, BiDirectionalAverage
#For sequence-to-sequence tasks, we could use: ForwardSeq2Seq
#BiDirectional type concatnates outputs of forward layer and backward layer as final output
#BiDirectionalAverage type averages outputs of forward layer and backward layer as final output
NETWORK_TYPE = BiDirectional

#Model file path
MODEL_FILEPATH = Data\Models\ParseORG_CHS\model.bin

#Hidden layers settings. LSTM and Dropout are supported. Here are examples of these layer types.
#Dropout: Dropout:0.5 -- Drop out ratio is 0.5 and layer size is the same as previous layer.
#If the model has more than one hidden layer, each layer settings are separated by comma. For example:
#"LSTM:300, LSTM:200" means the model has two LSTM layers. The first layer size is 300, and the second layer size is 200.
HIDDEN_LAYER = LSTM:200

#Output layer settings. Simple, Softmax ands sampled softmax are supported. Here is an example of sampled softmax:
#"SampledSoftmax:20" means the output layer is sampled softmax layer and its negative sample size is 20.
#"Simple" means the output is raw result from output layer. "Softmax" means the result is based on "Simple" result and run softmax.
OUTPUT_LAYER = Simple

#CRF layer settings
#If this option is true, output layer type must be "Simple" type.
CRF_LAYER = True

#The file name for template feature set
TFEATURE_FILENAME = Data\Models\ParseORG_CHS\tfeatures
#The context range for template feature set. In below, the context is current token, next token and next after next token
TFEATURE_CONTEXT = 0,1,2
#The feature weight type. Binary and Freq are supported
TFEATURE_WEIGHT_TYPE = Binary

#Pretrained features type: 'Embedding' and 'Autoencoder' are supported.
#For 'Embedding', the pretrained model is trained by Text2Vec, which looks like word embedding model.
#For 'Autoencoder', the pretrained model is trained by RNNSharp itself. For sequence-to-sequence task, "Autoencoder" is required, since source sequence needs to be encoded by this model at first, and then target sequence would be generated by decoder.
PRETRAIN_TYPE = Embedding

#The following settings are for pretrained model in 'Embedding' type.
#The embedding model generated by Txt2Vec (https://github.com/zhongkaifu/Txt2Vec). If it is raw text format, we should use WORDEMBEDDING_RAW_FILENAME instead of WORDEMBEDDING_FILENAME as keyword
WORDEMBEDDING_FILENAME = Data\WordEmbedding\wordvec_chs.bin
#The context range of word embedding. In below example, the context is current token, previous token and next token
#If more than one token are combined, this feature would use a plenty of memory.
WORDEMBEDDING_CONTEXT = -1,0,1
#The column index applied word embedding feature
WORDEMBEDDING_COLUMN = 0

#The following setting is for pretrained model in 'Autoencoder' type.
#The feature configuration file for pretrained model.
AUTOENCODER_CONFIG = D:\RNNSharpDemoPackage\config_autoencoder.txt

#The following setting is the configuration file for source sequence encoder which is only for sequence-to-sequence task that MODEL_TYPE equals to SEQ2SEQ.
#In this example, since MODEL_TYPE is SEQLABEL, so we comment it out.
#SEQ2SEQ_AUTOENCODER_CONFIG = D:\RNNSharpDemoPackage\config_seq2seq_autoencoder.txt

#The context range of run time feature. In below example, RNNSharp will use the output of previous token as run time feature for current token
#Note that, bi-directional model does not support run time feature, so we comment it out.
#RTFEATURE_CONTEXT = -1

Training file format

In training file, each sequence is represented as a features matrix and ends with an empty line. In the matrix, each row is for one token of the sequence and its features, and each column is for one feature type. In entire training corpus, the number of column must be fixed.

Sequence labeling task and sequence-to-sequence task have different training corpus format.

Sequence labeling corpus

For sequence labeling tasks, the first N-1 columns are input features for training, and the Nth column (aka last column) is the answer of current token. Here is an example for named entity recognition task(The full training file is at release section, you can download it there):

Word Pos Tag
! PUN S
Tokyo NNP S_LOCATION
and CC S
New NNP B_LOCATION
York NNP E_LOCATION
are VBP S
major JJ S
financial JJ S
centers NNS S
. PUN S
---empty line---
! PUN S
p FW S
' PUN S
y NN S
h FW S
44 CD S
University NNP B_ORGANIZATION
of IN M_ORGANIZATION
Texas NNP M_ORGANIZATION
Austin NNP E_ORGANIZATION

It has two records splitted by blanket line. For each token, it has three columns. The first two columns are input feature set, which are word and pos-tag for the token. The third column is the ideal output of the model, which is named entity type for the token.

The named entity type looks like "Position_NamedEntityType". "Position" is the word position in the named entity, and "NamedEntityType" is the type of the entity. If "NamedEntityType" is empty, that means this is a common word, not a named entity. In this example, "Position" has four values:
S : the single word of the named entity
B : the first word of the named entity
M : the word is in the middle of the named entity
E : the last word of the named entity

"NamedEntityType" has two values:
ORGANIZATION : the name of one organization
LOCATION : the name of one location

Sequence-to-sequence corpus

For sequence-to-sequence task, the training corpus format is different. For each sequence pair, it has two sections, one is source sequence, the other is target sequence. Here is an example:

Word
What
is
your
name
?
---empty line---
I
am
Zhongkai
Fu

In above example, "What is your name ?" is the source sentence, and "I am Zhongkai Fu" is the target sentence generated by RNNSharp seq-to-seq model. In source sentence, beside word features, other feautes can also be applied for training, such as postag feature in sequence labeling task in above.

Test file format

Test file has the similar format as training file. For sequence labeling task, the only different between them is the last column. In test file, all columns are features for model decoding. For sequence-to-sequence task, it only contains source sequence. The target sentence will be generated by model.

Tag (Output Vocabulary) File

For sequence labeling task, this file contains output tag set. For sequence-to-sequence task, it's output vocabulary file.

Console Tool

RNNSharpConsole

RNNSharpConsole.exe is a console tool for recurrent neural network encoding and decoding. The tool has two running modes. "train" mode is for model training and "test" mode is for output tag predicting from test corpus by given encoded model.

Encode Model

In this mode, the console tool can encode a RNN model by given feature set and training/validated corpus. The usage as follows:

RNNSharpConsole.exe -mode train
Parameters for training RNN based model. -trainfile : Training corpus file
-validfile : Validated corpus for training
-cfgfile : Configuration file
-tagfile : Output tag or vocabulary file
-inctrain : Incremental training. Starting from output model specified in configuration file. Default is false
-alpha : Learning rate, Default is 0.1
-maxiter : Maximum iteration for training. 0 is no limition, Default is 20
-savestep : Save temporary model after every sentence, Default is 0
-vq : Model vector quantization, 0 is disable, 1 is enable. Default is 0
-minibatch : Updating weights every sequence. Default is 1

Example: RNNSharpConsole.exe -mode train -trainfile train.txt -validfile valid.txt -cfgfile config.txt -tagfile tags.txt -alpha 0.1 -maxiter 20 -savestep 200K -vq 0 -grad 15.0 -minibatch 128

Decode Model

In this mode, given test corpus file, RNNSharp predicts output tags in sequence labeling task or generates a target sequence in sequence-to-sequence task.

RNNSharpConsole.exe -mode test
Parameters for predicting iTagId tag from given corpus
-testfile : test corpus file
-tagfile : output tag or vocabulary file
-cfgfile : configuration file
-outfile : result output file

Example: RNNSharpConsole.exe -mode test -testfile test.txt -tagfile tags.txt -cfgfile config.txt -outfile result.txt

TFeatureBin

It's used to generate template feature set by given template and corpus files. For high performance accessing and save memory cost, the indexed feature set is built as float array in trie-tree by AdvUtils. The tool supports three modes as follows:

TFeatureBin.exe
The tool is to generate template feature from corpus and index them into file
-mode : support extract,index and build modes
extract : extract features from corpus and save them as raw text feature list
index : build indexed feature set from raw text feature list
build : extract features from corpus and generate indexed feature set

Build mode

This mode is to extract features from given corpus according templates, and then build indexed feature set. The usage of this mode as follows:

TFeatureBin.exe -mode build
This mode is to extract feature from corpus and generate indexed feature set
-template : feature template file
-inputfile : file used to generate features
-ftrfile : generated indexed feature file
-minfreq : min-frequency of feature

Example: TFeatureBin.exe -mode build -template template.txt -inputfile train.txt -ftrfile tfeature -minfreq 3

In above example, feature set is extracted from train.txt and build them into tfeature file as indexed feature set.

Extract mode

This mode is only to extract features from given corpus and save them into a raw text file. The different between build mode and extract mode is that extract mode builds feature set as raw text format, not indexed binary format. The usage of extract mode as follows:

TFeatureBin.exe -mode extract
This mode is to extract features from corpus and save them as text feature list
-template : feature template file
-inputfile : file used to generate features
-ftrfile : generated feature list file in raw text format
-minfreq : min-frequency of feature

Example: TFeatureBin.exe -mode extract -template template.txt -inputfile train.txt -ftrfile features.txt -minfreq 3

In above example, according templates, feature set is extracted from train.txt and save them into features.txt as raw text format. The format of output raw text file is "feature string \t frequency in corpus". Here is a few examples:

U01:仲恺 \t 123
U01:仲文 \t 10
U01:仲秋 \t 12

U01:仲恺 is feature string and 123 is the frequency that this feature in corpus.

Index mode

This mode is only to build indexed feature set by given templates and feature set in raw text format. The usage of this mode as follows:

TFeatureBin.exe -mode index
This mode is to build indexed feature set from raw text feature list
-template : feature template file
-inputfile : feature list in raw text format
-ftrfile : indexed feature set

Example: TFeatureBin.exe -mode index -template template.txt -inputfile features.txt -ftrfile features.bin

In above example, according templates, the raw text feature set, features.txt, will be indexed as features.bin file in binary format.

Performance

Here is quality results on Chinese named entity recognizer task. Corpus, configuration and parameter files are available in RNNSharp demo package file at release section. The result is based on bi-directional model. The first hidden layer size is 200, and the second hidden layer size is 100. Here are test results:

Parameter Token Error Sentence Error
1-hidden layer 5.53% 15.46%
1-hidden layer-CRF 5.51% 13.60%
2-hidden layers 5.47% 14.23%
2-hidden layers-CRF 5.40% 12.93%

Run on Linux/Mac

RNNSharp is a pure C# project, so it can be compiled by .NET Core and Mono, and runns without modification on Linux/Mac.

APIs

The RNNSharp also provides some APIs for developers to leverage it into their projects. By download source code package and open RNNSharpConsole project, you will see how to use APIs in your project to encode and decode RNN models. Note that, before use RNNSharp APIs, you should add RNNSharp.dll as reference into your project.

RNNSharp referenced by the following published papers

  1. Project-Team IntuiDoc: Intuitive user interaction for document
  2. A New Pre-training Method for Training Deep Learning Models with Application to Spoken Language Understanding
  3. Long Short-Term Memory
  4. Deep Learning