Special token formatting #3

tdozat · 2017-06-18T17:57:14Z

The config file dictates what special tokens are used by each vocabulary. This is because the parser needs to know which token in the training file(s) is the root. In SD and UD, this is root, but in CTB and some CoNLL 2009 treebanks, it's ROOT. This means we can't just hardcode in which label string indicates the root relation. In a previous implementation of the parser the root string can be specified, but in this one you specify the format of all special tokens to allow for consistency; however, this opens up the possibility of leaving out some special tokens that the code assumes are there, or including ones that the code never uses.

A better approach is to hardcode in what the special tokens are for each vocabulary but let the configuration file specify what the format for them is, allowing for the following possibilities:

Upper (e.g. ROOT)
Proper (e.g. Root)
Lower (e.g. root)
Upper HTML (e.g. <ROOT>)
Proper HTML (e.g. <Root>)
Lower HTML (e.g. <root>)

Changing the special_tokens option to special_token_case and special_token_html should fix this, but it'll break older models.

The text was updated successfully, but these errors were encountered:

tdozat added the enhancement label Jun 18, 2017

tdozat self-assigned this Jun 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Special token formatting #3

Special token formatting #3

tdozat commented Jun 18, 2017

Special token formatting #3

Special token formatting #3

Comments

tdozat commented Jun 18, 2017