You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The config file dictates what special tokens are used by each vocabulary. This is because the parser needs to know which token in the training file(s) is the root. In SD and UD, this is root, but in CTB and some CoNLL 2009 treebanks, it's ROOT. This means we can't just hardcode in which label string indicates the root relation. In a previous implementation of the parser the root string can be specified, but in this one you specify the format of all special tokens to allow for consistency; however, this opens up the possibility of leaving out some special tokens that the code assumes are there, or including ones that the code never uses.
A better approach is to hardcode in what the special tokens are for each vocabulary but let the configuration file specify what the format for them is, allowing for the following possibilities:
Upper (e.g. ROOT)
Proper (e.g. Root)
Lower (e.g. root)
Upper HTML (e.g. <ROOT>)
Proper HTML (e.g. <Root>)
Lower HTML (e.g. <root>)
Changing the special_tokens option to special_token_case and special_token_html should fix this, but it'll break older models.
The text was updated successfully, but these errors were encountered:
The config file dictates what special tokens are used by each vocabulary. This is because the parser needs to know which token in the training file(s) is the root. In SD and UD, this is
root
, but in CTB and some CoNLL 2009 treebanks, it'sROOT
. This means we can't just hardcode in which label string indicates the root relation. In a previous implementation of the parser the root string can be specified, but in this one you specify the format of all special tokens to allow for consistency; however, this opens up the possibility of leaving out some special tokens that the code assumes are there, or including ones that the code never uses.A better approach is to hardcode in what the special tokens are for each vocabulary but let the configuration file specify what the format for them is, allowing for the following possibilities:
ROOT
)Root
)root
)<ROOT>
)<Root>
)<root>
)Changing the
special_tokens
option tospecial_token_case
andspecial_token_html
should fix this, but it'll break older models.The text was updated successfully, but these errors were encountered: