Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special token formatting #3

Open
tdozat opened this issue Jun 18, 2017 · 0 comments
Open

Special token formatting #3

tdozat opened this issue Jun 18, 2017 · 0 comments
Assignees

Comments

@tdozat
Copy link
Owner

tdozat commented Jun 18, 2017

The config file dictates what special tokens are used by each vocabulary. This is because the parser needs to know which token in the training file(s) is the root. In SD and UD, this is root, but in CTB and some CoNLL 2009 treebanks, it's ROOT. This means we can't just hardcode in which label string indicates the root relation. In a previous implementation of the parser the root string can be specified, but in this one you specify the format of all special tokens to allow for consistency; however, this opens up the possibility of leaving out some special tokens that the code assumes are there, or including ones that the code never uses.

A better approach is to hardcode in what the special tokens are for each vocabulary but let the configuration file specify what the format for them is, allowing for the following possibilities:

  1. Upper (e.g. ROOT)
  2. Proper (e.g. Root)
  3. Lower (e.g. root)
  4. Upper HTML (e.g. <ROOT>)
  5. Proper HTML (e.g. <Root>)
  6. Lower HTML (e.g. <root>)

Changing the special_tokens option to special_token_case and special_token_html should fix this, but it'll break older models.

@tdozat tdozat self-assigned this Jun 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant