The code is for our paper: Automated Concatenation of Embeddings for Structured Prediction
ACE is a framework for automatically searching a good embedding concatenation for structured prediction tasks and achieving state-of-the-art accuracy. The code is based on flair version 0.4.3 with a lot of modifications.
Task | Language | Dataset | ACE | Previous best |
---|---|---|---|---|
Named Entity Recognition | English | CoNLL 03 (sentence-level) | 93.6 (F1) | 93.5 (Yu et al., 2020) |
Named Entity Recognition | English | CoNLL 03 (document-level) | 94.1 (F1) | 93.5 (Yu et al., 2020) |
Named Entity Recognition | German | CoNLL 03 (document-level) | 88.0 (F1) | 86.4 (Yu et al., 2020) |
Named Entity Recognition | German | CoNLL 03 (06 Revision) (document-level) | 91.4 (F1) | 90.3 (Yu et al., 2020) |
Named Entity Recognition | Dutch | CoNLL 02 (document-level) | 95.5 (F1) | 93.7 (Yu et al., 2020) |
Named Entity Recognition | Spanish | CoNLL 02 (document-level) | 95.6 (F1) | 90.3 (Yu et al., 2020) |
POS Tagging | English | Ritter's | 93.4 (Acc) | 90.1 (Nguyen et al., 2020) |
POS Tagging | English | TweeBank v2 | 95.6 (Acc) | 95.2 (Nguyen et al., 2020) |
Aspect Extraction | English | SemEval 2014 Laptop | 85.0 (F1) | 84.3 (Xu et al., 2019) |
Aspect Extraction | English | SemEval 2016 Restaurant | 81.2 (F1) | 78.0 (Xu et al., 2019) |
Dependency Parsing | English | PTB | 95.7 (LAS) | 95.3 (Wang et al., 2020) |
Semantic Dependency Parsing | English | DM ID | 95.3 (LF1) | 94.4 (Fernández-González and Gómez-Rodríguez, 2020) |
Semantic Dependency Parsing | English | DM OOD | 92.6 (LF1) | 91.0 (Fernández-González and Gómez-Rodríguez, 2020) |
Semantic Dependency Parsing | English | PAS ID | 95.3 (LF1) | 95.1 (Fernández-González and Gómez-Rodríguez, 2020) |
Semantic Dependency Parsing | English | PAS OOD | 93.9 (LF1) | 93.4 (Fernández-González and Gómez-Rodríguez, 2020) |
Semantic Dependency Parsing | English | PSD ID | 83.6 (LF1) | 82.6 (Fernández-González and Gómez-Rodríguez, 2020) |
Semantic Dependency Parsing | English | PSD OOD | 83.2 (LF1) | 82.0 (Fernández-González and Gómez-Rodríguez, 2020) |
The project is based on PyTorch 1.1+ and Python 3.6+. To run our code, install:
pip install -r requirements.txt
The following requirements should be satisfied:
- transformers: 3.0.0
In our code, most of the embeddings can be downloaded automatically (except ELMo for non-English languages). You can also download the embeddings manually. Please check the Table 8 in our paper for the download links of the embeddings.
We provide pretrained models for Named Entity Recognition (Sentence-/Document-Level) and Dependency Parsing (PTB) on OneDrive. You can find the corresponding config file in config/
. For the zip files named with doc*.zip
, you need to extract document-level embeddings at first. Please check (Optional) Extract Document Features.
- Download models
unzip
the zip file- Put the directory in the
resources/taggers/
To check the accuracy of the model, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/conll_03_english.yaml --test
where --config $config_file
is setting the configureation file.
Here we take CoNLL 2003 English NER as an example. The $config_file
is config/conll_03_english.yaml
.
To train the model, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config $config_file
To set the dataset manully, you can set the dataset in the $confile_file
by:
Sequence Labeling:
targets: ner
ner:
Corpus: ColumnCorpus-1
ColumnCorpus-1:
data_folder: datasets/conll_03_new
column_format:
0: text
1: pos
2: chunk
3: ner
tag_to_bioes: ner
tag_dictionary: resources/taggers/your_ner_tags.pkl
Parsing:
targets: dependency
dependency:
Corpus: UniversalDependenciesCorpus-1
UniversalDependenciesCorpus:
data_folder: datasets/ptb
add_root: True
tag_dictionary: resources/taggers/your_parsing_tags.pkl
The tag_dictionary
is a path to the tag dictionary for the task. If the path does not exist, the code will generate a tag dictionary at the path automatically. The dataset format is: Corpus: $CorpusClassName-$id
, where $id
is the name of datasets (anything you like). You can train multiple datasets jointly. For example:
Corpus: ColumnCorpus-1:ColumnCorpus-2:ColumnCorpus-3
ColumnCorpus-1:
data_folder: ...
column_format: ...
tag_to_bioes: ...
ColumnCorpus-2:
data_folder: ...
column_format: ...
tag_to_bioes: ...
ColumnCorpus-3:
data_folder: ...
column_format: ...
tag_to_bioes: ...
Please refer to Config File for more details.
You need to modifiy the embedding paths in the $config_file
to change the embeddings for concatenation. For example, you need to add bert-large-cased
in the config/conll_03_english.yaml
embeddings:
TransformerWordEmbeddings-0:
layers: '-1'
pooling_operation: first
model: xlm-roberta-large-finetuned-conll03-english
TransformerWordEmbeddings-1:
model: bert-base-cased
layers: -1,-2,-3,-4
pooling_operation: mean
TransformerWordEmbeddings-2:
model: bert-base-multilingual-cased
layers: -1,-2,-3,-4
pooling_operation: mean
TransformerWordEmbeddings-3: # New embeddings
model: bert-large-cased
layers: -1,-2,-3,-4
pooling_operation: mean
...
To archieve state-of-the-art accuracy, one optional approach is fine-tuning the transformer-based embeddings over the task. We use fine-tuned embeddings in huggingface for NER tasks while embeddings in other tasks are fine-tuned by ourselves. Then take the embeddings as an embedding candidate of ACE. Taking fine-tuning BERT model on PTB parsing as an example, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/en-bert-finetune-ptb.yaml
After the model is fine-tuned, you will find a tuned BERT model at
ls resources/taggers/en-bert_10epoch_0.5inter_2000batch_0.00005lr_20lrrate_ptb_monolingual_nocrf_fast_warmup_freezing_beta_weightdecay_finetune_saving_nodev_dependency16/bert-base-cased
Then, replace bert-base-cased
with resources/taggers/en-bert_10epoch_0.5inter_2000batch_0.00005lr_20lrrate_ptb_monolingual_nocrf_fast_warmup_freezing_beta_weightdecay_finetune_saving_nodev_dependency16/bert-base-cased
in the $config_file
of the ACE model (for example, config/ptb_parsing_model.yaml
).
The config config/en-bert-finetune-ptb.yaml
can be applied to fine-tuning other embeddings in parsing tasks. Here is an example config for fine-tuning NER (sequence labeling tasks): config/en-bert-finetune-ner.yaml
To archieve state-of-the-art accuracy of NER, one optional approach is extracting the document-level features from the transformer-based embeddings. Then take the features as an embedding candidate of ACE. We follow the embedding extraction approach of (Yu et al., 2020). We use the sentences with a single word -DOCSTART-
to split the documents. For CoNLL 2002 Spanish, there is not -DOCSTART-
sentences. Therefore, we add a -DOCSTART-
sentence for every 25 sentences. For CoNLL 2002 Dutch, the -DOCSTART-
is in the first sentence of the document, please split the -DOCSTART-
token into a single sentence. For example:
-DOCSTART- -DOCSTART- O
De Art O
tekst N O
van Prep O
het Art O
arrest N O
is V O
nog Adv O
niet Adv O
schriftelijk Adj O
beschikbaar Adj O
maar Conj O
het Art O
bericht N O
werd V O
alvast Adv O
bekendgemaakt V O
door Prep O
een Art O
communicatiebureau N O
dat Conj O
Floralux N B-ORG
inhuurde V O
. Punc O
...
Taking English BERT model on CoNLL English NER as an example, run:
CUDA_VISIBLE_DEVICES=0 python extract_features.py --config config/en-bert-extract.yaml --batch_size 32
If you want to parse a certain file, add train
in the file name and put the file in a certain $dir
(for example, parse_file_dir/train.your_file_name
). Run:
CUDA_VISIBLE_DEVICES=0 python train.py --config $config_file --parse --target_dir $dir --keep_order
The format of the file should be column_format={0: 'text', 1:'ner'}
for sequence labeling or you can modifiy line 232 in train.py
. The parsed results will be in outputs/
.
The config files are based on yaml format.
targets
: The target taskner
: named entity recognitionupos
: part-of-speech taggingchunk
: chunkingast
: abstract extractiondependency
: dependency parsingenhancedud
: semantic dependency parsing/enhanced universal dependency parsing
ner
: An example for thetargets
. Iftargets: ner
, then the code will read the values with the key ofner
.Corpus
: The training corpora for the model, use:
to split different corpora.tag_dictionary
: A path to the tag dictionary for the task. If the path does not exist, the code will generate a tag dictionary at the path automatically.
target_dir
: Save directory.model_name
: The trained models will be save in$target_dir/$model_name
.model
: The model to train, depending on the task.FastSequenceTagger
: Sequence labeling model. The values are the parameters.SemanticDependencyParser
: Syntactic/semantic dependency parsing model. The values are the parameters.
embeddings
: The embeddings for the model, each key is the class name of the embedding and the values of the key are the parameters, seeflair/embeddings.py
for more details. For each embedding, use$classname-$id
to represent the class. For example, if you want to use BERT and M-BERT for a single model, you can name:TransformerWordEmbeddings-0
,TransformerWordEmbeddings-1
.trainer
: The trainer class.ModelFinetuner
: The trainer for fine-tuning embeddings or simply train a task model without ACE.ReinforcementTrainer
: The trainer for training ACE.
train
: the parameters for thetrain
function intrainer
(for example,ReinforcementTrainer.train()
).
- Knowledge Distillation with ACE: Wang et al., 2020
If you feel the code helpful, please cite:
@article{wang2020automated,
title={Automated Concatenation of Embeddings for Structured Prediction},
author={Wang, Xinyu and Jiang, Yong and Bach, Nguyen and Wang, Tao and Huang, Zhongqiang and Huang, Fei and Tu, Kewei},
journal={arXiv preprint arXiv:2010.05006},
year={2020}
}
Please email your questions or comments to Xinyu Wang.