Source code of the paper "E2CNN: Entity-Type-Enriched Cascaded Neural Network for Financial Relation Extraction"
In this work, we present a cascaded pointer networks approach for financial hyper-relation extraction which models relationship as functions that map subjects to objects and takes into account the entity's type information, to address the overlapping triple problem. Our approach contains two components:
- NER model: Adopt a span-level entity extraction strategy which performs entity type prediction on all possible spans, effectively solving the problem of identifying nested entities in financial texts.
- RE model: Adopt a cascaded pointer network relation extraction model fused with entity tags to make full use of the entity type feature. The network first extracts the relationship subject, then jointly extracts the relationship object and relationship type, which effectively solves the extraction of complex relationship types.
Please find more details of this work in our paper.
Please install all the dependency packages using the following command:
pip install -r requirements.txt
Pretrained models need to be installed and placed in ./pretrain_models
as the base transformer model. We use chinese-roberta-wwm-ext
as the default base model for Chinese dataset.
Our pre-trained model is based on the Chinese financial corpus FinCorpus.CN, and other open datasets mentioned in the paper are available at
DUIE:http://ai.baidu.com/broad/introduction
CLUENER2020:https://github.com/CLUEbenchmark/CLUENER2020
The input data format of the NER model is JSONL. Each line of the input file contains one document in the following format.
{
# document ID (please make sure the ID of each document is unique)
"doc_key": "2",
# sentences in the document, each sentence is a list of tokens.
"sentences": [
[...],
[...],
["报", "告", "期", "内", ",", ...],
...
],
# entities (boundaries and entity type) in each sentence
"ner": [
[...],
[...],
[[5, 8, "公司企业"], [13, 24, "公司企业"], ...],#the boundary positions are indexed in the document level
...,
],
# relations (two spans and relation type) in each sentence
"relations": [
[...],
[...],
[[5, 8, 13, 24, "投资"]...],
...
]
}
You can use run_entity.py
with --do_train
to train a NER model and with --do_eval
to evaluate a NER model.
The following commands can be used to train NER models on FinCorpus.CN:
CUDA_VISIBLE_DEVICES=0 \
python run_entity.py \
--do_train --do_eval --eval_test\
--learning_rate=1e-5 \
--task finance \
--model ./pretrain_models/chinese-roberta-wwm-ext \
--data_dir ./entity_data/FinCorpus.CN \
--output_dir entity_output/FinCorpus.CN \
--context_window 0 --num_epoch 100 --max_span_length 26
Arguments:
--task
:Determine the types of relationships that will appear in the dataset. All entity types for the corresponding task must be defined in./shared/const.py
--model
:the base transformer model.We usechinese-roberta-wwm-ext
for chinese dataset.--data_dir
:the input directory of the dataset. The prediction files (ent_pred_dev.json
orent_pred_test.json
) of the entity model will be saved in this directory.--output_dir
:the output directory of the entity model and logs.
The input data format of the relation model is JSONL. Each line of the input file contains one document in the following format.
{
# text: input sentence
"text": "报告期内,亿利洁能投资设立宁波氢能创新中心有限公司(公司持股10%),积极布局氢能源产业。",
# Relations(predicate), entity types(subject_type and object_type) and entity(subject and object) information
"spo_list": [
{
"predicate": "投资",
"subject_type": "公司企业",
"object_type": "公司企业",
"subject": "亿利洁能",
"object": "宁波氢能创新中心有限公司"
}
]
# predicted entities (boundaries and entity type) in sentences.(If there is no entity prediction file, the predicted entities will be set to all the correct entities involved in the sentence)
"predict_entity":[
[5, 8, "公司企业"]
[13, 24, "公司企业"]
]
}
You can use the following commands to convert the data format for the NER model to data for the RE model.
python entity_data/analyze_data/prediction2relation.py \
--pred_dir entity_data/FinCorpus.CN/ \
--output_dir relation_data/FinCorpus.CN/
Arguments:
--pred_dir
:the directory of entity model prediction files.In order to classify relationships based on predicted entities,ent_pred_test.json and ent_pred_dev.json should be included.--output_dir
:the output directory of the dataset. And the directory will be the input directory of the relation model. Besides, The entity types and relationship types involved in the dataset need to be predefined in JSONL files(entity.json, rel.json) and placed in the dataset folder.
You can train the relational model with the default configuration by using run_relation.py
A training command template is as follows:
python run_relation.py \
--lr 1e-5 \
--batch_size 4 \
--max_epoch 10 \
--max_len 300 \
--dataset FinCorpus.CN \
--bert_name './pretrain_models/chinese-roberta-wwm-ext' \
--bert_dim 768
Arguments:
--dataset
:dataset name.The dataset needs to be placed in the corresponding folder under./relation_data/
--bert_name
:the base transformer model.We usechinese-roberta-wwm-ext
for the chinese dataset. Other configurations can be viewed and modified in./relation/config.py
The prediction results will be stored in the fileresult.json
in the folder./results/
You can use the following commands to evaluate the trained model
python Evaluate.py \
--dataset FinCorpus.CN
If you use our code in your research, please cite our work:
E2CNN is developed in the National Engineering Research Center for Big Data Technology and System, Cluster and Grid Computing Lab, Services Computing Technology and System Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China by Mengfan Li([email protected]), Xuanhua Shi([email protected]), Chenqi Qiao([email protected]), Weihao Wang([email protected]), Yao Wan([email protected]), Hai Jin([email protected]) and Department of Computing, The Hong Kong Polytechnic University by Xiao Huang([email protected])