Golden Touchstone is a simple, effective, and systematic benchmark for bilingual (Chinese-English) financial large language models, driving the research and implementation of financial large language models, akin to a touchstone. We also have trained and open-sourced Touchstone-GPT as a baseline for subsequent community research.
Our inference is based on the llama-factory framework, and eval_benchmark.sh is our reasoning script. Register the template and dataset in llama-factory, and download the specified open source model before you can use it.
cd eval_benchmark/
bash eval_benchmark.sh
All files of our llama-factory framework will be uploaded later
evaluate_all.py is an evaluation program based on the file generated by llama-factory reasoning, which contains three main parameters:
Model, eval_dataset_path, output_dir
Model specifies the model you use, which is embedded in the two file paths of eval_dataset_path and output_dir.
eval_dataset_path indicates the file path generated after the llama-factory framework completes reasoning, which should contain the output folder of each data set
output_dir indicates the path of all the data set task results you want to output, and the output result is in json format
After specifying these three address variables, use
cd eval_benchmark/
python evaluate_all.py
to find all the evaluation results in the output_dir
The paper shows the evaluation of the diversity, systematicness and LLM adaptability of each open source benchmark.
By collecting and selecting representative task datasets, we built our own Chinese-English bilingual Touchstone Benchmark, which includes 22 datasets
We extensively evaluated GPT-4o, llama3, qwen2, fingpt and our own trained Touchstone-GPT, analyzed the advantages and disadvantages of these models, and provided direction for subsequent research on financial large language models
Here provides a code snippet with apply_chat_template
to show you how to load the tokenizer and model and how to generate contents.
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(
"IDEA-FinAI/TouchstoneGPT-7B-Instruct",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("IDEA-FinAI/TouchstoneGPT-7B-Instruct")
prompt = "What is the sentiment of the following financial post: Positive, Negative, or Neutral?\nsees #Apple at $150/share in a year (+36% from today) on growing services business."
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
@misc{wu2024goldentouchstonecomprehensivebilingual,
title={Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models},
author={Xiaojun Wu and Junxi Liu and Huanyi Su and Zhouchi Lin and Yiyan Qi and Chengjin Xu and Jiajun Su and Jiajie Zhong and Fuwei Wang and Saizhuo Wang and Fengrui Hua and Jia Li and Jian Guo},
year={2024},
eprint={2411.06272},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.06272},
}