Skip to content

Latest commit

 

History

History
126 lines (92 loc) · 5.61 KB

README.md

File metadata and controls

126 lines (92 loc) · 5.61 KB
Golden-Touchstone

Golden-Touchstone Benchmark

Golden-Touchstone

Golden Touchstone is a simple, effective, and systematic benchmark for bilingual (Chinese-English) financial large language models, driving the research and implementation of financial large language models, akin to a touchstone. We also have trained and open-sourced Touchstone-GPT as a baseline for subsequent community research.

Evaluation of Touchstone Benchmark

Our inference is based on the llama-factory framework, and eval_benchmark.sh is our reasoning script. Register the template and dataset in llama-factory, and download the specified open source model before you can use it.

cd eval_benchmark/
bash eval_benchmark.sh

All files of our llama-factory framework will be uploaded later

Quick Eval Use

evaluate_all.py is an evaluation program based on the file generated by llama-factory reasoning, which contains three main parameters:

Model, eval_dataset_path, output_dir

Model specifies the model you use, which is embedded in the two file paths of eval_dataset_path and output_dir.

eval_dataset_path indicates the file path generated after the llama-factory framework completes reasoning, which should contain the output folder of each data set

output_dir indicates the path of all the data set task results you want to output, and the output result is in json format

After specifying these three address variables, use

cd eval_benchmark/
python evaluate_all.py

to find all the evaluation results in the output_dir

Introduction

The paper shows the evaluation of the diversity, systematicness and LLM adaptability of each open source benchmark.

benchmark_info

By collecting and selecting representative task datasets, we built our own Chinese-English bilingual Touchstone Benchmark, which includes 22 datasets

golden_touchstone_info

We extensively evaluated GPT-4o, llama3, qwen2, fingpt and our own trained Touchstone-GPT, analyzed the advantages and disadvantages of these models, and provided direction for subsequent research on financial large language models

evaluation

Usage of Touchstone-GPT

Here provides a code snippet with apply_chat_template to show you how to load the tokenizer and model and how to generate contents.

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "IDEA-FinAI/TouchstoneGPT-7B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("IDEA-FinAI/TouchstoneGPT-7B-Instruct")

prompt = "What is the sentiment of the following financial post: Positive, Negative, or Neutral?\nsees #Apple at $150/share in a year (+36% from today) on growing services business."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Citation

@misc{wu2024goldentouchstonecomprehensivebilingual,
      title={Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models}, 
      author={Xiaojun Wu and Junxi Liu and Huanyi Su and Zhouchi Lin and Yiyan Qi and Chengjin Xu and Jiajun Su and Jiajie Zhong and Fuwei Wang and Saizhuo Wang and Fengrui Hua and Jia Li and Jian Guo},
      year={2024},
      eprint={2411.06272},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.06272}, 
}