CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments
Customer Relationship Management (CRM) systems are vital for modern enterprises, providing a foundation for managing customer interactions and data. Integrating AI agents into CRM systems can automate routine processes and enhance personalized service. However, deploying and evaluating these agents is challenging due to the lack of realistic benchmarks that reflect the complexity of real-world CRM tasks. To address this issue, we introduce CRMArena, a novel benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments. Following guidance from CRM experts and industry best practices, we designed CRMArena with nine customer service tasks distributed across three personas: service agent, analyst, and manager. The benchmark includes 16 commonly used industrial objects (e.g., account, order, knowledge article, case) with high interconnectivity, along with latent variables (e.g., complaint habits, policy violations) to simulate realistic data distributions. Experimental results reveal that state-of-the-art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function-calling abilities. Our findings highlight the need for enhanced agent capabilities in function-calling and rule-following to be deployed in real-world work environments. CRMArena is an open challenge to the community: systems that can reliably complete tasks showcase direct business value in a popular work environment.
Note: This repository is for research purposes only and not for commerical.
pip install -e .
To access our org, use the following credentials.
[email protected]
SALESFORCE_PASSWORD=crmarenatest
SALESFORCE_SECURITY_TOKEN=ugvBSBv0ArI7dayfqUY0wMGu
To access the GUI of our Org, follow the steps below:
- Head to https://login.salesforce.com/.
- Type in the user name and password using the above credentials.
- You can now see the GUI of our Org.
First, store your Salesforce org / OpenAI / AWS Bedrock / TogetherAI API keys in .env
OPENAI_API_KEY=...
...
Then, you can use Simple Salesforce to connect to our Org.
from simple_salesforce import Salesforce
import os
sf = Salesforce(username=os.getenv("SALESFORCE_USERNAME"), password=os.getenv("SALESFORCE_PASSWORD"), security_token=os.getenv("SALESFORCE_SECURITY_TOKEN"))
sf.query.query_all(...)
To run experiments, you need to download CRMArena queries and schema from Huggingface:
from datasets import load_dataset
queries = load_dataset("Salesforce/CRMArena", "CRMArena")
schema = load_dataset("Salesforce/CRMArena", "schema")
Please refer to crm_sandbox/data/assets.py
for more details.
Alternatively, we have prepared the evaluation scripts. Configure your setup in run_tasks.sh
and launch experiments:
bash run_tasks.sh
If you find this work useful, please consider citing:
@misc{huang-etal-2024-crmarena,
title = "CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments",
author = "Huang, Kung-Hsiang and
Prabhakar, Akshara and
Dhawan, Sidharth and
Mao, Yixin and
Wang, Huan and
Savarese, Silvio and
Xiong, Caiming and
Laban, Philippe and
Wu, Chien-Sheng",
year = "2024",
archivePrefix = "arXiv",
eprint={2411.02305},
primaryClass={cs.CL}
}
This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.