You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, thank you for producing such an interesting and well thought-out benchmark!
From the paper:
C.2.3 Task 42: partially solve compound requests
Here, the task requires the agent to check all orders to fix wrong addresses. However, the agent only
fixes the jigsaw order address as the user suggests.
And received the following output indicating it failed:
Namespace(num_trials=1, env='retail', model='gpt-4o', model_provider='openai', user_model='gpt-4o', user_model_provider='openai', agent_strategy='tool-calling', temperature=0.0, task_split='test', start_index=0, end_index=-1, task_ids=[41], log_dir='results', max_concurrency=10, seed=10, shuffle=0, user_strategy='llm')
Loading user with strategy: llm
Running tasks [41] (checkpoint path: results/tool-calling-gpt-4o-0.0_range_0--1_user-gpt-4o-llm_1017165018.json)
Running task 41
❌ task_id=41 {'task': {'user_id': 'mei_patel_7272', 'actions': [{'name': 'find_user_id_by_name_zip', 'kwargs': {'first_name': 'Mei', 'last_name': 'Patel', 'zip': '76165'}}, {'name': 'get_user_details', 'kwargs': {'user_id': 'mei_patel_7272'}}, {'name': 'get_order_details', 'kwargs': {'order_id': '#W9583042'}}, {'name': 'get_order_details', 'kwargs': {'order_id': '#W4082615'}}, {'name': 'modify_pending_order_address', 'kwargs': {'order_id': '#W9583042', 'address1': '445 Maple Drive', 'address2': 'Suite 394', 'city': 'Fort Worth', 'state': 'TX', 'country': 'USA', 'zip': '76165'}}, {'name': 'modify_pending_order_address', 'kwargs': {'order_id': '#W4082615', 'address1': '445 Maple Drive', 'address2': 'Suite 394', 'city': 'Fort Worth', 'state': 'TX', 'country': 'USA', 'zip': '76165'}}, {'name': 'modify_user_address', 'kwargs': {'user_id': 'mei_patel_7272', 'address1': '445 Maple Drive', 'address2': 'Suite 394', 'city': 'Fort Worth', 'state': 'TX', 'country': 'USA', 'zip': '76165'}}, {'name': 'get_product_details', 'kwargs': {'product_id': '1808611083'}}, {'name': 'get_order_details', 'kwargs': {'order_id': '#W4082615'}}, {'name': 'modify_pending_order_items', 'kwargs': {'order_id': '#W4082615', 'item_ids': ['9779102705'], 'new_item_ids': ['1096508426'], 'payment_method_id': 'paypal_4768213'}}], 'instruction': 'Your name is Mei Patel, and you live in 445 Maple Drive, Suite 394, Fort Worth, Texas, 76165. You just created your user id mei_patel_7272 and ordered some things, but you have two problems: first, the 1000-piece intermediate jigsaw might be too hard for your little kid, you wonder if you can change it to the easiest one with fewest pieces; second, you might have typed your address wrong. You want to check it, and potentially correct all order addresses and your user address. Make sure you mention these two problems at the same time in the same order. You are brief and your memory is not too good sometimes, but you are polite.', 'outputs': []}, 'source': 'user', 'user_cost': 0.0030625000000000006, 'reward_info': {'reward': 0.0, 'info': {'r_actions': 0.0, 'gt_data_hash': '8bbf7a1d26cae361a8ab672f0ef3242e9dce0838324cf9b4731789d49fe18284'}, 'actions': [{'name': 'find_user_id_by_name_zip', 'kwargs': {'first_name': 'Mei', 'last_name': 'Patel', 'zip': '76165'}}, {'name': 'get_user_details', 'kwargs': {'user_id': 'mei_patel_7272'}}, {'name': 'get_order_details', 'kwargs': {'order_id': '#W9583042'}}, {'name': 'get_order_details', 'kwargs': {'order_id': '#W4082615'}}, {'name': 'modify_pending_order_address', 'kwargs': {'order_id': '#W9583042', 'address1': '445 Maple Drive', 'address2': 'Suite 394', 'city': 'Fort Worth', 'state': 'TX', 'country': 'USA', 'zip': '76165'}}, {'name': 'modify_pending_order_address', 'kwargs': {'order_id': '#W4082615', 'address1': '445 Maple Drive', 'address2': 'Suite 394', 'city': 'Fort Worth', 'state': 'TX', 'country': 'USA', 'zip': '76165'}}, {'name': 'modify_user_address', 'kwargs': {'user_id': 'mei_patel_7272', 'address1': '445 Maple Drive', 'address2': 'Suite 394', 'city': 'Fort Worth', 'state': 'TX', 'country': 'USA', 'zip': '76165'}}, {'name': 'get_product_details', 'kwargs': {'product_id': '1808611083'}}, {'name': 'get_order_details', 'kwargs': {'order_id': '#W4082615'}}, {'name': 'modify_pending_order_items', 'kwargs': {'order_id': '#W4082615', 'item_ids': ['9779102705'], 'new_item_ids': ['1096508426'], 'payment_method_id': 'paypal_4768213'}}]}}
-----
🏆 Average reward: 0.0
📈 Pass^k
k=1: 0.0
However, upon inspecting the actual json output, it seems like the agent got all the actions correctly (sequence, inputs, etc.)? My questions are the following:
What is the gold Action sequence? Is it in tau-bench/tau_bench/envs/retail/tasks_test.py?
How can I interpret the fields in the json output? Where is the stack trace of actions?
First, thank you for producing such an interesting and well thought-out benchmark!
From the paper:
I tried to reproduce the results by running:
And received the following output indicating it failed:
However, upon inspecting the actual json output, it seems like the agent got all the actions correctly (sequence, inputs, etc.)? My questions are the following:
tau-bench/tau_bench/envs/retail/tasks_test.py
?tool-calling-gpt-4o-0.0_range_0--1_user-gpt-4o-llm_1017165018.json
Thank you in advance! cc @noahshinn @ysymyth
The text was updated successfully, but these errors were encountered: