Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion around reproducing Task 42 #9

Open
mattzh72 opened this issue Oct 18, 2024 · 0 comments
Open

Confusion around reproducing Task 42 #9

mattzh72 opened this issue Oct 18, 2024 · 0 comments

Comments

@mattzh72
Copy link

mattzh72 commented Oct 18, 2024

First, thank you for producing such an interesting and well thought-out benchmark!

From the paper:

C.2.3 Task 42: partially solve compound requests
Here, the task requires the agent to check all orders to fix wrong addresses. However, the agent only
fixes the jigsaw order address as the user suggests.

I tried to reproduce the results by running:

python3 run.py --agent-strategy tool-calling --env retail --model gpt-4o --model-provider openai --user-model gpt-4o --user-model-provider openai --user-strategy llm --max-concurrency 10 --task-ids 41

And received the following output indicating it failed:

Namespace(num_trials=1, env='retail', model='gpt-4o', model_provider='openai', user_model='gpt-4o', user_model_provider='openai', agent_strategy='tool-calling', temperature=0.0, task_split='test', start_index=0, end_index=-1, task_ids=[41], log_dir='results', max_concurrency=10, seed=10, shuffle=0, user_strategy='llm')
Loading user with strategy: llm
Running tasks [41] (checkpoint path: results/tool-calling-gpt-4o-0.0_range_0--1_user-gpt-4o-llm_1017165018.json)
Running task 41
❌ task_id=41 {'task': {'user_id': 'mei_patel_7272', 'actions': [{'name': 'find_user_id_by_name_zip', 'kwargs': {'first_name': 'Mei', 'last_name': 'Patel', 'zip': '76165'}}, {'name': 'get_user_details', 'kwargs': {'user_id': 'mei_patel_7272'}}, {'name': 'get_order_details', 'kwargs': {'order_id': '#W9583042'}}, {'name': 'get_order_details', 'kwargs': {'order_id': '#W4082615'}}, {'name': 'modify_pending_order_address', 'kwargs': {'order_id': '#W9583042', 'address1': '445 Maple Drive', 'address2': 'Suite 394', 'city': 'Fort Worth', 'state': 'TX', 'country': 'USA', 'zip': '76165'}}, {'name': 'modify_pending_order_address', 'kwargs': {'order_id': '#W4082615', 'address1': '445 Maple Drive', 'address2': 'Suite 394', 'city': 'Fort Worth', 'state': 'TX', 'country': 'USA', 'zip': '76165'}}, {'name': 'modify_user_address', 'kwargs': {'user_id': 'mei_patel_7272', 'address1': '445 Maple Drive', 'address2': 'Suite 394', 'city': 'Fort Worth', 'state': 'TX', 'country': 'USA', 'zip': '76165'}}, {'name': 'get_product_details', 'kwargs': {'product_id': '1808611083'}}, {'name': 'get_order_details', 'kwargs': {'order_id': '#W4082615'}}, {'name': 'modify_pending_order_items', 'kwargs': {'order_id': '#W4082615', 'item_ids': ['9779102705'], 'new_item_ids': ['1096508426'], 'payment_method_id': 'paypal_4768213'}}], 'instruction': 'Your name is Mei Patel, and you live in 445 Maple Drive, Suite 394, Fort Worth, Texas, 76165. You just created your user id mei_patel_7272 and ordered some things, but you have two problems: first, the 1000-piece intermediate jigsaw might be too hard for your little kid, you wonder if you can change it to the easiest one with fewest pieces; second, you might have typed your address wrong. You want to check it, and potentially correct all order addresses and your user address. Make sure you mention these two problems at the same time in the same order. You are brief and your memory is not too good sometimes, but you are polite.', 'outputs': []}, 'source': 'user', 'user_cost': 0.0030625000000000006, 'reward_info': {'reward': 0.0, 'info': {'r_actions': 0.0, 'gt_data_hash': '8bbf7a1d26cae361a8ab672f0ef3242e9dce0838324cf9b4731789d49fe18284'}, 'actions': [{'name': 'find_user_id_by_name_zip', 'kwargs': {'first_name': 'Mei', 'last_name': 'Patel', 'zip': '76165'}}, {'name': 'get_user_details', 'kwargs': {'user_id': 'mei_patel_7272'}}, {'name': 'get_order_details', 'kwargs': {'order_id': '#W9583042'}}, {'name': 'get_order_details', 'kwargs': {'order_id': '#W4082615'}}, {'name': 'modify_pending_order_address', 'kwargs': {'order_id': '#W9583042', 'address1': '445 Maple Drive', 'address2': 'Suite 394', 'city': 'Fort Worth', 'state': 'TX', 'country': 'USA', 'zip': '76165'}}, {'name': 'modify_pending_order_address', 'kwargs': {'order_id': '#W4082615', 'address1': '445 Maple Drive', 'address2': 'Suite 394', 'city': 'Fort Worth', 'state': 'TX', 'country': 'USA', 'zip': '76165'}}, {'name': 'modify_user_address', 'kwargs': {'user_id': 'mei_patel_7272', 'address1': '445 Maple Drive', 'address2': 'Suite 394', 'city': 'Fort Worth', 'state': 'TX', 'country': 'USA', 'zip': '76165'}}, {'name': 'get_product_details', 'kwargs': {'product_id': '1808611083'}}, {'name': 'get_order_details', 'kwargs': {'order_id': '#W4082615'}}, {'name': 'modify_pending_order_items', 'kwargs': {'order_id': '#W4082615', 'item_ids': ['9779102705'], 'new_item_ids': ['1096508426'], 'payment_method_id': 'paypal_4768213'}}]}}
-----
🏆 Average reward: 0.0
📈 Pass^k
  k=1: 0.0

However, upon inspecting the actual json output, it seems like the agent got all the actions correctly (sequence, inputs, etc.)? My questions are the following:

  1. What is the gold Action sequence? Is it in tau-bench/tau_bench/envs/retail/tasks_test.py?
  2. How can I interpret the fields in the json output? Where is the stack trace of actions?
  3. More generically, why did this fail?

tool-calling-gpt-4o-0.0_range_0--1_user-gpt-4o-llm_1017165018.json

Thank you in advance! cc @noahshinn @ysymyth

@mattzh72 mattzh72 changed the title Confusion around Task 42 Confusion around reproducing Task 42 Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant