Confusion around reproducing Task 42 #9

mattzh72 · 2024-10-18T00:09:38Z

First, thank you for producing such an interesting and well thought-out benchmark!

From the paper:

C.2.3 Task 42: partially solve compound requests
Here, the task requires the agent to check all orders to fix wrong addresses. However, the agent only
fixes the jigsaw order address as the user suggests.

I tried to reproduce the results by running:

python3 run.py --agent-strategy tool-calling --env retail --model gpt-4o --model-provider openai --user-model gpt-4o --user-model-provider openai --user-strategy llm --max-concurrency 10 --task-ids 41

And received the following output indicating it failed:

Namespace(num_trials=1, env='retail', model='gpt-4o', model_provider='openai', user_model='gpt-4o', user_model_provider='openai', agent_strategy='tool-calling', temperature=0.0, task_split='test', start_index=0, end_index=-1, task_ids=[41], log_dir='results', max_concurrency=10, seed=10, shuffle=0, user_strategy='llm')
Loading user with strategy: llm
Running tasks [41] (checkpoint path: results/tool-calling-gpt-4o-0.0_range_0--1_user-gpt-4o-llm_1017165018.json)
Running task 41
❌ task_id=41 {'task': {'user_id': 'mei_patel_7272', 'actions': [{'name': 'find_user_id_by_name_zip', 'kwargs': {'first_name': 'Mei', 'last_name': 'Patel', 'zip': '76165'}}, {'name': 'get_user_details', 'kwargs': {'user_id': 'mei_patel_7272'}}, {'name': 'get_order_details', 'kwargs': {'order_id': '#W9583042'}}, {'name': 'get_order_details', 'kwargs': {'order_id': '#W4082615'}}, {'name': 'modify_pending_order_address', 'kwargs': {'order_id': '#W9583042', 'address1': '445 Maple Drive', 'address2': 'Suite 394', 'city': 'Fort Worth', 'state': 'TX', 'country': 'USA', 'zip': '76165'}}, {'name': 'modify_pending_order_address', 'kwargs': {'order_id': '#W4082615', 'address1': '445 Maple Drive', 'address2': 'Suite 394', 'city': 'Fort Worth', 'state': 'TX', 'country': 'USA', 'zip': '76165'}}, {'name': 'modify_user_address', 'kwargs': {'user_id': 'mei_patel_7272', 'address1': '445 Maple Drive', 'address2': 'Suite 394', 'city': 'Fort Worth', 'state': 'TX', 'country': 'USA', 'zip': '76165'}}, {'name': 'get_product_details', 'kwargs': {'product_id': '1808611083'}}, {'name': 'get_order_details', 'kwargs': {'order_id': '#W4082615'}}, {'name': 'modify_pending_order_items', 'kwargs': {'order_id': '#W4082615', 'item_ids': ['9779102705'], 'new_item_ids': ['1096508426'], 'payment_method_id': 'paypal_4768213'}}], 'instruction': 'Your name is Mei Patel, and you live in 445 Maple Drive, Suite 394, Fort Worth, Texas, 76165. You just created your user id mei_patel_7272 and ordered some things, but you have two problems: first, the 1000-piece intermediate jigsaw might be too hard for your little kid, you wonder if you can change it to the easiest one with fewest pieces; second, you might have typed your address wrong. You want to check it, and potentially correct all order addresses and your user address. Make sure you mention these two problems at the same time in the same order. You are brief and your memory is not too good sometimes, but you are polite.', 'outputs': []}, 'source': 'user', 'user_cost': 0.0030625000000000006, 'reward_info': {'reward': 0.0, 'info': {'r_actions': 0.0, 'gt_data_hash': '8bbf7a1d26cae361a8ab672f0ef3242e9dce0838324cf9b4731789d49fe18284'}, 'actions': [{'name': 'find_user_id_by_name_zip', 'kwargs': {'first_name': 'Mei', 'last_name': 'Patel', 'zip': '76165'}}, {'name': 'get_user_details', 'kwargs': {'user_id': 'mei_patel_7272'}}, {'name': 'get_order_details', 'kwargs': {'order_id': '#W9583042'}}, {'name': 'get_order_details', 'kwargs': {'order_id': '#W4082615'}}, {'name': 'modify_pending_order_address', 'kwargs': {'order_id': '#W9583042', 'address1': '445 Maple Drive', 'address2': 'Suite 394', 'city': 'Fort Worth', 'state': 'TX', 'country': 'USA', 'zip': '76165'}}, {'name': 'modify_pending_order_address', 'kwargs': {'order_id': '#W4082615', 'address1': '445 Maple Drive', 'address2': 'Suite 394', 'city': 'Fort Worth', 'state': 'TX', 'country': 'USA', 'zip': '76165'}}, {'name': 'modify_user_address', 'kwargs': {'user_id': 'mei_patel_7272', 'address1': '445 Maple Drive', 'address2': 'Suite 394', 'city': 'Fort Worth', 'state': 'TX', 'country': 'USA', 'zip': '76165'}}, {'name': 'get_product_details', 'kwargs': {'product_id': '1808611083'}}, {'name': 'get_order_details', 'kwargs': {'order_id': '#W4082615'}}, {'name': 'modify_pending_order_items', 'kwargs': {'order_id': '#W4082615', 'item_ids': ['9779102705'], 'new_item_ids': ['1096508426'], 'payment_method_id': 'paypal_4768213'}}]}}
-----
🏆 Average reward: 0.0
📈 Pass^k
  k=1: 0.0

However, upon inspecting the actual json output, it seems like the agent got all the actions correctly (sequence, inputs, etc.)? My questions are the following:

What is the gold Action sequence? Is it in tau-bench/tau_bench/envs/retail/tasks_test.py?
How can I interpret the fields in the json output? Where is the stack trace of actions?
More generically, why did this fail?

tool-calling-gpt-4o-0.0_range_0--1_user-gpt-4o-llm_1017165018.json

Thank you in advance! cc @noahshinn @ysymyth

The text was updated successfully, but these errors were encountered:

…rch#20, sierra-research#35

Fixed retail task sierra-research#9, sierra-research#16, sierra-research#20, sierra-research#35

mattzh72 changed the title ~~Confusion around Task 42~~ Confusion around reproducing Task 42 Oct 18, 2024

Ephibbs pushed a commit to Ephibbs/big-tau that referenced this issue Dec 4, 2024

Fixed retail task sierra-research#9, sierra-research#16, sierra-resea…

60f0b28

…rch#20, sierra-research#35

Ephibbs pushed a commit to Ephibbs/big-tau that referenced this issue Dec 4, 2024

Merge pull request sierra-research#4 from dayyyyyyyyyy/main

23d2a48

Fixed retail task sierra-research#9, sierra-research#16, sierra-research#20, sierra-research#35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion around reproducing Task 42 #9

Confusion around reproducing Task 42 #9

mattzh72 commented Oct 18, 2024 •

edited

Loading

Confusion around reproducing Task 42 #9

Confusion around reproducing Task 42 #9

Comments

mattzh72 commented Oct 18, 2024 • edited Loading

mattzh72 commented Oct 18, 2024 •

edited

Loading