Prompt format for multi-step set up #11

Mayer123 · 2025-01-23T07:38:14Z

Hi there,

Congratulations on the great work!
I'm curious how should one format the prompt in agent evaluation? i.e. when there are multiple turns of user provided observations and agent actions.
Currently I tried the format below and tested a few tasks on OSWorld, however the results don't look good. The PROMPT_FOR_COMPUTER is just the prompt provided in the readme. So basically I only used the most recent one screenshot and condensed all history actions in the user turn as well.

previous_actions = "\n".join([f"Step {i+1}: {action}" for i, action in enumerate(self.actions)]) if self.actions else "None"
messages = []
messages.append({
    "role": "system",
    "content": [{"type": "text", "text": "You are a helpful assistant."}]
})
messages.append({
    "role": "user",
    "content": [
        {
            "type": "text",
            "text": PROMPT_FOR_COMPUTER + f"{instruction}\nPrevious Actions:\n{previous_actions}" )
        },
        {
            "type": "image_url",
            "image_url": {"url": f"data:image/png;base64,{encode_image(obs['screenshot'])}"}
        }
    ],
})

Could you please share some insights here? Thank you!

The text was updated successfully, but these errors were encountered:

pooruss · 2025-01-23T10:17:47Z

Hi! Here is a pseudocode for the multi step prompt logic:

# To predict third action
messages.append({
    "role": "user",
    "content": [
        {
            "type": "text",
            "text": PROMPT_FOR_COMPUTER + f"{instruction}"
        },
        {
            "type": "image_url",
            "image_url": screenshot_from_init
        },
        {
            "type": "text",
            "text": previous_actions[0],
        },
        {
            "type": "image_url",
            "image_url": screenshot_from_state_0
        },
        {
            "type": "text",
            "text": previous_actions[1],
        },
        {
            "type": "image_url",
            "image_url": screenshot_from_state_1
        }
    ],
})

Note that we apply the 'history 5' logic for multi step online tasks, as discussed in the report.
We will also share our infer codes later. Stay tuned!

llajan · 2025-01-23T19:13:37Z

Congrats on the great work and thanks for the comments.
When trying the prompt format above, the 72B DPO model complains that "More than 1 image is unsupported". Could you kindly comment on this?

korbinian-hoermann · 2025-01-26T17:56:55Z

Hi @llajan

Did you try:

python -m vllm.entrypoints.openai.api_server --served-model-name ui-tars --model <path to your model> --limit-mm-per-prompt image=5 -tp <tp>

from:
https://github.com/bytedance/UI-TARS?tab=readme-ov-file#start-an-openai-api-service

llajan · 2025-01-27T14:42:43Z

That seems to do the job. Thank you!

korbinian-hoermann mentioned this issue Jan 29, 2025

Clarification on long-term memory #28

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prompt format for multi-step set up #11

Prompt format for multi-step set up #11

Mayer123 commented Jan 23, 2025

pooruss commented Jan 23, 2025 •

edited

Loading

llajan commented Jan 23, 2025

korbinian-hoermann commented Jan 26, 2025

llajan commented Jan 27, 2025

Prompt format for multi-step set up #11

Prompt format for multi-step set up #11

Comments

Mayer123 commented Jan 23, 2025

pooruss commented Jan 23, 2025 • edited Loading

llajan commented Jan 23, 2025

korbinian-hoermann commented Jan 26, 2025

llajan commented Jan 27, 2025

pooruss commented Jan 23, 2025 •

edited

Loading