-
Notifications
You must be signed in to change notification settings - Fork 66
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
84 additions
and
0 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
# VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs | ||
|
||
Gabriel Sarch,Lawrence Jang,Michael J. Tarr,William W. Cohen,Kenneth Marino,Katerina Fragkiadaki, **NeurIPS 2024** | ||
|
||
## Summary | ||
|
||
The paper introduces In-Context Abstraction Learning (ICAL), a new method to improve the ability of large language and vision-language models (LLMs and VLMs) to learn from a few demonstrations. Traditionally, LLMs and VLMs rely on high-quality demonstrations for in-context learning. The paper proposes a method for these models to learn from sub-optimal demonstrations by generating their own in-context examples with multi-modal abstractions. | ||
|
||
## Contributions | ||
|
||
1. **In-Context Abstraction Learning (ICAL):** A novel method that enables LLMs and VLMs to generate their own examples from generic, sub-optimal demonstrations. | ||
|
||
2. **Multimodal Abstractions:** ICAL focuses on four types of cognitive abstractions to correct the errors in the examples | ||
|
||
|
||
## Method and Pipeline | ||
|
||
1. ICAL starts with a noisy trajectory, which is a sequence of observations and actions collected from non-expert humans or generated by an agent , denoted $\xi_{noisy} = \{ o_0, a_0, \dots, o_T, a_T \}$ in a new task domain $D$. | ||
|
||
2. **Abstraction Phase:** The VLM identifies and corrects errors in the trajectory and enriches it with four types of language abstractions: | ||
|
||
- **Task and Causal Abstractions:** Explain the fundamental principles or actions needed to achieve a goal. Example: "Since the box is already open, there is no need to close it after placing the watches inside." | ||
|
||
- **State Changes:** Describe how actions affect the form and conditions of objects in the scene. | ||
|
||
- **Task Decomposition and Subgoals:** Break down a complex task into intermediate steps | ||
|
||
- **Task Construals:** Highlight essential visual details within a task | ||
|
||
   Mathematically, | ||
|
||
   $F_{abstract} : (\xi_{noisy}, I, \{e_1, \dots, e_k\}) \rightarrow (\xi_{optimized}, L)$ | ||
|
||
   Where: | ||
- $\xi_{noisy}$: Noisy Trajectory | ||
- $I$: Task Instruction | ||
- $\{e_1, \dots, e_k\}$: Top-k previously successful in-context examples | ||
- $\xi_{optimized}$: Optimized Trajectory | ||
- $L$: Language Abstractions | ||
|
||
3. The optimized trajectory is executed in the environment, and a human observer provides natural language feedback when the agent fails. The VLM then uses this feedback to revise the trajectory and the abstractions. | ||
|
||
   Mathematically, | ||
|
||
   $\Xi_{update}(\xi_{optimized}, H(a_t, o_t), L, I, \{e_1, \dots, e_k\}) \rightarrow \xi'_{optimized}, L'$ | ||
|
||
   Where: | ||
- $\Xi_{update}$: Update function | ||
- $\xi_{optimized}$: Current trajectory | ||
- $H(a_t, o_t)$: Human feedback on action $a_t$ at observation $o_t$ | ||
- $L$: Current annotations | ||
- $I$: Task Instruction | ||
- $\{e_1, \dots, e_k\}$: Top-k retrieved examples | ||
- $\xi'_{optimized}$: Revised Trajectory | ||
- $L'$: Updated annotations | ||
|
||
4. If the execution is successful, the revised trajectory and abstractions are added to the agent's memory. | ||
|
||
<img src= '../images/ical1.jpg'> | ||
|
||
5. **Retrieval Augmented Generation at Deployment:** When presented with new instructions, the agent retrieves similar examples from its memory and uses them as context to generate actions. | ||
|
||
  Mathematically, | ||
|
||
  $s = \lambda_I \cdot s_I + \lambda_{textual} \cdot s_{textual} + \lambda_{visual} \cdot s_{visual}$ | ||
|
||
  Where: | ||
- $s$: Aggregated similarity score | ||
- $s_I$, $s_{textual}$, $s_{visual}$: Similarity scores for instruction, textual state, and visual state respectively,computed via cosine similarity | ||
- $\lambda_I$, $\lambda_{textual}$, and $\lambda_{visual}$: Weighting hyperparameters | ||
|
||
<img src='../images/ical2.jpg'> | ||
|
||
## Results | ||
|
||
1. On the TEACh household instruction following validation unseen dataset. ICAL examples significantly improves on the state-of-the-art in goal-condition success, outperforming agents that use the raw visual demonstrations as in context examples without abstraction learning. | ||
|
||
2. ICAL outperforms the previous state-of-the-art on the VisualWebArena benchmark | ||
|
||
3. ICAL demonstrates superior performance on Ego4D action anticipation compared to hand-written few-shot GPT4V that uses chain of thought. | ||
|
||
## Our Two Cents | ||
|
||
The paper introduces a promising approach aimed at enhancing the learning capabilities of LLM and VLMs through sub-optimal demonstrations. The concept of generating multimodal abstractions is highly applicable to real-world situations. However, a limitation of the approach is its reliance on a fixed action API, which may restrict the adaptability of the agent. Looking ahead, a valuable area for future research would be to explore the integration of more advanced forms of human feedback, such as demonstrations or corrections to the abstractions, to further improve the learning process. |