From 3bc9144c7803e0d427911e9d8eed7dc8c2e5aed0 Mon Sep 17 00:00:00 2001
From: Shivank Garg <128126577+shivank21@users.noreply.github.com>
Date: Sat, 9 Nov 2024 14:02:17 +0530
Subject: [PATCH] :zap: Add Summary for ICAL

---
 images/ical1.jpg  | Bin 0 -> 190 bytes
 images/ical2.jpg  | Bin 0 -> 76 bytes
 summaries/ICAL.md |  84 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 84 insertions(+)
 create mode 100644 images/ical1.jpg
 create mode 100644 images/ical2.jpg
 create mode 100644 summaries/ICAL.md

diff --git a/images/ical1.jpg b/images/ical1.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..f98f546cbc23df460562a9370cedd926613dc997
GIT binary patch
literal 190
zcmaFAfA4!3Vi53h^K@fi1OgpK9R>!71XGHqvvYhvkiVO!k1J4;fq^L`Cpov6#ejh!
zH?O29$lXaHA~H&W9jKgvg+YWtkU@ZfF)_Kgz%d}ehXE&mn9cw+4{Vbz$V{Ayu}g3R
E0H7c!e*gdg

literal 0
HcmV?d00001

diff --git a/images/ical2.jpg b/images/ical2.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..7eddb854d49b0aff9a0311e4f0de0fbbf2f4b89e
GIT binary patch
literal 76
zcmaFAfA4!3Vi53h^K@fi1OgpK9R>!71cNgJGZP4aNM;~lVPgRS4mLJcHck#+E?zE9
K4o*&9E*=2+W)U_3

literal 0
HcmV?d00001

diff --git a/summaries/ICAL.md b/summaries/ICAL.md
new file mode 100644
index 0000000..7336091
--- /dev/null
+++ b/summaries/ICAL.md
@@ -0,0 +1,84 @@
+# VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs
+
+Gabriel Sarch,Lawrence Jang,Michael J. Tarr,William W. Cohen,Kenneth Marino,Katerina Fragkiadaki, **NeurIPS 2024**
+
+##  Summary 
+
+The paper introduces In-Context Abstraction Learning (ICAL), a new method to improve the ability of large language and vision-language models (LLMs and VLMs) to learn from a few demonstrations. Traditionally, LLMs and VLMs rely on high-quality demonstrations for in-context learning. The paper proposes a method for these models to learn from sub-optimal demonstrations by generating their own in-context examples with multi-modal abstractions.
+
+## Contributions
+
+1. **In-Context Abstraction Learning (ICAL):** A novel method that enables LLMs and VLMs to generate their own examples from generic, sub-optimal demonstrations.
+
+2. **Multimodal Abstractions:** ICAL focuses on four types of cognitive abstractions to correct the errors in the examples
+
+
+## Method and Pipeline
+
+1. ICAL starts with a noisy trajectory, which is a sequence of observations and actions collected from non-expert humans or generated by an agent , denoted $\xi_{noisy} = \{ o_0, a_0, \dots, o_T, a_T \}$ in a new task domain $D$.
+
+2. **Abstraction Phase:** The VLM identifies and corrects errors in the trajectory and enriches it with four types of language abstractions:
+
+-  **Task and Causal Abstractions:** Explain the fundamental principles or actions needed to achieve a goal. Example: "Since the box is already open, there is no need to close it after placing the watches inside."
+
+- **State Changes:** Describe how actions affect the form and conditions of objects in the scene.
+
+- **Task Decomposition and Subgoals:** Break down a complex task into intermediate steps
+
+- **Task Construals:** Highlight essential visual details within a task
+
+&emsp;&emsp; Mathematically, 
+
+&emsp;&emsp; $F_{abstract} : (\xi_{noisy}, I, \{e_1, \dots, e_k\}) \rightarrow (\xi_{optimized}, L)$ 
+
+&emsp;&emsp; Where:
+- $\xi_{noisy}$: Noisy Trajectory
+- $I$: Task Instruction
+- $\{e_1, \dots, e_k\}$: Top-k previously successful in-context examples
+- $\xi_{optimized}$: Optimized Trajectory
+- $L$: Language Abstractions
+
+3.  The optimized trajectory is executed in the environment, and a human observer provides natural language feedback when the agent fails. The VLM then uses this feedback to revise the trajectory and the abstractions.
+
+&emsp;&emsp; Mathematically,
+
+&emsp;&emsp; $\Xi_{update}(\xi_{optimized}, H(a_t, o_t), L, I, \{e_1, \dots, e_k\}) \rightarrow \xi'_{optimized}, L'$ 
+
+&emsp;&emsp; Where:
+- $\Xi_{update}$: Update function
+- $\xi_{optimized}$: Current trajectory
+- $H(a_t, o_t)$: Human feedback on action $a_t$ at observation $o_t$
+- $L$: Current annotations
+- $I$: Task Instruction
+- $\{e_1, \dots, e_k\}$: Top-k retrieved examples
+- $\xi'_{optimized}$: Revised Trajectory
+- $L'$: Updated annotations
+
+4. If the execution is successful, the revised trajectory and abstractions are added to the agent's memory.
+
+<img src= '../images/ical1.jpg'> 
+
+5. **Retrieval Augmented Generation at Deployment:** When presented with new instructions, the agent retrieves similar examples from its memory and uses them as context to generate actions.
+
+&emsp; Mathematically, 
+
+&emsp; $s = \lambda_I \cdot s_I + \lambda_{textual} \cdot s_{textual} + \lambda_{visual} \cdot s_{visual}$
+
+&emsp; Where:
+- $s$: Aggregated similarity score
+- $s_I$, $s_{textual}$, $s_{visual}$: Similarity scores for instruction, textual state, and visual state respectively,computed via cosine similarity
+- $\lambda_I$, $\lambda_{textual}$, and $\lambda_{visual}$: Weighting hyperparameters
+
+<img src='../images/ical2.jpg'> 
+
+## Results
+
+1. On the TEACh household instruction following validation unseen dataset. ICAL examples significantly improves on the state-of-the-art in goal-condition success, outperforming agents that use the raw visual demonstrations as in context examples without abstraction learning. 
+
+2. ICAL outperforms the previous state-of-the-art on the VisualWebArena benchmark
+
+3. ICAL demonstrates superior performance on Ego4D action anticipation compared to hand-written few-shot GPT4V that uses chain of thought.
+
+## Our Two Cents 
+
+The paper introduces a promising approach aimed at enhancing the learning capabilities of LLM and VLMs through sub-optimal demonstrations. The concept of generating multimodal abstractions is highly applicable to real-world situations. However, a limitation of the approach is its reliance on a fixed action API, which may restrict the adaptability of the agent. Looking ahead, a valuable area for future research would be to explore the integration of more advanced forms of human feedback, such as demonstrations or corrections to the abstractions, to further improve the learning process.