trigaten · trigaten · Dec 11, 2022 · Dec 11, 2022 · Dec 12, 2022 · Dec 17, 2022
diff --git a/bibliography.bib b/bibliography.bib
@@ -210,6 +210,35 @@ @misc{zhou2022large
     primaryClass={cs.LG}
 }
 
+@misc{zhang2022tempera,
+    title={TEMPERA: Test-Time Prompting via Reinforcement Learning},
+    author={Tianjun Zhang and Xuezhi Wang and Denny Zhou and Dale Schuurmans and Joseph E. Gonzalez},
+    year={2022},
+    eprint={2211.11890},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+
+@misc{deng2022rlprompt,
+    title={RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning},
+    author={Mingkai Deng and Jianyu Wang and Cheng-Ping Hsieh and Yihan Wang and Han Guo and Tianmin Shu and Meng Song and Eric P. Xing and Zhiting Hu},
+    year={2022},
+    eprint={2205.12548},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+
+@misc{guo2021efficient,
+    title={Efficient (Soft) Q-Learning for Text Generation with Limited Good Data},
+    author={Han Guo and Bowen Tan and Zhengzhong Liu and Eric P. Xing and Zhiting Hu},
+    year={2021},
+    eprint={2106.07704},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+
+% Models
+
 % Language Model Guides
 
 @book{jurafsky2009,

diff --git a/docs/automated_pe/_category_.json b/docs/automated_pe/_category_.json
@@ -0,0 +1,8 @@
+{
+  "label": "⚙️ Automated Prompting",
+  "position": 70,
+  "link": {
+    "type": "generated-index",
+    "description": "Methods that automate prompt engineering"
+  }
+}
diff --git a/docs/automated_pe/ape.md b/docs/automated_pe/ape.md
@@ -0,0 +1,47 @@
+---
+sidebar_position: 1
+---
+
+# 🟢 APE
+
+Automatic Prompt Engineering (APE)(@zhou2022large) is an approach to automating the generation and
+selection of prompts. The basic idea of APE is to give a LLM a prompt containing
+a few shot exemplars, and ask it generate a prompt that would create these exemplars.
+
+## Example
+
+For example, if we give the LLM the following prompt:
+
+```text
+Is a banana a fruit?
+Yes
+Is a tomato a fruit?
+No
+Is a fish a fruit?
+No
+
+What would be a good prompt to generate an answer to the above questions?
+```
+
+```text
+banana
+Yes
+
+tomato
+No
+
+fish
+No
+
+watermelon
+Yes
+
+What would be a good prompt to generate an answer to the above questions?
+// highlight-start
+Is the following item a fruit:
+// highlight-end
+```
+
+## Notes
+
+Another simple automatic prompt engineering strategy is to simply give GPT-3 your prompt and ask GPT-3 to improve it.
diff --git a/docs/trainable/discretized.md → docs/automated_pe/discretized.md b/docs/trainable/discretized.md → docs/automated_pe/discretized.md
diff --git a/docs/automated_pe/more.md b/docs/automated_pe/more.md
@@ -0,0 +1,7 @@
+---
+sidebar_position: 200
+---
+
+# More
+
+Other methods exist, such as Autoprompt(@shin2020autoprompt), which uses gradient based search to build prompts for MLMs.
diff --git a/docs/automated_pe/overview.md b/docs/automated_pe/overview.md
@@ -0,0 +1,7 @@
+---
+sidebar_position: 0
+---
+
+# Overview
+
+Can prompt engineering really be automated? Sometimes.
diff --git a/docs/automated_pe/rl.md b/docs/automated_pe/rl.md
@@ -0,0 +1,68 @@
+---
+sidebar_position: 130
+---
+
+# 🟣 Reinforcement Learning
+
+This section covers reinforcement learning methods which optimize discrete prompts (not soft prompts). <br/>This is extremely complicated.
+
+## RLPrompt
+
+RLPrompt(@deng2022rlprompt) is a method that takes an input and trains a language model (the policy)
+to generate a good prompt for that input.
+
+More formally, given an input sequence $x$, the policy designs a prompt $z$ by selecting $[z_1, z_2, ..., z_T]$ tokens from the vocabulary sequentially.
+
+After creating the prompt, it combines it with $x$, and uses another language model to
+generate the completion. The LM output of x prompted by z can be described as $y_{LM}(\hat{z}, x)$.
+
+Then, the policy receives some reward according to this output: $R(y_{LM}(\hat{z}, x))$
+
+### Example
+
+Assuming we have partially trained RLPrompt on classifying movie reviews, and our next
+training point example is `x = "I hate this movie."`. RLPrompt will generate a prompt like
+`z = "Movie review bad or good:`. Then, it will combine the prompt with the input to get
+`x' = "Movie review bad or good: I hate this movie."`. Then, it will use a language model
+to generate the completion. Say it generates `bad`. Then, the reward is computed as
+`R(y_{LM}(\hat{z}, x))`. Deng et al. do not use a simple 0/1 reward.
+
+## Training
+
+RLPrompt embeds a task specific MLP inside a frozen LM. The MLP is trained with Soft Q Learning(@guo2021efficient).
+
+## TEMPERA
+
+**TE**st-ti**M**e **P**rompt **E**diting using **R**einforcement le**A**rning 
+(TEMPERA)(@zhang2022tempera) is a method for automatically generating
+interpretable prompts.
+
+At a high level, instead of building a prompt from scratch like RLPrompt, TEMPERA takes a starting prompt and modifies different parts of it in order to see what changes help most.
+
+## Action Space
+
+TEMPERA is allowed to edit 3 parts of the prompt:
+
+### 1) The instruction
+
+Given the instruction $i$, one could parse it through `nltk.tokenize.treebank` into a set of phrases. Then the actions allow swapping, addition and deletion between current set of phrases. For example, this will first parse the sentence `"Given text, classify whether it is good or bad."` to `["Given text", "classify", "whether", "it is", "good", "or", "bad"]`. Then we can perform different editing strategies (e.g., swapping two phrases, delete one phrase or repeat one phrase) on this set of phrases.
+
+### 2) In-context examples
+
+Given a example pool of $K$ examples (aka %%exemplars|exemplars%%), we want to select $k$ from them to formulate the final prompt. The action space allows change position of examples $i, j$ with $0 < i < j < k$. It also supports replacing example $0 < i < k$ with any candidate from the pool $k < j < K+1$. 
+
+### 3) The verbalizers
+
+The editing space simply allows changing the current verbalizer to any other verbalizer from the `promptsource` collections. For examples, changing from `["positive", "negative"]` to `["great", "terrible"]`.
+
+## Reward
+
+They use a reward which consists of the difference of score between a prompt before/after an edit.
+
+TEMPERA is densely reward, computing a reward for each edit step according to the accuracy improvement comparing the current prompt (after editing) and the previous prompt (before editing).
+
+## Training
+
+TEMPERA uses a GPT architecture and is trained with proximal policy optimization. 
+
+They use a reward which consists of the difference of score between a prompt before/after an edit.
diff --git a/docs/trainable/soft_prompting.md → docs/automated_pe/soft_prompting.md b/docs/trainable/soft_prompting.md → docs/automated_pe/soft_prompting.md
diff --git a/docs/bibliography.md b/docs/bibliography.md
@@ -56,7 +56,11 @@ cite them as such.
 
 #### AutoPrompt(@shin2020autoprompt) 🔵
 
-#### Automatic Prompt Engineer(@zhou2022large)
+#### Automatic Prompt Engineer(@zhou2022large) 🔵
+
+#### TEMPERA(@zhang2022tempera) 🔵
+
+#### RLPrompt(@deng2022rlprompt)
 
 ## Models
 

diff --git a/docs/trainable/_category_.json b/docs/trainable/_category_.json