Extracting Facts and the Chain-of-Thought #8

iburenko · 2024-06-10T13:47:59Z

iburenko
Jun 10, 2024
Maintainer

Overview

Chain-of-Thought

It has been noticed that LLM exhibits the Chain-of-Thought prompting property when trained long enough. For instance, if one asks a model to perform a reasoning task, than it might fail:

Model Input:
Q: Roger has 5 tennis balls. He buys 2 more cans of
tennis balls. Each can has 3 tennis balls. How many
tennis balls does he have now?
A: The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to
make lunch and bought 6 more, how many apples
do they have?
Model Output:
A: The answer is 27.

But with an additional Chain-of-Thought prompt, the model is able to predict the correct answer:

Model Input:
Q: Roger has 5 tennis balls. He buys 2 more cans of
tennis balls. Each can has 3 tennis balls. How many
tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls
each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to
make lunch and bought 6 more, how many apples
do they have?
Model output:
A: The cafeteria had 23 apples originally. They used
20 to make lunch. So they had 23 - 20 = 3. They
bought 6 more apples, so they have 3 + 6 = 9. The
answer is 9.

Feature extraction

Neural networks extract low-level features closer to the input of a network and high-level features closer to the output layer. Moreover, it has been shown that it is possible to reconstruct a syntax tree using BERT's features, indicating that it extracts some "higher-level" knowledge from the training data. Unlike BERT, today's LLMs predict the next token, but it also might be reasonable to expect from them to have the similar property.

Problem Statement

Given a CoT prompt P. From this, a model starts to generate a sequence of steps or facts (usually, the model represents a single fact as one sentence):

P ==> F1.
F1 ==> F2.
F2 ==> F3.
...
F_(n-1) ==> F_n.

Given that the next fact is dependent on the previous one, can we see a similar dependency in the activations of a network? Namely, can we see that in most of cases the weights for the fact F_i are activated deeper in the network compared to the weights for the fact F_j, when j < i?

Datasets

StrategtQA Dataset

StrategyQA is a question-answering benchmark focusing on open-domain questions where the required reasoning steps are implicit in the question and should be inferred using a strategy. StrategyQA includes 2,780 examples, each consisting of a strategy question, its decomposition, and evidence paragraphs.

Bibliography

Wei at al. -- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chi et al. -- Finding Universal Grammatical Relations in Multilingual BERT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting Facts and the Chain-of-Thought #8

{{title}}

Replies: 0 comments

Select a reply

Extracting Facts and the Chain-of-Thought #8

iburenko Jun 10, 2024 Maintainer

Contents

Overview

Chain-of-Thought

Feature extraction

Problem Statement

Datasets

StrategtQA Dataset

Bibliography

Replies: 0 comments

iburenko
Jun 10, 2024
Maintainer