Suggested Approach

Feb 10 2025

Author: Andrew Lee

Perhaps the rapid improvement of reasoning abilities in language models (o1, o3, DeepSeek variants, etc.) has caught multiple people by surprise. It certainly has for me. As someone interested in interpretability, my immediate questions were all something along the lines of ``How does it [blank]!?''

I suspect I am not the only one, which is perhaps why we are launching an initiative like Arbor in the first place. However, a large, distributed effort certainly requires careful coordination. How do we minimize duplicate efforts? How do we identify the most important research questions? The lowest hanging fruits? How do we scale gracefully?

By no means am I the right person to answer these questions, but I would like to offer a starting point. Namely, I suggest a few ways to leverage open collaboration and scale:

Divide-and-Conquer: One approach is to attempt "merge-sort" -- to decompose and distribute research questions and tasks, and "merge" findings until we reach a point in which we understand how reasoning models work. One way of distributing in this manner is to build a suite of "toy-models" that each exhibit task-specific reasoning behaviors.
"Atlas" of R1: Another approach is to aggregate and widely release a dissection of R1. Researchers have done this before, with genome browsers, Activation Atlas, and most recently Neuronpedia with SAEs. Aggregating and releasing a similar atlas of R1 can allow anyone to find patterns and form ideas.
Repository of resources - data, software: Finally, researchers can work together to construct useful resources. This is not limited to model checkpoints and experiments. For instance, together we could craft new datasets with rich annotations, or even more broadly, a open-source annotation tool. With enough contributors, something that would have taken weeks can hopefully take a few days.

Divide-and-Conquer: Leveraging Toy Models

North Star Question.

How does R1 do reasoning?

In my humble opinion, a good research project typically has a clear research question that serves as the "north star". When designing an experiment, or envisioning a plot, the north star serves as a sanity check: What new information do we gain? How does it take us closer to the north star?

Perhaps our ultimate north star question is "How does R1 (and variants) do reasoning?" But this is too broad a question, and importantly, it does not lend itself to a distributed and organized open-collaboration.

Decomposing North Star Questions.

A more practical approach may be to decompose the question into core components, such that when pieced together, a (nearly) complete explanation is reached. In the context of reasoning, perhaps there are obvious components that many of us have already asked:

CoT tokens == inner mechanisms? Does R1's chain-of-thought tokens reflect its internal computations?
Search? Is R1 doing search? If so, what kind of search? Can we find representations of a search tree?
Verification? R1 and variants seem to verify its output, which also seems to serve as a way of knowing when to stop its "thinking" and return an answer.

To be clear, I am not necessarily claiming that these should be our set of decomposed north star questions, but I think they are a reasonable starting point. I also imagine that the rest of the community will shape and evolve our set of decomposed north star questions, hopefully in a highly organized structure.

Decomposing reasoning behavior: Building simplified yet faithful abstractions.

So we can decompose our research question. But we might be able to decompose (and distribute efforts) even further! Given a mysterious phenomenon, one potential strategy to better understand it is to reproduce the phenomenon in a simplified yet faithful setting, and to understand the simplified setting deeply.

A good, faithful abstraction should result in an explanation that generalizes to the original phenomenon. Personally, I love toy models. But I've been guilty of studying toy models without a north star (i.e., a non-faithful abstraction) and reaching a dead end.

Researchers (who are clearly better than myself) have demonstrated successful toy models in the past. An example that might resonate with the interpretability community is Toy Models of Superposition, which arguably set off the craze around SAEs. Personally, my favorite example is OthelloGPT, which convincingly demonstrated that language models are not always stochastic parrots, but that they can indeed learn "world models".

Coming back to reasoning models - what does a "toy model" look like in our case? Given a reasoning behavior of interest (say, search), I believe we can design reasoning-behavior-specific tasks, train a model on said task in order to reproduce the behavior of interest, and study the said task-specific-model in detail.

Jiayi-Pan's reproduction of R1 on a "count down" task is a great example: Given a set of numbers (19, 36, 55, 7) and a target number (65), the task is to find the right arithmetic combination to use the 4 numbers to reach the target number -- which is a search problem.

Studying this model alone may lead to numerous insights! The resulting model already demonstrates search-like CoTs, as well as verification. Hopefully, findings from this model generalize to R1.

Though I would also like to point out that if the inner mechanisms and representations of task-specific models and R1 differ, that is an interesting finding in and of itself, naturally leading to a fascinating question of, why?

Why do we want "toy models"?

There are two arguments I'll list:

We may want to disentangle a specific reasoning behavior from R1's behavior. Some of the reasoning capabilities demonstrated by R1 necessitates a mix of skills, such as knowledge retrieval. Being able to compute the force between two objects requires the model to know the right equation for gravitational force in the first place, in addition to the correct gravitational constant. In a setting where we just want to know what a "search mechanism" looks like, studying a model trained on a single search task may be sufficient (this last point is an open question itself btw - do all "search mechanisms" look alike for all RL'ed reasoning models?).
Not all of us are math olympiads. To Chris Wendler's point - reverse-engineering a model is hard enough. I don't want to debug how a model solves a math question that I don't even understand!

To be clear, not every question necessitates a toy model. And every finding from a toy model should be compared to that of a general reasoning model like R1.

Concretely, what I am suggesting is that per (a set of) research question(s), we design and build a (set of) task(s) that can be distributed and studied in collaboration. The resulting process is akin to merge-sort: each of the intermediate steps can be distributed (designing suitable tasks, training models, interpreting models) and findings can be "merged". While "merging" findings, we may run into discrepancies that must be resolved -- which can also lead to a set of distributed tasks.

Risks: Blind men and an elephant.

(Credit: Chris Wendler) Our suggested approach puts scaleability of open collaboration at its heart, which is not without its risks. In particular, our decomposed, distributed effort may result in a reenactment of the parable of the blind men and an elephant.

(Image source: Wikipedia)

The parable of the blind men and an elephant is a story of a group of blind men who have never come across an elephant before and who learn and imagine what the elephant is like by touching it. Each blind man feels a different part of the animal's body, but only one part, such as the side or the tusk. They then describe the animal based on their limited experience and their descriptions of the elephant are different from each other. In some versions, they come to suspect that the other person is dishonest and they come to blows. The moral of the parable is that humans have a tendency to claim absolute truth based on their limited, subjective experience as they ignore other people's limited, subjective experiences which may be equally true.

Certainly, we have to be mindful of how our toy settings relate to the north star, how we might deal with discrepancies, what caveats and assumptions took place, and so on.

Risks: The blind leading the blind.

(Credit: Jacob Andreas) Are the proposed set of decomposed research questions the correct set of questions to be asking? When pitching a question to a broad audience (which may include, say, junior researchers), there's always the risk of misleading an entire sub-community. However, this is why we believe in the open-source, open-collaboration community. Research questions should be constantly re-evaluated and re-prioritized -- not by just a select few, but by the entire community.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Suggested Approach

Divide-and-Conquer: Leveraging Toy Models

North Star Question.

Decomposing North Star Questions.

Decomposing reasoning behavior: Building simplified yet faithful abstractions.

Why do we want "toy models"?

Risks: Blind men and an elephant.

Risks: The blind leading the blind.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally