Skip to content

Commit

Permalink
Update index.html
Browse files Browse the repository at this point in the history
  • Loading branch information
schwettmann authored Apr 14, 2024
1 parent c6c469e commit 8fbe2e8
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion index.html
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ <h1 class="title is-1 publication-title">A Multimodal Automated Interpretability
<p><h3 class="title is-4">How can AI systems help us understand other AI systems?</h3></p>
<p>Understanding an AI system can take many forms. For instance, we might want to know when and how the system relies on sensitive or spurious features, identify systematic errors in its predictions, or learn how to modify the training data and model architecture to improve accuracy and robustness. Today, answering these types of questions often involves significant human effort—researchers must formalize their question, formulate hypotheses about a model’s decision-making process, design datasets on which to evaluate model behavior, then use these datasets to refine and validate hypotheses. As a result, this type of understanding is slow and expensive to obtain, even about the most widely used models.</p><br>
<p>Automated Interpretability approaches have begun to address the issue of scale. Recently, such approaches have used pretrained language models like GPT-4 (in <a href="https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html" target="_blank">Bills et al. 2023</a>) or Claude (in <a href="https://transformer-circuits.pub/2023/monosemantic-features" target="_blank">Bricken et al. 2023</a>) to generate feature explanations. In earlier work, we introduced MILAN (<a href="https://arxiv.org/abs/2201.11114" target="_blank">Hernandez et al. 2022</a>), a captioner model trained on human feature annotations that takes as input a feature visualization and outputs a description of that feature. But automated approaches that use learned models to label features leave something to be desired: they are primarily tools for hypothesis generation (Huang et al. 2023), they characterize behavior on a limited set of inputs, and they are often low precision.</p><br>
<p> Our current line of research aims to build tools that help users understand models, while combining the flexibility of human experimentation with the scalability of automated techniques. We take an approach based on automating the scientific experimentation involved in understanding models. In <a href="https://arxiv.org/abs/2309.03886" target="_blank">Schwettmann et al. 2023</a>, we introduced the <em>Automated Interpretability Agent</em> (AIA) paradigm, where an LM-based agent interactively probes systems to explain their behavior. We now introduce a multimodal AIA with a vision-language model backbone and an API of tools that the agent can use to design experiments on other systems. The same modular system fields "macroscopic" questions like identifying systematic biases in model predictions (see the tench example above), as well as "microscopic" quesitions like describing individual features (see example below, and many more examples in our <b>neuron viewer</b>), simply by modifying the user prompt to the agent.</p><br>
<p> Our current line of research aims to build tools that help users understand models, while combining the flexibility of human experimentation with the scalability of automated techniques. We take an approach based on automating the scientific experimentation involved in understanding models. In <a href="https://arxiv.org/abs/2309.03886" target="_blank">Schwettmann et al. 2023</a>, we introduced the <em>Automated Interpretability Agent</em> (AIA) paradigm, where an LM-based agent interactively probes systems to explain their behavior. We now introduce a multimodal AIA, with a vision-language model backbone and an API of tools for designing experiments on other systems. Simply by modifying the user prompt to the agent, the same modular system can field both "macroscopic" questions like identifying systematic biases in model predictions (see the tench example above), as well as "microscopic" quesitions like describing individual features (see example below, and many more examples in our <b>neuron viewer</b>).</p><br>
</div>
</div>
</section>
Expand Down

0 comments on commit 8fbe2e8

Please sign in to comment.