🤖 Black Box Psych Experiments

🤯 Conducting psychology experiments on black box language models. Warning: Unstructured repo.

What we have tested (so far)

Anchoring (directory)

We originally replicated the anchoring paper with the format "Is Z higher or lower than X? {answer_1}\nWhat is the height of Z?". Initially, we see no consistent, replicable anchoring effect that corresponds to the original, i.e. your estimate is pushed in the direction of the anchor. Then we observe an effect where it anchors to numbers that are close to the right answer. A prompt can e.g. look like this:

Random number: 1002.
Q: How many meters are in a kilometer?
1: 1000
2: 1002
A:

...and the models will consistently respond with 1002 despite its usually correct responses. We also test this in for inverse scaling and see that large models are more susceptible to this effect. This is inherently very interesting and we have several hypotheses for why this might be.

Describing Black Swan events outside of its training time

In black-swan-future, we test how the language models describe long tail probability events within its dataset bounds versus outside. An example might be "What happened on January 3rd 2018" vs. "What happened on January 3rd 2022". Since Jan 3rd 2022 is not within its training dataset's bounds, it predicts wildly inaccurate things with very high certainty.

Political bias
Saliency effect

Project results

We hope to release a paper detailing cognitive biases in large language models and what it means for generalization of human features. Additionally, we participate in the inverse scaling prize with some of our results from this project and hope to release results in association with their team.

As an added way of sharing the work, we will release videos about our results on our YouTube channel about the safe development of AI. Check out our website at Apart Research.

How to join the project

Create a pull request to this repository
Join our Discord
Join our hackathons
Check out aisafetyideas
Read up at RWWC or on Jacob Hilton's opinionated deep learning reading list

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
anchoring		anchoring
black-swan-future		black-swan-future
data		data
inverse-scaling		inverse-scaling
src		src
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
generate_saliency_n_shots.py		generate_saliency_n_shots.py
saliency.py		saliency.py
saliency_bias-code.zip		saliency_bias-code.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 Black Box Psych Experiments

What we have tested (so far)

Project results

How to join the project

About

Releases

Packages

Contributors 2

Languages

apartresearch/blackbox-psych

Folders and files

Latest commit

History

Repository files navigation

🤖 Black Box Psych Experiments

What we have tested (so far)

Project results

How to join the project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages