This repository contains the official implementation for the paper Self-Recognition in Language Models [1].
In our paper we proposed assessing self-recognition in language models (LMs) using model-generated security questions. This approach takes three steps:
- ❓ generate a set of questions;
- 💬 generate a set of answers to these questions;
- ⚖️ generate "verdicts" by showing LMs questions with n-answers, and prompting them to select their own.
This repository contains code to reproduce the experiments of the paper and is structured as follows:
.
├── src/
│ ├── models/ (currently support: Anthropic, Cohere, Google, Microsoft, OpenAI, TogetherAI)
│ ├── configs/ (configurations to create (i) questions, (ii) answers, (iii) verdicts)
│ └── *.py
├── data/
│ ├── api_settings/
│ ├── model_settings/
│ ├── prompts/
│ ├── questions/
│ ├── responses/
│ ├── verdicts/
│ └── llm_model_details.yaml
├── secrets.json (to be created)
└── gcp_secrets.json (optional)
A limited set of example questions, answers, and verdicts are provided in the data/
directory.
We use hydra
to manage configurations.
The main entry point is src/run.py
, which takes a configuration file as input.
Configuration files are stored in src/configs/
and are used to generate questions, answers, and verdicts:
src/
└── configs/
├── generate_questions.yaml
├── generate_responses.yaml
└── generate_verdicts.yaml
Simply navigate to these files to specify the model(s) you want to use. LM wrappers for most leading
providers are included in src/models/
.
After generating questions, responses, and verdicts, hydra
will save the output to a specified directory, logs/
per default.
questions:
- to generate:
python src/run.py defaults.experiments=generate_questions
- saves a
questions.csv
file tologs/<your-experiment>
- copy this file to
data/questions/
for the next step
responses:
- to generate:
python src/run.py defaults.experiments=generate_responses
- saves a
responses.csv
file tologs/<your-experiment>
- copy this file to
data/responses/
for the next step
verdicts:
- to generate:
python src/run.py defaults.experiments=generate_verdicts
- saves a
verdicts.csv
file tologs/<your-experiment>
- copy this file to
data/verdicts/
for the next step
evaluation:
- to process verdicts and make sure they are correctly formatted:
python src/verdict_evaluation.py --base_folder=<path-to-verdicts>
- this creates
verdicts_extracted.csv
in the same directory - to evaluate the performance of the model:
python src/evaluations.py --base_folder=<path-to-extacted-verdicts>
Having run these steps, you can use various tools to analyze the results. For example, see the files:
src/analys.py
src/visualization.py
The simplest way to get started is to:
- clone this repository, then
- create a
secrets.json
file in the root directory with the following structure:
{
"openai": {
"api_key": "<your-key>"
},
"azure": {
"api_key": "<your-key>"
},
"anthropic": {
"api_key": "<your-key>"
},
"google": {
"api_key": "<your-key>"
},
"cohere": {
"api_key": "<your-key>"
},
"together_ai": {
"api_key": "<your-key>"
}
}
In this file, insert your own API key for one of the following providers:
{Anthropic, Cohere, Google, OpenAI, Microsoft}. This secrets.json
file is part of the .gitignore
, to prevent you
from accidentally pushing your raw keys to GitHub :). (see 'note' below if using Google/Azure models)
Next, create a virtual environment and install the packages listed in requirements.txt
. Once this is done you're all
set.
For any questions, feel free to open a ticket or reach out directly to Tim :).
If you are using Google or MSFT Azure, you also need to update the relevant endpoints in
data/api_settings/apis.yaml
. At the time of release, the Google Vertex API does not support simple API keys in all
regions. To get around this, you have to create a (1) service account, (2) set some permissions, (3) download a .json.
Save the exported .json file in a file called gcp_secrets.json
in the root directory of this project
(also in .gitignore
).
See the following docs for a walkthrough.
MIT
Please cite our work using one of the following if you end up using the repository - thanks!
[1] T.R. Davidson, V. Surkov, V. Veselovsky, G. Russo, R. West, C. Gulcehre.
Self-Recognition in Language Models. arXiv preprint, arXiv:2407.06946, 2024.
BibTeX format:
@article{davidson2024selfrecognitionlanguagemodels,
title={Self-Recognition in Language Models},
author={Tim R. Davidson and
Viacheslav Surkov and
Veniamin Veselovsky and
Giuseppe Russo and
Robert West and
Caglar Gulcehre},
year={2024},
journal={EMNLP},
url={https://arxiv.org/abs/2407.06946}
}