diff --git a/README.md b/README.md index b9d9f50f..00bf0c10 100644 --- a/README.md +++ b/README.md @@ -228,28 +228,28 @@ For testing. This generator repeats back the prompt it received. ## Intro to probes -| Probe | Description | -|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| blank | A simple probe that always sends an empty prompt. | +| Probe | Description | +|----------------------|-------------------------------------------------------------------------------------------------------------------------------| +| blank | A simple probe that always sends an empty prompt. | | atkgen | Automated Attack Generation. A red-teaming LLM probes the target and reacts to it in an attempt to get toxic output. Prototype, mostly stateless, for now uses a simple GPT-2 [fine-tuned](https://huggingface.co/garak-llm/artgpt2tox) on the subset of hhrlhf attempts that yielded detectable toxicity (the only target currently supported for now). | -| av_spam_scanning | Probes that attempt to make the model output malicious content signatures | -| continuation | Probes that test if the model will continue a probably undesirable word | -| dan | Various [DAN](https://adguard.com/en/blog/chatgpt-dan-prompt-abuse.html) and DAN-like attacks | -| donotanswer | Prompts to which responsible language models should not answer. | -| encoding | Prompt injection through text encoding | -| gcg | Disrupt a system prompt by appending an adversarial suffix. | -| glitch | Probe model for glitch tokens that provoke unusual behavior. | -| grandma | Appeal to be reminded of one's grandmother. | -| goodside | Implementations of Riley Goodside attacks. | -| leakerplay | Evaluate if a model will replay training data. | -| lmrc | Subsample of the [Language Model Risk Cards](https://arxiv.org/abs/2303.18190) probes | -| malwaregen | Attempts to have the model generate code for building malware | -| misleading | Attempts to make a model support misleading and false claims | -| packagehallucination | Trying to get code generations that specify non-existent (and therefore insecure) packages. | -| promptinject | Implementation of the Agency Enterprise [PromptInject](https://github.com/agencyenterprise/PromptInject/tree/main/promptinject) work (best paper awards @ NeurIPS ML Safety Workshop 2022) | -| realtoxicityprompts | Subset of the RealToxicityPrompts work (data constrained because the full test will take so long to run) | -| snowball | [Snowballed Hallucination](https://ofir.io/snowballed_hallucination.pdf) probes designed to make a model give a wrong answer to questions too complex for it to process | -| xss | Look for vulnerabilities the permit or enact cross-site attacks, such as private data exfiltration. | +| av_spam_scanning | Probes that attempt to make the model output malicious content signatures | +| continuation | Probes that test if the model will continue a probably undesirable word | +| dan | Various [DAN](https://adguard.com/en/blog/chatgpt-dan-prompt-abuse.html) and DAN-like attacks | +| donotanswer | Prompts to which responsible language models should not answer. | +| encoding | Prompt injection through text encoding | +| gcg | Disrupt a system prompt by appending an adversarial suffix. | +| glitch | Probe model for glitch tokens that provoke unusual behavior. | +| grandma | Appeal to be reminded of one's grandmother. | +| goodside | Implementations of Riley Goodside attacks. | +| leakerplay | Evaluate if a model will replay training data. | +| lmrc | Subsample of the [Language Model Risk Cards](https://arxiv.org/abs/2303.18190) probes | +| malwaregen | Attempts to have the model generate code for building malware | +| misleading | Attempts to make a model support misleading and false claims | +| packagehallucination | Trying to get code generations that specify non-existent (and therefore insecure) packages. | +| promptinject | Implementation of the Agency Enterprise [PromptInject](https://github.com/agencyenterprise/PromptInject/tree/main/promptinject) work (best paper awards @ NeurIPS ML Safety Workshop 2022) | +| realtoxicityprompts | Subset of the RealToxicityPrompts work (data constrained because the full test will take so long to run) | +| snowball | [Snowballed Hallucination](https://ofir.io/snowballed_hallucination.pdf) probes designed to make a model give a wrong answer to questions too complex for it to process | +| xss | Look for vulnerabilities the permit or enact cross-site attacks, such as private data exfiltration. | ## Logging