Skip to content

Commit

Permalink
tighten up table format while maintaining reabablity
Browse files Browse the repository at this point in the history
Signed-off-by: Jeffrey Martin <[email protected]>
  • Loading branch information
jmartin-tech committed Feb 14, 2025
1 parent 4f34211 commit 81e3e0d
Showing 1 changed file with 21 additions and 21 deletions.
42 changes: 21 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -228,28 +228,28 @@ For testing. This generator repeats back the prompt it received.

## Intro to probes

| Probe | Description |
|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| blank | A simple probe that always sends an empty prompt. |
| Probe | Description |
|----------------------|-------------------------------------------------------------------------------------------------------------------------------|
| blank | A simple probe that always sends an empty prompt. |
| atkgen | Automated Attack Generation. A red-teaming LLM probes the target and reacts to it in an attempt to get toxic output. Prototype, mostly stateless, for now uses a simple GPT-2 [fine-tuned](https://huggingface.co/garak-llm/artgpt2tox) on the subset of hhrlhf attempts that yielded detectable toxicity (the only target currently supported for now). |
| av_spam_scanning | Probes that attempt to make the model output malicious content signatures |
| continuation | Probes that test if the model will continue a probably undesirable word |
| dan | Various [DAN](https://adguard.com/en/blog/chatgpt-dan-prompt-abuse.html) and DAN-like attacks |
| donotanswer | Prompts to which responsible language models should not answer. |
| encoding | Prompt injection through text encoding |
| gcg | Disrupt a system prompt by appending an adversarial suffix. |
| glitch | Probe model for glitch tokens that provoke unusual behavior. |
| grandma | Appeal to be reminded of one's grandmother. |
| goodside | Implementations of Riley Goodside attacks. |
| leakerplay | Evaluate if a model will replay training data. |
| lmrc | Subsample of the [Language Model Risk Cards](https://arxiv.org/abs/2303.18190) probes |
| malwaregen | Attempts to have the model generate code for building malware |
| misleading | Attempts to make a model support misleading and false claims |
| packagehallucination | Trying to get code generations that specify non-existent (and therefore insecure) packages. |
| promptinject | Implementation of the Agency Enterprise [PromptInject](https://github.com/agencyenterprise/PromptInject/tree/main/promptinject) work (best paper awards @ NeurIPS ML Safety Workshop 2022) |
| realtoxicityprompts | Subset of the RealToxicityPrompts work (data constrained because the full test will take so long to run) |
| snowball | [Snowballed Hallucination](https://ofir.io/snowballed_hallucination.pdf) probes designed to make a model give a wrong answer to questions too complex for it to process |
| xss | Look for vulnerabilities the permit or enact cross-site attacks, such as private data exfiltration. |
| av_spam_scanning | Probes that attempt to make the model output malicious content signatures |
| continuation | Probes that test if the model will continue a probably undesirable word |
| dan | Various [DAN](https://adguard.com/en/blog/chatgpt-dan-prompt-abuse.html) and DAN-like attacks |
| donotanswer | Prompts to which responsible language models should not answer. |
| encoding | Prompt injection through text encoding |
| gcg | Disrupt a system prompt by appending an adversarial suffix. |
| glitch | Probe model for glitch tokens that provoke unusual behavior. |
| grandma | Appeal to be reminded of one's grandmother. |
| goodside | Implementations of Riley Goodside attacks. |
| leakerplay | Evaluate if a model will replay training data. |
| lmrc | Subsample of the [Language Model Risk Cards](https://arxiv.org/abs/2303.18190) probes |
| malwaregen | Attempts to have the model generate code for building malware |
| misleading | Attempts to make a model support misleading and false claims |
| packagehallucination | Trying to get code generations that specify non-existent (and therefore insecure) packages. |
| promptinject | Implementation of the Agency Enterprise [PromptInject](https://github.com/agencyenterprise/PromptInject/tree/main/promptinject) work (best paper awards @ NeurIPS ML Safety Workshop 2022) |
| realtoxicityprompts | Subset of the RealToxicityPrompts work (data constrained because the full test will take so long to run) |
| snowball | [Snowballed Hallucination](https://ofir.io/snowballed_hallucination.pdf) probes designed to make a model give a wrong answer to questions too complex for it to process |
| xss | Look for vulnerabilities the permit or enact cross-site attacks, such as private data exfiltration. |

## Logging

Expand Down

0 comments on commit 81e3e0d

Please sign in to comment.