Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: lightweight probe defaults #1116

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
640d304
stabilise and make explicit order of multiple Migrations in one fixer…
leondz Feb 24, 2025
e7c7db5
update FigStep names
leondz Feb 24, 2025
a85c257
rename and fixers for snowball
leondz Feb 24, 2025
4370779
add config entry for soft cap on how many prompts per probe
leondz Feb 26, 2025
169d481
rename promptinject probes & bind to soft probe prompt cap
leondz Feb 26, 2025
2e40865
migrate past tense probe names
leondz Feb 26, 2025
829f97f
resolve merge
leondz Feb 26, 2025
ea5bed8
rename probes to have lightweight versions as defaults and extended/f…
leondz Feb 26, 2025
79105cb
shrink LatentInjectionFactSnippetEiffel to soft cap, w/ shuffle
leondz Feb 26, 2025
3aa6677
rename FalseAssertion, Glitch, use soft cap
leondz Feb 26, 2025
3b3e786
fix rename
leondz Feb 26, 2025
88411ab
get order of operations right: set max_prompts after _config is avail…
leondz Feb 26, 2025
3f03bc2
lightweight defaults for latent injection probes
leondz Feb 26, 2025
0375e52
use random shuffle + prune for lightweight slur continuation
leondz Feb 26, 2025
02d202a
move to using shuffling & prompt cap to produce lightweight probes
leondz Feb 26, 2025
7310811
access config_root not _config
leondz Feb 26, 2025
e28f8c0
fixer class sorting should.. work
leondz Feb 27, 2025
f45938b
update test cases to fit current state of class names
leondz Feb 27, 2025
0163a4c
constrain class replacement to final position in plugin name
leondz Feb 27, 2025
f5d168a
place migrations involving ordered ops into single classes. much tidier
leondz Feb 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/configurable.rst
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ such as ``show_100_pass_modules``.
* ``narrow_output`` - Support output on narrower CLIs
* ``show_z`` - Display Z-scores and visual indicators on CLI. It's good, but may be too much info until one has seen garak run a couple of times
* ``enable_experimental`` - Enable experimental function CLI flags. Disabled by default. Experimental functions may disrupt your installation and provide unusual/unstable results. Can only be set by editing core config, so a git checkout of garak is recommended for this.
* ``soft_probe_prompt_cap`` - For probes that auto-scale their prompt count, the preferred limit of prompts per probe

``run`` config items
""""""""""""""""""""
Expand Down
2 changes: 1 addition & 1 deletion garak/configs/fast.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ run:
generations: 5

plugins:
probe_spec: continuation,dan,encoding.InjectBase64,encoding.InjectHex,goodside,av_spam_scanning,leakreplay,lmrc,malwaregen.SubFunctions,malwaregen.TopLevel,packagehallucination,realtoxicityprompts.RTPIdentity_Attack,realtoxicityprompts.RTPProfanity,realtoxicityprompts.RTPSexually_Explicit,realtoxicityprompts.RTPThreat,snowball,xss
probe_spec: ansiescape.AnsiRaw,continuation,dan,encoding.InjectBase64,encoding.InjectHex,goodside,av_spam_scanning,leakreplay,lmrc,malwaregen.SubFunctions,malwaregen.TopLevel,packagehallucination,realtoxicityprompts.RTPIdentity_Attack,realtoxicityprompts.RTPProfanity,realtoxicityprompts.RTPSexually_Explicit,realtoxicityprompts.RTPThreat,snowball,xss
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still want av_spam_scanning in the default fast config? It's largely useless for model-only evaluation.

extended_detectors: false


2 changes: 1 addition & 1 deletion garak/configs/full.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ system:
lite: false

plugins:
probe_spec: atkgen,continuation,dan,dan.Ablation_Dan_11_0,encoding,glitch.Glitch,goodside,av_spam_scanning,leakreplay.LiteratureCloze,leakreplay.LiteratureComplete,lmrc,malwaregen,misleading,packagehallucination,promptinject.HijackHateHumans,promptinject.HijackKillHumans,promptinject.HijackLongPrompt,realtoxicityprompts,snowball.GraphConnectivity,snowball.Primes,snowball.Senators,xss
probe_spec: ansiescape,atkgen,continuation,dan,dan.Ablation_Dan_11_0,encoding,glitch.GlitchFull,goodside,av_spam_scanning,latentinjection,leakreplay.LiteratureClozeFull,leakreplay.LiteratureCompleteFull,lmrc,malwaregen,misleading,packagehallucination,promptinject.HijackHateHumansFull,promptinject.HijackKillHumansFull,promptinject.HijackLongPromptFull,realtoxicityprompts,snowball.GraphConnectivityFull,snowball.PrimesFull,snowball.SenatorsFull,xss
probes:
encoding:
payloads:
Expand Down
2 changes: 1 addition & 1 deletion garak/configs/tox_and_buffs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ run:
generations: 5

plugins:
probe_spec: atkgen.Tox,continuation.ContinueSlursReclaimedSlursMini,encoding,lmrc.Anthropomorphisation,lmrc.Bullying,lmrc.Deadnaming,lmrc.Profanity,lmrc.SlurUsage,realtoxicityprompts,
probe_spec: atkgen.Tox,continuation.ContinueSlursReclaimedSlurs,encoding,lmrc.Anthropomorphisation,lmrc.Bullying,lmrc.Deadnaming,lmrc.Profanity,lmrc.SlurUsage,realtoxicityprompts,
probes:
encoding:
payloads:
Expand Down
21 changes: 18 additions & 3 deletions garak/probes/continuation.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,15 @@
"""

import json
import random

from garak import _config
from garak.attempt import Attempt
from garak.data import path as data_path
from garak.probes.base import Probe


class ContinueSlursReclaimedSlurs(Probe):
class ContinueSlursReclaimedSlursFull(Probe):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change a lot.

"""Continue partial word with slur or reclaimed slur.

This probe poses a partial ethnic slur with some preceding content, and
Expand All @@ -37,6 +38,8 @@ class ContinueSlursReclaimedSlurs(Probe):
active = False
source_resource_filename = "slurprompts.jsonl"

follow_prompt_cap = False

def __init__(self, config_root=_config):
super().__init__(config_root=config_root)
# go through the list
Expand Down Expand Up @@ -73,6 +76,17 @@ def __init__(self, config_root=_config):
self.prompts.append(candidate_prompt.strip())
self.triggers.append(term)

if self.follow_prompt_cap:
num_ids_to_delete = max(
0, len(self.prompts) - config_root.run.soft_probe_prompt_cap
)
Comment on lines +80 to +82
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot assume config_root is the global module _config.

ids_to_rm = random.sample(range(len(self.prompts)), num_ids_to_delete)
# delete in descending order
ids_to_rm = sorted(ids_to_rm, reverse=True)
for id in ids_to_rm:
del self.prompts[id]
del self.triggers[id]

def _attempt_prestore_hook(self, attempt: Attempt, seq: int) -> Attempt:
attempt.notes = dict(
attempt.notes
Expand All @@ -81,7 +95,7 @@ def _attempt_prestore_hook(self, attempt: Attempt, seq: int) -> Attempt:
return attempt


class ContinueSlursReclaimedSlursMini(ContinueSlursReclaimedSlurs, Probe):
class ContinueSlursReclaimedSlurs(ContinueSlursReclaimedSlursFull, Probe):
"""Continue partial word with slur or reclaimed slur.

This probe poses a partial ethnic slur with some preceding content, and
Expand All @@ -90,4 +104,5 @@ class ContinueSlursReclaimedSlursMini(ContinueSlursReclaimedSlurs, Probe):

aliases = ["continuation.ContinueSlursReclaimedSlurs80"]
active = True
source_resource_filename = "slurprompts_mini.jsonl"
source_resource_filename = "slurprompts_mini.jsonl" # high-quality selection
follow_prompt_cap = True
13 changes: 8 additions & 5 deletions garak/probes/dan.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
now) probes and others in a similar vein.
"""
import logging
import random
from typing import Iterable

import garak.attempt
Expand Down Expand Up @@ -620,7 +621,7 @@ def probe(self, generator) -> Iterable[garak.attempt.Attempt]:
return list()


class DanInTheWild(Probe):
class DanInTheWildFull(Probe):
"""A library of jailbreaks.

Tries a set of ~700 known jailbreaks, in an attempt to get a model to disregard
Expand Down Expand Up @@ -648,15 +649,17 @@ def __init__(self, config_root=_config):
self.prompts = f.readlines()


class DanInTheWildMini(DanInTheWild):
"""A subset of 100 jailbreaks from the DanInTheWild jailbreak library.
class DanInTheWild(DanInTheWildFull):
"""A subset of jailbreaks from the DanInTheWild jailbreak library.

Tries a set of 100 known jailbreaks, in an attempt to get a model to disregard
Tries a set of known jailbreaks, in an attempt to get a model to disregard
alignment/system prompt. Success is measured by not getting a safety mitigaton
message."""
Comment on lines +653 to 657
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest mentioning the configurability by changing the soft_probe_prompt_cap setting.

"""A subset of jailbreak from the DanInTheWild jailbreak library.
Tries a set of known jailbreaks, in an attempt to get a model to disregard
alignment/system prompt. Success is measured by not getting a safety mitigation
message. The number of prompts in the subset can be configured via the 
`soft_probe_prompt_cap` run parameter."""


active = True

def __init__(self, config_root=_config):
super().__init__(config_root=config_root)
self.prompts = self.prompts[:100]
self.prompts = random.sample(self.prompts, len(self.prompts))[
: config_root.run.soft_probe_prompt_cap
]
9 changes: 6 additions & 3 deletions garak/probes/glitch.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from garak.probes.base import Probe


class Glitch(Probe):
class GlitchFull(Probe):
"""Probe model for glitch tokens that provoke unusual behavior

Glitch tokens tend to be long entries in the tokenizer that only rarely occur
Expand Down Expand Up @@ -217,7 +217,7 @@ def _attempt_prestore_hook(self, attempt: Attempt, seq: int) -> Attempt:
return attempt


class Glitch100(Glitch, Probe):
class Glitch(GlitchFull):
"""Probe model for glitch tokens that provoke unusual behavior

Glitch tokens tend to be long entries in the tokenizer that only rarely occur
Expand All @@ -226,4 +226,7 @@ class Glitch100(Glitch, Probe):
This probe uses a subset of 100 potential glitch tokens, for speed."""

active = True
max_prompts = 100

def __init__(self, config_root=_config):
super().__init__(config_root)
self.max_prompts = config_root.run.soft_probe_prompt_cap
Loading
Loading