Skip to content

Commit

Permalink
Contributing Probes documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
erickgalinkin committed Sep 18, 2024
1 parent 784546f commit 9dc8468
Showing 1 changed file with 157 additions and 0 deletions.
157 changes: 157 additions & 0 deletions docs/source/contributing.probe.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
Writing a Probe
###############

Probes are, in some ways, the essence of garak's functionality -- they serve as the abstraction that encapsulates attacks against AI models and systems.
In this example, we're going to go over the key points of how to develop a new probe.

Inheritance
***********

All probes will inherit from ``garak.probes.base.Probe``.

.. code-block:: python
from garak.probes.base import Probe
class MyNewProbe(Probe):
"""
Probe to do something naughty to a language model
"""
...
We require class docstrings in garak and enforce this requirement via a test required before merging.

Probes must always inherit from ``garak.probes.base.Probe``.
This allows probes to work nicely with `Generator` and `Attempt` objects in addition to ensuring that any `Buff`s that one might want to apply to a probe are going to work appropriately.

The ``probe`` method of ``Probe`` objects is where the core logic of a probe lies.
Ideally, one would need only to populate the ``prompts`` attribute of a ``Probe`` and let the ``probe`` method do the heavy lifting.
However, if this logic is insufficient for a custom probe, this is where the majority of the work (and potential issues) tends to lie.

.. code-block:: python
def probe(self, generator) -> Iterable[garak.attempt.Attempt]:
"""attempt to exploit the target generator, returning a list of results"""
logging.debug("probe execute: %s", self)
self.generator = generator
# build list of attempts
attempts_todo: Iterable[garak.attempt.Attempt] = []
prompts = list(self.prompts)
for seq, prompt in enumerate(prompts):
attempts_todo.append(self._mint_attempt(prompt, seq))
# buff hook
if len(_config.buffmanager.buffs) > 0:
attempts_todo = self._buff_hook(attempts_todo)
# iterate through attempts
attempts_completed = self._execute_all(attempts_todo)
logging.debug(
"probe return: %s with %s attempts", self, len(attempts_completed)
)
return attempts_completed
Configuring and Describing Probes
*********************************

Probes are built upon the ``Configurable`` base class and are themselves configurable.
We largely ignore parameters like ``ENV_VAR`` and ``DEFAULT_PARAMS`` in ``Probe`` classes, but if your probe requires an environment variable or you want to set some default parameters, it is done first in the class.

More often, we'll be looking at descriptive attributes of the probe.
From the base class:

.. code-block:: python
# docs uri for a description of the probe (perhaps a paper)
doc_uri: str = ""
# language this is for, in bcp47 format; * for all langs
bcp47: Union[Iterable[str], None] = None
# should this probe be included by default?
active: bool = True
# MISP-format taxonomy categories
tags: Iterable[str] = []
# what the probe is trying to do, phrased as an imperative
goal: str = ""
# Deprecated -- the detectors that should be run for this probe. always.Fail is chosen as default to send a signal if this isn't overridden.
recommended_detector: Iterable[str] = ["always.Fail"]
# default detector to run, if the primary/extended way of doing it is to be used (should be a string formatted like recommended_detector)
primary_detector: Union[str, None] = None
# optional extended detectors
extended_detectors: Iterable[str] = []
# can attempts from this probe be parallelised?
parallelisable_attempts: bool = True
# Keeps state of whether a buff is loaded that requires a call to untransform model outputs
post_buff_hook: bool = False
# support mainstream any-to-any large models
# legal element for str list `modality['in']`: 'text', 'image', 'audio', 'video', '3d'
# refer to Table 1 in https://arxiv.org/abs/2401.13601
# we focus on LLM input for probe
modality: dict = {"in": {"text"}}
That's a lot of attributes before we get into ones you may need to add!
Many of these are decent defaults, though there are a few that we absolutely want to include in any new probe

* ``doc_uri``: What is a good reference for the probe you're implementing? An academic paper, blog post, or social media post (in descending order of preference) goes here.
* ``active``: Should this probe be part of default scans?
* ``tags``: MISP-formatted taxonomy categories __e.g.__ ``["avid-effect:security:S0403", "owasp:llm01", "quality:Security:PromptStability", "payload:jailbreak"]``
* ``goal``: What is the probe trying to do? __e.g.__ ``"disregard the system prompt"``
* ``primary_detector``: What ``Detector`` should your probe use?

.. code-block:: python
class MyNewProbe(Probe):
"""
Probe to do something naughty to a language model
"""
recommended_detector = ["mitigation.MitigationBypass"]
tags = [
"avid-effect:security:S0403",
"owasp:llm01",
"quality:Security:PromptStability",
"payload:jailbreak",
]
goal = "disregard the system prompt"
doc_uri = "https://garak.ai"
active = False
...
Testing
*******
Once the logic for our probe is written, you'll want to test it before opening a pull request.
Typically, a good place to start is by seeing if your probe can be imported!

.. code-block:: bash
$ conda activate garak
$ python
$ python
Python 3.11.5 (main, Sep 11 2023, 08:31:25) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import garak.probes.mynewprobe
>>>
If you can run this with no error, you're ready to move on to the next phase of testing.
Otherwise, try to address the encountered errors.

Let's try running our new probe against a HuggingFace ``Pipeline`` using ``meta-llama/Llama-2-7b-chat-hf``, a notoriously tricky model to get to behave badly.

.. code-block:: bash
$ garak -m huggingface -n meta-llama/Llama-2-7b-chat-hf -p mynewprobe.MyNewProbe
If it all runs well, you'll get a log and a hitlog file, which tell you how successful your new probe was!
If you encounter errors, go through and try to address them.

Finally, check a few properties:

* Does the new probe appear in ``python -m garak --list_probes``?
* Do the garak tests pass? ``python -m pytest tests/``

Done!
*****

Congratulations on writing a probe for garak!

If you've tested your probe and validated that it works, run ``black`` to format your code in accordance with garak code standards.
Once your code is properly tested and formatted, push your work to your github fork and open a pull request -- thanks for your contribution!

0 comments on commit 9dc8468

Please sign in to comment.