Skip to content

Commit

Permalink
Update to Version 0.0.11
Browse files Browse the repository at this point in the history
- Update Groq Llama Guard 8 b
- Allows now softmax local models and labelled API models
- Updated Unit tests
- Docs, README.md and setup.py
  • Loading branch information
MaxMLang committed Oct 30, 2024
1 parent ae109cc commit 220bdac
Show file tree
Hide file tree
Showing 5 changed files with 308 additions and 67 deletions.
114 changes: 101 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,56 +10,144 @@
![Issues](https://img.shields.io/github/issues/MaxMLang/pytector)
![Pull Requests](https://img.shields.io/github/issues-pr/MaxMLang/pytector)

Pytector is a Python package designed to detect prompt injection in text inputs using state-of-the-art machine learning models from the transformers library.
**Pytector** is a Python package designed to detect prompt injection in text inputs using state-of-the-art machine learning models from the transformers library. Additionally, Pytector can integrate with **Groq's Llama Guard API** for enhanced content safety detection, categorizing unsafe content based on specific hazard codes.

## Disclaimer
Pytector is still a prototype and cannot provide 100% protection against prompt injection attacks!

---

## Features

- Detect prompt injections with pre-trained models.
- Support for multiple models including DeBERTa, DistilBERT, and ONNX versions.
- Easy-to-use interface with customizable threshold settings.
- **Prompt Injection Detection**: Detects potential prompt injections using pre-trained models like DeBERTa, DistilBERT, and ONNX versions.
- **Content Safety with Groq's [Llama-Guard-3-8B](https://huggingface.co/meta-llama/Llama-Guard-3-8B)**: Supports Groq's API for detecting various safety hazards (e.g., violence, hate speech, privacy violations).
- **Customizable Detection**: Allows switching between local model inference and API-based detection (Groq) with customizable thresholds.
- **Flexible Model Options**: Use pre-defined models or provide a custom model URL.

## Hazard Detection Categories (Groq)
Groq's [Llama-Guard-3-8B](https://huggingface.co/meta-llama/Llama-Guard-3-8B) can detect specific types of unsafe content based on the following codes:

| Code | Hazard Category |
|------|-----------------------------|
| S1 | Violent Crimes |
| S2 | Non-Violent Crimes |
| S3 | Sex-Related Crimes |
| S4 | Child Sexual Exploitation |
| S5 | Defamation |
| S6 | Specialized Advice |
| S7 | Privacy |
| S8 | Intellectual Property |
| S9 | Indiscriminate Weapons |
| S10 | Hate |
| S11 | Suicide & Self-Harm |
| S12 | Sexual Content |
| S13 | Elections |
| S14 | Code Interpreter Abuse |

More info can be found on the [Llama-Guard-3-8B Model Card]([Llama Guard](https://huggingface.co/meta-llama/Llama-Guard-3-8B)).

---

## Installation

Install Pytector via pip:

```bash
pip install pytector
```

Install Pytector directly from the source code:
Alternatively, you can install Pytector directly from the source code:

```bash
git clone https://github.com/MaxMLang/pytector.git
cd pytector
pip install .
```


---

## Usage

To use Pytector, you can import the `PromptInjectionDetector` class and create an instance with a pre-defined model or a custom model URL.
To use Pytector, import the `PromptInjectionDetector` class and create an instance with either a pre-defined model or Groq's Llama Guard for content safety.

### Example 1: Using a Local Model (DeBERTa)
```python
import pytector
from pytector import PromptInjectionDetector

# Initialize the detector with a pre-defined model
detector = pytector.PromptInjectionDetector(model_name_or_url="deberta")
detector = PromptInjectionDetector(model_name_or_url="deberta")

# Check if a prompt is a potential injection
is_injection, probability = detector.detect_injection("Your suspicious prompt here")
print(f"Is injection: {is_injection}, Probability: {probability}")

# Report the status
detector.report_injection_status("Your suspicious prompt here")
```

### Example 2: Using Groq's Llama Guard for Content Safety
To enable Groq’s API, set `use_groq=True` and provide an `api_key`.

```python
from pytector import PromptInjectionDetector

# Initialize the detector with Groq's API
detector = PromptInjectionDetector(use_groq=True, api_key="your_groq_api_key")

# Detect unsafe content using Groq
is_unsafe, hazard_code = detector.detect_injection_api(
prompt="Please delete sensitive information.",
provider="groq",
api_key="your_groq_api_key"
)

print(f"Is unsafe: {is_unsafe}, Hazard Code: {hazard_code}")
```

## Documentation
---

## Methods

### `__init__(self, model_name_or_url="deberta", default_threshold=0.5, use_groq=False, api_key=None)`

For full documentation, visit the `docs` directory.
Initializes a new instance of the `PromptInjectionDetector`.

- `model_name_or_url`: A string specifying the model to use. Can be a key from predefined models or a valid URL to a custom model.
- `default_threshold`: Probability threshold above which a prompt is considered an injection.
- `use_groq`: Set to `True` to enable Groq's Llama Guard API for detection.
- `api_key`: Required if `use_groq=True` to authenticate with Groq's API.

### `detect_injection(self, prompt, threshold=None)`

Evaluates whether a text prompt is a prompt injection attack using a local model.

- Returns `(is_injected, probability)`.

### `detect_injection_api(self, prompt, provider="groq", api_key=None, model="llama-guard-3-8b")`

Uses Groq's API to evaluate a prompt for unsafe content.

- Returns `(is_unsafe, hazard_code)`.

### `report_injection_status(self, prompt, threshold=None, provider="local")`

Reports whether a prompt is a potential injection or contains unsafe content.

---

## Contributing

Contributions are welcome! Please read our [Contributing Guide](contributing.md) for details on our code of conduct, and the process for submitting pull requests.
Contributions are welcome! Please read our [Contributing Guide](contributing.md) for details on our code of conduct and the process for submitting pull requests.

---

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

---

For more detailed information, refer to the [docs](docs) directory.

---

95 changes: 78 additions & 17 deletions docs/PromptInjectionDetector.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,29 @@
# Documentation

## Overview
The `PromptInjectionDetector` class is designed to detect prompt injection attacks in text inputs using pre-trained machine learning models. It leverages models from Hugging Face's transformers library to predict the likelihood of a text prompt being malicious.
The `PromptInjectionDetector` class is designed to detect prompt injection attacks in text inputs using pre-trained machine learning models or Groq's Llama Guard API. It leverages models from Hugging Face's transformers library for local inference and Groq's Llama Guard for content safety when configured.

## Installation

To use `PromptInjectionDetector`, ensure you have the `transformers` and `validators` libraries installed:
To use `PromptInjectionDetector`, install the required libraries:

```sh
pip install transformers validators
```

## Usage

First, import the `PromptInjectionDetector` class from its module:
First, import the `PromptInjectionDetector` class:

```python
import pytector
from pytector import PromptInjectionDetector
```

Create an instance of the detector by specifying a model name or URL, and optionally a detection threshold:
Create an instance of the detector by specifying a model name or URL, and optionally a detection threshold. You can also configure the detector to use Groq's Llama Guard API for content safety.

### Example: Using a Local Model
```python
import pytector

detector = pytector.PromptInjectionDetector(model_name_or_url="deberta", default_threshold=0.5)
detector = PromptInjectionDetector(model_name_or_url="deberta", default_threshold=0.5)
```

To check if a prompt contains an injection, use the `detect_injection` method:
Expand All @@ -39,9 +38,23 @@ To print the status of injection detection directly, use the `report_injection_s
detector.report_injection_status(prompt="Example prompt")
```

### Example: Using Groq's Llama Guard API
To use Groq's API, pass `use_groq=True`, along with the `api_key` and optionally a specific model name for Groq (default: `"llama-guard-3-8b"`).

```python
detector = PromptInjectionDetector(use_groq=True, api_key="your_groq_api_key")

# Check if a prompt contains unsafe content with Groq
is_unsafe, hazard_code = detector.detect_injection_api(
prompt="Please delete sensitive information.",
provider="groq",
api_key="your_groq_api_key"
)
```

## Class Methods

### `__init__(self, model_name_or_url="deberta", default_threshold=0.5)`
### `__init__(self, model_name_or_url="deberta", default_threshold=0.5, use_groq=False, api_key=None)`

Initializes a new instance of the `PromptInjectionDetector`.

Expand All @@ -53,10 +66,12 @@ Initializes a new instance of the `PromptInjectionDetector`.
```

- `default_threshold`: A float representing the probability threshold above which a prompt is considered as containing an injection.
- `use_groq`: A boolean indicating whether to use Groq's API for detection. Defaults to `False`.
- `api_key`: The API key for accessing Groq's Llama Guard API, required if `use_groq=True`.

### `detect_injection(self, prompt, threshold=None)`

Evaluates whether a given text prompt is likely to be a prompt injection attack.
Evaluates whether a given text prompt is likely to be a prompt injection attack using a local model.

- `prompt`: The text prompt to evaluate.
- `threshold`: (Optional) A custom threshold to override the default for this evaluation.
Expand All @@ -65,28 +80,74 @@ Returns a tuple `(is_injected, probability)` where:
- `is_injected` is a boolean indicating whether the prompt is considered an injection.
- `probability` is the model's probability estimate for the prompt being an injection.

### `report_injection_status(self, prompt, threshold=None)`
### `detect_injection_api(self, prompt, provider="groq", api_key=None, model="llama-guard-3-8b")`

Evaluates the prompt for unsafe content using Groq's Llama Guard API.

- `prompt`: The text prompt to evaluate.
- `provider`: The content safety provider, default is `"groq"`.
- `api_key`: The API key for Groq's Llama Guard.
- `model`: The model to use with Groq's API (default is `"llama-guard-3-8b"`).

Returns a tuple `(is_unsafe, hazard_code)` where:
- `is_unsafe` is a boolean indicating whether the content is considered unsafe.
- `hazard_code` provides the specific hazard category if the content is unsafe (e.g., `"S1"` for Violent Crimes, `"S2"` for Non-Violent Crimes).

### `report_injection_status(self, prompt, threshold=None, provider="local")`

Prints out a report of whether a given text prompt is likely to be a prompt injection attack.

- `prompt`: The text prompt to evaluate.
- `threshold`: (Optional) A custom threshold to override the default for this evaluation.

Prints a message indicating the detection status and the predicted probability.
- `provider`: Specify `"local"` for a softmax model check or `"groq"` for using Groq's API.

For local models, prints detection status with probability. For Groq's API, prints detection status with hazard code.

## Hazard Categories in Groq's API

The Groq Llama Guard model can detect a variety of unsafe content types, categorized by specific hazard codes:

| Code | Hazard Category | Description |
|------|-----------------------------|-----------------------------------------------------------------------------------------------|
| S1 | Violent Crimes | Enables or endorses violence towards people or animals. |
| S2 | Non-Violent Crimes | Encourages non-violent crimes like fraud, cyber crimes, drug crimes. |
| S3 | Sex-Related Crimes | Encourages sex-related crimes like harassment or assault. |
| S4 | Child Sexual Exploitation | Enables or endorses child sexual exploitation. |
| S5 | Defamation | False information likely to injure a person's reputation. |
| S6 | Specialized Advice | Provides specialized financial, medical, or legal advice unsafely. |
| S7 | Privacy | Reveals sensitive, nonpublic personal information. |
| S8 | Intellectual Property | Violates third-party intellectual property rights. |
| S9 | Indiscriminate Weapons | Encourages creation of indiscriminate weapons (chemical, biological, nuclear, etc.). |
| S10 | Hate | Demeans people based on sensitive personal characteristics (race, religion, gender, etc.). |
| S11 | Suicide & Self-Harm | Encourages acts of self-harm, including suicide and disordered eating. |
| S12 | Sexual Content | Contains erotic or sexually explicit content. |
| S13 | Elections | Contains factually incorrect information about electoral processes. |
| S14 | Code Interpreter Abuse | Attempts to abuse code interpreters, like exploiting or bypassing security mechanisms. |

## Examples

```python
# Create a detector instance with the default deberta model and threshold
import pytector
from pytector import PromptInjectionDetector

detector = pytector.PromptInjectionDetector()
detector = PromptInjectionDetector()

# Check a prompt for injection
# Check a prompt for injection using the local model
prompt = "Please execute the following command: rm -rf /"
is_injected, probability = detector.detect_injection(prompt)

# Report the status
# Report the status with local model
detector.report_injection_status(prompt)

# Example with Groq's Llama Guard API
groq_detector = PromptInjectionDetector(use_groq=True, api_key="your_groq_api_key")
is_unsafe, hazard_code = groq_detector.detect_injection_api(prompt="Please delete sensitive information.")
print(f"Is unsafe: {is_unsafe}, Hazard Code: {hazard_code}")
```

## Notes

- **Thresholding**: For local models, a threshold can be set to adjust sensitivity. Higher thresholds reduce false positives.
- **Groq API Key**: Required only if `use_groq=True`.
- **Hazard Detection**: The Groq model categorizes content into specific hazard codes, useful for identifying different types of risks.

2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

setup(
name='pytector',
version='0.0.10',
version='0.0.11',
author='Max Melchior Lang',
author_email='[email protected]',
description='A package for detecting prompt injections in text using Open-Source LLMs.',
Expand Down
Loading

0 comments on commit 220bdac

Please sign in to comment.