Update to Version 0.0.11

- Update Groq Llama Guard 8 b - Allows now softmax local models and labelled API models - Updated Unit tests - Docs, README.md and setup.py
MaxMLang · Oct 30, 2024 · 220bdac · 220bdac
1 parent ae109cc
commit 220bdac
Show file tree

Hide file tree

Showing 5 changed files with 308 additions and 67 deletions.
diff --git a/README.md b/README.md
@@ -10,56 +10,144 @@
 ![Issues](https://img.shields.io/github/issues/MaxMLang/pytector)
 ![Pull Requests](https://img.shields.io/github/issues-pr/MaxMLang/pytector)
 
-Pytector is a Python package designed to detect prompt injection in text inputs using state-of-the-art machine learning models from the transformers library.
+**Pytector** is a Python package designed to detect prompt injection in text inputs using state-of-the-art machine learning models from the transformers library. Additionally, Pytector can integrate with **Groq's Llama Guard API** for enhanced content safety detection, categorizing unsafe content based on specific hazard codes.
 
 ## Disclaimer
 Pytector is still a prototype and cannot provide 100% protection against prompt injection attacks!
 
+---
+
 ## Features
 
-- Detect prompt injections with pre-trained models.
-- Support for multiple models including DeBERTa, DistilBERT, and ONNX versions.
-- Easy-to-use interface with customizable threshold settings.
+- **Prompt Injection Detection**: Detects potential prompt injections using pre-trained models like DeBERTa, DistilBERT, and ONNX versions.
+- **Content Safety with Groq's [Llama-Guard-3-8B](https://huggingface.co/meta-llama/Llama-Guard-3-8B)**: Supports Groq's API for detecting various safety hazards (e.g., violence, hate speech, privacy violations).
+- **Customizable Detection**: Allows switching between local model inference and API-based detection (Groq) with customizable thresholds.
+- **Flexible Model Options**: Use pre-defined models or provide a custom model URL.
+
+## Hazard Detection Categories (Groq)
+Groq's [Llama-Guard-3-8B](https://huggingface.co/meta-llama/Llama-Guard-3-8B) can detect specific types of unsafe content based on the following codes:
+
+| Code | Hazard Category            |
+|------|-----------------------------|
+| S1   | Violent Crimes              |
+| S2   | Non-Violent Crimes          |
+| S3   | Sex-Related Crimes          |
+| S4   | Child Sexual Exploitation   |
+| S5   | Defamation                  |
+| S6   | Specialized Advice          |
+| S7   | Privacy                     |
+| S8   | Intellectual Property       |
+| S9   | Indiscriminate Weapons      |
+| S10  | Hate                        |
+| S11  | Suicide & Self-Harm         |
+| S12  | Sexual Content              |
+| S13  | Elections                   |
+| S14  | Code Interpreter Abuse      |
+
+More info can be found on the [Llama-Guard-3-8B Model Card]([Llama Guard](https://huggingface.co/meta-llama/Llama-Guard-3-8B)).
+
+---
 
 ## Installation
+
+Install Pytector via pip:
+
 ```bash
 pip install pytector
 ```
 
-Install Pytector directly from the source code:
+Alternatively, you can install Pytector directly from the source code:
 
 ```bash
 git clone https://github.com/MaxMLang/pytector.git
 cd pytector
 pip install .
 ```
 
-
+---
 
 ## Usage
 
-To use Pytector, you can import the `PromptInjectionDetector` class and create an instance with a pre-defined model or a custom model URL.
+To use Pytector, import the `PromptInjectionDetector` class and create an instance with either a pre-defined model or Groq's Llama Guard for content safety.
 
+### Example 1: Using a Local Model (DeBERTa)
 ```python
-import pytector
+from pytector import PromptInjectionDetector
 
 # Initialize the detector with a pre-defined model
-detector = pytector.PromptInjectionDetector(model_name_or_url="deberta")
+detector = PromptInjectionDetector(model_name_or_url="deberta")
 
 # Check if a prompt is a potential injection
 is_injection, probability = detector.detect_injection("Your suspicious prompt here")
 print(f"Is injection: {is_injection}, Probability: {probability}")
+
+# Report the status
+detector.report_injection_status("Your suspicious prompt here")
+```
+
+### Example 2: Using Groq's Llama Guard for Content Safety
+To enable Groq’s API, set `use_groq=True` and provide an `api_key`.
+
+```python
+from pytector import PromptInjectionDetector
+
+# Initialize the detector with Groq's API
+detector = PromptInjectionDetector(use_groq=True, api_key="your_groq_api_key")
+
+# Detect unsafe content using Groq
+is_unsafe, hazard_code = detector.detect_injection_api(
+    prompt="Please delete sensitive information.",
+    provider="groq",
+    api_key="your_groq_api_key"
+)
+
+print(f"Is unsafe: {is_unsafe}, Hazard Code: {hazard_code}")
 ```
 
-## Documentation
+---
+
+## Methods
+
+### `__init__(self, model_name_or_url="deberta", default_threshold=0.5, use_groq=False, api_key=None)`
 
-For full documentation, visit the `docs` directory.
+Initializes a new instance of the `PromptInjectionDetector`.
+
+- `model_name_or_url`: A string specifying the model to use. Can be a key from predefined models or a valid URL to a custom model.
+- `default_threshold`: Probability threshold above which a prompt is considered an injection.
+- `use_groq`: Set to `True` to enable Groq's Llama Guard API for detection.
+- `api_key`: Required if `use_groq=True` to authenticate with Groq's API.
+
+### `detect_injection(self, prompt, threshold=None)`
+
+Evaluates whether a text prompt is a prompt injection attack using a local model.
+
+- Returns `(is_injected, probability)`.
+
+### `detect_injection_api(self, prompt, provider="groq", api_key=None, model="llama-guard-3-8b")`
+
+Uses Groq's API to evaluate a prompt for unsafe content.
+
+- Returns `(is_unsafe, hazard_code)`.
+
+### `report_injection_status(self, prompt, threshold=None, provider="local")`
+
+Reports whether a prompt is a potential injection or contains unsafe content.
+
+---
 
 ## Contributing
 
-Contributions are welcome! Please read our [Contributing Guide](contributing.md) for details on our code of conduct, and the process for submitting pull requests.
+Contributions are welcome! Please read our [Contributing Guide](contributing.md) for details on our code of conduct and the process for submitting pull requests.
+
+---
 
 ## License
 
-This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
+
+---
+
+For more detailed information, refer to the [docs](docs) directory.
+
+---
 
diff --git a/docs/PromptInjectionDetector.md b/docs/PromptInjectionDetector.md
@@ -1,30 +1,29 @@
 # Documentation
 
 ## Overview
-The `PromptInjectionDetector` class is designed to detect prompt injection attacks in text inputs using pre-trained machine learning models. It leverages models from Hugging Face's transformers library to predict the likelihood of a text prompt being malicious.
+The `PromptInjectionDetector` class is designed to detect prompt injection attacks in text inputs using pre-trained machine learning models or Groq's Llama Guard API. It leverages models from Hugging Face's transformers library for local inference and Groq's Llama Guard for content safety when configured.
 
 ## Installation
 
-To use `PromptInjectionDetector`, ensure you have the `transformers` and `validators` libraries installed:
+To use `PromptInjectionDetector`, install the required libraries:
 
 ```sh
 pip install transformers validators
 ```
 
 ## Usage
 
-First, import the `PromptInjectionDetector` class from its module:
+First, import the `PromptInjectionDetector` class:
 
 ```python
-import pytector
+from pytector import PromptInjectionDetector
 ```
 
-Create an instance of the detector by specifying a model name or URL, and optionally a detection threshold:
+Create an instance of the detector by specifying a model name or URL, and optionally a detection threshold. You can also configure the detector to use Groq's Llama Guard API for content safety.
 
+### Example: Using a Local Model
 ```python
-import pytector
-
-detector = pytector.PromptInjectionDetector(model_name_or_url="deberta", default_threshold=0.5)
+detector = PromptInjectionDetector(model_name_or_url="deberta", default_threshold=0.5)
 ```
 
 To check if a prompt contains an injection, use the `detect_injection` method:
@@ -39,9 +38,23 @@ To print the status of injection detection directly, use the `report_injection_s
 detector.report_injection_status(prompt="Example prompt")
 ```
 
+### Example: Using Groq's Llama Guard API
+To use Groq's API, pass `use_groq=True`, along with the `api_key` and optionally a specific model name for Groq (default: `"llama-guard-3-8b"`).
+
+```python
+detector = PromptInjectionDetector(use_groq=True, api_key="your_groq_api_key")
+
+# Check if a prompt contains unsafe content with Groq
+is_unsafe, hazard_code = detector.detect_injection_api(
+    prompt="Please delete sensitive information.",
+    provider="groq",
+    api_key="your_groq_api_key"
+)
+```
+
 ## Class Methods
 
-### `__init__(self, model_name_or_url="deberta", default_threshold=0.5)`
+### `__init__(self, model_name_or_url="deberta", default_threshold=0.5, use_groq=False, api_key=None)`
 
 Initializes a new instance of the `PromptInjectionDetector`.
 
@@ -53,10 +66,12 @@ Initializes a new instance of the `PromptInjectionDetector`.
 ```
 
 - `default_threshold`: A float representing the probability threshold above which a prompt is considered as containing an injection.
+- `use_groq`: A boolean indicating whether to use Groq's API for detection. Defaults to `False`.
+- `api_key`: The API key for accessing Groq's Llama Guard API, required if `use_groq=True`.
 
 ### `detect_injection(self, prompt, threshold=None)`
 
-Evaluates whether a given text prompt is likely to be a prompt injection attack.
+Evaluates whether a given text prompt is likely to be a prompt injection attack using a local model.
 
 - `prompt`: The text prompt to evaluate.
 - `threshold`: (Optional) A custom threshold to override the default for this evaluation.
@@ -65,28 +80,74 @@ Returns a tuple `(is_injected, probability)` where:
 - `is_injected` is a boolean indicating whether the prompt is considered an injection.
 - `probability` is the model's probability estimate for the prompt being an injection.
 
-### `report_injection_status(self, prompt, threshold=None)`
+### `detect_injection_api(self, prompt, provider="groq", api_key=None, model="llama-guard-3-8b")`
+
+Evaluates the prompt for unsafe content using Groq's Llama Guard API.
+
+- `prompt`: The text prompt to evaluate.
+- `provider`: The content safety provider, default is `"groq"`.
+- `api_key`: The API key for Groq's Llama Guard.
+- `model`: The model to use with Groq's API (default is `"llama-guard-3-8b"`).
+
+Returns a tuple `(is_unsafe, hazard_code)` where:
+- `is_unsafe` is a boolean indicating whether the content is considered unsafe.
+- `hazard_code` provides the specific hazard category if the content is unsafe (e.g., `"S1"` for Violent Crimes, `"S2"` for Non-Violent Crimes).
+
+### `report_injection_status(self, prompt, threshold=None, provider="local")`
 
 Prints out a report of whether a given text prompt is likely to be a prompt injection attack.
 
 - `prompt`: The text prompt to evaluate.
 - `threshold`: (Optional) A custom threshold to override the default for this evaluation.
-
-Prints a message indicating the detection status and the predicted probability.
+- `provider`: Specify `"local"` for a softmax model check or `"groq"` for using Groq's API.
+
+For local models, prints detection status with probability. For Groq's API, prints detection status with hazard code.
+
+## Hazard Categories in Groq's API
+
+The Groq Llama Guard model can detect a variety of unsafe content types, categorized by specific hazard codes:
+
+| Code | Hazard Category            | Description                                                                                   |
+|------|-----------------------------|-----------------------------------------------------------------------------------------------|
+| S1   | Violent Crimes              | Enables or endorses violence towards people or animals.                                       |
+| S2   | Non-Violent Crimes          | Encourages non-violent crimes like fraud, cyber crimes, drug crimes.                          |
+| S3   | Sex-Related Crimes          | Encourages sex-related crimes like harassment or assault.                                     |
+| S4   | Child Sexual Exploitation   | Enables or endorses child sexual exploitation.                                                |
+| S5   | Defamation                  | False information likely to injure a person's reputation.                                     |
+| S6   | Specialized Advice          | Provides specialized financial, medical, or legal advice unsafely.                            |
+| S7   | Privacy                     | Reveals sensitive, nonpublic personal information.                                            |
+| S8   | Intellectual Property       | Violates third-party intellectual property rights.                                            |
+| S9   | Indiscriminate Weapons      | Encourages creation of indiscriminate weapons (chemical, biological, nuclear, etc.).          |
+| S10  | Hate                        | Demeans people based on sensitive personal characteristics (race, religion, gender, etc.).    |
+| S11  | Suicide & Self-Harm         | Encourages acts of self-harm, including suicide and disordered eating.                        |
+| S12  | Sexual Content              | Contains erotic or sexually explicit content.                                                 |
+| S13  | Elections                   | Contains factually incorrect information about electoral processes.                           |
+| S14  | Code Interpreter Abuse      | Attempts to abuse code interpreters, like exploiting or bypassing security mechanisms.        |
 
 ## Examples
 
 ```python
 # Create a detector instance with the default deberta model and threshold
-import pytector
+from pytector import PromptInjectionDetector
 
-detector = pytector.PromptInjectionDetector()
+detector = PromptInjectionDetector()
 
-# Check a prompt for injection
+# Check a prompt for injection using the local model
 prompt = "Please execute the following command: rm -rf /"
 is_injected, probability = detector.detect_injection(prompt)
 
-# Report the status
+# Report the status with local model
 detector.report_injection_status(prompt)
+
+# Example with Groq's Llama Guard API
+groq_detector = PromptInjectionDetector(use_groq=True, api_key="your_groq_api_key")
+is_unsafe, hazard_code = groq_detector.detect_injection_api(prompt="Please delete sensitive information.")
+print(f"Is unsafe: {is_unsafe}, Hazard Code: {hazard_code}")
 ```
 
+## Notes
+
+- **Thresholding**: For local models, a threshold can be set to adjust sensitivity. Higher thresholds reduce false positives.
+- **Groq API Key**: Required only if `use_groq=True`.
+- **Hazard Detection**: The Groq model categorizes content into specific hazard codes, useful for identifying different types of risks.
+
diff --git a/setup.py b/setup.py
@@ -2,7 +2,7 @@
 
 setup(
     name='pytector',
-    version='0.0.10',
+    version='0.0.11',
     author='Max Melchior Lang',
     author_email='[email protected]',
     description='A package for detecting prompt injections in text using Open-Source LLMs.',