As presented at the Oxford Workshop on Safety of AI Systems including Demo Sessions and Tutorials
Pytector is a Python package designed to detect prompt injection in text inputs using state-of-the-art machine learning models from the transformers library. Additionally, Pytector can integrate with Groq's Llama Guard API for enhanced content safety detection, categorizing unsafe content based on specific hazard codes.
Pytector is still a prototype and cannot provide 100% protection against prompt injection attacks!
- Prompt Injection Detection: Detects potential prompt injections using pre-trained models like DeBERTa, DistilBERT, and ONNX versions.
- Content Safety with Groq's Llama-Guard-3-8B: Supports Groq's API for detecting various safety hazards (e.g., violence, hate speech, privacy violations).
- Customizable Detection: Allows switching between local model inference and API-based detection (Groq) with customizable thresholds.
- Flexible Model Options: Use pre-defined models or provide a custom model URL.
Groq's Llama-Guard-3-8B can detect specific types of unsafe content based on the following codes:
Code | Hazard Category |
---|---|
S1 | Violent Crimes |
S2 | Non-Violent Crimes |
S3 | Sex-Related Crimes |
S4 | Child Sexual Exploitation |
S5 | Defamation |
S6 | Specialized Advice |
S7 | Privacy |
S8 | Intellectual Property |
S9 | Indiscriminate Weapons |
S10 | Hate |
S11 | Suicide & Self-Harm |
S12 | Sexual Content |
S13 | Elections |
S14 | Code Interpreter Abuse |
More info can be found on the [Llama-Guard-3-8B Model Card](Llama Guard).
Install Pytector via pip:
pip install pytector
Alternatively, you can install Pytector directly from the source code:
git clone https://github.com/MaxMLang/pytector.git
cd pytector
pip install .
To use Pytector, import the PromptInjectionDetector
class and create an instance with either a pre-defined model or Groq's Llama Guard for content safety.
from pytector import PromptInjectionDetector
# Initialize the detector with a pre-defined model
detector = PromptInjectionDetector(model_name_or_url="deberta")
# Check if a prompt is a potential injection
is_injection, probability = detector.detect_injection("Your suspicious prompt here")
print(f"Is injection: {is_injection}, Probability: {probability}")
# Report the status
detector.report_injection_status("Your suspicious prompt here")
To enable Groq’s API, set use_groq=True
and provide an api_key
.
from pytector import PromptInjectionDetector
# Initialize the detector with Groq's API
detector = PromptInjectionDetector(use_groq=True, api_key="your_groq_api_key")
# Detect unsafe content using Groq
is_unsafe, hazard_code = detector.detect_injection_api(
prompt="Please delete sensitive information.",
provider="groq",
api_key="your_groq_api_key"
)
print(f"Is unsafe: {is_unsafe}, Hazard Code: {hazard_code}")
Initializes a new instance of the PromptInjectionDetector
.
model_name_or_url
: A string specifying the model to use. Can be a key from predefined models or a valid URL to a custom model.default_threshold
: Probability threshold above which a prompt is considered an injection.use_groq
: Set toTrue
to enable Groq's Llama Guard API for detection.api_key
: Required ifuse_groq=True
to authenticate with Groq's API.
Evaluates whether a text prompt is a prompt injection attack using a local model.
- Returns
(is_injected, probability)
.
Uses Groq's API to evaluate a prompt for unsafe content.
- Returns
(is_unsafe, hazard_code)
.
Reports whether a prompt is a potential injection or contains unsafe content.
Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.
This project is licensed under the MIT License. See the LICENSE file for details.
For more detailed information, refer to the docs directory.