LLM prompt injection detection.
- User submits a potentially malicious message.
- The message is passed through a LLM prompted to format the message plus a unique key into a JSON. In the event the message is a malicious prompt, this output should be compromised. If the output is an invalid JSON, is missing a key, or a key-value doesn't match the expected values, then the integrity may be compromised.
- If the integrity check passes, the user message is forwarded to the guarded LLM (e.g.: the application chatbot, etc.).
- The API returns the result of the integrity test (boolean) and either the chatbot response (if integrity passes) or an error message (if integrity fails).
graph TD
A[1. User Inputs Chat Message] --> B[2. Integrity Filter]
B -->|Integrity check passes.| C[3. Generate Chatbot Response]
B -->|Integrity check fails.\n\nResponse is error message.| D
C -->|Response is chatbot message.| D[4. Return Integrity and Response]
What this solution can do:
- Detect inputs that override an LLMs initial / system prompt.
What this solution cannot do:
- Neutralise malicious prompts.
If using poetry:
poetry install
If using vanilla pip:
pip install .
Set your OpenAI API key in .envrc
.
To run the project locally, run
make start
This will launch a webserver on port 8001.
Or via docker compose (does not use hot reload by default):
docker compose up
Query the /chat
endpoint, e.g.: using curl:
curl -X POST -H "Content-Type: application/json" -d '{"message": "Hi how are you?"}' http://127.0.0.1:8000/chat
To run unit tests:
make test
For information on how to set up your dev environment and contribute, see here.
MIT