Fengqing Jiang1,* ,
Zhangchen Xu1,* ,
Luyao Niu1,* ,
Zhen Xiang2 ,
Bhaskar Ramasubramanian3 ,
Bo Li4 ,
Radha Poovendran1
1University of Washington 2University of Illinois Urbana-Champaign
3Western Washington University 4University of Chicago
*Equal Contribution
Warning: This project contains model outputs that may be considered offensive
ACL 2024
We provide a demo prompt to show the effectiveness of ArtPrompt in notebook demo.ipynb
(also at demo_prompt.txt
). This is a successful prompt toward gpt-4-0613
.
- Make sure setup your API key in
utils/model.py
(or in environment) before running experiment.
Run evaluation on vitc-s
dataset. More details please refer to benchmark.py
# at dir ArtPrompt
python benchmark.py --model gpt-4-0613 --task s
Run jailbreak with ArtPrompt. More details please refer to baseline.py
cd jailbreak
python baseline.py --model gpt-4-0613 --tmodel gpt-3.5-turbo-0613
-
You could use
--mp
arg to accelerate the inference time based on the available cpu cores on your machine. -
--ps
flag is used to set the font setup for our method. We have font name sets used for evaluation stated in our paper, please refer to Appendix A.3 for details. Specially, the top-1 setup in table 3 usevitc-h-gen
, and ensemble usevitc-h-gen/alphabet/keyboard/cards/letters/puzzle
, please run each font to generate the individual result and use theensemble_eval.ipynb
notebook for ensemble evaluation results. The top-1 setup is subject to change given the different setups of victim models, we determined the top-1 font based on the average performance on our victim model sets.
Our project built upon the work from python-art,llm-attack, AutoDan, PAIR, DeepInception, LLM-Finetuning-Safety, BPE-Dropout. We appreciated these open-sourced work in the community.
If you find our project useful in your research, please consider citing:
@inproceedings{jiang-etal-2024-artprompt,
title = "{A}rt{P}rompt: {ASCII} Art-based Jailbreak Attacks against Aligned {LLM}s",
author = "Jiang, Fengqing and
Xu, Zhangchen and
Niu, Luyao and
Xiang, Zhen and
Ramasubramanian, Bhaskar and
Li, Bo and
Poovendran, Radha",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.809",
doi = "10.18653/v1/2024.acl-long.809",
pages = "15157--15173",
abstract = "Safety is critical to the usage of large language models (LLMs). Multiple techniques such as data filtering and supervised fine-tuning have been developed to strengthen LLM safety. However, currently known techniques presume that corpora used for safety alignment of LLMs are solely interpreted by semantics. This assumption, however, does not hold in real-world applications, which leads to severe vulnerabilities in LLMs. For example, users of forums often use ASCII art, a form of text-based art, to convey image information. In this paper, we propose a novel ASCII art-based jailbreak attack and introduce a comprehensive benchmark Vision-in-Text Challenge (ViTC) to evaluate the capabilities of LLMs in recognizing prompts that cannot be solely interpreted by semantics. We show that five SOTA LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) struggle to recognize prompts provided in the form of ASCII art. Based on this observation, we develop the jailbreak attack ArtPrompt, which leverages the poor performance of LLMs in recognizing ASCII art to bypass safety measures and elicit undesired behaviors from LLMs. ArtPrompt only requires black-box access to the victim LLMs, making it a practical attack. We evaluate ArtPrompt on five SOTA LLMs, and show that ArtPrompt can effectively and efficiently induce undesired behaviors from all five LLMs.",
}