A comprehensive research framework for creating and evaluating adversarial attacks against Multimodal Large Language Models (MLLMs).
π Research Paper: Universal Adversarial Attack on Aligned Multimodal LLMs
Authors: Temurbek Rahmatullaev, Polina Druzhinina, Nikita Kurdiukov, Matvey Mikhalchuk, Andrey Kuznetsov, Anton Razzhigaev
arXiv: 2502.07987 [cs.AI]
License: Apache-2.0
Universal Adversarial Attack is a sophisticated framework that enables training adversarial images to trigger unsafe content generation in multimodal models and evaluating the effectiveness of such attacks. This implementation corresponds to the research paper "Universal Adversarial Attack on Aligned Multimodal LLMs", which demonstrates how a single optimized image can override alignment safeguards across diverse queries and multiple models.
The framework achieves up to 81% attack success rates on certain models, outperforming existing baselines including text-only universal prompts. Our method leverages backpropagation through vision encoders and language heads to craft synthetic images that force models to generate targeted phrases (e.g., "Sure, here it is") or other unsafe content.
Key Research Contributions:
- Universal Image Attacks: Single adversarial image effective across multiple models and queries
- Optimization pipeline: Various models could be attacked through both vision and language components
- Benchmark Evaluation: Comprehensive competing methods reproducing on SafeBench and MM-SafetyBench
- Ablation Variants: Multi-Answer supervision, Gaussian blurring, and localized perturbations
- Transferability resistance: Using unique image processing pipelines drastically reduces the transerability of attacks
This framework is designed exclusively for AI safety research purposes. The goal is to identify and mitigate potential vulnerabilities in multimodal AI systems. Usage for creating harmful content is strictly prohibited.
Some content generated by Multimodal LLMs on this page may be offensive.
Figure: An example of a single universal adversarial image producing disallowed content. This image was originally optimized on three models (Phi, Qwen, and Llama) but is here tested on the Llava 1.5 7B demonstrating cross-model generalization. Despite safety alignment, the model yields an unsafe response for a harmful prompt.
# Add your WandB API key
echo "your_wandb_key" > wandb_key.txt
# Single model attack on Llama-3.2
./scripts/attacks/attack_clamp_tanh_llama.sh
# Cross-model universal attack
./scripts/attacks/attack_cross.sh
# Localized attack with Gaussian blur
./scripts/attacks/attack_cross_gblur.sh
# 1. Find best iteration using safety judge
./scripts/evaluation/find_best_iter.sh
# 2. Test on SafeBench dataset
./scripts/evaluation/safebench_test.sh experiment_name iteration model_suffix cuda_num
# 3. Evaluate with safety judge
./scripts/evaluation/guard_eval.sh /path/to/results cuda_num
- Single Model Attacks (
attack_model.py
) - Targeted attacks against individual models with various clamping methods and localization techniques - Universal Cross-Model Attacks (
crossattack_models.py
) - Attacks that work across multiple models simultaneously
The framework features a sophisticated processor system with specialized handlers for each model:
- Phi-3.5 (
phi3processor.py
) - Dynamic tile-based processing with global + local context - Qwen2-VL (
qwen2VLprocessor.py
) - Dynamic scaling with fixed ViT patches - Llama-3.2 (
llama32processor.py
) - Adaptive tiling with optimal canvas allocation - LLaVA (
llavaprocessor.py
) - Simple fixed-size scaling - Gemma-3 (
gemma3processor.py
) - Safety judge for evaluation only
Extending the Pipeline to New Models
To attack a different model, you can leverage the existing pipeline by writing a differentiable Image Processor for your target model. You can use the provided examples (phi3processor.py
, qwen2VLprocessor.py
, llama32processor.py
, llavaprocessor.py
) and the abstract_processor.py
as a reference.
Steps to add support for a new model:
- Write a differentiable Image Processor for your model, using
abstract_processor.py
as an example. - Implement the image preprocessing methods, ensuring they are differentiable (support autograd).
- Register your processor in the main attack pipeline (
./src/processors/__init__.py
). - If needed, adapt the attack launch scripts for your new model (
./scripts/attacks/
).
This approach allows you to utilize all features of the framework (localization, universal attacks, benchmarking) with any compatible model.
SafetyChecker class uses Gemma-3 with structured output to evaluate the safety of generated content, providing:
- Binary safety classification
- Confidence scores
- Detailed reasoning
- Structured JSON output
- Experiment Tracker (
experiment_tracker.py
) - Comprehensive analysis and visualization of experimental results - Best Iteration Finder (
find_best_iter_gemma.py
) - Automated selection of optimal attack iterations - Universal Testing (
SafeBench_universal.py
) - Standardized evaluation on safety benchmarks - Baseline Comparisons - Testing against multiple safety datasets
- Tanh Clamping - Gradient-preserving constraint using tanh function (default)
- Direct Clamping - Hard pixel value constraints
- Custom Methods - Extensible clamping system
- Corner Attack - Adversarial perturbations in nΓn corner regions
- Bottom Lines - Attacks targeting bottom k rows of images
- Random Square - Movable nΓn adversarial patches for robustness
- Full Image - Traditional full-image perturbations
- Gaussian Blur - Blur augmentation for improved transferability
- Local Cropping - Random cropping for universal attacks
- Multi-Model Training - Cross-model optimization for universal attacks
- Adversarial Training - Optimize pixel perturbations to maximize target text generation probability
- Best Iteration Selection - Use Gemma-3 judge to identify most effective attack iterations
- Benchmark Testing - Evaluate attacks on SafeBench, MM-SafetyBench, and FigStep datasets
- Safety Assessment - Quantitative evaluation using Attack Success Rate (ASR) metrics
Use the ExperimentTracker
for comprehensive result analysis:
from src.evaluation.experiment_tracker import ExperimentTracker
# Initialize tracker
tracker = ExperimentTracker()
# List all experiments
experiments = tracker.list_experiments()
# Get experiment details
info = tracker.get_experiment_info("experiment_name")
# Visualize ASR dynamics
tracker.plot_asr_dynamics("experiment_name")
# Generate summary reports
runs_summary = tracker.get_runs_summary()
tests_summary = tracker.get_tests_summary()
python src/attack_model.py \
--model_name "microsoft/Phi-3.5-vision-instruct" \
--epsilon 0.3 \
--mask_type corner \
--mask_size 50 \
--clamp_method tanh \
--use_gaussian_blur
python src/crossattack_models.py \
--models "microsoft/Phi-3.5-vision-instruct,Qwen/Qwen2-VL-7B-Instruct" \
--epsilon 0.4 \
--mask_type random_square \
--mask_size 100 \
--use_gaussian_blur \
--clamp_method tanh
# Find best iteration
python src/evaluation/find_best_iter_gemma.py \
--experiment_name "cross_attack_20250121"
# Test on SafeBench
python src/evaluation/SafeBench_universal.py \
--exp "cross_attack_20250121" \
--iter 3500 \
--model_suf "Universal" \
--cuda_num 0
- SafeBench - Primary safety evaluation benchmark
- MM-SafetyBench - Extended multimodal safety dataset
- FigStep - Image understanding and reasoning dataset
- src/evaluation/README.md - Evaluation framework documentation
- src/judge/README.md - Safety judge documentation
We welcome contributions, especially new Image Processor modules for supporting additional models or custom image transformations. To contribute an Image Processor:
- Fork the repository on GitHub.
- Create a feature branch (for example:
git checkout -b feature/image-processor-yourmodel
). - Implement your Image Processor in the appropriate location under
src/processors/
, following the structure of existing processors. - Test your implementation to ensure compatibility and correctness.
- Commit your changes (
git commit -m 'Add Image Processor for YourModel'
). - Push to your branch (
git push origin feature/image-processor-yourmodel
). - Open a Pull Request with a clear description of your changes.
We are also open to broader collaborations and research partnerships. For inquiries, please refer to the contact information provided in the research paper.
This project is intended for academic research purposes. Please contact the authors for usage details.