Universal Adversarial Attack

A comprehensive research framework for creating and evaluating adversarial attacks against Multimodal Large Language Models (MLLMs).

📄 Research Paper: Universal Adversarial Attack on Aligned Multimodal LLMs
Authors: Temurbek Rahmatullaev, Polina Druzhinina, Nikita Kurdiukov, Matvey Mikhalchuk, Andrey Kuznetsov, Anton Razzhigaev
arXiv: 2502.07987 [cs.AI]
License: Apache-2.0

🎯 Overview

Universal Adversarial Attack is a sophisticated framework that enables training adversarial images to trigger unsafe content generation in multimodal models and evaluating the effectiveness of such attacks. This implementation corresponds to the research paper "Universal Adversarial Attack on Aligned Multimodal LLMs", which demonstrates how a single optimized image can override alignment safeguards across diverse queries and multiple models.

The framework achieves up to 81% attack success rates on certain models, outperforming existing baselines including text-only universal prompts. Our method leverages backpropagation through vision encoders and language heads to craft synthetic images that force models to generate targeted phrases (e.g., "Sure, here it is") or other unsafe content.

Key Research Contributions:

Universal Image Attacks: Single adversarial image effective across multiple models and queries
Optimization pipeline: Various models could be attacked through both vision and language components
Benchmark Evaluation: Comprehensive competing methods reproducing on SafeBench and MM-SafetyBench
Ablation Variants: Multi-Answer supervision, Gaussian blurring, and localized perturbations
Transferability resistance: Using unique image processing pipelines drastically reduces the transerability of attacks

⚠️ Ethical Considerations

This framework is designed exclusively for AI safety research purposes. The goal is to identify and mitigate potential vulnerabilities in multimodal AI systems. Usage for creating harmful content is strictly prohibited.

Some content generated by Multimodal LLMs on this page may be offensive.

Figure: An example of a single universal adversarial image producing disallowed content. This image was originally optimized on three models (Phi, Qwen, and Llama) but is here tested on the Llava 1.5 7B demonstrating cross-model generalization. Despite safety alignment, the model yields an unsafe response for a harmful prompt.

🚀 Quick Start

Setup WandB

# Add your WandB API key
echo "your_wandb_key" > wandb_key.txt

Launch Attacks

# Single model attack on Llama-3.2
./scripts/attacks/attack_clamp_tanh_llama.sh

# Cross-model universal attack
./scripts/attacks/attack_cross.sh

# Localized attack with Gaussian blur
./scripts/attacks/attack_cross_gblur.sh

Evaluation Pipeline

# 1. Find best iteration using safety judge
./scripts/evaluation/find_best_iter.sh

# 2. Test on SafeBench dataset
./scripts/evaluation/safebench_test.sh experiment_name iteration model_suffix cuda_num

# 3. Evaluate with safety judge
./scripts/evaluation/guard_eval.sh /path/to/results cuda_num

🔧 Core Components

Adversarial Attack Framework

Single Model Attacks (attack_model.py) - Targeted attacks against individual models with various clamping methods and localization techniques
Universal Cross-Model Attacks (crossattack_models.py) - Attacks that work across multiple models simultaneously

Processor System

The framework features a sophisticated processor system with specialized handlers for each model:

Phi-3.5 (phi3processor.py) - Dynamic tile-based processing with global + local context
Qwen2-VL (qwen2VLprocessor.py) - Dynamic scaling with fixed ViT patches
Llama-3.2 (llama32processor.py) - Adaptive tiling with optimal canvas allocation
LLaVA (llavaprocessor.py) - Simple fixed-size scaling
Gemma-3 (gemma3processor.py) - Safety judge for evaluation only

Extending the Pipeline to New Models

To attack a different model, you can leverage the existing pipeline by writing a differentiable Image Processor for your target model. You can use the provided examples (phi3processor.py, qwen2VLprocessor.py, llama32processor.py, llavaprocessor.py) and the abstract_processor.py as a reference.

Steps to add support for a new model:

Write a differentiable Image Processor for your model, using abstract_processor.py as an example.
Implement the image preprocessing methods, ensuring they are differentiable (support autograd).
Register your processor in the main attack pipeline (./src/processors/__init__.py).
If needed, adapt the attack launch scripts for your new model (./scripts/attacks/).

This approach allows you to utilize all features of the framework (localization, universal attacks, benchmarking) with any compatible model.

Safety Evaluation System

SafetyChecker class uses Gemma-3 with structured output to evaluate the safety of generated content, providing:

Binary safety classification
Confidence scores
Detailed reasoning
Structured JSON output

Evaluation Suite

Experiment Tracker (experiment_tracker.py) - Comprehensive analysis and visualization of experimental results
Best Iteration Finder (find_best_iter_gemma.py) - Automated selection of optimal attack iterations
Universal Testing (SafeBench_universal.py) - Standardized evaluation on safety benchmarks
Baseline Comparisons - Testing against multiple safety datasets

🎯 Attack Methods

Clamping Techniques

Tanh Clamping - Gradient-preserving constraint using tanh function (default)
Direct Clamping - Hard pixel value constraints
Custom Methods - Extensible clamping system

Localization Strategies

Corner Attack - Adversarial perturbations in n×n corner regions
Bottom Lines - Attacks targeting bottom k rows of images
Random Square - Movable n×n adversarial patches for robustness
Full Image - Traditional full-image perturbations

Robustness Enhancements

Gaussian Blur - Blur augmentation for improved transferability
Local Cropping - Random cropping for universal attacks
Multi-Model Training - Cross-model optimization for universal attacks

📊 Evaluation Workflow

Adversarial Training - Optimize pixel perturbations to maximize target text generation probability
Best Iteration Selection - Use Gemma-3 judge to identify most effective attack iterations
Benchmark Testing - Evaluate attacks on SafeBench, MM-SafetyBench, and FigStep datasets
Safety Assessment - Quantitative evaluation using Attack Success Rate (ASR) metrics

📈 Experiment Analysis

Use the ExperimentTracker for comprehensive result analysis:

from src.evaluation.experiment_tracker import ExperimentTracker

# Initialize tracker
tracker = ExperimentTracker()

# List all experiments
experiments = tracker.list_experiments()

# Get experiment details
info = tracker.get_experiment_info("experiment_name")

# Visualize ASR dynamics
tracker.plot_asr_dynamics("experiment_name")

# Generate summary reports
runs_summary = tracker.get_runs_summary()
tests_summary = tracker.get_tests_summary()

🔬 Usage Examples

Single Model Attack with Localization

python src/attack_model.py \
    --model_name "microsoft/Phi-3.5-vision-instruct" \
    --epsilon 0.3 \
    --mask_type corner \
    --mask_size 50 \
    --clamp_method tanh \
    --use_gaussian_blur

Universal Cross-Model Attack

python src/crossattack_models.py \
    --models "microsoft/Phi-3.5-vision-instruct,Qwen/Qwen2-VL-7B-Instruct" \
    --epsilon 0.4 \
    --mask_type random_square \
    --mask_size 100 \
    --use_gaussian_blur \
    --clamp_method tanh

Automated Evaluation Pipeline

# Find best iteration
python src/evaluation/find_best_iter_gemma.py \
    --experiment_name "cross_attack_20250121"

# Test on SafeBench
python src/evaluation/SafeBench_universal.py \
    --exp "cross_attack_20250121" \
    --iter 3500 \
    --model_suf "Universal" \
    --cuda_num 0

📋 Available Datasets

SafeBench - Primary safety evaluation benchmark
MM-SafetyBench - Extended multimodal safety dataset
FigStep - Image understanding and reasoning dataset

📚 More Documentation

src/evaluation/README.md - Evaluation framework documentation
src/judge/README.md - Safety judge documentation

🤝 Contributing: Implementing a New Image Processor

We welcome contributions, especially new Image Processor modules for supporting additional models or custom image transformations. To contribute an Image Processor:

Fork the repository on GitHub.
Create a feature branch (for example: git checkout -b feature/image-processor-yourmodel).
Implement your Image Processor in the appropriate location under src/processors/, following the structure of existing processors.
Test your implementation to ensure compatibility and correctness.
Commit your changes (git commit -m 'Add Image Processor for YourModel').
Push to your branch (git push origin feature/image-processor-yourmodel).
Open a Pull Request with a clear description of your changes.

We are also open to broader collaborations and research partnerships. For inquiries, please refer to the contact information provided in the research paper.

📄 License

This project is intended for academic research purposes. Please contact the authors for usage details.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.vscode		.vscode
assets		assets
images		images
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
TODO.md		TODO.md
outlines_instruction.md		outlines_instruction.md
play_pqlet.ipynb		play_pqlet.ipynb
rah_p312_cu124.yml		rah_p312_cu124.yml
rah_p39_cu117.yml		rah_p39_cu117.yml
requirements.txt		requirements.txt
run_job.yaml		run_job.yaml
tmp.png		tmp.png
wandb_key.txt		wandb_key.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Universal Adversarial Attack

🎯 Overview

⚠️ Ethical Considerations

🚀 Quick Start

Setup WandB

Launch Attacks

Evaluation Pipeline

🔧 Core Components

Adversarial Attack Framework

Processor System

Safety Evaluation System

Evaluation Suite

🎯 Attack Methods

Clamping Techniques

Localization Strategies

Robustness Enhancements

📊 Evaluation Workflow

📈 Experiment Analysis

🔬 Usage Examples

Single Model Attack with Localization

Universal Cross-Model Attack

Automated Evaluation Pipeline

📋 Available Datasets

📚 More Documentation

🤝 Contributing: Implementing a New Image Processor

📄 License

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

FusionBrainLab/AdversarialVLM

Folders and files

Latest commit

History

Repository files navigation

Universal Adversarial Attack

🎯 Overview

⚠️ Ethical Considerations

🚀 Quick Start

Setup WandB

Launch Attacks

Evaluation Pipeline

🔧 Core Components

Adversarial Attack Framework

Processor System

Safety Evaluation System

Evaluation Suite

🎯 Attack Methods

Clamping Techniques

Localization Strategies

Robustness Enhancements

📊 Evaluation Workflow

📈 Experiment Analysis

🔬 Usage Examples

Single Model Attack with Localization

Universal Cross-Model Attack

Automated Evaluation Pipeline

📋 Available Datasets

📚 More Documentation

🤝 Contributing: Implementing a New Image Processor

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages