This project provides the open source implementation of the robust safe RL introduced in the ICLR 2023 paper: "On the Robustness of Safe Reinforcement Learning under Observational Pertubrations" (Liu, et al. 2023).
Safe RL trains a policy to maximize the reward while satisfying constraints. While prior works focus on the performance optimality, we find that the optimal solutions of many safe RL problems are not robust and safe against carefully designed observational perturbations. We propose two adversarial attacks - one maximizes the cost and the other maximizes the reward. One interesting and counter-intuitive finding is that the maximum reward attack is strong, as it can both induce unsafe behaviors and make the attack stealthy by maintaining the reward. We further propose a defense method based on adversarial training, which can make the agent stay safe under attacks. Video demos are available at the project webpage.
If you find this code useful, consider to cite:
@article{liu2022robustness,
title={On the robustness of safe reinforcement learning under observational perturbations},
author={Liu, Zuxin and Guo, Zijian and Cen, Zhepeng and Zhang, Huan and Tan, Jie and Li, Bo and Zhao, Ding},
journal={arXiv preprint arXiv:2205.14691},
year={2022}
}
The structure of this repo is as follows:
Robust safe RL libraries
├── rsrl # package folder
│ ├── policy # core algorithm implementation
│ ├── ├── model # stores the actor critic model architecture
│ ├── ├── policy_name # algorithms implementation
│ ├── util # logger and pytorch utils
│ ├── runner.py # training logic of the algorithms
│ ├── evaluator.py # evaluation logic of trained agents
├── script # stores the training scripts.
│ ├── config # stores some configs of the env and policy
│ ├── run.py # launch a single experiment
│ ├── experiment.py # launch multiple experiments in parallel with ray
│ ├── eval.py # evaluate script of trained agents
├── data # stores experiment results
- The repo is tested in Ubuntu 20.04 and should be fine with Ubuntu 18.04
- We recommend to use Anaconda3 for python env management
- Activate a python 3.7+ virtual anaconda env, then install the
bullet_safety_gym
simulation environment:
cd envs/Bullet-Safety-Gym
pip install -e .
cd ../..
- After switching back to the repo root folder, install the dependencies that are listed in
requirement.txt
and thersrl
library:
pip install -r requirement.txt
pip install -e .
- Then install
pytorch
based on your system configurations, see instructions here. For example, installing a cpu-only versionpytorch
via Anaconda3 by the following command:
conda install pytorch cpuonly -c pytorch
- The MAD attacker requires pysgmcmc library for optimization. Install it by:
pip install git+https://github.com/MFreidank/pysgmcmc@pytorch
To run a single experiment:
python script/run.py --rs_mode vanilla --policy robust_ppo
To run multiple experiments in parallel:
python script/experiment.py -e experiment_name
To evaluate a trained model, run:
python script/eval.py -d path_to_model
To evaluate multiple trained model in parallel:
python script/evaluation.py -d path_to_model -e env_name
The complete hyper-parameters can be found in script/config/config_robust_ppo.yaml
.
In particular, PPO-Lagrangian has different robust training modes, which are specified by the rs_mode
parameter. We detail the modes in the following table.
Algorithm | PPOL | ADV-PPOL(MC) | ADV-PPOL(MR) | PPOL-random | SA-PPOL | SA-PPOL(MC) | SA-PPOL(MR) |
---|---|---|---|---|---|---|---|
Mode | vanilla | max_cost | max_reward | uniform | kl | klmc | klmr |
- The proposed adversarial training methods correspond to the
max_cost, max_reward
modes. - For SA-PPOL series, the modes are
kl, klmc, klmr
. The SA-PPOL with the original MAD attacker is thekl
mode, the SA-PPOL method with the MC and MR attackers areklmc
andklmr
respectively. - Note that FOCOPS also supports the adversarail training modes
max_cost, max_reward
anduniform, vanilla
.
The pretrained weights are available at here.
Part of the code is based on several public repos:
- https://github.com/SvenGronauer/Bullet-Safety-Gym, note that our BulletSafetyGym is modified based on the original one. The major modification is the simulation step where we increase it to reduce the total training time without sacrifacing too much accuracy.
- https://github.com/openai/spinningup