This codebase implements a deployment-efficient algorithm, BREMEN, proposed in Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization.
We modified ME-TRPO repository for deployment-efficient or offline settings.
We recommend you to use Docker.
You can use Python 3.6. You must download MuJoCo 1.31 from https://www.roboti.us/, and then install package dependencies.
pip install -r requirements.txt
You must use Behavior Regularized Offline Reinforcement Learning codebase for data collection.
Follow their instruction and collect 1M transitions with each noise strategies (pure, eps1, eps3, gaussian1, gaussian3).
If you are interested in deployment-efficient settings, it is enough to collect transitions with pure strategy.
After the data collection, put data.data-00000-of-00001
and data.index
to the ./data/<Agent name>/pure/
e.g. ./data/Ant/pure/data.data-00000-of-00001
, ./data/Ant/pure/data.index
Note: This procedure is needed for offline experiments. If you just run deployment-efficient experiments, you can skip. However, this must be done if you want to save video of your policy (because of the normalization of state and action).
This repository contains pre-trained policies of BREMEN in deployment-efficient settings with batch size 200k (Top row in Figure 2). Save video for the visualization of the results using the following command:
e.g.
python save_video.py --env ant --param_path configs/params_ant_offline.json --video_dir <relative path to the video save dir> --restore_path ./weights/Ant/policy.ckpt --restore_policy_variables --n_train 50000
You can use four pre-trained policies of BREMEN ant
, half_cheetah
, hopper
, walker2d
.
(This process requires offline data for the normalization of state and action.)
Run BREMEN in deployment-efficient experiments using the following command:
python recursive.py --env <env_name> --exp_name <experiment_name> --sub_exp_name <exp_save_dir> --param_path configs/params_<env_name>_offline.json --bc_init --random_seeds 0 --target_kl 0.01 --max_path_length 1000
env_name
:ant
,half_cheetah
,hopper
,walker2d
,cheetah_run
exp_name
: what you want to call your experimentsub_exp_name
: partial path for saving experiment logs and resultsparam_path
: path to config json filetarget_kl
: delta in TRPO objectivemax_path_length
: length of an imaginary rolloutbc_init
: enable behavior-initializationalpha
: coefficient of explicit KL value penalty (0 is the default)
Experiment results will be logged to ./log/<env_name>/<exp_save_dir>/<experiment_name>/<experiment_name><seed>/
e.g.
python recursive.py --env ant --exp_name recursive_example --sub_exp_name BREMEN_demo --param_path configs/params_ant_offline.json --bc_init --random_seeds 0 --target_kl 0.05 --max_path_length 250 --gaussian 0.1 --const_sampling
python recursive.py --env half_cheetah --exp_name recursive_example --sub_exp_name BREMEN_demo --param_path configs/params_half_cheetah_offline.json --bc_init --random_seeds 0 --target_kl 0.1 --max_path_length 250 --gaussian 0.1 --const_sampling
python recursive.py --env cheetah_run --exp_name recursive_example --sub_exp_name BREMEN_demo --param_path configs/params_cheetah_run_offline.json --bc_init --random_seeds 0 --target_kl 0.1 --max_path_length 250 --gaussian 0.1 --const_sampling
python recursive.py --env hopper --exp_name recursive_example --sub_exp_name BREMEN_demo --param_path configs/params_hopper_offline.json --bc_init --random_seeds 0 --target_kl 0.05 --max_path_length 1000 --gaussian 0.1 --const_sampling --n_train 2000000 --onpol_iters 2400 --interval 240
python recursive.py --env walker2d --exp_name recursive_example --sub_exp_name BREMEN_demo --param_path configs/params_walker2d_offline.json --bc_init --random_seeds 0 --target_kl 0.05 --max_path_length 1000 --gaussian 0.1 --const_sampling --n_train 2000000 --onpol_iters 800
Run BREMEN in offline experiments using the following command:
python offline.py --env <env_name> --exp_name <experiment_name> --sub_exp_name <exp_save_dir> --param_path configs/params_<env_name>_offline.json --bc_init --random_seeds 0 --target_kl 0.01 --max_path_length 1000
env_name
:ant
,half_cheetah
,hopper
,walker2d
exp_name
: what you want to call your experimentsub_exp_name
: partial path for saving experiment logs and resultsparam_path
: path to config json filetarget_kl
: delta in TRPO objectivemax_path_length
: length of an imaginary rolloutbc_init
: enable behavior-initializationalpha
: coefficient of explicit KL value penalty (0 is the default)onpol_iters
: number of outer iteration (inner iteration is set to 25).noise
:(pure, eps1, eps3, gaussian1, gaussian3, random)
, default ispure
Experiment results will be logged to ./log/<env_name>/<exp_save_dir>/<experiment_name>/<experiment_name><seed>/
e.g.
python offline.py --env ant --exp_name offline_example --sub_exp_name BREMEN_demo --param_path configs/params_ant_offline.json --bc_init --random_seeds 0 --target_kl 0.05 --max_path_length 250 --gaussian 0.1 --const_sampling --onpol_iters 250
python offline.py --env half_cheetah --exp_name offline_example --sub_exp_name BREMEN_demo --param_path configs/params_half_cheetah_offline.json --bc_init --random_seeds 0 --target_kl 0.1 --max_path_length 250 --gaussian 0.1 --const_sampling --onpol_iters 250
python offline.py --env cheetah_run --exp_name offline_example --sub_exp_name BREMEN_demo --param_path configs/params_cheetah_run_offline.json --bc_init --random_seeds 0 --target_kl 0.1 --max_path_length 250 --gaussian 0.1 --const_sampling --onpol_iters 250
python offline.py --env hopper --exp_name offline_example --sub_exp_name BREMEN_demo --param_path configs/params_hopper_offline.json --bc_init --random_seeds 0 --target_kl 0.05 --max_path_length 1000 --gaussian 0.1 --const_sampling --onpol_iters 250
python offline.py --env walker2d --exp_name offline_example --sub_exp_name BREMEN_demo --param_path configs/params_walker2d_offline.json --bc_init --random_seeds 0 --target_kl 0.05 --max_path_length 1000 --gaussian 0.1 --const_sampling --onpol_iters 250
Please use the following bibtex for citations:
@inproceedings{matsushima2020deploy,
title={Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization},
author={Tatsuya Matsushima and Hiroki Furuta and Yutaka Matsuo and Ofir Nachum and Shixiang Shane Gu},
year={2021},
booktitle={International Conference on Learning Representations},
}