Skip to content

Commit

Permalink
Merge pull request #32 from araffin/HER-support
Browse files Browse the repository at this point in the history
HER support + time wrappers
  • Loading branch information
araffin authored Jun 16, 2019
2 parents d260c92 + db6bf1c commit b0f43e1
Show file tree
Hide file tree
Showing 27 changed files with 487 additions and 106 deletions.
11 changes: 5 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ Note: when training TRPO, you have to use `mpirun` to enable multiprocessing:
mpirun -n 16 python train.py --algo trpo --env BreakoutNoFrameskip-v4
```

## Hyperparameter Optimization
## Hyperparameter Tuning

We use [Optuna](https://optuna.org/) for optimizing the hyperparameters.

Expand All @@ -68,7 +68,7 @@ when using SuccessiveHalvingPruner ("halving"), you must specify `--n-jobs > 1`
Budget of 1000 trials with a maximum of 50000 steps:

```
python -m train.py --algo ppo2 --env MountainCar-v0 -n 50000 -optimize --n-trials 1000 --n-jobs 2 \
python train.py --algo ppo2 --env MountainCar-v0 -n 50000 -optimize --n-trials 1000 --n-jobs 2 \
--sampler random --pruner median
```

Expand Down Expand Up @@ -120,7 +120,7 @@ Additional Atari Games (to be completed):
| PPO2 | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |:heavy_check_mark: |
| DQN | :heavy_check_mark: | :heavy_check_mark: |:heavy_check_mark:| N/A | N/A |
| DDPG | N/A | N/A | N/A| :heavy_check_mark: | :heavy_check_mark: |
| SAC | N/A | N/A | N/A| :heavy_check_mark: | |
| SAC | N/A | N/A | N/A| :heavy_check_mark: | :heavy_check_mark: |
| TRPO | :heavy_check_mark: | :heavy_check_mark: | | :heavy_check_mark: | :heavy_check_mark: |


Expand Down Expand Up @@ -181,13 +181,12 @@ Note that you need to specify --gym-packages gym_minigrid with enjoy.py and trai

```
pip install gym-minigrid
python -m train.py --algo ppo2 --env MiniGrid-DoorKey-5x5-v0 \
--gym-packages gym_minigrid
python train.py --algo ppo2 --env MiniGrid-DoorKey-5x5-v0 --gym-packages gym_minigrid
```

This does the same thing as:

```
```python
import gym_minigrid
```

Expand Down
8 changes: 5 additions & 3 deletions benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@
|acktr|SeaquestNoFrameskip-v4 | 1672.239| 105.092| 149148| 67|
|acktr|SpaceInvadersNoFrameskip-v4 | 738.045| 306.756| 149714| 156|
|ddpg |AntBulletEnv-v0 | 2070.790| 74.253| 150000| 150|
|ddpg |BipedalWalker-v2 | 94.202| 142.679| 149647| 240|
|ddpg |HalfCheetahBulletEnv-v0 | 2549.452| 37.652| 150000| 150|
|ddpg |LunarLanderContinuous-v2 | 244.566| 75.617| 149531| 660|
|ddpg |MountainCarContinuous-v0 | 91.858| 1.350| 149945| 1818|
Expand All @@ -70,7 +71,7 @@
|ppo2 |BreakoutNoFrameskip-v4 | 228.594| 141.964| 150921| 101|
|ppo2 |CartPole-v1 | 500.000| 0.000| 150000| 300|
|ppo2 |EnduroNoFrameskip-v4 | 643.824| 205.988| 149683| 17|
|ppo2 |HalfCheetahBulletEnv-v0 | 2037.586| 59.480| 150000| 150|
|ppo2 |HalfCheetahBulletEnv-v0 | 3195.326| 115.730| 150000| 150|
|ppo2 |HopperBulletEnv-v0 | 1944.588| 612.994| 149157| 176|
|ppo2 |HumanoidBulletEnv-v0 | 1285.814| 918.715| 149544| 244|
|ppo2 |InvertedDoublePendulumBulletEnv-v0 | 7702.750| 2888.815| 149089| 181|
Expand All @@ -89,15 +90,16 @@
|ppo2 |SeaquestNoFrameskip-v4 | 1782.687| 80.883| 150535| 67|
|ppo2 |SpaceInvadersNoFrameskip-v4 | 689.631| 202.143| 150081| 176|
|ppo2 |Walker2DBulletEnv-v0 | 1276.848| 504.586| 149959| 179|
|sac |AntBulletEnv-v0 | 2354.785| 42.501| 150000| 150|
|sac |AntBulletEnv-v0 | 3485.228| 29.964| 150000| 150|
|sac |BipedalWalker-v2 | 307.198| 1.055| 149794| 175|
|sac |BipedalWalkerHardcore-v2 | 100.802| 117.769| 148974| 84|
|sac |HalfCheetahBulletEnv-v0 | 2021.599| 261.582| 150000| 150|
|sac |HalfCheetahBulletEnv-v0 | 3330.911| 95.575| 150000| 150|
|sac |HopperBulletEnv-v0 | 2438.152| 335.284| 149232| 155|
|sac |HumanoidBulletEnv-v0 | 2048.187| 829.776| 149886| 172|
|sac |InvertedDoublePendulumBulletEnv-v0 | 9357.406| 0.504| 150000| 150|
|sac |InvertedPendulumSwingupBulletEnv-v0| 891.508| 0.963| 150000| 150|
|sac |LunarLanderContinuous-v2 | 269.783| 57.077| 149852| 709|
|sac |MountainCarContinuous-v0 | 90.421| 0.997| 149989| 1311|
|sac |Pendulum-v0 | -159.669| 86.665| 150000| 750|
|sac |ReacherBulletEnv-v0 | 17.529| 9.860| 150000| 1000|
|sac |Walker2DBulletEnv-v0 | 2052.646| 13.631| 150000| 150|
Expand Down
56 changes: 43 additions & 13 deletions enjoy.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,18 @@
# For pybullet envs
warnings.filterwarnings("ignore")
import gym
import pybullet_envs
try:
import pybullet_envs
except ImportError:
pybullet_envs = None
import numpy as np

try:
import highway_env
except ImportError:
highway_env = None
import stable_baselines
from stable_baselines.common import set_global_seeds
from stable_baselines.common.vec_env import VecNormalize, VecFrameStack
from stable_baselines.common.vec_env import VecNormalize, VecFrameStack, VecEnv

from utils import ALGOS, create_test_env, get_latest_run_id, get_saved_hyperparams

Expand All @@ -40,17 +46,19 @@ def main():
help='Do not render the environment (useful for tests)')
parser.add_argument('--deterministic', action='store_true', default=False,
help='Use deterministic actions')
parser.add_argument('--stochastic', action='store_true', default=False,
help='Use stochastic actions (for DDPG/DQN/SAC)')
parser.add_argument('--norm-reward', action='store_true', default=False,
help='Normalize reward if applicable (trained with VecNormalize)')
parser.add_argument('--seed', help='Random generator seed', type=int, default=0)
parser.add_argument('--reward-log', help='Where to log reward', default='', type=str)
parser.add_argument('--gym-packages', type=str, nargs='+', default=[], help='Additional external Gym environemnt package modules to import (e.g. gym_minigrid)')
args = parser.parse_args()

# Going through custom gym packages to let them register in the global registory
for env_module in args.gym_packages:
importlib.import_module(env_module)

env_id = args.env
algo = args.algo
folder = args.folder
Expand Down Expand Up @@ -94,11 +102,14 @@ def main():

obs = env.reset()

# Force deterministic for DQN and DDPG
deterministic = args.deterministic or algo in ['dqn', 'ddpg', 'sac']
# Force deterministic for DQN, DDPG, SAC and HER (that is a wrapper around)
deterministic = args.deterministic or algo in ['dqn', 'ddpg', 'sac', 'her'] and not args.stochastic

running_reward = 0.0
episode_reward = 0.0
episode_rewards = []
ep_len = 0
# For HER, monitor success rate
successes = []
for _ in range(args.n_timesteps):
action, _ = model.predict(obs, deterministic=deterministic)
# Random Agent
Expand All @@ -109,7 +120,8 @@ def main():
obs, reward, done, infos = env.step(action)
if not args.no_render:
env.render('human')
running_reward += reward[0]

episode_reward += reward[0]
ep_len += 1

if args.n_envs == 1:
Expand All @@ -121,17 +133,35 @@ def main():
print("Atari Episode Score: {:.2f}".format(episode_infos['r']))
print("Atari Episode Length", episode_infos['l'])

if done and not is_atari and args.verbose >= 1:
if done and not is_atari and args.verbose > 0:
# NOTE: for env using VecNormalize, the mean reward
# is a normalized reward when `--norm_reward` flag is passed
print("Episode Reward: {:.2f}".format(running_reward))
print("Episode Reward: {:.2f}".format(episode_reward))
print("Episode Length", ep_len)
running_reward = 0.0
episode_rewards.append(episode_reward)
episode_reward = 0.0
ep_len = 0

# Reset also when the goal is achieved when using HER
if done or infos[0].get('is_success', False):
if args.algo == 'her' and args.verbose > 1:
print("Success?", infos[0].get('is_success', False))
# Alternatively, you can add a check to wait for the end of the episode
# if done:
obs = env.reset()
if args.algo == 'her':
successes.append(infos[0].get('is_success', False))
episode_reward, ep_len = 0.0, 0

if args.verbose > 0 and len(successes) > 0:
print("Success rate: {:.2f}%".format(100 * np.mean(successes)))

if args.verbose > 0 and len(episode_rewards) > 0:
print("Mean reward: {:.2f}".format(np.mean(episode_rewards)))

# Workaround for https://github.com/openai/gym/issues/893
if not args.no_render:
if args.n_envs == 1 and 'Bullet' not in env_id and not is_atari:
if args.n_envs == 1 and 'Bullet' not in env_id and not is_atari and isinstance(env, VecEnv):
# DummyVecEnv
# Unwrap env
while isinstance(env, VecNormalize) or isinstance(env, VecFrameStack):
Expand Down
70 changes: 41 additions & 29 deletions hyperparams/ddpg.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,56 +19,68 @@ Pendulum-v0:
noise_std: 0.1
memory_limit: 50000

# To be tuned
# Tuned
BipedalWalker-v2:
n_timesteps: !!float 5e6
policy: 'LnMlpPolicy'
normalize_observations: True
n_timesteps: !!float 1e6
policy: 'MlpPolicy'
noise_type: 'adaptive-param'
noise_std: 0.2
memory_limit: 50000
noise_std: 0.287
memory_limit: 100000
normalize_observations: True
normalize_returns: False
gamma: 0.999
actor_lr: !!float 0.000527
batch_size: 256
random_exploration: 0.0
policy_kwargs: 'dict(layer_norm=True)'

# To be tuned
# Tuned
HalfCheetahBulletEnv-v0:
n_timesteps: !!float 1e6
n_timesteps: !!float 2e6
policy: 'LnMlpPolicy'
gamma: 0.99
memory_limit: 100000
gamma: 0.95
memory_limit: 1000000
noise_type: 'normal'
noise_std: 0.024
batch_size: 64
noise_std: 0.22
batch_size: 256
normalize_observations: True
normalize_returns: True
normalize_returns: False

# Tuned
Walker2DBulletEnv-v0:
n_timesteps: !!float 2e6
policy: 'LnMlpPolicy'
gamma: 0.99
memory_limit: 100000
gamma: 0.95
memory_limit: 1000000
noise_type: 'normal'
noise_std: 0.024
batch_size: 64
noise_std: 0.15
batch_size: 128
normalize_observations: True
normalize_returns: True

# To be tuned
AntBulletEnv-v0:
env_wrapper: utils.wrappers.TimeFeatureWrapper
n_timesteps: !!float 2e6
policy: 'LnMlpPolicy'
policy: 'MlpPolicy'
gamma: 0.99
memory_limit: 100000
memory_limit: 1000000
noise_type: 'normal'
noise_std: 0.024
batch_size: 64
noise_std: 0.22
batch_size: 256
normalize_observations: True
normalize_returns: True
normalize_returns: False

# To be tuned
HopperBulletEnv-v0:
n_timesteps: !!float 2e6
policy: 'LnMlpPolicy'
gamma: 0.99
memory_limit: 100000
noise_type: 'normal'
noise_std: 0.024
batch_size: 64
policy: 'MlpPolicy'
gamma: 0.98
memory_limit: 1000000
noise_type: 'ornstein-uhlenbeck'
noise_std: 0.652
batch_size: 256
actor_lr: 0.00156
critic_lr: 0.00156
normalize_observations: True
normalize_returns: True
normalize_returns: False
Loading

0 comments on commit b0f43e1

Please sign in to comment.