0.4.1
API Change
- Add observation normalization in BaseVectorEnv (
norm_obs
,obs_rms
,update_obs_rms
andRunningMeanStd
) (#308) - Add
policy.map_action
to bound with raw action (e.g., map from (-inf, inf) to [-1, 1] by clipping or tanh squashing), and the mapped action won't store in replaybuffer (#313) - Add
lr_scheduler
in on-policy algorithms, typically forLambdaLR
(#318)
Note
To adapt with this version, you should change the action_range=...
to action_space=env.action_space
in policy initialization.
Bug Fix
- Fix incorrect behaviors (error when
n/ep==0
and reward shown in tqdm) with on-policy algorithm (#306, #328) - Fix q-value mask_action error for obs_next (#310)
Enhancement
- Release SOTA Mujoco benchmark (DDPG/TD3/SAC: #305, REINFORCE: #320, A2C: #325, PPO: #330) and add corresponding notes in /examples/mujoco/README.md
- Fix
numpy>=1.20
typing issue (#323) - Add cross-platform unittest (#331)
- Add a test on how to deal with finite env (#324)
- Add value normalization in on-policy algorithms (#319, #321)
- Separate advantage normalization and value normalization in PPO (#329)