Callback collection, cleanup and bug fixes
Breaking Changes
-
evaluate_policy
now returns the standard deviation of the reward per episode
as second return value (instead ofn_steps
) -
evaluate_policy
now returns as second return value a list of the episode lengths
whenreturn_episode_rewards
is set toTrue
(instead ofn_steps
) -
Callback are now called after each
env.step()
for consistency (it was called everyn_steps
before
in algorithm likeA2C
orPPO2
) -
Removed unused code in
common/a2c/utils.py
(calc_entropy_softmax
,make_path
) -
Refactoring, including removed files and moving functions.
-
Algorithms no longer import from each other, and
common
does not import from algorithms. -
a2c/utils.py
removed and split into other files:- common/tf_util.py:
sample
,calc_entropy
,mse
,avg_norm
,total_episode_reward_logger
,
q_explained_variance
,gradient_add
,avg_norm
,check_shape
,
seq_to_batch
,batch_to_seq
. - common/tf_layers.py:
conv
,linear
,lstm
,_ln
,lnlstm
,conv_to_fc
,ortho_init
. - a2c/a2c.py:
discount_with_dones
. - acer/acer_simple.py:
get_by_index
,EpisodeStats
. - common/schedules.py:
constant
,linear_schedule
,middle_drop
,double_linear_con
,double_middle_drop
,
SCHEDULES
,Scheduler
.
- common/tf_util.py:
-
trpo_mpi/utils.py
functions moved (traj_segment_generator
moved tocommon/runners.py
,flatten_lists
tocommon/misc_util.py
). -
ppo2/ppo2.py
functions moved (safe_mean
tocommon/math_util.py
,constfn
andget_schedule_fn
tocommon/schedules.py
). -
sac/policies.py
functionmlp
moved tocommon/tf_layers.py
. -
sac/sac.py
functionget_vars
removed (replaced withtf.util.get_trainable_vars
). -
deepq/replay_buffer.py
renamed tocommon/buffers.py
.
-
New Features:
- Parallelized updating and sampling from the replay buffer in DQN. (@flodorner)
- Docker build script,
scripts/build_docker.sh
, can push images automatically. - Added callback collection
- Added
unwrap_vec_normalize
andsync_envs_normalization
in thevec_env
module
to synchronize two VecNormalize environment - Added a seeding method for vectorized environments. (@NeoExtended)
- Added extend method to store batches of experience in ReplayBuffer. (@solliet)
Bug Fixes:
- Fixed Docker images via
scripts/build_docker.sh
andDockerfile
: GPU image now containstensorflow-gpu
,
and both images havestable_baselines
installed in developer mode at correct directory for mounting. - Fixed Docker GPU run script,
scripts/run_docker_gpu.sh
, to work with new NVidia Container Toolkit. - Repeated calls to
RLModel.learn()
now preserve internal counters for some episode
logging statistics that used to be zeroed at the start of every call. - Fix
DummyVecEnv.render
fornum_envs > 1
. This used to print a warning and then not render at all. (@shwang) - Fixed a bug in PPO2, ACER, A2C, and ACKTR where repeated calls to
learn(total_timesteps)
reset
the environment on every call, potentially biasing samples toward early episode timesteps.
(@shwang) - Fixed by adding lazy property
ActorCriticRLModel.runner
. Subclasses now use lazily-generated
self.runner
instead of reinitializing a new Runner every timelearn()
is called. - Fixed a bug in
check_env
where it would fail on high dimensional action spaces - Fixed
Monitor.close()
that was not calling the parent method - Fixed a bug in
BaseRLModel
when seeding vectorized environments. (@NeoExtended) - Fixed
num_timesteps
computation to be consistent between algorithms (updated afterenv.step()
)
OnlyTRPO
andPPO1
update it differently (after synchronization) because they rely on MPI - Fixed bug in
TRPO
with NaN standardized advantages (@richardwu) - Fixed partial minibatch computation in ExpertDataset (@richardwu)
- Fixed normalization (with
VecNormalize
) for off-policy algorithms - Fixed
sync_envs_normalization
to sync the reward normalization too - Bump minimum Gym version (>=0.11)
Others:
- Removed redundant return value from
a2c.utils::total_episode_reward_logger
. (@shwang) - Cleanup and refactoring in
common/identity_env.py
(@shwang) - Added a Makefile to simplify common development tasks (build the doc, type check, run the tests)