- Tensorflow 1.05
- Numpy
- Openai gym
- Discrete policy
- Continous policy
- Different losses
- LSTM support
- Parallelism
- Penalty for going out of action bounds
To start training process:
python train_parallel.py -env %ENV_NAME%
To start play on trained model:
python play.py -env %ENV_NAME%
To start tensorboard while training:
tensorboard --logdir=logs/%ENV_NAME%
All trained models located at models/%ENV_NAME%/
If you want to override hyperparameters, you should create/modify props/%ENV_NAME%.properties
file
clip_eps
clipping bound (if using clipped surrogate objective) (default0.2
)grad_step
learning rate for Adam (default0.0001
)discount_factor
Discount factor (default0.99
)gae_factor
lambda for Generalized advantage estimation (default0.98
)batch_size
Batch size for training. Each gathering worker will be collectingbatch_size / gather_per_worker
episode steps (default4096
)rollout_size
Full size of rollout. Worker will be training onrollout_size / batch_size
batches (default8192
)epochs
Num epochs for training onrollout_size
timesteps (default10
)entropy_coef
Entropy penalty (default0
)hidden_layer_size
Common hidden layer size (LSTM unit size) (default128
)timestep_size
(LSTM only) number of timesteps per LSTM batch. Input data will be of shape(batch_size, timestep_size, state_dim)
kl_target
(only forkl_loss
) (default0.01
)use_kl_loss
Use kl_loss or clipped objective (defaultFalse
)init_beta
(only forkl_loss
) multiplying parameter for kl_loss (default1.0
)eta
multiplying parameter for hinge lossrecurrent
Use recurrent NN (defaultFalse
)worker_num
Number of workers. After optimization step each worker sends gradients to master process.gather_per_worker
Number of "experience gathering" workers per optimizing worker.nn_std
To use or not variance estimation through NerualNet or just use set of trainable variables.reward_transform
reward transforming function (allowable_values:scale
,positive
,identity
)
Many of implementation details are taken from here