Recurrent PPO #53

araffin · 2021-11-29T12:04:45Z

Description

Experimental version of PPO with LSTM policy.

Current status: usable but not polished, see #53 (comment)

Missing:

benchmark https://wandb.ai/sb3/no-vel-envs/reports/PPO-vs-RecurrentPPO-aka-PPO-LSTM-on-environments-with-masked-velocity--VmlldzoxOTI4NjE4 (and models in https://wandb.ai/openrlbenchmark/sb3?workspace=user-araffin )
more tests (reproducibility + identity learning/callback)
reduce code duplication in buffer
document buffer + rest
more testing of the different architecture (shared vs separate lstm for actor/critic)

Known issue: if the model was train on GPU and tested on CPU, a warning will be issued because it cannot unpickle the lstm initial states. This is ok as they will be reset anyway in setup_model() and it doesn't affect prediction.

Context

I have raised an issue to propose this change (required)
closes recurrent policy implementation in ppo [feature-request] DLR-RM/stable-baselines3#18

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)

Checklist:

Note: we are using a maximum length of 127 characters per line

araffin · 2022-04-12T22:35:00Z

@andrewwarrington

When doing the backprop through the action probabilities (here), you just take a single step in the RNN, i.e. the gradient with respect to the RNN parameters given just the previous RNN state and input?

Actually not, see the shape in the buffer https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/blob/feat/ppo-lstm/sb3_contrib/common/recurrent/buffers.py#L164
I just pushed proper masking for the padded sequences. We do backprop for the sequences collected during the n_steps of the rollout.

HamiltonWang · 2022-05-02T04:00:26Z

for the time being, is there any way we can feedback the older data back to the trainer as input to mimic a crude version of LSTM? any sample code to do that?

araffin · 2022-05-03T08:05:22Z

for the time being, is there any way we can feedback the older data back to the trainer as input to mimic a crude version of LSTM? any sample code to do that?

Hello,
you can already use this PR if you need to use recurrent PPO (see install from source in our doc), otherwise you can use frame stacking or history wrapper (see code in the RL Zoo).

henrydeclety · 2022-05-11T16:46:38Z

when LSTM for A2C?

araffin · 2022-05-11T19:43:33Z

when LSTM for A2C?

a2c is a special case of ppo ;) (cc @vwxyzjn )

vwxyzjn · 2022-05-11T23:01:49Z

@henrydeclety see https://github.com/vwxyzjn/a2c_is_a_special_case_of_ppo. We have a paper coming out soon...

vwxyzjn · 2022-05-20T02:30:41Z

The preprint of the paper is out at https://arxiv.org/abs/2205.09123 @henrydeclety :)

HamiltonWang · 2022-05-20T02:46:11Z

Hello, you can already use this PR if you need to use recurrent PPO (see install from source in our doc), otherwise you can use frame stacking or history wrapper (see code in the RL Zoo).

I’ll give it a try

EloyAnguiano · 2023-03-15T14:34:04Z

How could I configure the maximum sequence length for the LSTM?

philippkiesling · 2023-03-15T15:24:21Z

@EloyAnguiano As far as I could tell from the code, the implementation in SB3 does not have a sequence length, but saves the hidden state between steps of your environment and then uses it as input. So the maximum sequence length for the lstm would be the number of steps (n_steps) before you update your policy.

This way you only need to compute each input once, instead of refeeding it every new step.

araffin and others added 23 commits November 22, 2021 18:04

Running (not working yet) version of recurrent PPO

85c9a50

Fixes for multi envs

b92da74

Save WIP, rework the sampling

d9f9c4e

Add Box support

97ec8ec

Fix sample order

a890976

Being cleanup, code is broken (again)

7fecd9f

First working version (no shared lstm)

0ddc3f6

Start cleanup

5ef313b

Try rnn with value function

c803ac9

Re-enable batch size

0c8ab15

Deactivate vf rnn

eb1e6c1

Allow any batch size

f013346

Add support for evaluation

a14f2ce

Add CNN support

362dec4

Fix start of sequence

5b162db

Allow shared LSTM

954e6dd

Rename mask to episode_start

832093d

Fix type hint

2a9c956

Enable LSTM for critic

15c080a

Clean code

0d304aa

Fix for CNN LSTM

1dc78b4

Fix sampling with n_layers > 1

deaa7b4

Add std logger

ced6aee

This was referenced Nov 29, 2021

Episode start signal not used in RNN for on-policy algorithms thu-ml/tianshou#486

Open

Highlights over existing PyTorch RL repos DLR-RM/stable-baselines3#20

Closed

Update wording

b81fdff

Miffyli mentioned this pull request Dec 1, 2021

reccurent policy DLR-RM/stable-baselines3#210

Closed

araffin added 3 commits December 1, 2021 19:16

Merge branch 'master' into feat/ppo-lstm

a2a201f

Merge branch 'master' into feat/ppo-lstm

754e0a3

Rename and add dict obs support

c9c0b4e

Update default in perf test

1cd27da

araffin mentioned this pull request Apr 14, 2022

Add RecurrentPPO support DLR-RM/rl-baselines3-zoo#190

Merged

15 tasks

araffin added 3 commits April 15, 2022 20:25

Remove TODO, mask is now working

c52959b

Merge branch 'master' into feat/ppo-lstm

18e6230

Add helper to remove duplicated code, remove hack for padding

673d23a

vineetvermait approved these changes May 3, 2022

View reviewed changes

araffin added 2 commits May 8, 2022 15:23

Enable LSTM critic and raise threshold for cartpole with no vel

e271d03

Fix tests

73bb89c

araffin added 2 commits May 18, 2022 23:39

Update doc and tests

efa6181

Doc fix

564d428

Fix for new Sphinx version

408ed24

araffin mentioned this pull request May 29, 2022

observation_space does not match reset() observation, though I confirmed they are identical. DLR-RM/stable-baselines3#921

Closed

araffin added 4 commits May 29, 2022 22:02

Merge branch 'master' into feat/ppo-lstm

d917487

Fix doc note

6acb64a

Switch to batch first, no more additional swap

5fd8be7

Add comments and mask entropy loss

7a1d3e8

araffin merged commit 75b2de1 into master May 30, 2022

araffin deleted the feat/ppo-lstm branch May 30, 2022 02:31

vwxyzjn mentioned this pull request Sep 19, 2022

Are you interested in PRs for improvements in performance of PPO LSTM script? vwxyzjn/cleanrl#276

Open

araffin mentioned this pull request Oct 21, 2022

Questions regarding BPTT (backpropagation through time) #110

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recurrent PPO #53

Recurrent PPO #53

araffin commented Nov 29, 2021 •

edited

Loading

araffin commented Apr 12, 2022

HamiltonWang commented May 2, 2022

araffin commented May 3, 2022

henrydeclety commented May 11, 2022

araffin commented May 11, 2022

vwxyzjn commented May 11, 2022

vwxyzjn commented May 20, 2022

HamiltonWang commented May 20, 2022

EloyAnguiano commented Mar 15, 2023

philippkiesling commented Mar 15, 2023 •

edited

Loading

Recurrent PPO #53

Recurrent PPO #53

Conversation

araffin commented Nov 29, 2021 • edited Loading

Description

Context

Types of changes

Checklist:

araffin commented Apr 12, 2022

HamiltonWang commented May 2, 2022

araffin commented May 3, 2022

henrydeclety commented May 11, 2022

araffin commented May 11, 2022

vwxyzjn commented May 11, 2022

vwxyzjn commented May 20, 2022

HamiltonWang commented May 20, 2022

EloyAnguiano commented Mar 15, 2023

philippkiesling commented Mar 15, 2023 • edited Loading

araffin commented Nov 29, 2021 •

edited

Loading

philippkiesling commented Mar 15, 2023 •

edited

Loading