Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recurrent PPO #53

Merged
merged 60 commits into from
May 30, 2022
Merged

Recurrent PPO #53

merged 60 commits into from
May 30, 2022

Conversation

araffin
Copy link
Member

@araffin araffin commented Nov 29, 2021

Description

Experimental version of PPO with LSTM policy.

Current status: usable but not polished, see #53 (comment)

Missing:

Known issue: if the model was train on GPU and tested on CPU, a warning will be issued because it cannot unpickle the lstm initial states. This is ok as they will be reset anyway in setup_model() and it doesn't affect prediction.

Context

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)

Checklist:

  • I've read the CONTRIBUTION guide (required)
  • The functionality/performance matches that of the source (required for new training algorithms or training-related features).
  • I have updated the tests accordingly (required for a bug fix or a new feature).
  • I have included an example of using the feature (required for new features).
  • I have included baseline results (required for new training algorithms or training-related features).
  • I have updated the documentation accordingly.
  • I have updated the changelog accordingly (required).
  • I have reformatted the code using make format (required)
  • I have checked the codestyle using make check-codestyle and make lint (required)
  • I have ensured make pytest and make type both pass. (required)

Note: we are using a maximum length of 127 characters per line

@araffin
Copy link
Member Author

araffin commented Apr 12, 2022

@andrewwarrington

When doing the backprop through the action probabilities (here), you just take a single step in the RNN, i.e. the gradient with respect to the RNN parameters given just the previous RNN state and input?

Actually not, see the shape in the buffer https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/blob/feat/ppo-lstm/sb3_contrib/common/recurrent/buffers.py#L164
I just pushed proper masking for the padded sequences. We do backprop for the sequences collected during the n_steps of the rollout.

@HamiltonWang
Copy link

for the time being, is there any way we can feedback the older data back to the trainer as input to mimic a crude version of LSTM? any sample code to do that?

@araffin
Copy link
Member Author

araffin commented May 3, 2022

for the time being, is there any way we can feedback the older data back to the trainer as input to mimic a crude version of LSTM? any sample code to do that?

Hello,
you can already use this PR if you need to use recurrent PPO (see install from source in our doc), otherwise you can use frame stacking or history wrapper (see code in the RL Zoo).

@henrydeclety
Copy link

when LSTM for A2C?

@araffin
Copy link
Member Author

araffin commented May 11, 2022

when LSTM for A2C?

a2c is a special case of ppo ;) (cc @vwxyzjn )

@vwxyzjn
Copy link
Contributor

vwxyzjn commented May 11, 2022

@henrydeclety see https://github.com/vwxyzjn/a2c_is_a_special_case_of_ppo. We have a paper coming out soon...

@vwxyzjn
Copy link
Contributor

vwxyzjn commented May 20, 2022

The preprint of the paper is out at https://arxiv.org/abs/2205.09123 @henrydeclety :)

@HamiltonWang
Copy link

Hello, you can already use this PR if you need to use recurrent PPO (see install from source in our doc), otherwise you can use frame stacking or history wrapper (see code in the RL Zoo).

I’ll give it a try

@EloyAnguiano
Copy link

How could I configure the maximum sequence length for the LSTM?

@philippkiesling
Copy link

philippkiesling commented Mar 15, 2023

@EloyAnguiano As far as I could tell from the code, the implementation in SB3 does not have a sequence length, but saves the hidden state between steps of your environment and then uses it as input. So the maximum sequence length for the lstm would be the number of steps (n_steps) before you update your policy.

This way you only need to compute each input once, instead of refeeding it every new step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

recurrent policy implementation in ppo [feature-request]