Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the paper/implementation #3

Open
araffin opened this issue Aug 28, 2022 · 8 comments
Open

Question about the paper/implementation #3

araffin opened this issue Aug 28, 2022 · 8 comments

Comments

@araffin
Copy link
Contributor

araffin commented Aug 28, 2022

Hello,
thanks for sharing and open sourcing the work.
After a quick read of the paper, I had several questions:

I have a working implementation of TQC + DropQ using Stable-Baselines3 that I can also share ;) (I can do a PR on request, and it will probably part of SB3 soon)
SB3 branch: https://github.com/DLR-RM/stable-baselines3/tree/feat/dropq
SB3 contrib branch: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/tree/feat/dropq
Training script: https://github.com/araffin/walk_in_the_park/blob/feat/sb3/train_sb3.py

EDIT: SBX = SB3 + Jax is available here: https://github.com/araffin/sbx (with TQC, DroQ and SAC-N)

W&B example run: https://wandb.ai/araffin/a1/runs/2ln32rqx?workspace=user-araffin

@ikostrikov
Copy link
Owner

Hello,

  • we ablated over different utds and found that utd=20 works best. See this figure.
  • TQC is an exciting algorithm. However, we didn't try it specifically for this work.
  • we ran experiments with and without a low pass filter; however, in our specific setup, we didn't notice a significant difference, probably due to larger damping values. At the same time, I think in many scenarios the low-pass filter can be useful.

Results for TQC+DroQ look interesting! However, we do not plan to expand this repository and intent to keep it frozen to ensure the reproducibility of the results reported in the paper.

@araffin
Copy link
Contributor Author

araffin commented Aug 28, 2022

Thanks for the swift answer =)

we ablated over different utds and found that utd=20 works best. See this figure.

given how fast is the implementation, it would make sense to even try UTD > 20, no?

Btw, what makes it so fast? jax only or additional special tricks?

Did you consider running the training for longer than 20 minutes or does it plateau/breaks? (let's say 1h for the easiest setup)
Because the learned policies walk forward but one can tell it's a RL controller... (gaits are not so natural/good looking)

@ikostrikov
Copy link
Owner

Our laptop could run training only with utd=20 in real time, so we didn't try larger values :)

Yes, it's just jax.jit. Otherwise, it's a vanilla implementation without any additional engineering.

In the wild, we were constrained by the battery capacity :) With more training it gets better and better.

@araffin
Copy link
Contributor Author

araffin commented Aug 28, 2022

In the wild, we were constrained by the battery capacity :) With more training it gets better and better.

Alright... still curious to see what it could do in the simplest setting (indoor, no battery, flat ground).

fyi, I created a small report for the runs I did today with TQC ;) https://wandb.ai/araffin/a1/reports/TQC-with-DropQ-config-on-walk-in-the-park-env--VmlldzoyNTQxMzgz
After minor tuning of the discount factor (gamma=0.98), it consistently reaches return > 3700 in only 8k env interactions =) (sometimes in only 5k)

@araffin
Copy link
Contributor Author

araffin commented Sep 23, 2022

As a follow up, I've got a working version of TQC + DroQ in jax here (borrowed some code from your implementation ;)): vwxyzjn/cleanrl#272
(also a version of TQC + TD3 + DroQ, need to polish everything)

@i1Cps
Copy link

i1Cps commented Jul 2, 2024

@araffin Morning, I was reading the paper and trying to understand where the simulations played a part in the training. I understood the paper was promoting the idea of training using the real environment over a simulation, cutting the steps sim -> real out.

But it then mentions using Mujoco and modelling the A1, I am assuming perhaps they initialised the training in mujoco then fine tuned with the real environment?

@araffin
Copy link
Contributor Author

araffin commented Jul 2, 2024

no no, they trained on the real robot only, i managed to reproduce the experiment: https://araffin.github.io/slides/design-real-rl-experiments/#/13/2

@i1Cps
Copy link

i1Cps commented Jul 2, 2024

Nice work! But I'm then confused as to which way they Incorporated the simulation (specifically Mujoco) in there training/research. Was it simply to draw a comparison?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants