Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does the walker run reproduce correctly? #4

Open
letusfly85 opened this issue Sep 12, 2020 · 8 comments
Open

Does the walker run reproduce correctly? #4

letusfly85 opened this issue Sep 12, 2020 · 8 comments

Comments

@letusfly85
Copy link

Hi, thank you for the cool repository!

I tried several tasks walker walk, cheetah run. They seem to work fine.

But when I run walker run, the episode_reward cannot achieve around 700.
Is there any problem...? 🤔

スクリーンショット 2020-09-12 10 35 27

The original paper seems to say walker run will achieve around 700 with 1M steps, saying on the following page 7, Figure7.

https://arxiv.org/pdf/1912.01603.pdf

Thank you.

@yusukeurakami
Copy link
Owner

I know it is too late to comment on this issue but could you tell me the hyper parameters you tried this experiment?

@coderlemon17
Copy link

@yusukeurakami Hi, I also find that the results for walker-run are weird, I ran the experiment with 5 different seeds, and here's what I got:
image

And the hyperparameters I use are:

Hyperparameters
action_noise: 0.3
action_repeat: 2
actor_lr: 8.0e-05
adam_epsilon: 1.0e-07
algo: dreamer
batch_size: 50
belief_size: 200
bit_depth: 5
candidates: 1000
checkpoint_interval: 50
chunk_size: 50
cnn_activation_function: relu
collect_interval: 100
comment: ''
config: dm_control/dreamer/walker-run.yaml
dense_activation_function: elu
device: cuda:3
embedding_size: 1024
env: walker-run
episodes: 1000
exp_ckpt: ''
experience_size: 1000000
free_nats: 3
gamma: 0.99
global_kl_beta: 0.0
grad_clip_norm: 100.0
hidden_size: 200
id: dreamer
lambda_: 0.95
max_episode_length: 1000
model_ckpt: ''
model_lr: 0.001
model_lr_schedule: 0
optimisation_iters: 10
overshooting_distance: 50
overshooting_kl_beta: 0.0
overshooting_reward_scale: 0.0
planning_horizon: 15
render: false
save_experience_buffer: false
seed: 0
seed_episodes: 5
state_size: 30
symbolic_env: false
test: false
test_episodes: 10
test_interval: 10
top_candidates: 100
torch_deterministic: true
value_lr: 8.0e-05
worldmodel_LogProbLoss: false

@yingchengyang
Copy link

The same question. Hoping for your reply. Thanks.

@sumwailiu
Copy link

sumwailiu commented Jan 8, 2025

I think the root cause is that there are some minor mistakes in the reward computation (inspired by the Issue #5). After fixing them (refers to the pull request #12), I found walker-run task is reproduced correctly.

@coderlemon17
Copy link

I think the root cause is that there are some minor mistakes in the reward computation (inspired by the Issue #5). After fixing them (refers to the pull request #12), I found walker-run task is reproduced correctly.

Hi, thanks for the explanation. However, I'm a little confused about the fix.
Assume the sequence is $(s_1, a_1, r_1, \cdots)$. To my understanding, you are trying to predict $r_t$ with $(s_t, h_t)$, i.e. $(s_{\leq t}, a_{<t})$. However, you need $(s_{\leq t+1}, a_{<t+1})$ to predict $r_t$, is that correct?

@sumwailiu
Copy link

sumwailiu commented Jan 8, 2025

Hi, thanks for the explanation. However, I'm a little confused about the fix. Assume the sequence is ( s 1 , a 1 , r 1 , ⋯ ) . To my understanding, you are trying to predict r t with ( s t , h t ) , i.e. ( s ≤ t , a < t ) . However, you need ( s ≤ t + 1 , a < t + 1 ) to predict r t , is that correct?

It is correct that I try to predict rt with (st, ht), and I didn't use (st+1, at) to predict rt.

Acutally, the origin implementation of dreamer-torch (i.e., reward_loss = -reward_dist.log_prob(rewards[:-1]).mean(dim=(0, 1))) means that it tries to predict rt-1 with (st, ht). In other words, it needs (s≤t+1, a<t+1) to predict rt.

@yusukeurakami
Copy link
Owner

Thanks @sumwailiu for your PR
Let's me test on my side with your PR. I need to remember the details first. I forgot how its been implemented on the paper and original repo

@sumwailiu
Copy link

@coderlemon17 @yusukeurakami There are some mistakes in the first version of the problem fix (#12), where the reward computation is indeed correct. Please refer to the second version (#13).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants