Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please check if the reward is correct #13

Open
gaya1thri opened this issue Dec 2, 2024 · 2 comments
Open

Please check if the reward is correct #13

gaya1thri opened this issue Dec 2, 2024 · 2 comments

Comments

@gaya1thri
Copy link
Contributor

We got the below results of reward after training rl model
And rl+llm model (we used llama instead of gpt 4 as u suggested)
Here it executed until 41 step and finally got the reward when training with llm but it accepted every decision of rl considering it as reasonable always
But the reward reduced from
-187165.88(rl) to -2.29(
a293c0a7-3f20-4ed0-abf7-25b58b9c4d35
632140ba-24c4-40d0-ba79-728bab41fc7b
14b0fc42-d5c1-421f-943c-6f0f816fe3ad
rl+llm).
Is the reward we getting is correct?
And also how do we get the mean travel time, mean waiting time and mean speed values?

@pangay
Copy link
Member

pangay commented Dec 4, 2024

The final output of the program is the cumulative reward, so I believe the value of -2.29 is problematic. Regarding the training, the code I provided does not incorporate the large model into the training process. The idea behind this code is to first train the RL model, and then, during usage, attach the LLM. In other words, you need to train the RL model weights first and then combine the trained RL model with the LLM.

@gaya1thri
Copy link
Contributor Author

gaya1thri commented Dec 5, 2024

The final output of the program is the cumulative reward, so I believe the value of -2.29 is problematic. Regarding the training, the code I provided does not incorporate the large model into the training process. The idea behind this code is to first train the RL model, and then, during usage, attach the LLM. In other words, you need to train the RL model weights first and then combine the trained RL model with the LLM.

Thank you very much for replying,
IMG_6335

Since we got 187555 we changed the number of steps from 3e5 to 10e6
I updated the last_vec_normalize.pkl with vec_normalize_10e6model
Then with 10e6 we got -1014 as a reward (with just rl )
And using rl that we trained and incorporating with llama we got reward as -21.5
Now are these values are appropriate(better than before)?
IMG_6336

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants