Performance drop bewteen offline and online #29
-
Thanks so much for your work, I find it very helpful. I am confused with a problem, that I trained with qmix+cql,startcraft v1, 3m scneario, firstly offline and then online. I comment out the trainning part in the train online function, so It is actually just evaluating the performance. But I see a big performance discrepency, the offline training achieves around 20 epsiode return, but when I just evluate it in the train_online, it has a much lower return ranging roughly around [2,4]. I am very confused, and hope you could provide some help or insight into it. Thanks so much! Below are my episode return curves for offline and online. I aslo paste my online trainning code and part of my main() function.
def main(_):
|
Beta Was this translation helpful? Give feedback.
Replies: 26 comments 2 replies
-
Hi @zhonghai1995 , this is very cool. Looks like you are trying to do some offline-to-online training. Let me take some time to look into why it doesn't seem to be working. Ill get back to you as soon as possible. I am working on it now. |
Beta Was this translation helpful? Give feedback.
-
I think I know what is going on. The QMIX system ( I see in By the way, I have found IDRQN+CQL works better than QMIX+CQL. |
Beta Was this translation helpful? Give feedback.
-
I see and it solves the problem. thanks again! |
Beta Was this translation helpful? Give feedback.
-
I am glad its working. We actually also did a research project on offline-to-online MARL which you might find interesting. You can find it here: |
Beta Was this translation helpful? Give feedback.
-
By the way, did you tried to use omar for discrete actions? I tried the gumbel max trck in the smac envrionments, and the performance is bad. |
Beta Was this translation helpful? Give feedback.
-
I have also not successfully implemented OMAR for discrete actions. I have seen other people also have challenges with this. See here: thu-rllab/CFCQL#1 |
Beta Was this translation helpful? Give feedback.
-
hi @jcformanek, I tried omar on 2ant in mamujoco, with good dataset. I used adam instead of rmsprop, increase the hidden sizes to 256, and it seems the performance is better, roughly matches the peformance of BC and ITD3 bc in Table D.5 in your paper. Please have a look. |
Beta Was this translation helpful? Give feedback.
-
I run for more seeds, now the result is across 20 seeds. and it looks like it could acheive an average of roughly 1700 mean reward for good dataset in 2ant scenario, worse than bc based methods, but sill better than reported in the table. |
Beta Was this translation helpful? Give feedback.
-
Oh that is great, thank you for sharing. We will work on updating all of the benchmark results. |
Beta Was this translation helpful? Give feedback.
-
og-marl/og_marl/tf2/systems/idrqn.py Lines 115 to 116 in 68db0c0 One more question, why you increase the env step here? thanks! |
Beta Was this translation helpful? Give feedback.
-
Thats used to control the epsilon greedy exploration. It only has an effect if you train online. |
Beta Was this translation helpful? Give feedback.
-
og-marl/og_marl/tf2/systems/base.py Lines 103 to 127 in 68db0c0 But you also increase the environment step here. And the default argument of explore for select action function is True, then you would increase the environment step counter twice for a single environment step,is this expected? |
Beta Was this translation helpful? Give feedback.
-
Oh I see. I think you are right! That would result in exploration decreasing 2x faster than I expected. You are welcome to open a PR to fix it if you like. Alternatively, I can attend to it |
Beta Was this translation helpful? Give feedback.
-
og-marl/og_marl/tf2/systems/qmix_cql.py Lines 191 to 192 in 68db0c0 I also find here cql loss is not multiplied by its weight. is this as expected? If no, please fix them |
Beta Was this translation helpful? Give feedback.
-
I have just merged (#28) in a fix for this and for the |
Beta Was this translation helpful? Give feedback.
-
hi @jcformanek , I see you added more benchmark results for datasets from previous works, thanks for this and it is really helpful. I wonder if I want to convert the dataset of omar's mpe (other than the simple spread), how can I do it? Also, do I need to calculate the normalized score by myself? If so, where can I find the expert, random score of the datset it self? Thanks so much |
Beta Was this translation helpful? Give feedback.
-
I am glad you find it helpful. Ill upload the datasets for the other scenarios, we already converted them. The challenge we faced on those scenarios is that the MPE environment code they used depended on loading in a pre-trained model (PyTorch) for the adversaries. If you can properly instantiate the environment for evaluation, then everything should work fine. With regards to normalisation, the CFCQL paper says they normalise in one way, but if you inspect the code you can see they simply normalise by dividing by the mean episode return of the dataset. You need to do the normalisation yourself, yes. |
Beta Was this translation helpful? Give feedback.
-
Thanks! I am trying to run this simple spread environment from offline to online, during online I need state, but in the infos obtained from step are just info_n [{}, {}, {}]. What are the states for the mpe simple spread? I could extract this by myself, but I do not know how it is composed. Please help me. Thanks so much! |
Beta Was this translation helpful? Give feedback.
-
I think I figure it out, the state is just the concatenation of the three agent's observations. |
Beta Was this translation helpful? Give feedback.
-
Yes, I think you are correct! Also @callumtilbury is uploading the other MPE vaults now. We will add the download link to the file |
Beta Was this translation helpful? Give feedback.
-
Hi @zhonghai1995 👋🏻 Here are the MPE vaults from OMAR:
Note that for the The vault conversion code can be found here: https://bit.ly/vault-conversion-notebook. For OMAR's MPE datasets, see Example 4. Please let us know if you have any further questions or problems! 🚀 |
Beta Was this translation helpful? Give feedback.
-
@callumtilbury This is super helpful for me! Thanks so mcuh! |
Beta Was this translation helpful? Give feedback.
-
I am going to convert this "issue" into a "discussion" and then we can continue discussing using OG-MARL for offline-to-online MARL. 🚀 |
Beta Was this translation helpful? Give feedback.
-
hi @jcformanek, what do you think of adding td lambda for qmix and qmix +cql? Is this a good practice or not? I see pymarl2 uses https://github.com/hijkzzz/pymarl2 and there is performance improvement, but I am not sure whether we should use it. What is your opinion? |
Beta Was this translation helpful? Give feedback.
-
Hey @zhonghai1995. We did try this, but the limited early results were somewhat lacklustre. That being said, we did not perform a thorough analysis, and it's quite possible that there is value to be gained here. e.g. the MAICQ paper mention it as a core element to their algorithm. Please keep us in the loop if you do any further investigations 🚀 |
Beta Was this translation helpful? Give feedback.
-
hi @jcformanek @callumtilbury, I have a question for the cql term for qmix+cql, for the implementation, why you chooses to minize the q values under the uniform distribution, instead of using the logsumexp, like in the original cql paper? Thanks! og-marl/og_marl/tf2/systems/qmix_cql.py Lines 164 to 185 in e99a480 |
Beta Was this translation helpful? Give feedback.
I think I know what is going on. The QMIX system (
qmix.py
) has an argument calledeps_decay_timesteps=50_000
. This means that theqmix+cql.py
system will use epsilon-greedy action selection for the first 50000 timesteps. That means your system is choosing random actions when it goes online. Try setting that value to zero.I see in
qmix_cql.py
I did not expose theeps_decay_timesteps
argument. So you may want to just modify the code a bit so that you can change it.By the way, I have found IDRQN+CQL works better than QMIX+CQL.
https://instadeepai.github.io/og-marl/baselines/smac_v1/