Performance drop bewteen offline and online #29

zhonghai1995 · 2024-05-10T12:33:02Z

zhonghai1995
May 10, 2024

Thanks so much for your work, I find it very helpful.

I am confused with a problem, that I trained with qmix+cql,startcraft v1, 3m scneario, firstly offline and then online. I comment out the trainning part in the train online function, so It is actually just evaluating the performance. But I see a big performance discrepency, the offline training achieves around 20 epsiode return, but when I just evluate it in the train_online, it has a much lower return ranging roughly around [2,4]. I am very confused, and hope you could provide some help or insight into it. Thanks so much!

Below are my episode return curves for offline and online. I aslo paste my online trainning code and part of my main() function.

def train_online(
    self,
    replay_buffer: FlashbaxReplayBuffer,
    max_env_steps: int = int(1e4),
    train_period: int = 20
) -> None:
    """Method to train the system online."""
    episodes = 0
    while True:  # breaks out when env_steps > max_env_steps
        self.reset()  # reset the system
        observations_ = self._environment.reset()

        if isinstance(observations_, tuple):
            observations, infos = observations_
        else:
            observations = observations_
            infos = {}

        episode_return = 0.0
        while True:
            
            if "legals" in infos:
                legal_actions = infos["legals"]
            else:
                legal_actions = None

            start_time = time.time()
            actions = self.select_actions(observations, legal_actions)
            end_time = time.time()
            time_for_action_selection = end_time - start_time

            start_time = time.time()
            (
                next_observations,
                rewards,
                terminals,
                truncations,
                next_infos,
            ) = self._environment.step(actions)
            end_time = time.time()
            time_to_step = end_time - start_time

            # Add step to replay buffer
            replay_buffer.add(observations, actions, rewards, terminals, truncations, infos)

            # Critical!!
            observations = next_observations
            infos = next_infos

            # Bookkeeping
            episode_return += np.mean(list(rewards.values()), dtype="float")
            self._env_step_ctr += 1
            """Comment out the training part
            if (
                self._env_step_ctr > 100 and self._env_step_ctr % train_period == 0
            ):  # TODO burn in period
                # Sample replay buffer
                start_time = time.time()
                experience = replay_buffer.sample()
                end_time = time.time()
                time_to_sample = end_time - start_time

                # Train step
                start_time = time.time()
                train_logs = self.train_step(experience,offline = False)
                end_time = time.time()
                time_train_step = end_time - start_time

                train_steps_per_second = 1 / (time_train_step + time_to_sample)
                env_steps_per_second = 1 / (time_to_step + time_for_action_selection)

                train_logs = {
                    **train_logs,
                    **self.get_stats(),
                    "Environment Steps": self._env_step_ctr,
                    "Time to Sample": time_to_sample,
                    "Time for Action Selection": time_for_action_selection,
                    "Time to Step Env": time_to_step,
                    "Time for Train Step": time_train_step,
                    "Train Steps Per Second": train_steps_per_second,
                    "Env Steps Per Second": env_steps_per_second,
                }

                self._logger.write(train_logs)
            """
            if all(terminals.values()) or all(truncations.values()):
                break

        episodes += 1
        if episodes % 1 == 0:  # TODO: make variable
            self._logger.write(
                {   
                    "Episodes": episodes,
                    "Episode Return": episode_return,
                    "Environment Steps": self._env_step_ctr,
                },
                force=True,
            )

        if self._env_step_ctr > max_env_steps:
            break

def main(_):
config = {
"env": FLAGS.env,
"scenario": FLAGS.scenario,
"dataset": FLAGS.dataset,
"system": FLAGS.system,
"backend": "tf2",
}

env = get_environment(FLAGS.env, FLAGS.scenario)

buffer = FlashbaxReplayBuffer(sequence_length=20, sample_period=1)

download_and_unzip_vault(FLAGS.env, FLAGS.scenario)

is_vault_loaded = buffer.populate_from_vault(FLAGS.env, FLAGS.scenario, FLAGS.dataset)
if not is_vault_loaded:
    print("Vault not found. Exiting.")
    return


run_name = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
logger = WandbLogger(project="og-marl-baselines", config=config,name = run_name)

json_writer = JsonWriter(
    "logs",
    f"{FLAGS.system}",
    f"{FLAGS.scenario}_{FLAGS.dataset}",
    FLAGS.env,
    FLAGS.seed,
    file_name=f"{FLAGS.scenario}_{FLAGS.dataset}_{FLAGS.seed}.json",
    save_to_wandb=True,
)

system_kwargs = {"add_agent_id_to_obs": True}
if FLAGS.scenario == "pursuit":
    system_kwargs["observation_embedding_network"] = CNNEmbeddingNetwork()

system = get_system(FLAGS.system, env, logger, **system_kwargs)

system.train_offline(buffer, max_trainer_steps=FLAGS.trainer_steps, json_writer=json_writer)

online_buffer = FlashbaxReplayBuffer(sequence_length=20, sample_period=1)
system.train_online(online_buffer)

Answered by jcformanek

May 10, 2024

I think I know what is going on. The QMIX system (qmix.py) has an argument called eps_decay_timesteps=50_000. This means that the qmix+cql.py system will use epsilon-greedy action selection for the first 50000 timesteps. That means your system is choosing random actions when it goes online. Try setting that value to zero.

I see in qmix_cql.py I did not expose the eps_decay_timesteps argument. So you may want to just modify the code a bit so that you can change it.

By the way, I have found IDRQN+CQL works better than QMIX+CQL.

https://instadeepai.github.io/og-marl/baselines/smac_v1/

View full answer

jcformanek · 2024-05-10T13:05:03Z

jcformanek
May 10, 2024
Maintainer

Hi @zhonghai1995 , this is very cool. Looks like you are trying to do some offline-to-online training. Let me take some time to look into why it doesn't seem to be working. Ill get back to you as soon as possible. I am working on it now.

0 replies

jcformanek · 2024-05-10T13:16:31Z

jcformanek
May 10, 2024
Maintainer

I think I know what is going on. The QMIX system (qmix.py) has an argument called eps_decay_timesteps=50_000. This means that the qmix+cql.py system will use epsilon-greedy action selection for the first 50000 timesteps. That means your system is choosing random actions when it goes online. Try setting that value to zero.

I see in qmix_cql.py I did not expose the eps_decay_timesteps argument. So you may want to just modify the code a bit so that you can change it.

By the way, I have found IDRQN+CQL works better than QMIX+CQL.

https://instadeepai.github.io/og-marl/baselines/smac_v1/

0 replies

zhonghai1995 · 2024-05-11T09:49:19Z

zhonghai1995
May 11, 2024
Author

I see and it solves the problem. thanks again!

0 replies

jcformanek · 2024-05-13T08:38:05Z

jcformanek
May 13, 2024
Maintainer

I am glad its working. We actually also did a research project on offline-to-online MARL which you might find interesting. You can find it here:

https://instadeepai.github.io/selective-reincarnation-marl/

0 replies

zhonghai1995 · 2024-05-23T08:48:58Z

zhonghai1995
May 23, 2024
Author

By the way, did you tried to use omar for discrete actions? I tried the gumbel max trck in the smac envrionments, and the performance is bad.

0 replies

jcformanek · 2024-05-23T13:32:51Z

jcformanek
May 23, 2024
Maintainer

I have also not successfully implemented OMAR for discrete actions. I have seen other people also have challenges with this. See here: thu-rllab/CFCQL#1

0 replies

zhonghai1995 · 2024-05-25T15:57:50Z

zhonghai1995
May 25, 2024
Author

hi @jcformanek, I tried omar on 2ant in mamujoco, with good dataset. I used adam instead of rmsprop, increase the hidden sizes to 256, and it seems the performance is better, roughly matches the peformance of BC and ITD3 bc in Table D.5 in your paper. Please have a look.

https://wandb.ai/haizhong/og-marl-baselines/reports/evaluator-episode_return_offline-24-05-25-23-51-46---Vmlldzo4MDkyMjU4

0 replies

zhonghai1995 · 2024-05-26T11:57:38Z

zhonghai1995
May 26, 2024
Author

I run for more seeds, now the result is across 20 seeds. and it looks like it could acheive an average of roughly 1700 mean reward for good dataset in 2ant scenario, worse than bc based methods, but sill better than reported in the table.

https://wandb.ai/haizhong/og-marl-baselines/reports/evaluator-episode_return_offline-24-05-26-21-21-23---Vmlldzo4MDk5MjA3

0 replies

jcformanek · 2024-05-31T08:25:36Z

jcformanek
May 31, 2024
Maintainer

Oh that is great, thank you for sharing. We will work on updating all of the benchmark results.

0 replies

zhonghai1995 · 2024-06-12T10:25:37Z

zhonghai1995
Jun 12, 2024
Author

og-marl/og_marl/tf2/systems/idrqn.py

Lines 115 to 116 in 68db0c0

    
           if explore: 
        
               self._env_step_ctr += 1.0

One more question, why you increase the env step here? thanks!

0 replies

jcformanek · 2024-06-12T11:34:59Z

jcformanek
Jun 12, 2024
Maintainer

Thats used to control the epsilon greedy exploration. It only has an effect if you train online.

0 replies

zhonghai1995 · 2024-06-12T12:59:26Z

zhonghai1995
Jun 12, 2024
Author

og-marl/og_marl/tf2/systems/base.py

Lines 103 to 127 in 68db0c0

    
           actions = self.select_actions(observations, legal_actions) 
        
           end_time = time.time() 
        
           time_for_action_selection = end_time - start_time 
        
           start_time = time.time() 
        
           ( 
        
               next_observations, 
        
               rewards, 
        
               terminals, 
        
               truncations, 
        
               next_infos, 
        
           ) = self._environment.step(actions) 
        
           end_time = time.time() 
        
           time_to_step = end_time - start_time 
        
           # Add step to replay buffer 
        
           replay_buffer.add(observations, actions, rewards, terminals, truncations, infos) 
        
           # Critical!! 
        
           observations = next_observations 
        
           infos = next_infos 
        
           # Bookkeeping 
        
           episode_return += np.mean(list(rewards.values()), dtype="float") 
        
           self._env_step_ctr += 1

But you also increase the environment step here. And the default argument of explore for select action function is True, then you would increase the environment step counter twice for a single environment step，is this expected?

0 replies

jcformanek · 2024-06-12T13:08:29Z

jcformanek
Jun 12, 2024
Maintainer

Oh I see. I think you are right! That would result in exploration decreasing 2x faster than I expected. You are welcome to open a PR to fix it if you like. Alternatively, I can attend to it

0 replies

zhonghai1995 · 2024-06-14T11:17:28Z

zhonghai1995
Jun 14, 2024
Author

og-marl/og_marl/tf2/systems/qmix_cql.py

Lines 191 to 192 in 68db0c0

    
           # Mask out zero-padded timesteps 
        
           loss = td_loss + cql_loss

I also find here cql loss is not multiplied by its weight. is this as expected? If no, please fix them

0 replies

jcformanek · 2024-06-18T07:46:14Z

jcformanek
Jun 18, 2024
Maintainer

I have just merged (#28) in a fix for this and for the env_step_ctr bug. Thank you so much for finding and reporting these bugs. I really appreciate your contributions. Let me know if you find any more.

0 replies

zhonghai1995 · 2024-06-24T06:58:59Z

zhonghai1995
Jun 24, 2024
Author

hi @jcformanek , I see you added more benchmark results for datasets from previous works, thanks for this and it is really helpful. I wonder if I want to convert the dataset of omar's mpe (other than the simple spread), how can I do it? Also, do I need to calculate the normalized score by myself? If so, where can I find the expert, random score of the datset it self? Thanks so much

0 replies

jcformanek · 2024-06-24T07:16:15Z

jcformanek
Jun 24, 2024
Maintainer

I am glad you find it helpful. Ill upload the datasets for the other scenarios, we already converted them. The challenge we faced on those scenarios is that the MPE environment code they used depended on loading in a pre-trained model (PyTorch) for the adversaries. If you can properly instantiate the environment for evaluation, then everything should work fine.

With regards to normalisation, the CFCQL paper says they normalise in one way, but if you inspect the code you can see they simply normalise by dividing by the mean episode return of the dataset. You need to do the normalisation yourself, yes.

0 replies

zhonghai1995 · 2024-06-24T11:04:57Z

zhonghai1995
Jun 24, 2024
Author

Thanks! I am trying to run this simple spread environment from offline to online, during online I need state, but in the infos obtained from step are just info_n [{}, {}, {}]. What are the states for the mpe simple spread? I could extract this by myself, but I do not know how it is composed. Please help me. Thanks so much!

0 replies

zhonghai1995 · 2024-06-25T07:23:43Z

zhonghai1995
Jun 25, 2024
Author

Thanks! I am trying to run this simple spread environment from offline to online, during online I need state, but in the infos obtained from step are just info_n [{}, {}, {}]. What are the states for the mpe simple spread? I could extract this by myself, but I do not know how it is composed. Please help me. Thanks so much!

I think I figure it out, the state is just the concatenation of the three agent's observations.

0 replies

jcformanek · 2024-06-26T08:30:05Z

jcformanek
Jun 26, 2024
Maintainer

Yes, I think you are correct! Also @callumtilbury is uploading the other MPE vaults now. We will add the download link to the file og_marl/offline_dataset.py.

0 replies

callumtilbury · 2024-06-27T10:14:58Z

callumtilbury
Jun 27, 2024

Hi @zhonghai1995 👋🏻 Here are the MPE vaults from OMAR:

    "mpe_omar": {
        "simple_spread": {"url": "https://s3.kao.instadeep.io/offline-marl-dataset/omar/simple_spread.zip"},
        "simple_tag": {"url": "https://s3.kao.instadeep.io/offline-marl-dataset/omar/simple_tag.zip"},
        "simple_world": {"url": "https://s3.kao.instadeep.io/offline-marl-dataset/omar/simple_world.zip"},
    }

Note that for the simple_world and simple_tag scenarios, the observation dimensions are not homogenous, so we pad them with -inf when appropriate.

The vault conversion code can be found here: https://bit.ly/vault-conversion-notebook. For OMAR's MPE datasets, see Example 4.

Please let us know if you have any further questions or problems! 🚀

0 replies

zhonghai1995 · 2024-06-27T12:00:01Z

zhonghai1995
Jun 27, 2024
Author

@callumtilbury This is super helpful for me! Thanks so mcuh!

0 replies

jcformanek · 2024-07-01T13:30:23Z

jcformanek
Jul 1, 2024
Maintainer

I am going to convert this "issue" into a "discussion" and then we can continue discussing using OG-MARL for offline-to-online MARL. 🚀

0 replies

zhonghai1995 · 2024-07-05T10:45:09Z

zhonghai1995
Jul 5, 2024
Author

hi @jcformanek, what do you think of adding td lambda for qmix and qmix +cql? Is this a good practice or not? I see pymarl2 uses https://github.com/hijkzzz/pymarl2 and there is performance improvement, but I am not sure whether we should use it. What is your opinion?

0 replies

callumtilbury · 2024-07-05T12:14:29Z

callumtilbury
Jul 5, 2024

Hey @zhonghai1995. We did try this, but the limited early results were somewhat lacklustre. That being said, we did not perform a thorough analysis, and it's quite possible that there is value to be gained here. e.g. the MAICQ paper mention it as a core element to their algorithm. Please keep us in the loop if you do any further investigations 🚀

0 replies

zhonghai1995 · 2024-07-13T08:40:06Z

zhonghai1995
Jul 13, 2024
Author

hi @jcformanek @callumtilbury, I have a question for the cql term for qmix+cql, for the implementation, why you chooses to minize the q values under the uniform distribution, instead of using the logsumexp, like in the original cql paper? Thanks!

og-marl/og_marl/tf2/systems/qmix_cql.py

Lines 164 to 185 in e99a480

    
           random_ood_actions = tf.random.uniform( 
        
               shape=(self._num_ood_actions, B, T, N), minval=0, maxval=A, dtype=tf.dtypes.int64 
        
           )  # [Ra, B, T, N] 
        
           all_mixed_ood_qs = [] 
        
           for i in range(self._num_ood_actions): 
        
               # Gather 
        
               one_hot_indices = tf.one_hot(random_ood_actions[i], depth=qs_out.shape[-1]) 
        
               ood_qs = tf.reduce_sum( 
        
                   qs_out * one_hot_indices, axis=-1, keepdims=False 
        
               )  # [B, T, N] 
        
               # Mixing 
        
               mixed_ood_qs = self._mixer(ood_qs, env_state_embeddings)  # [B, T, 1] 
        
               all_mixed_ood_qs.append(mixed_ood_qs)  # [B, T, Ra] 
        
           all_mixed_ood_qs.append(chosen_action_qs)  # [B, T, Ra + 1] 
        
           all_mixed_ood_qs = tf.concat(all_mixed_ood_qs, axis=-1) 
        
           cql_loss = tf.reduce_mean( 
        
               tf.reduce_logsumexp(all_mixed_ood_qs, axis=-1, keepdims=True)[:, :-1] 
        
           ) - tf.reduce_mean(chosen_action_qs[:, :-1])

2 replies

jcformanek Jul 15, 2024
Maintainer

Hi @zhonghai1995, we apply logsumexp on line 184. Is that what you are talking about?

zhonghai1995 Jul 16, 2024
Author

oh, I see, sorry for my mistake.

Performance drop bewteen offline and online #29

zhonghai1995 May 10, 2024

Replies: 26 comments · 2 replies

jcformanek May 10, 2024 Maintainer

jcformanek May 10, 2024 Maintainer

zhonghai1995 May 11, 2024 Author

jcformanek May 13, 2024 Maintainer

zhonghai1995 May 23, 2024 Author

jcformanek May 23, 2024 Maintainer

zhonghai1995 May 25, 2024 Author

zhonghai1995 May 26, 2024 Author

jcformanek May 31, 2024 Maintainer

zhonghai1995 Jun 12, 2024 Author

jcformanek Jun 12, 2024 Maintainer

zhonghai1995 Jun 12, 2024 Author

jcformanek Jun 12, 2024 Maintainer

zhonghai1995 Jun 14, 2024 Author

jcformanek Jun 18, 2024 Maintainer

zhonghai1995 Jun 24, 2024 Author

jcformanek Jun 24, 2024 Maintainer

zhonghai1995 Jun 24, 2024 Author

zhonghai1995 Jun 25, 2024 Author

jcformanek Jun 26, 2024 Maintainer

callumtilbury Jun 27, 2024

zhonghai1995 Jun 27, 2024 Author

jcformanek Jul 1, 2024 Maintainer

zhonghai1995 Jul 5, 2024 Author

callumtilbury Jul 5, 2024

zhonghai1995 Jul 13, 2024 Author

jcformanek Jul 15, 2024 Maintainer

zhonghai1995 Jul 16, 2024 Author

zhonghai1995
May 10, 2024

Replies: 26 comments 2 replies

jcformanek
May 10, 2024
Maintainer

jcformanek
May 10, 2024
Maintainer

zhonghai1995
May 11, 2024
Author

jcformanek
May 13, 2024
Maintainer

zhonghai1995
May 23, 2024
Author

jcformanek
May 23, 2024
Maintainer

zhonghai1995
May 25, 2024
Author

zhonghai1995
May 26, 2024
Author

jcformanek
May 31, 2024
Maintainer

zhonghai1995
Jun 12, 2024
Author

jcformanek
Jun 12, 2024
Maintainer

zhonghai1995
Jun 12, 2024
Author

jcformanek
Jun 12, 2024
Maintainer

zhonghai1995
Jun 14, 2024
Author

jcformanek
Jun 18, 2024
Maintainer

zhonghai1995
Jun 24, 2024
Author

jcformanek
Jun 24, 2024
Maintainer

zhonghai1995
Jun 24, 2024
Author

zhonghai1995
Jun 25, 2024
Author

jcformanek
Jun 26, 2024
Maintainer

callumtilbury
Jun 27, 2024

zhonghai1995
Jun 27, 2024
Author

jcformanek
Jul 1, 2024
Maintainer

zhonghai1995
Jul 5, 2024
Author

callumtilbury
Jul 5, 2024

zhonghai1995
Jul 13, 2024
Author

jcformanek Jul 15, 2024
Maintainer

zhonghai1995 Jul 16, 2024
Author