Revert "Fix argument order & renaming" #236

EricDinging · 2023-09-06T05:09:18Z

@fanlai0990 @AmberLJC
This reverts commit 035e0e3.

Why are these changes needed?

The previous fix might be buggy. I found that in the long term (>=20 rounds) the accuracy does not increase with time. I think there might be other bugs apart from torch_module_adapter, that did not introduce any issues when coexist with the fixed bug. I will take a look later. Sorry for not checking it thoroughly before making the last PR.

Related issue number

#235

Checks

I've included any doc changes needed for https://fedscale.readthedocs.io/en/latest/
I've made sure the following tests are passing.
Testing Configurations
- Dry Run (20 training rounds & 1 evaluation round)
- Cifar 10 (20 training rounds & 1 evaluation round)
- Femnist (20 training rounds & 1 evaluation round)

This reverts commit 035e0e3.

fanlai0990 · 2023-09-06T06:55:09Z

I think we need to check whether the model reloaded the updated weights.

EricDinging · 2023-09-10T04:56:50Z

@fanlai0990 @AmberLJC
While I was using FedScale backends to run experiments in my system (thank you Fan for making my life easier...), I came across an error which is not exposed in the previous FedScale run: TypeError: Can’t Convert CUDA Tensor to Numpy - Use Tensor.cpu() to Copy the Tensor to Host Memory First.

The error happens when I tried to do an aggregation using fedyogi. I figured there is a minor bug in the torch_module_adapter or optimizers.

https://github.com/SymbioticLab/FedScale/blob/faab2832de4d8e32d39c379cc3cd7999992f8dd3/fedscale/cloud/aggregation/optimizers.py#L46C17-L46C17

Here I think it should be: (I'm not an expert in pytorch)

new_state_dict = {
    name: torch.from_numpy(np.array(last_model[idx].cpu() + diff_weight[idx].cpu(), dtype=np.float32))
    for idx, name in enumerate(target_model.state_dict().keys())
}

I changed the code as above, tested on my end, and haven't found any error. I think this is also the reason why my previous Fedyogi run using FedScale ended up with a non-increasing time-to-accuracy plot, as there is no actual aggregation taking place. WDYT

AmberLJC · 2023-09-10T14:50:24Z

I think it make sense, and since target_model itself is on cpu?

fanlai0990 · 2023-11-01T03:02:19Z

Try to decrease the server-side learning rate for YoGi; 2. Change the experiment model;

Revert "Fix argument order & renaming"

1ca9ef8

This reverts commit 035e0e3.

EricDinging mentioned this pull request Sep 11, 2023

Implement large simulation EricDinging/Propius#8

Open

25 tasks

EricDinging closed this Sep 20, 2023

EricDinging mentioned this pull request Oct 31, 2023

[Aggregator | Executor: torch_model_adapter] #243

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "Fix argument order & renaming" #236

Revert "Fix argument order & renaming" #236

EricDinging commented Sep 6, 2023 •

edited

Loading

fanlai0990 commented Sep 6, 2023 •

edited

Loading

EricDinging commented Sep 10, 2023 •

edited

Loading

AmberLJC commented Sep 10, 2023

fanlai0990 commented Nov 1, 2023

Revert "Fix argument order & renaming" #236

Revert "Fix argument order & renaming" #236

Conversation

EricDinging commented Sep 6, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

fanlai0990 commented Sep 6, 2023 • edited Loading

EricDinging commented Sep 10, 2023 • edited Loading

AmberLJC commented Sep 10, 2023

fanlai0990 commented Nov 1, 2023

EricDinging commented Sep 6, 2023 •

edited

Loading

fanlai0990 commented Sep 6, 2023 •

edited

Loading

EricDinging commented Sep 10, 2023 •

edited

Loading