Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Fix argument order & renaming" #236

Closed
wants to merge 1 commit into from

Conversation

EricDinging
Copy link
Contributor

@EricDinging EricDinging commented Sep 6, 2023

@fanlai0990 @AmberLJC
This reverts commit 035e0e3.

Why are these changes needed?

The previous fix might be buggy. I found that in the long term (>=20 rounds) the accuracy does not increase with time. I think there might be other bugs apart from torch_module_adapter, that did not introduce any issues when coexist with the fixed bug. I will take a look later. Sorry for not checking it thoroughly before making the last PR.

Related issue number

#235

Checks

Screenshot from 2023-09-06 01-08-07

  • I've included any doc changes needed for https://fedscale.readthedocs.io/en/latest/
  • I've made sure the following tests are passing.
  • Testing Configurations
    • Dry Run (20 training rounds & 1 evaluation round)
    • Cifar 10 (20 training rounds & 1 evaluation round)
    • Femnist (20 training rounds & 1 evaluation round)

@fanlai0990
Copy link
Member

fanlai0990 commented Sep 6, 2023

I think we need to check whether the model reloaded the updated weights.

@EricDinging
Copy link
Contributor Author

EricDinging commented Sep 10, 2023

@fanlai0990 @AmberLJC
While I was using FedScale backends to run experiments in my system (thank you Fan for making my life easier...), I came across an error which is not exposed in the previous FedScale run: TypeError: Can’t Convert CUDA Tensor to Numpy - Use Tensor.cpu() to Copy the Tensor to Host Memory First.

The error happens when I tried to do an aggregation using fedyogi. I figured there is a minor bug in the torch_module_adapter or optimizers.

https://github.com/SymbioticLab/FedScale/blob/faab2832de4d8e32d39c379cc3cd7999992f8dd3/fedscale/cloud/aggregation/optimizers.py#L46C17-L46C17

Here I think it should be: (I'm not an expert in pytorch)

new_state_dict = {
    name: torch.from_numpy(np.array(last_model[idx].cpu() + diff_weight[idx].cpu(), dtype=np.float32))
    for idx, name in enumerate(target_model.state_dict().keys())
}

I changed the code as above, tested on my end, and haven't found any error. I think this is also the reason why my previous Fedyogi run using FedScale ended up with a non-increasing time-to-accuracy plot, as there is no actual aggregation taking place. WDYT

@AmberLJC
Copy link
Member

I think it make sense, and since target_model itself is on cpu?

@fanlai0990
Copy link
Member

  1. Try to decrease the server-side learning rate for YoGi; 2. Change the experiment model;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants