Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't reproduce #1

Open
THU-syh opened this issue Jul 13, 2022 · 7 comments
Open

Can't reproduce #1

THU-syh opened this issue Jul 13, 2022 · 7 comments

Comments

@THU-syh
Copy link

THU-syh commented Jul 13, 2022

Sorry, I'm having some problems reproducing your work. I can't get the same results as in your paper(https://doi.org/10.1016/j.patter.2022.100521) by following the readme and code here.

@zwvews
Copy link
Collaborator

zwvews commented Jul 13, 2022

hi, thanks for your interests. Could you please provide your experimental results? Just FYI, you need to tune the hyperparameters following the experimental setup described in our paper.

@THU-syh
Copy link
Author

THU-syh commented Jul 19, 2022

Thanks for your reply, I just run the sample code given in the readme file, like
python main.py -dataset esol -fedmid avg -part_alpha 0.1
but the result is as follows:
image

Surprised that this result is better than the one you gave in the article (even better than the FLIT(+) results reported in the article which marked as Best federated-learning results)
image
image

However, when we tried FLIT/FLIT+,
python main.py -dataset esol -fedmid oursvatFLITPLUS -tmpFed 0.5 -lambdavat 0.01 -part_alpha 0.1
we got worse results than FedAvg
image

@zwvews
Copy link
Collaborator

zwvews commented Jul 19, 2022

as I mentioned, you need to tune the hyperparameters for FLIT(+) follow our paper. We do not find a set of hyperparameters that fits all datasets. However, it is wired to see that fedavg has such good performance. I will check our experiments, and will get you back soon.

@zwvews
Copy link
Collaborator

zwvews commented Jul 21, 2022

Hi, I have checked our previous experimental results and also re-run the experiments.
First, I did obtain the reported results for FedAvg on ESOL dataset as shown below.
image
I also admit that I cannot reproduce the results with our current code for this dataset. However, I should note that ESOL is extremely small and the training/testing performance is pretty unstable. I may suggest you play with our code on larger datasets e.g. Lipo. Anyway, thanks very much for pointing out the problem, and let me know if you have any other questions.

@THU-syh
Copy link
Author

THU-syh commented Jul 22, 2022

Thanks for your prompt response, with reference to your suggestion, I have also executed the relevant FedAvg code on other datasets, but also got surprising results on some datasets, as follows.

  1. Freesolv: As the degree of data heterogeneity increases, the test metrics of the Freesolv dataset also increase, however, the lower the metrics of this dataset, the better.

image

Note: Note that this problem also occurs on the Lipo dataset (the lower the better) and SIDER dataset (the higher the better).

image

image

  1. ClinTox: As with the ESOL problem, the Avg results on this dataset significantly outperform the state-of-the-art results for all methods reported in the paper.

image

@zwvews
Copy link
Collaborator

zwvews commented Jul 22, 2022

As for problem 1, we make the claim in our paper that our current heterogenous simulation method is not perfect and may not result in heterogeneous datasets. We give discussion in the main results section and also in the conclusions section. More research should be done in this direction.

As for problem 2, I beleive there may be some small differences between our current code and the one when we run the experiments. I am really sorry for this. Our results on these two datasets are consistent for all methods and I thus believe the results should still be able to work as a reference for comparision. I also paste the experimental records on fedavg for clintox here.
image

@THU-syh
Copy link
Author

THU-syh commented Jul 22, 2022

Thanks for your reply. In view of the current problems, I suggest that you carefully check the current open source code for errors and update the correct code. If the current code has no errors in FedAvg, it is obvious that you did not find the optimal baseline of FedAvg. I also recommend that you re-run the relevant experiments of FLIT(+) (especially on the ESOL and ClinTox datasets) to ensure that the conclusions in the paper are correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants