Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results for Freeze Modules are higher than the ones reported (CIFAR-100 20 tasks) #1

Open
AlbinSou opened this issue Oct 1, 2021 · 3 comments

Comments

@AlbinSou
Copy link

AlbinSou commented Oct 1, 2021

After running this command:

python lifelong_experiment.py -d CIFAR -T 20 -e 100 -alg fm_compositional --initial_seed 0 -l 4 -arc cnn -b 64 -i random_onehot -n 1 --init_tasks 4 -l 4 -s 50

I would say the only differences with the original setup are that I set dropout to 0.2 and batch size to 64, but this should not influence the outcome at that point. The reported results for that baseline on CIFAR is 48%. However, when I run the command above to reproduce these results, I get an average accuracy of 71%.

I also obtain similar results in my own code, while trying to reproduce the experiments.

Did I write the wrong command to reproduce these results ?

@jorge-a-mendez
Copy link
Collaborator

Hi, thank you for your interest in our work and for raising this issue.

The command as written looks okay to me. I implemented the dropout change and ran the experiment with the command you provided, and found the average accuracy on a single random seed to be 49.78%.

One note on interpreting results: the average accuracy is computed from the log file of the last task trained (task 19), by averaging the performance of all tasks (tasks 0 through 19) in the last epoch (epoch 100). The reason I bring this up is because when I ran the evaluation, the final accuracy on the last task was 71.8%, but the performance we care about is the average across tasks after all tasks have been trained.

I hope this helps.

@AlbinSou
Copy link
Author

AlbinSou commented Oct 15, 2021

Hello, thank you for your answer.

So, the measure I took is the mean of the accuracies over all the individual tasks, after training on all the tasks (the ones present in the files of task19/ folder, in the last epoch). I believe my computation is correct on this part.

Another change that I had to make and I only now remember is that I changed the load_data method for SplitCIFAR100 dataset so that it loads data using the torchvision CIFAR100 dataset instead of taking it from directories that I did not have. Could this have an effect on results ?

@jorge-a-mendez
Copy link
Collaborator

Hi,

The computation of accuracies sounds correct.

It is possible that that tiny change makes a difference. Could you try downloading the raw data directly from the repo here and re-runing your experiments? This might matter if torchvision applies any sort of transformations to the images.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants