You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed something a little odd while evaluating an ensemble using baselines/cifar/ensemble.py: it seems that evaluation is only performed on the test set rounded down to a multiple of the batch size, rather than the full set. I noticed this as the numpy arrays which store the predictions have shape (9984, 10) (that script has an eff. batch size of 64, which divides 9984).
I believe that this might be the case in the other training/eval scripts as well; as I read it, the test iterator is only called for the first TEST_IMAGES // BATCH_SIZE batches, leaving a partial batch if the batch size doesn't evenly divide.
Please let me know if I'm mistaken about this. If you find this is accurate, do the reported results need to be reevaluated? If they were run with the current default effective batch size of 512, I believe 272 test examples out of 10000 were missed.
The text was updated successfully, but these errors were encountered:
Yeah, this is an unfortunate issue. The papers that were used to implement this codebase all dropped the last partial batch, so the convention was kept. That said, we would like to properly pad the last partial batch to properly do evals in the near future (we mostly run on TPUs internally and they require a fixed batch size, which makes the final partial batch nontrivial).
Hi there!
I noticed something a little odd while evaluating an ensemble using
baselines/cifar/ensemble.py
: it seems that evaluation is only performed on the test set rounded down to a multiple of the batch size, rather than the full set. I noticed this as the numpy arrays which store the predictions have shape (9984, 10) (that script has an eff. batch size of 64, which divides 9984).I believe that this might be the case in the other training/eval scripts as well; as I read it, the test iterator is only called for the first
TEST_IMAGES // BATCH_SIZE
batches, leaving a partial batch if the batch size doesn't evenly divide.Please let me know if I'm mistaken about this. If you find this is accurate, do the reported results need to be reevaluated? If they were run with the current default effective batch size of 512, I believe 272 test examples out of 10000 were missed.
The text was updated successfully, but these errors were encountered: