Question about batch size and test-set evaluation #303

stevenreich47 · 2021-02-20T00:53:53Z

Hi there!

I noticed something a little odd while evaluating an ensemble using baselines/cifar/ensemble.py: it seems that evaluation is only performed on the test set rounded down to a multiple of the batch size, rather than the full set. I noticed this as the numpy arrays which store the predictions have shape (9984, 10) (that script has an eff. batch size of 64, which divides 9984).

I believe that this might be the case in the other training/eval scripts as well; as I read it, the test iterator is only called for the first TEST_IMAGES // BATCH_SIZE batches, leaving a partial batch if the batch size doesn't evenly divide.

Please let me know if I'm mistaken about this. If you find this is accurate, do the reported results need to be reevaluated? If they were run with the current default effective batch size of 512, I believe 272 test examples out of 10000 were missed.

The text was updated successfully, but these errors were encountered:

znado · 2021-05-11T00:35:22Z

Yeah, this is an unfortunate issue. The papers that were used to implement this codebase all dropped the last partial batch, so the convention was kept. That said, we would like to properly pad the last partial batch to properly do evals in the near future (we mostly run on TPUs internally and they require a fixed batch size, which makes the final partial batch nontrivial).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about batch size and test-set evaluation #303

Question about batch size and test-set evaluation #303

stevenreich47 commented Feb 20, 2021

znado commented May 11, 2021

Question about batch size and test-set evaluation #303

Question about batch size and test-set evaluation #303

Comments

stevenreich47 commented Feb 20, 2021

znado commented May 11, 2021