There are studies who claim that adaptive overfitting is not a problem in imagenet. My initial intuition was that same test set has been used for decades, there was ample opportunity for overfitting, so it must have occurred. But, alas, it doesn't seem to have happened - I find that quite puzzling. Do you have any thoughts?
I’ve come across that paper before, and it’s an interesting study, since the hypothesis that adaptive overfitting doesn’t occur goes against statistical wisdom. However, I would be wary of accepting their claims at face value, or at least be cautious of their generality. For instance, although they later argue that adaptive overfitting didn’t occur, they did find that accuracy decreased significantly when a group of models were reevaluated on new data collected in the same way as CIFAR-10 and ImageNet. They attribute this to a change in data distribution rather than overfitting, though it seems to me that there’s uncertainty here. I suspect model diversity and model selection could also be significant factors. That is, they only considered a small number of generally well-known CNN models. Given that a number of these were developed by Google etc, it seems likely that their architectures were not developed just to do well on CIFAR-10 and/or ImageNet, which would also explain a lack of overfitting.
Anyway, I ‘d be interested to hear of any other similar studies you’ve come across.
Thanks for your take. I guess my rationalist leanings make me want to resist their conclusion too.
Regarding follow-ups there is a paper which did some statistical modelling around the generation of Recht et al.'s new test set. They conclude after reanalysis that the accuracy gap drops from ~11% to ~4%. I guess this means there would be even less overfitting to the test set? But they remain silent as to what causes the remaining error. See: Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Steinhardt, J., Madry, A. (2020). Identifying Statistical Bias in Dataset Replication. https://proceedings.mlr.press/v119/engstrom20a.html.
In the end there are two perhaps conflicting observations by Recht et al.:
1) evaluating the models on the new test set drops absolute accuracy
2) the "accuracy order" of models is preserved on the new test set.
1) would indicate overfitting to the test set. 2) is interpreted by them to mean that there are real improvements not attributable to overfitting (under the assumption that later models would have experienced more overfitting). They explicitly caveat against constant overfitting across all models: "...it could be that any test set adaptivity leads to a roughly constant drop in accuracy. Then all models are affected equally and we would see no diminishing returns since later models could still be better."
That later models would suffer from more overfitting seems to me like a reasonable assumption though. At least if we consider the standard mechanism of adaptive overfitting where information leaks from model to model generation through the optimization of hyperparameters on the same test set.
So yes, in the end I'm confused.
Ps.: arXiv:2006.07159 might be relevant too, although I'm not completely sure I follow their methodology.
There are studies who claim that adaptive overfitting is not a problem in imagenet. My initial intuition was that same test set has been used for decades, there was ample opportunity for overfitting, so it must have occurred. But, alas, it doesn't seem to have happened - I find that quite puzzling. Do you have any thoughts?
See for example: Recht, B., Roelofs, R., Schmidt, L. & Shankar, V.. (2019). Do ImageNet Classifiers Generalize to ImageNet? https://proceedings.mlr.press/v97/recht19a.html
Thanks for the question Nico.
I’ve come across that paper before, and it’s an interesting study, since the hypothesis that adaptive overfitting doesn’t occur goes against statistical wisdom. However, I would be wary of accepting their claims at face value, or at least be cautious of their generality. For instance, although they later argue that adaptive overfitting didn’t occur, they did find that accuracy decreased significantly when a group of models were reevaluated on new data collected in the same way as CIFAR-10 and ImageNet. They attribute this to a change in data distribution rather than overfitting, though it seems to me that there’s uncertainty here. I suspect model diversity and model selection could also be significant factors. That is, they only considered a small number of generally well-known CNN models. Given that a number of these were developed by Google etc, it seems likely that their architectures were not developed just to do well on CIFAR-10 and/or ImageNet, which would also explain a lack of overfitting.
Anyway, I ‘d be interested to hear of any other similar studies you’ve come across.
Thanks for your take. I guess my rationalist leanings make me want to resist their conclusion too.
Regarding follow-ups there is a paper which did some statistical modelling around the generation of Recht et al.'s new test set. They conclude after reanalysis that the accuracy gap drops from ~11% to ~4%. I guess this means there would be even less overfitting to the test set? But they remain silent as to what causes the remaining error. See: Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Steinhardt, J., Madry, A. (2020). Identifying Statistical Bias in Dataset Replication. https://proceedings.mlr.press/v119/engstrom20a.html.
In the end there are two perhaps conflicting observations by Recht et al.:
1) evaluating the models on the new test set drops absolute accuracy
2) the "accuracy order" of models is preserved on the new test set.
1) would indicate overfitting to the test set. 2) is interpreted by them to mean that there are real improvements not attributable to overfitting (under the assumption that later models would have experienced more overfitting). They explicitly caveat against constant overfitting across all models: "...it could be that any test set adaptivity leads to a roughly constant drop in accuracy. Then all models are affected equally and we would see no diminishing returns since later models could still be better."
That later models would suffer from more overfitting seems to me like a reasonable assumption though. At least if we consider the standard mechanism of adaptive overfitting where information leaks from model to model generation through the optimization of hyperparameters on the same test set.
So yes, in the end I'm confused.
Ps.: arXiv:2006.07159 might be relevant too, although I'm not completely sure I follow their methodology.