Is peer review in machine learning dead?

Or am I just grumpy because my paper was rejected?

Sep 11, 2024

A bit of a rant today.

Peer review is supposedly the process that ensures quality in science. The idea is that authors submit their paper to a journal or conference. The journal editor or programme chair then identifies a group of experts in the area, and sends it off to them for comment. They then use these comments to decide whether to publish the paper. Or in the case of a journal, whether to ask the authors to address any issues and resubmit for a further round of reviews.

Or at least that’s the idea. In practice there are some quite serious problems with peer review, and I’d say this is particularly the case within ML.

I’m a journal editor, a reviewer and an author, so I’ve seen the system from all angles. As a journal editor, I’ve noticed it’s increasingly difficult to find competent reviewers who are willing to review papers. Sometimes it feels like an impossible task, and it’s a real relief when someone finally agrees. As a reviewer, I struggle to find time to do reviews, so I understand why people are unwilling to take on this unpaid and often time-consuming role. As an author, I’ve seen the quality of reviews drop dramatically, and the time to get decisions from journals increase to troubling levels.

A big part of the problem1 is that those who have the most experience in a field also have the least time. This means that reviewing tasks are often picked up by those early in their careers, who inevitably have a less holistic view. And this is particularly evident in the area of AI and ML, where the newest generation of researchers are — understandably — knee deep in deep learning, and sometimes unaware of the merits of simpler, more time-tested, approaches. Which means that AI and ML are in danger of disappearing up their own, erm, output layer.

The reason I’m particularly grumpy about peer review at the moment is because I2 just had a paper rejected. This was unexpected, because the first round reviews were favourable. But then, the editor sent it to a different reviewer in the second round3. This reviewer slated it, and their slating was largely based on the paper having limited coverage of the most recent deep learning methods. Which would be a fair comment, except (1) the paper is largely about how a problem can be solved well without needing to use deep learning models, and (2) there was relatively little deep learning work in this problem domain two years ago when we submitted the paper. So, we were stymied by both the unreasonable length of the review process, and by an increasingly myopic focus on deep learning within applied ML. Oh the injustice!

To misquote Casablanca, perhaps the suffering of academics doesn’t amount to a hill of beans in this crazy world. But I think this particular hill of beans is part of a broader leguminous landscape resulting from poor peer review in ML. As you’ll know if you’ve read my ramblings, there’s a big issue of quality in ML. That is, lots of people who do it don’t know what they’re doing. These people then publish papers, which often contain fundamental errors, and because they’ve published papers, they’re then asked to review other papers.

And this is a particular problem in deep learning, where the opportunity to overfit data through errors in the ML pipeline is huge4. This means that there are lots of papers out there saying that deep learning does an awesome job, but scratch the surface, and you might find all sorts of data leaks that make the resulting models meaningless. Yes, they do an awesome job on the data they were developed on, but many would probably fail if they were reevaluated on previously unseen data. And then the people who did this work come along and review a paper that says that deep learning is not the best tool for a job. And of course they slate it.

But in another respect, this is nothing new. Papers which challenge a status quo have always been more difficult to publish, despite their greater potential value to a community. It’s sadly much easier to do what everyone else is doing, even if it doesn’t really make sense to do so.

Another problem within the ML community is the speed at which it’s grown and the sheer volume of papers being produced. It’s no longer possible to have a relatively small number of experts in a field who can quality check everything that’s being produced. Imagine for example asking Yann LeCun5 to review every deep learning paper that’s written. He wouldn’t be happy. Instead, they have to be fielded out to a huge range of reviewers, many of whom have limited experience.

Things are particularly tough within ML conferences. There are a small number of top-tier ML conferences, including the likes of AAAI, CVPR and NeurIPS. Given the popularity of deep learning, these each receive a huge number of submissions, but the physical space for the conference remains the same, meaning that most will be rejected. And when the acceptance rates get this small, noise and bias become big issues. That is, whether or not a paper gets accepted becomes a lottery, and papers which challenge the status quo are particularly unlikely to make it through.

But wait a minute — do we still need peer review? If you work in ML, you’ve probably noticed that most ML papers appear on the preprint server arXiv well before they’re published. Many of the papers on arXiv are never formally published, yet still manage to garner a huge number of citations. That is, there are plenty of people reading and using preprints, even without the quality stamp of peer review. For instance, I didn’t even try to publish my ML pitfalls guide until several years after I first put it on arXiv, yet 90-or-so people still trusted it enough to cite it from their papers6.

It’s also no longer the case that people read journals or conference proceedings. In the olden days, academics would schmooze on down to the library, find a nice quiet spot, and then spend a pleasant morning leafing through their favourite journal. But nowadays, papers are much more numerous, they’re published across a huge number of journals and conferences, and we find them using tools like Google Scholar. The only reason we look at which journal or conference a paper was published in is to get a rough idea of how respectable it is — but even that makes less sense these days, given how noisy peer review has become.

Yet we still do need something like peer review precisely because there is so much poor practice in ML. Whether or not peer review in its current form is effective at separating the wheat from the chaff is another matter, but we do need some way of recognising which papers we should pay attention to and which we shouldn’t touch with a bargepole. One approach that’s gained some attention is post-publication peer review — publish everything, but then let anyone comment on it. The theory is that dodgy papers will be flagged by the community, though in practice this can suffer from the same issues as social media when peer review is done anonymously7.

Other people have suggested using LLMs to replace human reviewers. I can see some logic to this. In a way, it’s removing the middle-man, since a number of reviewers already seem to be delegating their task to their favourite LLM. Sadly, it’s also likely to produce better reviews than many reviewers. However, there is a real danger of it reinforcing existing biases. For instance, how could an LLM evaluate a fundamentally new idea that wasn’t in its training set?

In conclusion, there is no conclusion. Yes, there’s broad recognition that peer review is broken — and not just within ML — but there’s little agreement on what we can do to fix it. In practice, I suspect we’re going to muddle on with a mixture of traditional peer review, preprints, increasing use of LLMs, and maybe a splash of post-publication peer review. And yes, I am grumpy because my paper was rejected, but I think justifiably so.

But not the only one. Even worse is when reviews are done by what I might call “parasitic reviewers” — people who do it only to inflate their citation scores (and consequently their careers) by asking authors to cite their own papers. It’s a sad reality that appointment and promotion committees often care more about the number of citations to a paper rather than their quality. A common hallmark of these kind of reviews is generic text, e.g. “the presentation and language of this paper could be improved”, and increasingly the use of LLM-generated text.

The royal I, which in this case includes three other people.

Or at least I assume so, based on a step change in the style and quantity of text.

Worth noting that this doesn’t apply to all work in deep learning. There’s plenty of good work out there too, though it seems thin on the ground within the particular problem domain — computer security — that the rejected paper focused on. Daniel Arp et al. have written a nice review of poor ML practice in this area.

The convolutional neural network guy.

Suckers!

Which it tends to be, since reviewers don’t want authors to chase then down the street with a bargepole.

Nick Taylor

Sep 23

Oh dear Galileo, does nobody love you? Sorry, couldn't resist. Some good points. Of course, busy experts have been delegating their reviews to their PhD students for years. Can be quite beneficial all round as long as those experts review the reviews before submission. If LLMs ever take over peer review we might as well ditch the whole process, especially for ML research.

Expand full comment

Fetch Decode Execute

Discussion about this post