“Groundbreaking AI” in medicine

Jun 12, 2024

The details of ML do matter. It’s easy to make mistakes when applying ML, and these mistakes can invalidate the outcomes. So it’s important that the details are fully explained when writing up an ML-based study. And this is especially the case when the uses of ML have high stakes.

Which brings us to medicine. Arguably one of the highest-stakes fields in which ML is applied, because the consequences can literally be life and death. And so you’d think that reporting of ML methods would be particularly good in medicine, yes?

Well, no. It often feels like an afterthought. That is, the most impactful studies in medicine are usually published in top medical journals, like The Lancet. These journals care deeply about the details of the medical study, e.g. how many patients were involved, what were their demographics, what were the inclusion and exclusion criteria, and so on. But then the methods section often has the meagrest of details about what ML technique was used and how it was applied1.

And it doesn’t help that the popular media doesn’t seem to care about details. They rarely say what kind of ML was used, or to what extent it was used, and there’s almost never any checking of whether it was used correctly. Yes, they might understandably assume the scientific process assured that things were done correctly, but peer reviewers of medical journals are rarely experts in ML, and can easily miss fundamental errors.

As an arbitrary example of media reporting of ML models, recently I was reading about a new ML-based approach that can predict heart failure years in advance from information in heart scans that isn’t readily visible to humans. The report in the Guardian described it as “Groundbreaking AI”. I had to dig down into the original paper (in The Lancet, not linked from the article) to get any idea of how it worked, and then only by traversing appendices and previous publications by the authors.

Which is not to say this is a bad study. Quite the opposite in fact. Unlike many AI-based studies, they actually followed a group of patients through from their initial AI-led prognosis through to their actual prognosis, and the AI did pretty well at predicting these. This also means the results weren’t subject to the usual worries about misleading data leaks. However, given its importance, It would have been good to see an upfront description of the ML model and how it was used, rather than digging down (a long way) to find this. I did eventually work out that it was using a CNN to do segmentation (groundbreaking AI?) and some other model to tie together the output of the CNN with more conventional clinical features.

Most people understandably don’t care about these details. But there is a big space between the low-level details that I like to see and a popular media summary like “groundbreaking AI”. People should, for example, care about how much of the prediction is due to AI, and how much is due to human involvement or expertise. And if it’s entirely done by AI, then they should care about how much effort was put into trying to understand the AI’s predictions. If the whole system is just a black box, then they should be worried about this. After all, I’m sure few people want to put their health in the hands of a system that nobody understands.

It may be another of my pipe dreams, but I’d really like to see the popular media engage with the problems facing ML, rather than just feeding into the hype. Telling people that AI is going to change the world will not ultimately help AI change the world in a way that their readers will necessarily appreciate.

Beyond the correctness and appropriateness of methods used by AI practitioners, which are perhaps a hard sell for their average reader, there’s also nebulous scope for discussing issues of fairness. For instance, was the patient data used in an AI study really representative of the end users? The failure of pulse oximeters to work as well on people with darker skin is a sage lesson, and throwing AI into the mix only makes the outcomes less clear.

However, it’s not all on the popular media. Authors of published papers could make it easier for journalists (and readers more generally) to pick up on the relevant points. For instance, they might consider including a model card. This is a box of information that concisely describes which ML model was used, how it was trained and evaluated, and notes any limitations of the model or data that might affect real world performance. Or (and I may be a bit biased here) fill in the REFORMS checklist and include this as an appendix.

Worth noting that this has a lot to do with the tight word limits in medical journals, e.g. 3500 words in the case of The Lancet. This doesn’t give much space to discuss ML aspects.

Fetch Decode Execute

Discussion about this post