When machine learning doesn't learn

Feb 28, 2024

Machine learning is a branch of AI concerned with predicting and classifying things. Do I have cancer? Will my stock portfolio be worth more or less tomorrow? Will that meteorite collide with the earth? Many of these things are quite important to our lives and livelihoods, but one of the big under-reported issues in machine learning is that many machine learning systems just don’t work.

In a recent article in The Gradient, I delve into the reasons why machine learning models can fail to work well in practice despite appearing to work well during development. I’ll say a little about this here too, and then I’ll talk about how we ended up in this situation, and what we might do to get out of it.

Why does machine learning fail?

Machine learning models are typically evaluated using a bunch of data points they haven’t seen before, known as a test set. For instance, if the aim of a model is to discriminate cats from dogs, then the test set will contain pictures of cats and dogs which were not used to train the model. Since the model hasn’t seen these data points before, evaluating a model using a test set should give a good idea of how well it works in practice — but often it doesn’t.

Sometimes this is due to issues with the data. For instance, if the data in the test set doesn’t reflect the real world, then the model’s performance on the test set won’t be informative. An example of this is using studio photographs to train and test a machine learning model. In a studio, everything is controlled and consistent. For example, Fluffy the cat may be held down in a sitting position, glaring at the camera whilst illuminated by a professional lighting rig. Out in the real world, Fluffy may be jumping from a tree under the glare of a streetlight. A machine learning model trained on studio photographs of cats is very unlikely to work on real world images, but this failure will not be detected if the test set only contains studio photographs.

But, a lot of the time, it’s not due to issues with the data. Rather, it’s due to issues with the machine learning pipeline — that is, the process used to train and test models. A common example of this is a data leak. This occurs when information about the test set somehow finds its way into the training of the machine learning model. When this happens, the model has the opportunity to overfit the test data, meaning that the test data no longer provides a reliable means of evaluating how well the model works. For example, if the model somehow finds out that most of the images in the test set are of ginger cats, then it might be able to do well on the test set just by looking at the colour of an image — whereas in practice this would be a poor way of recognising cats.

Unfortunately there are many ways1 that deficiencies in the machine learning pipeline can result in models that seem good but are actually bad. And this is a big problem. In science, where machine learning is becoming the tool of choice, results have been invalidated due to issues in machine learning pipelines. Even in our day-to-day lives, systems built around machine learning are starting to tell us to do inappropriate things — like bathe in fetid water. As AI permeates more and more corners of our existence, the consequences of poor machine learning will only become greater.

How did we get here?

So how did we end up in a place where a technology increasingly integral to our lives appears to be so fragile? I’d argue that a lot of this comes down to a mismatch between the ease with which machine learning can be applied and the expertise required to do so. That is, it’s not hard to get some data, pick up a machine learning toolkit, and start training models. Unlike other technologies that underlie the fabric of our lives — such as architecture, engineering and medicine — you don’t require a degree, or many years of experience, before you can start to use and deploy machine learning. However, you do require a lot of education or experience before you can start to do it reliably.

This situation has a lot to do with the recency2 of machine learning and its rapid speed of uptake. This has resulted in more demand for machine learning practitioners than there are trained practitioners, meaning that inexperienced practitioners can end up in inappropriate roles. That’s not to mention there being more demand for machine learning education than there are experienced educators, meaning that people leaving universities may not have learnt the right things. It’s also true that the dangers of technologies are often not understood until after they’ve entered common use3.

And on top of this, there’s currently very little oversight of machine learning, and therefore few regulatory barriers to prevent all of these things from happening. Why is there so little oversight? Partly this is due to the inevitable gap between the deployment of a new technology and the development of regulations. However, it’s also because there are still few people in both government and wider society who really understand machine learning and its potential impact on our lives — and that brings us back to education.

Down with that sort of thing!

So perhaps this comes down to computer science education. If there was more understanding of computer science, then there would be more understanding of how computer’s work, what their limitations are, why we need governance over them, and of the appropriate level of expertise for practitioners and educators to have. With sufficient education of sufficient quality, maybe this whole situation will resolve itself.

But until then, we need other options. One approach is to encourage people to think more carefully about what they’re doing. This is partially the route we’re taking with the REFORMS checklist, which we hope will eventually become a gatekeeper to scientific journals that publish machine learning results. However, it’s not just a stick — it’s also about carrots, and educating people who already practice machine learning about the potential pitfalls when doing so.

Another option is to try and remove these pitfalls. At the moment, machine learning frameworks let users do pretty much whatever they want. Perhaps they shouldn’t. Certainly there’s no reason why tools can’t restrict what their users are able to do, and they could potentially prevent a lot of common pitfalls by doing so. We could even train machine learning models to do this for us — assuming they work correctly!

I talk about many of these in my article in The Gradient. Also see my longer review in How to Avoid Machine Learning Pitfalls: A Guide for Academic Researchers.

Okay, machine learning got going way back in the mid-1900s, but it’s only recently that it’s hit the mainstream.

Asbestosis and concrete cancer, to name a couple.

Fetch Decode Execute

Discussion about this post