ML pitfalls guide updated

Aug 30, 2024

Just a quicky to say that I’ve updated my ML pitfalls guide on arXiv to synch it with a new version just published in Patterns — a data science journal from Cell Press that’s become a good place to find work in the area of ML practice.

Although the guide has now been formally published, this is not the end of its road. I intend to keep updating it until there’s no longer a need for such a thing. Whilst it’s good to have a peer-reviewed version, pitfalls don’t stay still, so the usefulness of this version will gradually wane over time. And since it’s unusual to continue working on a paper after publication, I’d like to thank the Editor-in-Chief of Patterns for supporting my plan to do so.

This new version extends the previous one in a number of ways. Most notably, I’ve added a few new topics, so I’ll say a bit about these. These are baselines, data cleaning, and fairness. The last two were suggested by the Patterns reviewers, so thanks to them for prompting me to fill in these important holes.

Baselines are the models that any new modelling approach should be compared against in order to get an objective view of how well it works. Depending on the context, these may be the current state of the art, they may be a simpler version of the new approach, or they may be a simplistic naïve model. Either way, they’re basically there to prevent people from drawing misleading conclusions.

A classic example can be seen in time series prediction, where predicting the next value to be the same as the current value — a naïve model — often beats seemingly more intelligent ML models. An illustrative example of this can be found in Hewamalage et al’s paper on ML pitfalls in time series forecasting, where they found that a transformer model published at a top conference didn’t work as well as this naïve model. Failing to recognise this (as the original authors of the transformer model presumably did) risks encouraging other people to apply inappropriate models.

And baselines are not just required for fancy new transformer models. They should be used whenever new results are being presented, to reassure readers that what is being presented is noteworthy. Unfortunately there are many ML papers that just say things like “we achieved an awesome accuracy of N%” without realising that an off-the-shelf algorithm could have done better. Inspired by this misleading reporting, other people then come and follow in their footsteps, and ultimately civilisation collapses (okay, maybe a slight exaggeration).

Another topic which I’ve added is data cleaning. Not an exciting topic to most people, but an important one. Whether data is collected by people or machines, all sorts of things can go wrong. Failing to remedy these things before building models using the data can be a serious pitfall, and has led to the failure of many projects. And so has remedying them inappropriately.

An example is missing values. These are often easy to spot, especially when they result in visible holes in the data. It can be tempting to just throw an off-the-shelf imputation algorithm at them (imputation = filling in holes). However, care should be taken here, since imputation is yet another opportunity1 for a data leak to occur. Specifically, this occurs when the parameters of the imputation model are fit using the whole dataset. So, for example, if you’re doing mean imputation, then the mean should be calculated using just the train data, with the test data kept locked up in the cupboard under the stairs.

The final new topic is fairness, which has become increasingly significant as ML models have been rolled out to make important decisions. There have been many reported cases of ML models being unfairly biased against people with particular genders, ethnicities or backgrounds. ML systems that make decisions on things like loans, probation and medical treatments have all been found wanting in this regard. As a reaction to this, fairness in ML is about guaranteeing that people with different characteristics will be treated equally by a model. This has been a rapidly growing area of study, so there are quite a few ways of achieving and measuring this — see the guide for some pointers.

Related to fairness is the topic of explainability, which I’ve also added more about. This is because it’s hard to prove a model is fair if you have no idea how it works. Explainability in ML attempts to address this by providing some insight into the workings of opaque models such as transformers. However, explanations of complex models will always be incomplete, and potentially misleading, so care should be taken when using them.

My hope is that this guide will remain useful to the ML community, so please do send me feedback if there’s something missing that you’d like me to cover. I’m particularly eager to keep up with any new pitfalls that are emerging alongside the growing use of LLMs within ML. I’ve already touched upon some of these in the guide — including the need to do multiple evaluations at inference time, the need to to take into account service or hosting costs, and the potential for data leaks due to the opacity of their training processes. However, I’m sure that new ones will appear as LLMs push further into ML practice.

Check out the pitfalls guide for more of these than you can shake a stick at.

Fetch Decode Execute

Discussion about this post