ML Pitfalls #1: Classification metrics
I’ve not talked about machine learning pitfalls for a while. I’d hoped by now that everyone would have read my guide to avoiding them and be living in a blissful pitfall-free state. But, alas, the evidence1 suggests otherwise. So I thought I’d start a series focusing on particular pitfalls. And this is the first one — woohoo!
I’m going to start with a fundamental yet important topic: metrics used for measuring the performance of classifiers.
There’s a substantial bunch of classification metrics to choose from. In addition to the well-known accuracy, there’s also (in no particular order) precision and recall, their harmonic mean the F1 score, and their weighted harmonic mean the Fβ score. Then there’s the similar-but-subtlety-different specificity and sensitivity, plus their mean balanced accuracy and their geometric mean G-mean. If a classifier has a continuous output, you can also generate ROC-AUC (the area under the ROC curve) from a bunch of specificity-sensitivity values taken at different thresholds; or, if you’re more inclined that way, PR-AUC, the area under the precision-recall curve. And if that’s not enough, there’s also Matthew’s correlation coefficient (MCC), plus a pile of more obscure metrics like Cohen’s kappa and gamma.
I’m not going to walk you through their definitions — you can find plenty of that elsewhere. However, I do want to talk about some of their properties and limitations, and how this might influence your choice of metrics in practice.
I’m also not going to say much about accuracy, since I’ve already said plenty elsewhere2, and so have lots of other people. But basically, if your data set is substantially imbalanced (i.e. some classes are bigger than others) and accuracy is the only metric you’re using, then you should definitely consider other metrics.
It’s also worth noting that metrics are not always chosen for their properties. Sometimes they’re chosen because they’re widely used in a particular domain — sometimes for good reason, sometimes not. So don’t exclude a metric just because no one is using it; but equally, don’t ignore ones that are commonly-used, because that makes it harder to compare results.
And bear in mind that there’s no such thing as a perfect metric. They all have their issues. And this is always going to be the case when you’re trying to squeeze a whole lot of classifications into one number — they don’t fit, and some information will get lost. Different metrics lose (and emphasise) different information. This is why you should always use multiple metrics, since they often compensate for one another’s weaknesses, and together give a much more complete picture.
I guess the first thing to clarify is the difference between unity metrics and pairwise metrics. Unity metrics (like accuracy, F1 and MCC) are intended to give a complete summary of performance, whereas pairwise ones (like precision-recall and specificity-sensitivity) are incomplete by themselves, but when used together give a more complete picture. This is important because the same classifier can have, for example, a very high precision and a very low recall, so presenting just precision is misleading.
A distinction can also be made between metrics used for binary and multiclass classification. However, this is a fuzzy area, since many metrics have been adapted for use in both of these settings. For example, MCC and the AUCs, which are typically used for binary classifiers, also have multiclass versions — in fact, both have more than one multiclass version. And this raises another common issue in the land of metrics: referring to different things by the same name.
A common example of this is the holy triad of precision, recall and F1 score. These are probably the most widely used metrics in modern-day machine learning. F1, in particular, tends to be the metric that people focus on when comparing classifiers. However, the term F1 can be used to refer to a bunch of related metrics, partly because precision and recall are class-specific concepts.
Recall, if you recall3, is the proportion of positive examples of something that are correctly classified as being positive, and precision is the proportion of positive classifications that are actually positive examples. Both are dependent on the term “positive”, which usually refers to a specific class. If you use it to refer to another class, then you’d likely get different values for precision and recall. So, in binary classification, for example, you can generate two different pairs of precision and recall values for the two different classes, and these can then be used to generate two different F1 values — usually known as the per-class F1 scores.
Sometimes it makes sense to focus on per-class F1 scores. Typically this is where the positive class is either very small compared to other classes4, or where the positive class is particularly important. Both of these apply in fields such as medicine and fraud detection. But the downside of per-class F1 scores is that they don’t give a single overall value that captures a classifier’s performance across all classes.
Consequently, it is common to use other forms of the F1 score that do summarise performance in a more complete way. Macro F1 is the most widely used of these, and is simply the mean of all the per-class F1 scores. Often when people say “F1 score”, this is the one they’re reporting. However, macro F1 has the opposite problem to per-class F1 — it treats all classes as being of equal worth, and this can be misleading in certain situations. For example, if a classifier does especially well or especially badly on rare classes that are not typical of the data it will see in practice, then this can lead to an overly optimistic or overly pessimistic measure of performance.
A solution to this5 is to use a weighted F1 score, which calculates a weighted mean of the per-class F1 scores, weighted by the size of each class. This solves the above problem, but is troublesome in situations where you don’t want the performance on common classes to mask the performance on rarer classes. It can also be troublesome if the data distribution in your test set doesn’t match the real-world data distribution, which can happen in some fields due to biases in data collection6. So, if you’re planning to use F1 scores, think about which formulation is most appropriate, and maybe use more than one.
Class imbalance is an important issue for classification metrics more generally. I mentioned accuracy up top, which is particularly sensitive to this, but every metric has some degree of sensitivity to imbalance. Even the ones that are reputed to be good for imbalanced data, such as F1, MCC, AUC-ROC and AUC-PR, are known to break down at some point. And it doesn’t help that people disagree about which metric is best for a particular type of imbalance — for instance, I thought that AUC-PR was a better choice than AUC-ROC when faced with seriously imbalanced data, but then a recent paper said the opposite. So, always be wary of the results you get in imbalanced scenarios, and use multiple metrics to mitigate against their varying failure modes.
That said, MCC7 is an under-appreciated metric. Unlike AUC-ROC and AUC-PR, it works on classifiers with discrete (rather than continuous) outputs, and also has multi-class versions. Many people believe it’s more reliable than F1 and the AUCs when used with imbalanced data, though this is contested8.
But reliability isn’t the only concern when using metrics. Interpretability, i.e. understanding what a number is trying to tell you, is also important. This is where accuracy shines: it’s simply the proportion of classifications that the classifier got right. AUC-ROC is also a strong contender from an interpretability perspective, since it gives the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. Other commonly-used unity metrics are rather less transparent, though MCC and Cohen’s kappa are arguably interpretable in that they measure correlation, which is a widely-understood concept.
I’m not going to dig any further into the properties of metrics, since it would soon wander into mathematical territory. However, if you want to know more, Juri Opitz recently published a relatively accessible article on this topic. For anyone working with image data specifically, I’d also recommend a recent pair9 of papers published in Nature about image-related metrics, and their companion website Metrics Reloaded.
But to sum-up, always use multiple metrics. That way, even if you make a suboptimal choice of individual metrics, or don’t fully understand what each metric is telling you, there’s still plenty of useful information available. And equally, don’t rest on a single metric when deciding which classifier model is best. Different classifiers may do well at different metrics, and this might reflect different trade-offs in the behaviour space.
Well, the majority of ML papers I’ve been reading and reviewing have left me feeling sad.
Sorry.
This is actually the original use case of precision and recall, which came out of the field of information retrieval.
Another potential solution is the micro F1 score, which treats all samples equally and is essentially blind to class labels. However, in most cases, micro F1 ends up being equivalent to accuracy, so usually not a great option.
For example, in medicine, a clinician who is collecting data for a study may have easier access to people with a condition than people without — potentially leading to the positive class being bigger than the negative class, despite the opposite being true in the wild.
MCC is equivalent to the phi coefficient.
Specifically in Qiuming Zhu’s 2020 paper On the performance of Matthews correlation coefficient (MCC) for imbalanced dataset. Though Davide Chicco and co-authors have written a bunch of paper asserting the opposite, including their 2023 paper The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification.