ML Pitfalls #3: Data Contamination in LLMs
Apologies for the unplanned hiatus — things have been a bit full-on recently! I’m aiming to get back to my regular posting schedule, but life does tend to get in the way of things…
In this third post in the ML Pitfalls series, I’m going to talk about what I think is one of the most important emerging risks in contemporary machine learning — data contamination in LLMs.
Once the domain of decision trees, plain neural networks, and myriad other traditional machine learning models, ML (like everything else) is increasingly being offloaded to LLMs. A good example is sentiment analysis, where an LLM is given some text and asked to say whether the text has a positive or negative sentiment — a classification task. Just like in traditional ML, a test set is used to measure how well it does at this.
But LLMs, unlike most traditional ML models, already know a lot of things. And this might include elements of the test set. For example, if you’re measuring an LLM’s performance on sentiment analysis using one of the many Amazon reviews datasets out there, it’s quite likely that the whole dataset (test set included) has already been ingested by the LLM. So maybe you’re not measuring its ability to do sentiment analysis, but just its ability to recall information it’s already learnt1. This is known as data contamination.
It’s worth mentioning here that “already knowing a lot of things” is exactly why LLMs are useful to ML. Most traditional ML models don’t know anything before they are trained, and hence they need a lot of training data to learn stuff. LLMs, on the other hand, can often be used for ML tasks that they were not explicitly trained to solve2, without requiring explicit training data. In turn, this can make it possible to do ML-related tasks in areas where data is not readily available.
But a big problem is that we don’t know (in most cases) what data an LLM was exposed to during its original training process. So we don’t know whether it’s really generalising, or merely querying its training data. And even if we do know what data an LLM was trained on3, the sheer weight of data makes it challenging to practically identify data contamination and its consequences4.
Generally speaking, data contamination is an issue for any dataset that has been made publicly available. Which unfortunately means it’s an issue whenever someone uses a publicly-shared benchmark dataset, or when they are creating their own dataset constructed from publicly-available material. And given that recent LLMs now have multimodal capabilities, it is not just a problem when working with text data. For example, it is likely that most of the commonly-used LLMs have ingested data from computer vision benchmarks like CIFAR and ImageNet. This makes it harder to compare ML models meaningfully, but also reduces the time over which benchmarks are useful, and hence the incentive for people to build benchmarks.
But more generally, it can cause overconfidence in LLMs. There have been numerous studies showing that the reasoning abilities of current LLMs are fragile, and closely tied to the alignment between what they’re asked to do and what data they were originally trained on. Small misalignments can lead to large drops in performance. This means that if the real world data you’re hoping to apply an LLM to is in any way different to the data it was originally trained on, then there’s a risk it won’t work.
And from this perspective, it’s not surprising that in many application domains data contamination has been found to boost an LLM’s capabilities, artificially exaggerating their performance. Although, counterintuitively, there are also published examples where data contamination has reduced the performance of an LLM5. Either way, data contamination is something we want to avoid.
But how do we avoid it? With existing datasets, avoidance isn’t really a possibility6. If you want to use public datasets with an LLM, then the onus is really on you to prove the absence of contamination. Or at the very least to not compare the performance of possibly-contaminated LLMs against traditional ML — this is something I’ve seen being done, and it gives the LLM an unfair and possibly ungeneralisable advantage.
If you’re putting together a new dataset, and wish to share it, then there are options. One of these is to encrypt the dataset and only provide the public key in plain text7. This would probably stop current LLMs in their tracks, but it wouldn’t be surprising if future LLMs8 could work out how to use the public key to decrypt the data themselves. So this is likely to be a time-limited solution.
Another is to use a dynamic dataset that changes over time9, though this is more challenging to implement. This could involve periodically (or perhaps continuously) updating the data so that it only uses recent samples that can reasonably be assumed to have not been ingested by an LLM yet. Or alternatively some form of synthetic data that can be periodically regenerated from a model of an underlying data distribution.
A third option is to set up some kind of private benchmarking service that controls access to data. This is potentially even more challenging to implement, though see this paper for some idea of what such a system could look like.
Of course, it would be easier if we could ask LLMs not to ingest public datasets in the first place, and therefore avoid all these overheads. In theory you can do this by configuring an llms.txt file and placing it in the root directory of the website hosting a dataset. Some LLM developers say they will abide by the instructions in such files. But in practice, there’s nothing to stop badly-behaved LLMs from scraping your content. And given the important of benchmarks in marketing new LLM releases, can we really rely on the honesty of the “good guys”?
So, in a nutshell, be aware that data contamination is a big issue, try to avoid it when doing your own ML studies, and understand that a lot of published studies have probably reported over-optimistic results due to data contamination.
Or, if the class labels of the test set weren’t readily available during ingestion, the results will at least be biased by the contributions of these samples to the LLM’s language model. It’s a bit like a human taking an exam that they already saw whilst revising, even if the answers weren’t given.
Sentiment analysis is also a good example of this. You can ask an LLM whether a piece of text has a positive or negative sentiment without having to fine-tune the model on any examples of classifying text. Fine-tuning might improve its behaviour, but you’d likely get good performance right out of the box.
For example, the fully open-sourced OLMo model provides its training data.
Made harder because the provenance of data in datasets is often unclear. For example, the dataset itself might not have been used in training, but some of its data sources may have been.
For example, this paper.
Well, mostly. You could try to remove the knowledge from the model, and there are techniques like ROME (Rank-One Model Editing) that aim to achieve this. However, they do tend to lobotomise models to some extent, reducing their overall abilities.
As described in this paper.
And perhaps current agentic approaches.
For a review of this idea, see this paper.