GenAI in machine learning: what could possibly go wrong?
A new guide to the risks of putting LLMs into machine learning workflows
First of all, apologies for the recent gap in posting. It’s been a busy time work-wise, not helped by the enthusiastic company of an overly-friendly virus. But one of the less disease-ridden things to happen to me recently was the publication of my new guide Pitfalls and risks of generative AI in machine learning in Cell Press’ excellent data science journal Patterns. You can read this open access article here.
Here’s a taster of some of these pitfalls and risks:
Regulatory compliance may be impossible to achieve
Model updates can silently and radically change behaviour
Data contamination can invalidate evaluation
Synthetic data can carry hidden bias and legal risk
Generative AI dramatically expands the attack surface
Sensitive information can leak via tools and explanations
Long-term costs may be unpredictable or uncontrollable
And beyond this, the use of LLMs also amplifies existing pitfalls found within ML workflows. You know, the kind of things I’ve been banging on about for the last two years: data leaks, overfitting, bias, spurious correlations, blah blah blah1.
This all matters because people are already using generative AI within ML workflows. You’ve probably seen some of this: product recommendations online, spam detection, automated routing of customer support tickets. These kind of things are now quite routine, and the consequences of mistakes are quite limited: “Argh, you recommended me a product I wasn’t interested in; that’s 2 seconds of my life wasted!”
But the more worrying examples are those we’re not so aware of. In the guide, I frame the discussion around two hypotheticals: loans approval decisions and medical triage. But to avoid treading the same ground, let’s take a related example – insurance. AI in insurance already causes issues, particularly within US healthcare, where the use of AI models to make ongoing care decisions has become routine and controversial2. And this was before generative AI became a thing.
From an insurer’s point of view, it’s not hard to see why throwing generative AI into the mix might be appealing. LLMs can in principle make better decisions than conventional ML by processing unstructured data and bringing in information beyond locally-collected data. This information might include general understanding of a particular context, e.g. what kind of intervention is appropriate within a particular clinical care situation. Which in turn might lead to more robust decisions.
However, this assumes that LLMs are reliable, trustworthy and unbiased, which is not currently the case. Bias is a well known issue; by training on vast amounts of data, modern LLMs have learnt a significant proportion of human knowledge, but they’ve also learnt all of humanity’s biases and historical unfairness. And they’re quite capable of applying these biases to decisions they’re asked to make. Which is a big problem for insurance, since it could lead to certain communities being unfairly treated.
Hallucination, or the tendency to make things up, is another well known issue. In conventional machine learning, trained models are pretty much forced to look at their input data when making decisions. But this is not true of LLMs. You can give an LLM some data, and ask it to make a decision based on this data, but in practice there’s nothing stopping it from making up its own data and basing its decision on that.
Another potential benefit of modern LLMs is the opportunity to engage in agentic behaviour; things like going off and searching databases or the web to find relevant information. In theory, this could further increase the robustness of decisions, but in practice there’s a real risk of inappropriate knowledge or misinformation influencing decisions. And agentic behaviour also increases the risks of hallucination – why go to the effort of searching a database when you could just make something up?
All of this is made worse by the opacity of LLMs. It’s very hard to figure out how they make their decisions, and this becomes even harder when they start carrying out agentic behaviour or – as is increasingly the case – interacting with other LLMs. This in turn presents a major challenge to regulatory compliance, which in many jurisdictions requires explanations of how AI systems reach important decisions.
And it’s not just direct involvement of LLMs within ML systems that’s a problem. They also indirectly influence ML workflows. For instance, developers consult them on design decisions, and integrate code generated by LLMs. The latter is particularly risky, since it can open up all manner of vulnerabilities, something I already discussed in Emerging security risks of GenAI in ML. And there are also risks around using LLMs to generate synthetic data and within analytical roles.
But I won’t take up more time you could be spending reading the guide. I’d just like to end by saying I’m not a luddite. I’m an AI researcher, and I see the potential. And when generative AI systems reach human-level intelligence, I’ll happily retire to the Swiss Alps3 and let them take over. But in the meantime I’m going to keep giving them constructive feedback.
Oh, and many thanks to Reviewer #2, who made some excellent suggestions about how to improve the guide. To quote from the colourfully-titled paper Dear Reviewer 2: Go F’ Yourself, “Reviewer #2 is not the problem. Reviewer #3 is.”4
See my previous guide Avoiding common machine learning pitfalls.
See this article for some context.
Assuming universal basic income exists and stretches that far.
Luckily there were only two reviewers.


