Today’s post is about the use of LLMs for analysing text, specifically whether encoder-only models or decoder-only models are better at text classification.
But first a quick rant. A couple of months back, I read an article in The Guardian alleging that university courses are devoid of purpose and that students primarily spend their time partying. I found this infuriating. I’m not saying it isn’t true for some courses at some universities, but tarring the entire university sector with the correspondent’s own narrow experiences is grossly unfair. It’s insulting to academics, most of whom work hard to provide a good education, despite decreasing pay and increasingly constrained resources. And it’s really insulting to students, many of whom experience significant hardship in the pursuit of their degrees.
But what’s this got to do with using LLMs to analyse text? Well, one of the great joys of being an academic is working with students during their final year projects, in which they apply what they’ve been learning to some real world problem. I’m fortunate to have worked with a lot of capable and hard-working students over the years. And over the last couple of years, a few of these students have been looking at whether encoder-only or decoder-only models are better for classifying text.
I’ve touched on encoders and decoders before, but a quick refresh. The original LLM was developed primarily for text translation, and comprised an encoder, which turned a passage of text into a bunch of numbers, and a decoder, which turned this bunch of numbers back into text, but in a different language.
Soon after, people realised they could use the encoder part of an LLM as a way of getting text into classical machine learning models. This led to a bunch of models, mostly in the BERT1 family, that could turn text into a set of numbers that somehow captures the text’s meaning. Given that they don’t use a decoder, these are sometimes referred to as encoder LLMs.
But more familiar to most people these days are decoder-only LLMs like GPT, which have become the backbone of generative AI. They’re decoders in the sense they’re in the business of generating text, rather than turning things into numbers. Strictly speaking, they still encode text into numbers2, but let’s not muddy the waters.
Encoder models have an obvious mapping to text analysis, and classification in particular. You give them some text, and they give you some numbers. You can then take these numbers, alongside a label for the text, and train a classifier in the normal way. This could be a standalone classifier, or it could be some extra layers bunged on top of an existing BERT model — something known as a classification head.
But BERT and its ilk are considered prehistoric these days. The original model was released in 2018. At a mere 300 million or so parameters, BERT models are tiny compared to the billions and trillions of parameters found in modern decoder models. Which has got many people thinking — why not just use decoder models to do text classification? That is, give it some text, and ask it to output a class label as text.
Seems obvious, but as work by my students has shown, this approach tends to work less well than using a BERT-style transformer. For instance, in work3 recently presented at UKCI, one of my undergraduate students talked about how the encoder-based models he used consistently outperformed decoder models at the task of identifying bot accounts from their social media posts. And similar things were found by students who were looking at product sentiment analysis4.
Further evidence of this can be found in other published work where people have compared BERT and GPT-style transformers. In much of the work I’ve seen5, BERT was the winner. Which is surprising. You’d think the immense size of modern LLMs would give them the edge over their tiny ancestors, but the evidence suggests otherwise.
I don’t really know why this is the case. Perhaps it has something to do with the extra processing required to produce a text label, rather than just numbers. That is, text to numbers to text might involve more work and understanding than just text to numbers. But it’s a good thing, since encoder models are much cheaper to deploy than decoder models, and numeric output is much easier to deal with than text6.
It’s also another lesson for those who think that newer or bigger is necessarily better. Yes, sometimes it is, but there are plenty of examples where it isn’t.
But back to the rant. An undergraduate student being able to publish their work in a scientific conference is just one example of students working hard to get a degree. And if The Guardian hates universities so much, why do they put so much effort into producing their own league table each year? But don’t get me started on league tables – that would be a very long rant!7
BERT stands for Bidirectional Encoder Representations from Transformers. Other members of the family include RoBERTa, DistilBERT and ALBERT.
And actually you can use open decoder models to give you a bunch of numbers too, though this is not necessarily a good idea, due to rather opaque biases.
Given a product review on <insert favourite online retailer here>, say whether its sentiment is positive or negative.
Examples include emotion recognition and text classification in political science.
Notably, you can’t force decoder models to produce just a class label. They’ll generate whatever text they feel like, and it’s then your job to sanitise it.
Okay, a few thoughts. Measuring quality based on whichever metrics are easy to measure is bound to introduce bias. Publishing these metrics means that universities will attempt to overfit them without necessarily improving quality. The fact that universities regularly move tens of places each year shows how noisy this measurement system is. And in my experience, league tables amplify differences between universities that aren’t really there.



Hear hear to your rants.