Everything in the land of LLMs is subject to hype. Everything must be an exciting, new and groundbreaking advance that will revolutionise AI. Otherwise, AI influencers won’t be able to write posts with exuberant emoji-encrusted titles whenever some company releases a new model. But back in the real world, most progress is incremental, involving a relatively minor tweak that adds some interesting new dimension to an LLM’s behaviour. And I would argue this is mostly still the case with reasoning models, which are currently the focal point of the world’s hype-sphere.
So what is a reasoning model? Basically, it’s an LLM that spews out a lot of chain-of-thought text before it answers your question. Kind of like how an engineer mumbles to themself as they go about solving a knotty design problem. As I mentioned in Deep Dips #5: Prompt engineering, LLMs have been going in the chain-of-thought direction for some time, and talking about how they’re solving a problem does seem to help them come up with more reasonable answers. And you can easily get them to indulge in this behaviour just by adding things like “think step by step” to a prompt.
Reasoning models take this chain-of-thought to a new level, spewing out increasingly dense reams of their inner ruminations. This is sometimes referred to as inference-time scaling, which just goes to show that there’s a confusing term for everything. But basically it means that LLMs are being forced to do more in response to your query, scaling up the number of times they drive their outputs back through themselves, rather than increasing the size of the underlying transformer model1.
One way of increasing the amount of reasoning behaviour is simply to discourage LLMs from generating stop tokens2, the things they use to signify when they’ve finished their current response. If you don’t let them generate a stop token, then they’re forced to keep mulling over the output they’ve already generated. This can cause them to, for instance, start fact checking what they’ve already said, which can improve the correctness of answers; just how humans sometimes3 like to sanity check their own working in order to avoid mistakes.
Instead of increasing the length of an LLM’s output, another approach to promoting reasoning behaviour is to use some kind of organised process to gather and coordinate multiple outputs into one coherent stream of reasoning, potentially using more than one LLM in the process. I’m not going to talk about these in any detail, but they often involve hybridising LLMs with more classic AI approaches such as beam search and Monte Carlo tree search.
However, more typical these days is to use some kind of training to promote reasoning behaviour. Unlike the above approaches, this involves changing the underlying weights of the LLM in a systematic way so that it’s more likely to generate text that contains appropriate reasoning. A popular approach is to use reinforcement learning4, which involves training the LLM on lots of labelled examples of chain-of-thought reasoning that produce correct answers to given problems. In a nutshell, this involves checking whether, for each given problem, the LLM outputs the correct answer5. The pathways within the LLM which most contributed to these outputs then get strengthened or decayed depending on whether their answer was correct, causing them to churn out more of the good stuff in response to future prompts.
As a result of this kind of thing, these reasoning models work better than vanilla LLMs when they’re applied to certain problems that require reasoning. And this is why people have been getting excited about them, since it’s enabled LLMs to solve problems they couldn’t solve before, particularly problems that involve mathematical proofs, puzzles and writing code. Not coincidentally, these are also the kind of things they were trained to do; whether they can consistently solve problems they weren’t trained to solve remains somewhat of an open question.
But reasoning prowess can come at the expense of more mundane behaviour. For example, reasoning LLMs have been observed to over-think problems that don’t require complex reasoning, tying themselves in knots and chewing through the world’s power supply in the process. Also — as is often the case when an AI model is trained to do something specific — they inevitably lose some of their general abilities. And this can cause them to struggle on problems that vanilla LLMs can easily solve. Which is why we still, at present, need vanilla LLMs in addition to reasoning LLMs. That, plus the exorbitant costs of the latter.
However, the main reason I’m reluctant to file them under “huge leap forward” is that reasoning LLMs haven’t really changed anything about how LLMs function; they’ve just tweaked it. Sure, this has come with some benefits, but they still basically behave in an autoregressive fashion; that is, at each step, they take the text they’ve produced, and shove it back through their inputs to generate the next word.
Don’t get me wrong — the way in which a reasoning LLM uses its own past output to guide its future outputs is a departure from earlier models, and has brought LLMs closer to what many humans consider to resemble thinking. But there remain huge differences. We humans don’t jot down everything we think about and then constantly look back at it in order to choose what to think next. Instead, our brains have a complex dynamical inner state which produces chain-of-thought as a byproduct, rather than a direct medium of thought6. And I’d argue that this simply isn’t possible using the feed-forward architecture of a transformer7, so I suspect that any true leap forward will require us to revisit how we construct the neural networks underpinning LLMs, rather than just tweaking how we train or interact with them.
But even if the advent of reasoning LLMs is a small step in the great scheme of things, there’s a lot of small steps happening at once, and collectively these amount to some pretty fast developments in AI. For example, look at the way in which reasoning LLMs are already getting together with agentic approaches. This involves setting LLMs loose within the world, giving them (virtual) appendages with which to interact with and manipulate online infrastructure. The combination of this with multi-stage problem solving (thanks to the reasoning models) means they can get a lot done — and/or wreck havoc, depending on how you look at these things. OpenAI’s deep research is one example, using their reasoning LLM to search for information on the web, which it then attempts to unify into a cohesive report.
While reasoning has contributed significantly to the range of problems that LLMs can solve, it hasn’t addressed fundamental underlying problems like hallucination, limited generalisation beyond training data, and a lack of common sense. So let’s not get carried away with ourselves.
Though this tends to happen as well.
Also commonly known as end of sequence tokens. A well-known example of this approach is this paper [https://arxiv.org/abs/2501.19393], where they replaced stop tokens with the word “wait”.
And, increasingly, sometimes not.
Largely thanks to DeepSeek — see my previous post on this.
Sometimes with the correct intermediate results, and sometimes with the right formatting too.
Well, as far as we know. How our brains engage in thought is still somewhat of a mystery.
Internal dynamics require feedback connections, something we pretty much did away with when we moved from LSTMs to transformers, because they don’t readily scale.
I recently read the paper 'ReAct: Synergizing Reasoning and Acting in Language Models' which mentions how humans integrate task-oriented actions with verbal reasoning. It made me reflect on my own thinking process, and confirm that I don't always reason verbally before taking action, at least not on purpose all the time. There’s a lot of non-intentional thought at play. Your text helped clarify this: chain of thought is a byproduct of thinking, not its fundamental medium. Thank you for this insight, Professor.
nice article. What do you think of Collective Intelligence of LLM instead of Chain of Thoughts? https://arxiv.org/abs/2401.02051