There’s been a lot of fuss about DeepSeek’s new R1 LLM. I don’t want to add to the noise, but the reporting has been somewhat shallow and misleading, so I’d like to make some of the facts a bit clearer.
First of all, there’s nothing particularly novel about the architecture of R1. It’s a GPT-style transformer model that uses a mixture-of-experts approach to improve performance. So, it’s just like many of its contemporaries in this regard.
Instead, the main novelty lies in the training of the LLM. Specifically, there’s a lot more reinforcement learning (RL) and a lot less supervised fine-tuning. RL itself is nothing new1; it’s been used to improve the behaviour of LLMs for some time. But usually it’s a relatively small step that’s done after the main pre-training (from masses of text) and then fine-tuning (from smaller amounts of curated text) of the model.
Fine-tuning is usually seen as the main driver behind aligning the behaviour of an LLM to the expectations of its users, but DeepSeek’s paper suggests that RL is the more important component. In particular, they show that a pretty good LLM (which they name R1-Zero) can be trained using RL alone. This, whilst interesting from a technical perspective, is not particularly newsworthy, so has stayed below the radar in mainstream reports.
Much of the reporting has mentioned model size, stating that R1 is a lot smaller and less resource intensive than the LLMs used by ChatGPT etc., yet is equally capable. There’s some truth here, but at more than half a trillion parameters, R1 is still pretty meaty. The fact that smaller models can do impressive things has been a strongly emerging concept for the last year or so, so I don’t find it surprising that a smaller LLM can perform as well as a larger one. Frankly, we don’t know how big an LLM needs to be to get a particular capability. Experimenting with LLM architectures and training procedures is expensive, so a lot of the design choices that go into these things are necessarily ad hoc, and it takes time to find the sweet spot. Also worth noting that, in the case of Chinese LLM developers, US restrictions on chip exports have pretty much forced them towards exploring smaller model sizes.
Nevertheless, the release of an LLM that is competitive against the market leader despite being significantly smaller (if not small) is a welcome (if unsurprising) development. It’s also good that it’s an open model — though not, as some media outlets have been reporting, open source, since the all-important training data has not been published. However, to me, the more interesting observation in the DeepSeek paper is that some of the capabilities of R1 can be distilled2 into much smaller LLM architectures3, producing better performance than when these were trained using more conventional means. This provides a potential route to producing better performing compact LLMs.
Yet the real driver behind media reports has been the narrative that momentum in LLM innovation is moving from the USA to China. That may or may not be the case, but I’m not sold on the idea that R1 is a specific sign of this, and I think it reflects technological naivety amongst the political classes. DeepSeek used an approach that could have been devised and followed by pretty much any well-funded lab anywhere in the world, largely using techniques that were developed elsewhere. Like most work on LLMs, this builds heavily on work done by both academics and companies around the world. Kudos to DeepSeek for finding a good model, but I’m not convinced it’s a sign of inevitable Western decline.
However, media reports have helped highlight a factor that does need more debate — censorship by LLMs. The emphasis in these reports has been on alignment to the world view of the Chinese government, but there’s really nothing to prevent any LLM developer from baking in their own world view. Given the increasing pervasiveness of LLMs, combined with the increasing politicisation of big tech, this is a significant concern. And along a similar vein, what happens to the data you type into an LLM? Media concerns have again centred around the Chinese government, but there’s little to stop any company doing with this data what they please. Current events in the US, for example, show how quickly any regulatory framework can be unravelled.
For a good intro to RL and its use in training LLMs, see this post by Cameron Wolfe.
Distillation here refers to a technique called knowledge distillation, which involves training a smaller neural network to mimic the behaviour of a larger one — again, nothing new in itself.
Specifically Meta’s Llama and Alibaba’s Qwen.
Thanks for the distillation. Two very important points in your last paragraph.