Install your own LLM: so many choices!

Aug 07, 2024

I thought it would be nice to install an LLM on my computer, so that I could satiate my quest for knowledge (and/or “alternative facts”) without having to deal with commercial services like OpenAI’s ChatGPT and Google’s Gemini. So, I installed the awesome LM Studio, which lets you download open source LLMs from HuggingFace and run them locally on your own computer. Then I pulled up its list of available open source LLMs.

Turns out there’s a lot of them.

Okay, let’s go back a step. What is an open source LLM? Basically, it’s the same sort of thing as the closed LLMs (large language models) behind ChatGPT and friends, but unlike these, you can download them, modify them, and do pretty much what you like with them1 — all without ponying up your cash. Almost all of them are transformer models of some kind, each trained on stupidly large amounts of text2 using equally stupid amounts of computational resources.

Given that they don’t involve ponying up cash, but do require a lot of resources to develop, it’s interesting that most of the prominent open source LLMs were developed by companies. What is the motivation for this? In part, it offloads some of the development of the models to the open source community. It can also help to build up confidence in the models, and perhaps enable some kind of up-selling in the future. It may even be to interfere with the business models of competitors. But regardless of the motivation, it’s fortunate for the rest of us, given that few not-for-profit organisations have the resources required to train LLMs.

The modern era3 of take-home LLMs really started with Meta’s Llama model. It wasn’t actually open source (it didn’t allow commercial use) and it wasn’t actually released to the public (the model parameters were leaked), but it did whet people’s appetite for installing LLMs on their own machine. Since then, other tech giants have got in on the game and released open source LLMs. These include Google’s Gemma, Microsoft’s Phi, Apple’s OpenELM, Alibaba’s Qwen and Nvidia’s Nemotron. Beyond the giants, there are also companies set up explicitly to develop LLMs, and these have also released open models; current notables are Mistral’s Mixtral and Cohere’s Command R+. And on top of these, there have been efforts by not-for-profits, most recently OLMo from the Allen Institute for AI4.

So, you might fire up LM Studio and search for one of these. If you do, you may be surprised at how many options there are.

In part, this comes down to different model sizes. LLMs are transformers, a kind of large neural network whose behaviour is configured by setting the values of numeric parameters that represent weights and biases. Generally, larger transformers contain more parameters, and this allows them to learn more from their training data. However, more parameters means that an LLM will take up more space in memory, and also means that it will have longer inference times. That is, it will take longer to spew out text.

Neither of these are a big problem for commercial LLMs running on huge power-hungry server farms. They can deal with larger models by spinning up more processors with more memory. But for the likes of you and me, running an LLM on our own computer, they are a problem. And so we’re faced with a trade-off between wanting to run the largest most powerful model, but also wanting to fit it in our computer’s memory and have it generate text at a reasonable speed.

At present, common model sizes amongst open source LLMs vary between 2 and 9 billion parameters. For a typical laptop of recent specification, you’d be looking at a sweet spot around 7-9 billion parameters. If you only have a tablet or a mobile phone, then 2 billion is probably your limit. There are also larger models for those with meatier machines, and truly massive ones for people who own a server farm. However, do bear in mind that there is not a direct relationship between model size and model ability — it also depends on the model architecture, with recent LLMs able to achieve more with less. Apple’s recent OpenELM models, for instance, seem to get a lot done with less than a billion parameters5.

And on top of that, there’s also the matter of how individual parameters are represented. In a full-fat model, each of these is represented by a 32-bit floating-point number. But you can reduce memory consumption by half by using 16-bit numbers instead — and there are open source LLMs that do this. Another way to make a model leaner is to use quantisation. That is, rather than using the full range of floating-point numbers, you can use a smaller selection of values to represent parameters, and this allows the values to be stored more efficiently. However, reduced precision and quantisation are both thought to reduce the capacity of the model, often in poorly understood ways. But in practice, you’ll likely have to accept a reduction in performance if you want to get a larger model to run on your machine.

So, the same model will often come in different sizes (usually shown as “B” in its name) and different precisions (“F”), and different degrees of quantization (“Q”) and this explains a lot of the options you’ll see when selecting a model in LM Studio. But it doesn’t explain all you see. Other ways in which the same model may differ include context length, whether it’s been fine-tuned on extra data, and whether someone has removed protections6.

Beyond having something you can use offline without paying money, these last two points really get to the heart of why many people are interested in open source LLMs — they can be configured to do whatever you want them to do7. You want to fine-tune them to respond to whatever your cat types on your computer’s keyboard? Sure, go ahead! You want them to be narcissistic and sweary? Erm, okay. Or, more realistic, perhaps you run a small company that wants to adapt an LLM to the requirements of your organisation, without paying a tech giant for the privilege.

Either way, you’re currently spoilt for choice!

Well, mostly. Meta’s Llama 2, for instance, does place some restrictions on commercial use. Technically this means it is not open-source, at least according to the usual definition.

Some of them say what text was used, and how it was used. Some don’t. This might be a concern if you’re worried about it spewing out copyrighted material.

The field of transformer-based LLMs actually goes a few years further back, but it wasn’t until more recent models that they became useful to the general public. A number of the earlier models were open source — BLOOM is a good example.

And this one is truly open source, including publishing the data used to train it.

The thinking is that these have been developed to deploy on resource constrained devices like iPhones, where you really want to minimise the size and energy usage of a model.

Most LLMs are configured to prevent what the originators deem inappropriate use. This may include asking them about illegal things and asking for offensive output.

Within the limits of the underlying transformer model, of course, including any size/precision/quantisation constraints that are baked into it.

Fetch Decode Execute

Discussion about this post

Ready for more?