Fetch Decode Execute: Deep Dips

Deep Dips #6: Mechanistic interpretability

Michael Lones — Fri, 06 Feb 2026 12:01:04 GMT

It’s a while since I last wrote a Deep Dips post, so I’m going to broach another topic in the area of deep learning and LLMs that is becoming increasingly talked about — Mechanistic Interpretability, or MI to its friends. But first here’s a quick reminder of previous posts in the series:

For those new to this, the idea is to introduce concepts in an approachable manner, with each post generally building on the previous ones.

Why we need MI

LLMs give the impression of being intelligent, but scratch the surface and you’ll find all manner of hallucinations, shortcut learning, fragile heuristics and prompt sensitivities. All of which makes taking their output at face value a risky proposition. Wouldn’t it be nice if we could gain some insight into their actual reasoning, to give us confidence that they’re not deceiving us? Well, we can, to an extent, using MI.

But wait a minute, don’t we already have reasoning LLMs which tell us the thinking behind what they do? Yeah, in theory, but the thinking they produce is just tokens produced by the model, which may (and often do) have little to do with how the LLM is actually reasoning1. In order to understand how an LLM is actually reasoning, you have to go beyond their outputs and see what they’re doing inside. You know, in the billions, or sometimes trillions, of neuron activations that lead to each output token. Sounds challenging? Well, it is, but the folk who work on MI have come up with some pretty interesting tools for probing this low-level behaviour.

The linear representation hypothesis

The first thing to know is the linear representation hypothesis. This theorises that human-interpretable concepts captured within the internal state of LLMs are most likely to be encoded as linear vectors: straight-line directions within the model’s latent space. Without going into any detail, this follows from the observation that linear representations are the easiest, and therefore most likely, to be learnt2.

When I talk about an LLM’s internal state, in general I’m not referring to the LLM’s entire internal state, i.e. every single activation of every single neuron in every single layer. That would be a lot of activations. Instead, most MI techniques focus on the state at the outputs of a particular transformer block3. Sometimes this is the last block before the transformer’s outputs, since this has already extracted everything that is needed to predict the next token. Sometimes it’s the middle block, which some people think offers a better trade-off between being too focused on next token prediction and having done enough useful processing of the input4 . Either way, this state is basically a long list of numbers.

Anyway, the linear representation hypothesis basically says that this list of numbers is a linear superposition of all the concepts that the LLM has derived from its context window, all firing at different magnitudes. And the good thing about linearity is that it makes it relatively easy to break down this list of numbers and extract information about the magnitude of each concept.

Training linear probes

But first you need to identify the vectors associated with each concept. And this is where the various MI techniques come in, since they’re designed to help us find them. The simplest MI approach uses a linear classifier: a very basic kind of classifier whose output is based on a weighted combination of its inputs. It’s pretty much the same thing as a Perceptron (see Deep Dips #1), i.e. a single neuron in an MLP layer.

The idea is to use this linear classifier — sometimes referred to as a probe — to identify how a single concept is encoded within the LLM’s state. So, let’s say we want to do this for the concept of cheese. We assemble a bunch of cheese-themed prompts, e.g. “What is your favourite cheese?”, “How many Babybels would it take to fill the Albert Hall?” and so on. We then collect the corresponding set of internal states that occur when the LLM is fed with these prompts, and these become the positive class in our training data. And we repeat this process with prompts that don’t mention the theme of cheese in any way, collect the internal states, and use these for our negative class. The linear classifier is then trained to separate these two sets of states. And since we’re assuming the underlying signal is linear, this should identify the vector in the LLM’s representation space that corresponds to cheese.

But doing this to map out all the concepts learned by an LLM would be onerous. We first have to come up with a list of concepts that we think are relevant, generate an appropriate set of prompts for each one, collect the corresponding states for each of these, and then train a fresh linear model each time. That’s a lot of work, and a lot rests on our ability to come up with a sensible list of concepts in the first place, especially given that our idea of what represents a concept might not neatly align to the LLM’s. I know it’s hard to imagine, but the LLM might not even have a concept for cheese. It may instead have a bunch of concepts for yellow things, stinky things, things that are often round which fire in concert when a prompt mentions cheese, but not one specifically for cheese.

Sparse autoencoders

But MI comes to our rescue again, using something called sparse autoencoders (SAEs). I talked about autoencoders in Deep Dips #2: Embedding and latent spaces. It’s a kind of neural network that is trained to reconstruct its input at its output, via a hidden bottleneck layer. Its outputs are equal in number to its inputs, and its loss function measures how well the outputs match the inputs. The purpose of the bottleneck layer is to force it to project the inputs into a lower-dimensional latent space, ie. to compress them.

But in the case of an SAE, the hidden layer is no longer a bottleneck. Instead, it’s substantially bigger than the input and output layers5. And the loss function doesn’t just maximise reconstruction accuracy; it also minimises how many neurons in the hidden layer fire for each set of inputs. In effect, it’s trying to train a whole bunch of linear models (one for each hidden neuron) at the same time, and do so in an unsupervised manner. To train it, you just need to churn the internal states from a lot of prompts containing a lot of concepts through it, and over time it will learn to separate and characterise the vectors corresponding to each concept. Pretty neat.

Cataloguing semantic concepts

Using this approach, various people have discovered how commonly-used open weight LLMs, such as Meta’s Llama and Google’s Gemma, encode concepts within their internal states. A lot of these have been recorded on the website Neuronpedia. And it’s fair to say that the concepts are diverse. For example, they include concrete concepts like coffee, but also more abstract things like positivity or uncertainty, plus much more obscure things that are hard to put a label on6. The approach has also been applied to Claude’s commercial model, Sonnet, and there’s a nice write-up of what they discovered on their website.

So, using an SAE, you can learn how a whole bunch of concepts are encoded within the state of an LLM. Given a new prompt, you can then measure how much each of these concepts is being triggered, and this gives some insight into how the LLM is actually interpreting the prompt.

Steering LLM behaviour

An example of a practical use of semantic concepts is detecting jailbreaks. A jailbreak occurs when a user manages to convince an LLM to do something it’s explicitly trained not to do, e.g. use bad language or provide illegal information. By collecting a bunch of prompts containing jailbreaks, and a bunch of prompts that don’t contain jailbreaks, it’s possible to identify the concept (or concepts) that are triggered when a user attempts to jailbreak an LLM. These concepts can then be monitored during use.

But more importantly, if a concept is triggered, it can also be steered. This involves dampening the neuron activations that underlie the concept, i.e. replacing the current outputs with lower values, which then continue their journey through the model. In various published studies7, this kind of thing has proved quite effective at preventing jailbreak attacks, and could be similarly applied to any other behaviour that you want to actively prevent from happening.

Limitations of MI

However, the optimism generated by these kind of studies should be tempered with an understanding of the limitations of current MI approaches. The semantic concepts discovered by MI can sometimes be fragile, they don’t always generalise beyond collections of specific prompts, and they sometimes don’t align with human-interpretable concepts. All of this can limit their practical utility, especially when it comes to steering LLM behaviour. As an illustrative example, the authors of this paper described a coffee feature present in Meta’s Llama model. Although it does often trigger upon the mention of coffee across various languages, it doesn’t always trigger, and it also sometimes triggers for non-related concepts, such as the word “coffin”.

A more fundamental limitation of MI is that it requires access to the internal activations of an LLM. This is not a big problem if you’re hosting an open weight LLM on your own machine, but it’s an unsurmountable obstacle if you’re using a remotely-hosted commercial model. There’s nothing stopping commercial developers applying MI to their models and then sharing the information with us, but in practice they’re not likely to do this directly with users. More likely they’ll use MI to better understand their own models and improve their behaviour.

Too long; didn’t read

The only way of getting reliable insight into an LLM’s behaviour is to probe its internal state. MI is a group of techniques for doing this. Central to these is the linear representation hypothesis, which says that semantic concepts are encoded as linear vectors. These can be extracted individually by training linear classifiers, or they can be extracted en-masse using sparse autoencoders. Their activation can then be monitored, or the behaviour of the LLM can be steered by manipulating their activation. Semantic concepts extracted through MI methods are not always robust, generalisable or meaningful to humans. Nevertheless, they still provide valuable insight into the workings of the LLM black box, and are an important tool in trying to make LLMs more dependable. Which is important, because LLMs are finding their way into all manner of contexts where dependability is important.

Also, most commercial models hide the thinking tokens, presumably to stop their competitors from using them to train models.

See this paper for an approachable introduction.

For a refresher on the architecture of a transformer, see Deep Dips #3: Transformers.

Anthropic focused on the middle block when applying MI to Sonnet.

Exactly how much bigger is another design decision, and requires some appreciation beforehand of how many concepts you need/want to learn. Anthropic used up to 34M.

Which may make you wonder: how do these concepts get labelled? One approach is to get a human to look at what the prompts that trigger a particular concept have in common. But this is tedious work, so it’s increasingly being offloaded to LLMs to do.

Here’s a recent example, in which the success of jailbreak attacks was reduced from 61% to 2%.

Deep Dips #5: Prompt engineering

Michael Lones — Wed, 11 Dec 2024 12:20:52 GMT

Previously in this series, I’ve discussed neural networks, transformers, and how you train them, covering pretty much all you need to know to build and train an LLM. But most people neither want nor need to build and train their own LLM, since pre-trained models are readily available. These can be downloaded, or accessed through services such as ChatGPT, and are ready to be applied to downstream tasks without the need for further training1.

However, it’s not straightforward to apply an LLM to a particular task. Unlike classical machine learning approaches, there is no simple mapping between the information you provide to an LLM and the response it produces. Instead, the manner in which you deliver input to the model — that is, how you prompt it — can have a profound effect on its output, and consequently its ability to solve the task you’re applying it to. LLMs care about which words you use, they care about the order in which you say things, they care how many prompts you use to say things, they care about what you said in previous prompts, they even care about punctuation. Minor alterations to all these things can lead to major alterations in their responses, a phenomenon known as prompt sensitivity.

The process of finding how to deliver the input in the right format to get the task done well has become known as prompt engineering. But don’t let the term engineering fool you. This is not engineering in the sense of a well-developed and well-understood process that uses mature tools. It’s more akin to tinkering, and this reflects the current Wild West nature of LLMs. Unlike the topics I’ve discussed earlier in this series, pretty much everything in LLMs is new, and we’re only just starting to understand how to use them productively.

In this post, I’m going to talk through some of the main techniques for doing prompt engineering. As usual, I’m going to focus on the big picture rather than try to be exhaustive.

Keep asking the same thing

I’m going to start with something obvious, but often overlooked. Ask your LLM the same thing multiple times. If you read my post about transformers, you’ll know that the output layer is probabilistic — essentially, for each word it knows, it will return the probability that this should be the next word in the output. These probabilities are then turned into an actual word through a sampling process which favours words with high probability, but occasionally picks words with lower probabilities. By tweaking a setting known as the temperature, you can change the distribution of this.

Which means that if you give the same LLM the same prompt in two different sessions, you’ll likely get a different answer. If the temperature is set low, these answers will be similar. If it’s set high, they may be quite different. In most LLMs, the temperature is set to an intermediate level by default, so there will be a fair amount of diversity in its response to the same prompt. This means that you can easily collect a bunch of different answers to the same prompt, and doing so can lead to more robust insights. For example, when asking an LLM a yes/no answer, you can use the majority response over several repeats — something akin to ensemble or wisdom-of-crowds models in machine learning.

By way of example, here’s the responses to the same prompt in five different fresh sessions using a pocket-sized version of Gemma 22, Google’s open LLM:

Prompt: Tell me precisely how many ones there are in the string 100101111000.

Response 1: There are **four** ones in the string "100101111000".
Response 2: There are **8** ones in this string.
Response 3: There are **5** ones.
Response 4: There are **7** ones in this string. 
Response 5: There are **7** ones in this string.

So, it got it wrong every time, but if you sum up the responses and divide by 5, you get 6.2, which is not far from the truth.

Variants of this technique include ensembling the responses from slightly different prompts, e.g. with different sentence ordering or word usage, and ensembling the responses from multiple LLMs, i.e. giving the same prompt to different LLMs. Both have been found to significantly increase robustness, where applicable.

Big up your LLM

One of the surprising findings of prompt engineering is that you can significantly improve the quality of a response by telling the LLM that it knows what it’s doing. In retrospect, it’s not entirely surprising. After all, LLMs were not explicitly trained to give good answers to questions; they were trained to learn the patterns underlying a large amount of training data, some of which is likely to be wrong or of dubious quality. So, telling your LLM that it’s competent at some task may help point it towards the more reliable parts of its knowledge base. This is akin to telling a dog “Who’s a good boy? You’re a good boy!” prior to asking it to complete an assault course. Though in the case of an LLM, it’s typically done by saying “You are an expert in ” before giving it the task you want it to fulfil. However, as usual, the exact wording does matter, and small changes to these encouraging statements might lead to significant differences in the outcome. But such is the way of LLMs.

Here’s an example of this simple tweak in action:

Prompt: You are an expert in binary numbers. Tell me precisely how many ones there are in the string 100101111000.

Response: There are **6** ones in this string.

Correct, but in 5 repeats, it only got it right this one time, with an average of 7.4. So, not great, but still better than when I tried “Who’s a good boy? You’re a good boy!”, which resulted in no correct answers and an average of 9.4.

Putting things in context

This idea of pointing an LLM towards appropriate parts of its knowledge base is also leveraged by a group of approaches called in-context learning. These work by providing examples within the prompt, and this has been shown to improve the accuracy of an LLM’s responses. Though again, things like where you place the example in the prompt, and the words you use to introduce the example can have a significant effect.

A simple example:

Prompt: Tell me precisely how many ones there are in the string 000001111001.
Answer: 5

Tell me precisely how many ones there are in the string 100101111000.
Answer: 

Response: 6

When I repeated this 5 times, it got it right 4 times, so this seems to be a more robust approach for this particular task.

There’s been a fair amount of research on how many examples you should give, and this has led to a distinction between zero shot, one shot and few shot learning. As you might guess, these involve giving the LLM no examples, one example, or more than one. Generally speaking, there’s a sweet spot somewhere, but this is likely to be task-specific, and you’ll have to work to find where it is. For some tasks, it makes sense to give positive and negative examples, but there’s also work to do in finding an appropriate balance between these, since both the balance and the order in which you present them can have a significant impact on how well this works.

There’s also been research on which examples you should include, and this has led to an approach called retrieval augmented generation, or RAG. This involves an external database of examples that are queried to find those which are most similar to text in the prompt. The matches are then added to the prompt as examples, and this seems to work better than using generic examples. RAG is also used more generally to align LLMs to particular domains. A typical approach is to gather a bunch of domain-specific documents (manuals, technical papers, etc.), split these into fragments of text to store in the RAG database, and then use matching algorithms to identify appropriate fragments of text that can be embedded in prompts.

Explain yourself!

As an academic, when I’m writing exam questions, I won’t just ask students what the answer is; I’ll also ask them to explain why it’s the answer. This gives me more confidence in their reasoning abilities, and also helps to rule out guessing. And, it turns out, the same is true of LLMs. If you ask then to reason about their answer, then they’re more likely to give you a correct answer. The simplest way to do this is to slap “Let’s think step by step” on the end of your prompt — this encourages the LLM to not only give an answer, but also explain how it got there.

By way of example:

Prompt: Tell me precisely how many ones there are in the string 100101111000. Let’s think step by step.

Response: Here's how to count the ones:

1. **Initialization:** We start with a counter initialized to zero. This will keep track of the number of ones we find.
2. **Scanning the string:**  We carefully go through each digit in the string 100101111000.
3. **Identifying Ones:** For each digit, we check if it is a '1'. If it is, we increment our counter.
4. **Counting:** Let's apply this to your example:
    *  100101111000 
    * We encounter the digit '1' at positions 1, 2, 3, 6, 7, and 8. 
    * So, we increment our counter a total of 6 times.
5. **Final result:**  Our counter should now reflect the number of ones in the string.

**Therefore, there are 6 ones in the string 100101111000.**

However, this simple approach doesn’t always work. In this particular task, it only worked this once — the other 4 times it just told me how to solve the problem, not what the answer was.

To address this, people have explored a range of more involved ways of encouraging LLMs to think about what they’re doing. Currently the most popular of these is chain of thought (or CoT) reasoning, which builds on the idea of in-context learning described above by presenting the LLM with an example of how to think. This typically takes the form of a question, a reasoning process, and the correct output, all embedded in a single prompt. The LLM is then asked a related question.

Here’s an example:

Prompt: Tell me precisely how many ones there are in the string 100101010010.

Answer: To calculate the number of ones in a string, you have to read it left to right, character by character, and sum up the characters that are 1. In this case, the first character is a 1, so the sum is initially 1. The second character is a 0, so the sum stays as 1. The third character is a 0, so the sum stays as 1. The fourth character is a 1, so the sum increases to 2. And so on. The answer is 5.

Tell me precisely how many ones there are in the string 100101111000.

Answer: 

Response: Here's how to count the ones in the string "100101111000":

1. **Read Left to Right:**  Just like before, we analyze each character individually.
2. **Sum Up Ones:** For each character, if it's a '1', add 1 to our running total.

Let's break down the process:

* First Character: 1  - Add 1 to our sum.
* Second Character: 0 - No change in the sum.
* Third Character: 1 - Add 1 to the sum.
* Fourth Character: 1 - Add 1 to the sum.
* Fifth Character: 1 - Add 1 to the sum.
* Sixth Character: 1 - Add 1 to the sum.
* Seventh Character: 1 - Add 1 to the sum.
* Eighth Character: 0 - No change in the sum.
* Ninth Character: 0 - No change in the sum.
* Tenth Character: 0 - No change in the sum.

**Answer:**  The total number of ones is **8**.

As you can see, this didn’t work out. All 5 attempts gave the wrong answer and an average of 8.4. On the bright side, it did seem to pick up on the general idea described in the reasoning process — it just wasn’t very good at following it!

But beyond not always working, the limitation of CoT is that the prompt writer needs to come up with a good example of the thought process behind answering a similar question, and also needs to know the answer to this similar question — which may defy the point of asking an LLM in the first place. A range of techniques have been dreamt up to at least partially address these problems. One approach is to automatically mine for similar questions. Auto-CoT for example, does this by grabbing a bunch of similar questions from some dataset or knowledge base and then using the “Let’s think step by step” hack to generate answers and reasoning chains. Other approaches blend CoT with RAG in order to mine external knowledge sources for relevant question-answer pairs.

Employ a fact-checker

As the last example illustrated, LLMs don’t always get things right, even with gratuitous use of prompt engineering. One way to address this is to use an LLM to fact check its own output:

Prompt: Consider the following prompt:



Now consider the following response from an LLM:

>

Was the LLM’s conclusion correct? 

Response: No, the LLM's conclusion was incorrect. Here's why:

The LLM correctly performs the steps described but fails to accurately count the ones.  

There are 7 ones in the string "100101111000".

The key takeaway is that the LLM needs a better understanding of what it means to "count" or "sum." It should recognize the pattern and not just mechanically apply the steps for each character.

That’s some tough self-criticism. But as you can see, although it correctly identified some of what was wrong, it still couldn’t work out the right answer.

So, let’s give another LLM the same query — in this case GPT-4o:

Response: Let's verify the response step by step:

Read Left to Right: We analyze each character individually.
Sum Up Ones: For each character, if it's a '1', add 1 to our running total.
Here's the breakdown:

First Character: 1 - Add 1 to the sum (Total: 1)
Second Character: 0 - No change in the sum (Total: 1)
Third Character: 0 - No change in the sum (Total: 1)
Fourth Character: 1 - Add 1 to the sum (Total: 2)
Fifth Character: 0 - No change in the sum (Total: 2)
Sixth Character: 1 - Add 1 to the sum (Total: 3)
Seventh Character: 1 - Add 1 to the sum (Total: 4)
Eighth Character: 1 - Add 1 to the sum (Total: 5)
Ninth Character: 1 - Add 1 to the sum (Total: 6)
Tenth Character: 0 - No change in the sum (Total: 6)
Eleventh Character: 0 - No change in the sum (Total: 6)
Twelfth Character: 0 - No change in the sum (Total: 6)
The correct total number of ones in the string "100101111000" is 6, not 8. The response provided by the LLM contains an error in counting the ones.

And this shows the benefit of using more than one LLM. Or perhaps just the benefit of using a much bigger LLM3.

Don’t do it yourself

So far, I’ve covered a range of different prompt optimisation techniques. These can be used individually, or combined, and each of them has various degrees of freedom, i.e. different ways you can tweak them, each of which may have a significant effect on the outcome. This quickly adds up to a lot of choices. You could spend a lot of time working your way through these in order to find the one that works best. Or you could attempt to automate this process, using your computer to do the hard work for you — or at least some of the hard work.

Various methods have been developed for this. I’ll refer to them as automatic prompt optimisation, but in practice they’ve been branded in a number of different ways. The general idea is that you provide an initial prompt, and some optimisation process then kicks in and explores alternative prompts, returning the best one. The way in which this exploration is done varies markedly between approaches. Some require detailed knowledge of the LLM’s underlying transformer model and/or its training data, and use techniques like gradient descent and reinforcement learning to tweak the initial prompt4 in directions that minimise some error function. That is, they leverage the same kind of techniques that are used to train transformers in the first place.

More recent approaches tend to have a lot less dependence on the underlying model and instead use the LLM itself to assist with prompt refinement. AutoPrompt, for example, requires no knowledge of the underlying transformer model, and instead uses the host LLM to generate new prompts and evaluate these on synthetic data which it also generates. However, it still requires some involvement from the user.

The best is yet to come…?

Prompt engineering is still in its infancy, and this is reflected in the hit-and-miss results from the example prompts shown above. In the future, it seems likely that LLMs will move away from manual design of prompts to more automated means. At the same time, they will probably become less and less sensitive to how prompts are formulated. We’ve already reached the stage where LLMs are starting to become integrated within operating systems, and this will put more pressure on LLM developers to produce a straight-forward and consistent user experience. For example, you can already see this happening behind the scenes in Apple Intelligence, where the developers use careful wording to massage user queries before they reach an LLM. Although a difficult nut to crack, another important direction is better understanding of how LLMs actually work, with the potential to give guiding insights into how the wording of prompts influences behaviour.

Too long; didn’t read

LLMs are very sensitive to the way in which you prompt them. Prompt engineering is the process of improving a prompt so that an LLM carries out its task more effectively. It’s an art rather than a science. Popular approaches include chain-of-thought reasoning and retrieval augmented generation. Both of these rest on the idea of in context learning, which involves embedding carefully chosen information within the prompt. The LLM then uses this to better align its response with the required task. However, a challenge of these approaches is how to source appropriate information. Simpler approaches include asking an LLM the same thing multiple times, using an encouraging statement that tells the LLM it is an expert in the task, and asking it to reason about or fact-check its response. These can be surprisingly effective. Automated prompt optimisation is an emerging approach that aims to remove the manual work involved in optimising prompts, offloading this to an optimisation algorithm or an LLM. This is an active area of research, so watch this space.

1
Although it is possible to train them further, through various fine-tuning methods. Check out Sebastian Raschka’s Ahead of AI substack for extensive info on this.

2
With 2 billion parameters and 8 bit quantization.

3
Microsoft haven’t told us how many parameters there are in GPT-4o, but it’s widely believed to be over 1 trillion.

4
These approaches tend to work directly in the embedding space, so they basically turn the initial prompt into a bunch of numbers which is no longer human-readable.

Deep Dips #4: Training neural networks

Michael Lones — Wed, 25 Sep 2024 13:25:42 GMT

In previous posts in this series, I’ve covered the basics of neural network models, talked about embedding models, and explained how transformers work. So, if you’ve read these, then you should already have an idea of what neural networks look like and how they’re used.
Next I want to cover the training and configuration of neural networks. I’m going to start off with backpropagation and gradient descent, which are the bread-and-butter of most neural network training. Then I’ll say a bit about reinforcement learning, gradient-free methods, and neural architecture search. Finally, a few words on approaches which are biologically more plausible.
Measuring performance
When a neural network is first created, its weights and biases (also known as its parameters) are set to random values. So at this point it’s unlikely to do anything useful. Training a neural network involves working out more appropriate values for these parameters, so that the neural network does do something useful.
You can tell how useful a neural network is by doing a forward pass. That is, you provide input, you propagate this through the network, and you get an output. Assuming you know what the output should be for a particular input, you can then work out how close it got to the right answer. Doing this for a whole bunch of input-output examples provides a measure of how well the neural network is doing. The accumulated differences between the correct outputs and the outputs the neural network actually gave is know as the loss, and the specific method used to determine this is known as the loss function.
In most cases, training a neural network amounts to attempting to minimise this loss, getting it as close to zero as possible. To use an unnecessarily complicated and opaque term — which, alas, people often do — this is known as empirical risk minimisation. There are various ways of doing this, but the most popular is something commonly referred to as backpropagation, or just backprop. However, it really consists of two different things: backpropagation and gradient descent, so I’ll cover these separately.
Backpropagation
Backpropagation is a method to determine how you need to tweak each parameter of a neural network in order to reduce the loss. It achieves this through the magic of differentiation, which is a mathematical procedure that takes a function and tells you how its output will change if you change one of its inputs.
Basically, backpropagation moves backwards through a neural network (hence the name), starting at the outputs1 and moving towards the inputs. As it does this, it uses differentiation2 to calculate values called gradients, and attaches one of these to each component of the neural network. This procedure rests on something called the chain rule, which allows you to work out the gradient for a particular component based on gradients you’ve already calculated further downstream — hence why backpropagation moves from the outputs to the inputs.
Gradients are just numbers that capture how much the loss of the neural network will change if you tweak the value at that point. And we’re really just interested in the gradients that are associated with weights and biases. So, if a weight has a high gradient associated with it, then we know a small change to that weight would have a large effect on the loss. Conversely, if a weight has a low gradient, then you’d need a larger tweak in order to have the same effect. Gradients can be positive or negative. For weights with positive gradients, increasing the weight would increase the loss — so this means we want to reduce these weights in order to improve the behaviour of the neural network. Likewise, we want to increase the values of weights for those with negative gradients. And similarly for biases.
Gradient descent
And this is where gradient descent comes in. Having completed backpropagation, you’ll now have a gradient value for each parameter in the neural network, which tells you how you need to tweak each weight and bias in order to reduce the overall loss. Gradient descent is just an iterative process for doing this tweaking.
An iteration in gradient descent is known as an epoch. During each epoch, you carry out a forward pass to calculate the loss, you then do backpropagation (also referred to as the backward pass in this context) to calculate the gradients, and you then tweak every parameter. The size of the tweak applied to each parameter is determined by its gradient multiplied by a (typically constant) negative value called the learning rate. Over a series of epochs, gradient descent will move the parameters from their initial random values to values which minimise the loss — eventually leading to a trained neural network.
Or at least that’s the aim. In practice, you need to find an appropriate learning rate for the problem you’re trying to solve. Too high, and you’ll find it overshoots the optimal parameter values. Too low, and it’ll take ages to converge. And this is assuming that it does converge, since gradient descent is not guaranteed to find optimal values.
Tweaking gradient descent
There are a bunch of methods which have been developed to improve the behaviour of gradient descent in one way or another.
Some of these are concerned with how much training data you pass through the neural network at each epoch. In the original formulation, forward passes were made for every sample in the training data, and then gradient values were calculated across all of these before any tweaking was done. But this was very time consuming if you had a large training set. So one simple innovation was to use less data during each epoch, and this generally leads to faster learning. If you use only one item of data each time, it’s known as stochastic gradient descent. If you use more than one, but not all the data, then it’s called mini-batching. The amount you use is known as the batch size.
Another issue with vanilla gradient descent is that the learning rate is a constant. But as you approach the optimum, it can actually be useful to perform smaller tweaks. Conversely, at the beginning of training, larger tweaks can be better to speed up convergence. This is addressed through the use of adaptive learning rates. The simplest approach is to gradually decrease the learning rate using a learning schedule.
But a more successful approach is used by Adam, which is short for adaptive moment estimation, and is currently the most popular form of gradient descent used for neural networks. Its main point of novelty is that it adds a fraction of a gradient’s value from the previous epoch onto its value in the current epoch. This is known as momentum, and helps to make gradient values more stable across epochs, particularly when using small batch sizes. However, Adam also uses this information to individually adapt the learning rate for each parameter.
Another common extension is the catchily-named L2 regularisation. Like most things in neural networks, this is not as complicated as it sounds, and basically involves adding a penalty term to the loss function in order to minimise the number of parameters that have non-zero values. So, if lots of weights and biases have non-zero values, then the loss is artificially increased to apply pressure towards bringing some of these to zero. This helps to prevent overfitting by reducing the effective number of parameters. A related concept is weight decay, which is used in a variant of Adam called AdamW.
Reinforcement learning
This all assumes that after each forward pass, you can measure how well the neural network did, and use this information to tweak its parameters. Whilst this is true for many situations, there are situations where a measure of performance is not available until much later, or is in some way incomplete. In these circumstances it’s common to use some kind of reinforcement learning.
A typical example is playing a computer game. The goal may be to complete a level, but to do so may involve carrying out a lot of other actions that contribute in some way towards whether and how quickly the level is completed. These are actions like avoiding monsters, not falling into pits, collecting power-ups. How does each of these actions contribute towards the eventual score? Or, at an even lower level, how does each move made by the player contribute towards these? Well, this is unclear. But if you’re using a neural network’s forward pass to determine the next move, then you need to know this information in order to correctly tweak its parameters during the subsequent backwards pass.
Reinforcement learning, in a nutshell, involves remembering all the actions that contributed to the outcome, and then apportioning reward to all those actions that led to it3. And this reward may be positive or negative depending on whether the goal was achieved. So, if a neural network — through a series of moves generated by forward passes — completed a level and received points as a consequence, these points will be apportioned back to all the individual moves that contributed to receiving them. And these rewards can then be used to drive the backward passes and update the neural network’s parameters4.
Optimisation without gradients
Backpropagation isn’t the only fish in the sea. It’s a pretty big one, and could feed you for much of your neural network training life, but there are also some pretty tasty minnows. An important group of minnows is the gradient-free optimisers — which, as the name suggests, do not calculate or use gradients when learning neural network parameters.
Optimisation is a pretty meaty subject in itself, so I’m not going to say much about how it works. But in a nutshell, it treats the parameters of a neural network as a list of numbers that have to be correctly determined in order to minimise some loss function. Any optimiser that works with numbers can be applied to this task, but a common approach is to use an evolutionary algorithm. This is based upon an analogue of natural selection — start with a population of random solutions (i.e. a bunch of neural networks with random weights and biases) and then iteratively kill off the weak ones and breed new solutions using the best ones. Breeding is done using mutation (which randomly change a small number of parameter values) and crossover (which splices together parts of two existing solutions).
Evolutionary algorithms tend to be better at avoiding local optima than gradient descent, which means they can in principle train better neural networks. However, gradient-free optimisers don’t scale anywhere near as well as backpropagation, which means in practice they can only optimise the parameters of relatively small neural networks. Anything above the thousands of parameters would be a push, and this is quite a limitation given that nowadays we’re routinely dealing with billions of parameters. But they do still have niches, and one of these is neural architecture search.
Neural architecture search
Before training a neural network, you have to choose an architecture. This includes things like how many layers there are, how many neurons there are in each layer, how the layers are connected, whether there are any residual connections, and which activation functions are used in each layer. Wouldn’t it be great if there were a way of choosing these things for you?
This is where neural architecture search (or NAS) comes in. It’s a way of learning the neural network architecture that is optimal for a particular task. There are various ways of doing this, including flavours of reinforcement learning, but the most successful approach is arguably evolutionary algorithms. This again rests on their ability to find good solutions and not get stuck in local optima. Scalability is much less of an issue here, because they’re only being used to learn a relatively small number of architectural parameters, rather than a huge number of weights and biases.
If you’re interested in doing a bit of NAS, it’s worth mentioning that it can be a very expensive procedure, since it basically involves generating a large number of candidate architectures and then training all their weights and biases each time (using backpropagation) in order to find out how good they are. So, it’s only really advisable for smaller neural networks, or for people with a huge pile of GPUs5.
Biologically more plausible approaches
And that’s pretty much it for training neural networks. However, I don’t want to leave without mentioning that although neural networks are modelled upon the structure of the brain, backpropagation takes pretty much no inspiration from biology. Yes, it works well enough, but I wouldn’t assume this is the final destination. In the longer term, it seems plausible that we could improve neural network training by learning more about how biological brains actually learn.
But currently we don’t really know how biological brains learn. There are theories, and these theories underlie some of the more biologically-plausible approaches to training neural networks. Leading amongst these is the idea of Hebbian learning, often summarised as “neurons that fire together, wire together”. This can be seen within a method called spike-timing dependent plasticity which is used to train spiking neural networks, which are themselves a more biologically-plausible model of brains. However, in its current form, Hebbian learning seems to be in no danger of challenging backpropagation as the go-to neural network trainer.
Too long; didn’t read
Neural networks are typically trained using backpropagation and gradient descent. Backpropagation is a way of working out how you need to tweak the network’s parameters in order to reduce the loss — which is a measure of how close a neural network is to its target behaviour. Gradient descent is an iterative procedure for carrying out these tweaks, with the aim of minimising the loss. There’s no guarantee it will lead to an optimal neural network, but more recent innovations like mini-batching, momentum and weight decay improve its behaviour, and are central to modern optimisers like Adam. Evolutionary algorithms are also sometimes used to train neural networks, and are particularly effective at optimising the architecture of a neural network. In the future, we might expect further improvements by looking more closely at how brains learn — that is, once we work out how they learn!
Thanks for reading Fetch Decode Execute! Subscribe for free to receive new posts and support my work.
1
Well, actually at the output of the loss function, which also gets involved in the whole differentiation thing.
2
Specifically, partial differentiation and the chain rule. For an accessible introduction to this process, see this video by Andrej Karpathy.
3
Typically with some kind of strategy that assigns more points to more recent moves. This whole area is known as credit assignment.
4
Although exactly how this is done varies significantly. If you want to know more, you could look into deep Q-learning. This is the approach that DeepMind famously used to solve a bunch of Atari games problems.
5
So you won’t be surprised to learn that early work in this area was done by Google.

Deep Dips #3: Transformers

Michael Lones — Wed, 22 May 2024 08:47:06 GMT

I initially struggled to get my head around transformers. In an ideal world, it would be possible to read the paper that introduced them, Attention is All You Need, and absorb the relevant facts. But in practice this is a densely-written paper which assumes a lot of knowledge1. So, I also trawled through the introductions on the web, but these mostly frustrated me. Many of them only present the low-level details of the model, which are admittedly not that complicated, but which by themselves are not very revealing. Others attempt to describe how transformers work, but often in a way that’s hand-wavy and light on facts.
At the heart of it, I think there are several reasons why transformers are confusing:
They have a lot of moving parts, and these interact in complex ways that no one really understands.
Some of the design decisions are rather arbitrary, and informed by experiment rather than design — that is: oh, this works rather well, let’s keep it, rather than: ah, yes, that’s the obvious way of doing it.
The self-attention process, which is at the core of how transformers work, is not particularly intuitive, and is widely misunderstood.
In this post, I’m going to try to introduce transformers in a way that captures the key details but doesn’t require an understanding of linear algebra. To keep things simple, I’m going to stick to the kind of transformers that we’ve grown to know and love: GPT models (such as ChatGPT) which are used to generate text. The basic idea behind these is that you provide it with some text — in what is referred to as a context window2 — and it predicts the next word. This word then gets added to the end of the text, and the augmented text gets passed through the transformer again3, resulting in another word being produced. And so on4. In most cases this series of words will be the answer to a question you asked5.
For those who want more detail, take a look at the footnotes. I also recommend Jay Alammar’s “The Illustrated Transformer” if you want more depth than I provide here, though this does require understanding of basic linear algebra. If you’re more comfortable thinking in code, I’d also recommend Sebastian Raschka’s article on coding the self-attention mechanism.
Overall architecture
I think it helps to be aware of the overall architecture before looking at the details. These are some of the key points:
Transformers are organised into blocks. In each block is a self-attention layer followed by a multilayer perceptron (MLP for short, which I talked about in the first post in this series). Each block also contains some other, arguably less important, bits and pieces6.
The transformer block is repeated multiple times, with the outputs of each one feeding into the following one7. Each of these blocks has its own independent learnable weights, which are trained by a standard neural network optimiser in order to configure its behaviour (see Deep Dips #4: Training neural networks).
Importantly, the self-attention layer in each block is multi-headed. This means that the self-attention process (which I discuss below) is repeated multiple times, in parallel, with different weights, and then the outputs from each of these get combined before passing through the block’s MLP.
To give you some idea of scale, a self-attention layer in GPT-3 has 96 heads, and the whole thing comprises 96 transformer blocks. So that’s 96x96 = 9216 repeats of the self-attention process, each with different trainable weights. GPT-4, which underlies the most recent release of ChatGPT, presumably has a lot more8. This is worth bearing in mind before I get into the details of self-attention.
GPT is often referred to as a decoder-only model. However, this terminology only really makes sense within the context of the original transformer model, which was designed to translate one sequence of words into another. See my previous post on embedding models for more on this. It’s worth being aware that there are also encoder-only models like BERT, but in terms of how they work, encoder- and decoder-models are very similar. They only really differ in their output layers.
Input embedding
The way in which inputs to a transformer are represented is a key part of the puzzle of how they work, so I’m going to start here.
Neural networks process numbers, not text, so when a transformer receives text, each word9 must first be turned into a list of numbers — known as an embedding vector. A key idea underlying text-based transformers is that related words are represented by related embedding vectors. For example, if two words mean much the same thing, then they should have much the same embedding vector. If they have meanings that are different, but related, this is likely to be reflected in parts of their embedding vectors being similar and other bits being different.
The embedding process — turning words into embedding vectors — is not traditionally done by the transformer itself10. Rather, it’s done by a word embedding, which is many cases is another neural network. I talked about these in the previous post in this series, so I won’t repeat it here. Once the embedding process is complete, each of the input words will have been replaced by a corresponding embedding vector. In the case of GPT-3, for example, that turns out to be a vector (i.e. a list) of 12,288 numbers for each word. Multiplying that by the number of words in the context window means there’s a whole lot of numbers going into a transformer.
One thing to bear in mind is that because the word embedding used by a transformer is learnt from data, we don’t really know how it works. People have hypothesised that it might pick up on the same kind of relationships that humans recognise between words, but it’s quite possible that it has an entirely different outlook on things.
Oh, and there’s one more source of complexity in the input encoding, because transformers also impose a positional encoding on top of the embedding vectors. In the interest of not loading too much on you at once, I’ll come back to this later.
And let’s not forget the outputs
It sometimes helps to think of a transformer as a set of parallel pipelines (depicted as thick black lines in the diagram above), each one working at the same time on one word within the transformer’s context window. At the bottom of each pipeline enters an embedding vector representing the word. These embeddings then move up through each block of the transformer, and in each block they get transformed in some way. When they reach the top of the last transformer block, only the final embedding in the final pipeline (corresponding to the most recent word in the input context) gets used to predict the next word in the sequence.
I said earlier that GPT-style transformers generate a single word each time you use them. Well, that’s not entirely true. GPT-style transformers actually generate a probability distribution of next words, and this is then used to pick the next word. A temperature setting is often used to determine how this is done. A low temperature means that it will always pick the most likely next word, and a high temperature means it will often pick less likely words. That is, by twiddling the temperature knob on ChatGPT and friends you can vary the diversity of text generated.
So, how does the final embedding of the final word in the input text get turned into a probability distribution of next words? This is done using two more layers on top of the last transformer block. The first of these is a linear layer.
It’s worth pausing for a moment to explain what a linear layer is, since this concept will come up again later. A linear layer is essentially a fully-connected neural network layer that doesn’t use a transfer function. That is, every input is connected to every output, and the outputs are only determined by the layer’s weights11. This is different to a standard MLP layer, where the weighted sum of inputs is then fed through a non-linear transfer function. One common use of linear layers is to project data from one representation to another. And this is exactly what’s happening here. The linear layer contains an output node for each word in the transformer’s vocabulary12, and its inputs are the numbers in the final embedding of the final word, and it is essentially mapping between these two representations.
After the linear layer is a softmax layer, which is just a standard way of turning raw output numbers from a neural network into a probability distribution13.
Self-attention
A single self-attention head enriching its first input embedding
At last we’ve reached the headline act of the transformer architecture, the self-attention layer. But things can get confusing at this point, and I find it helps to (a) not fixate on the term self-attention, which is far from being self-explanatory and (b) only think about the self-attention layer in the first transformer block. We can worry about the higher-ups later on.
The first self-attention layer receives as input a set of embedding vectors, one for each word in the context window. It then attempts to quantify how relevant each word is to every other word. Essentially, for each pair of embedding vectors (recall that each of these is a list of numbers), it does this by multiplying each pair of numbers within them together and then summing these values up14. This may sound a bit weird, but recall the idea from earlier that related words have related embeddings. This means that their embeddings are likely to contain, in places at least, big numbers at the same positions. So, if you multiply these numbers together, you’ll get even bigger numbers in the positions where they have commonalities, and the presence of these big numbers can be taken as an indication that these words are in some sense compatible with each other. So you can think of multiplying embedding vectors together as a general mechanism for amplifying their commonalities and thereby emphasising their relationships. Summing up all these multiplied-together numbers then produces a single number that captures how relevant two embeddings are to one another.
But it’s not quite that simple, because before they’re multiplied together, the embedding vectors are first altered by a bunch of learned weights, the purpose of which is to focus the search for relationships on particular parts of the embeddings. I use the word altered here to simplify things, but what actually happens is that each word embedding vector is multiplied by a weight matrix. However, this is not multiplication in the conventional arithmetic sense, but rather in a more specific linear algebra sense. But since I promised to avoid linear algebra, you can equivalently think of this as applying a linear layer (see above) to the embedding, in which the linear layer’s inputs are the numeric components of the word embedding, its outputs15 are the numeric components of the altered word embedding, and the weights in the linear layer correspond to the weights in the matrix.
And to further complicate matters, there are actually two separate weight matrices involved in this process, known as the key and query weight matrices. For each parallel pipeline, the key is applied to the pipeline’s own embedding vector and the query is applied to each of the others that it is being compared against — see the diagram above for a visual depiction of this. However, I wouldn’t worry too much about the distinction, since it’s really one of those oh, this works rather well, let’s keep it things I mentioned earlier, rather than being really integral to understanding what’s going on.
So, there are lots of learned weights involved in the self-attention process. To put the need for these in context, recall that each attention layer has multiple heads, i.e. multiple versions that are executed in parallel on the same set of embedding vectors. Since each of these has different learned weights, this means that each head can use its particular weight matrices to hone in on different parts of the embeddings, which in turn may correspond to particular aspects of language. So, one head might focus on what verbs are doing, another might be more concerned with adjectives, and others might be looking at more exotic relationships between words. Exotic here is code for “we have no clue what they might be doing.”
Essentially, the process described so far tells the self-attention layer how relevant each embedding vector is to each of the other embedding vectors in the context window. Now, recall earlier I mentioned that embedding vectors move up the transformer architecture in parallel pipelines. Well, the only place that information actually gets transferred between these pipelines is within the self-attention layers. And this is what happens next. For each pipeline, a new embedding vector is created by melding the existing embedding vector at that position with information taken from the other embedding vectors that the above process says are most relevant to it.
Specifically, this melding is done by creating a weighted sum of all the embedding vectors, with the weight assigned to each one determined by the strength of its relationship — as determined by the pairwise weight-multiply-and-sum process plus a bit of normalisation16. So, if embedding vector A has a strong relationship with embedding vector B, then quite a lot of A is going to end up in B, and vice versa. More generally, all the embedding vectors will gain information from the other embedding vectors that are most relevant to them.
But again it’s not quite that simple, because before the embedding vectors are weighted and summed, they each get multiplied by another learned weight matrix, known as the value weight matrix. Just like the key and query weight matrices earlier, this can be used to emphasise or de-emphasise certain parts of the embedding vectors during this integration process. And since different heads have different value weight matrices, this gives them yet another opportunity to specialise in particular aspects of language.
But don’t forget not self-attention
Despite its tendency to hog the limelight, there’s more to a transformer block than just self-attention. Two other important bits are the linear layer and the MLP layers.
The linear layer (yes, another one) is responsible for turning the embedding vectors generated by multiple heads back into a single embedding vector. That is, after the self-attention layer, at each word position, each self-attention head will have constructed a new embedding vector. So, a single original embedding vector will have become many, and to stop things getting out of hand, these need to become one again. It works just like the linear layers we came across earlier by taking the multiple embedding vectors (concatenated into one list of numbers) generated by each of the heads and mapping these to output values equal in number to the size of a single embedding vector, again configured using learnable weights.
This unified embedding vector then goes through the transformer block’s MLP. No one is entirely sure what the MLP does — again it depends on learned weights17 — but it’s basically the only opportunity for a transformer block to do something non-linear, since all the other operations described so far have been linear, and therefore limited in their ability to do interesting things. Despite the focus on self-attention within transformers, the majority of its parameters are actually in these MLP layers, so they’re an important, if not very well understood, part of the puzzle. And it’s worth noting that at this stage the embedding vectors are very much back in their parallel pipelines, and the same MLP is applied independently to each of them. So whatever it’s doing, it does it to each unified embedding vector separately, resulting in a bunch of altered embedding vectors that then become the inputs to the next transformer block.
Moving on up
So that’s how a transformer block is formulated: first carry out multi-headed self-attention, where information moves between related embedding vectors in different ways in each head, then apply a linear layer where the outputs of all the heads get munged together, and then do MLP, where something non-linear happens.
The first transformer block transforms the set of word embedding vectors in the context window into a new set of embedding vectors. This new set of embeddings are said to be enriched; that is, they’ve gained information by going through the self-attention and MLP process. People who know about these things reckon that the first transformer block tends to capture information about each word’s contextual relationship with nearby words — known as the local context.
These enriched embedding vectors then become the inputs to the second transformer block. Since these inputs already contain information about the local context, it is thought that the second block captures wider contextual relationships within the text, adding more nuance to the understanding of each word. Subsequent blocks then build upon this, further refining contextual understanding and finding broader meaning within the text in the context window.
However, much of this is speculation based upon observing particular trained transformers processing particular text samples, and in practice it’s very hard to know how transformers actually work, given their size and considerable complexity.
But what about position?
Oh yes, I promised to say something about positional encodings.
Something to be aware of is that transformers are by default ignorant when it comes to the position of words within the context window. This is because the self-attention process has no direct way of taking position into account — it just looks at every pair of embedding vectors, and treats them all equally. And this is a problem, because even us mere humans know that the position of a word within a sentence is a very important indicator of its role and meaning.
Positional encodings are a solution to this problem. They basically involve overlaying a positional encoding on top of each word’s embedding vector. That is, adding some numeric pattern to each embedding vector to give some indication of its position within the text. There are various ways of doing this, but one common way is to add some kind of repeating sinusoidal pattern to capture relative positions.
In theory, this means that the transformer has access to information about the positions of each pair of words when it is performing self-attention, and it can therefore take this context into account when enriching embedding vectors. In practice, it’s unclear (well, to me at least!) how the transformer then acts upon this information and integrates it with all the other information it’s processing. But I guess this is true of transformers in general.
And train
So that’s the main bits and pieces that make up a transformer. The final piece of the puzzle is how you configure it.
At this point it’s worth noting that the GPT in ChatGPT stands for Generative Pre-trained Transformer. That is, it isn’t just a transformer, but rather a transformer that has already been pre-trained by setting its learned weights to particular values. And it’s this particular configuration of weights that underlies its ability to generate meaningful text.
I’m not going to go into detail about how these weights are trained, since it basically involves using a standard neural network optimiser, which is something I plan to talk about in a future post. However, in a nutshell, these optimisers work by training on input-output examples, in which for a given input, the correct output is already known. The optimiser therefore knows what the output should be, and it can therefore determine how far the neural network missed it by. This information is then used to tweak the values of each weight, so that if the same input was provided again, it would get closer to the correct output. Doing this lots of times with lots of different input-output examples eventually leads to sensible weight values.
Exactly the same thing happens when training transformers. That is, they’re given some input, in the form of text. The correct output — the next word — is already known, so can be compared against whichever word the transformer predicts, and this error can then be used to tweak the weights. The only real difference is that transformers are a lot larger than other types of neural networks, and this basically means that a huge amount of input-output examples are required to train the weights correctly. For example, GPT-3 has 175 billion learnable weights and required about 500 billion words of text to train these weights18.
The end result is known as a large language model, or LLM. This is because, having absorbed a substantial fraction of all the text on the internet, the transformer has essentially learnt a generalised model of how humans use language. And it’s really this model, embedded in the weights of the transformer, that underlies the power of ChatGPT and its ilk.
Too long; didn’t read
Yes, it was quite long wasn’t it? I like to make these posts short, but transformers are just so darn complex. But I’ll attempt to make them simpler: A transformer consists of parallel pipelines within vertically-stacked blocks, with one pipeline working on each word in the transformer’s input text. Words are transformed into embedding vectors when they enter at the bottom, and in each block a combination of self-attention and multilayer perceptron layers enrich each embedding vector to capture contextual information and meaning from the other embedding vectors. The core operation of transformers, called self-attention, basically involves multiplying embedding vectors together in a pairwise fashion. This brings out their relationships, guided by a bunch of learned weights. A multilayer perceptron then increases the processing capacity of the block by adding non-linearity. Once it reaches the top of the transformer, the enriched embedding vector of the most recent word in the input should have gained all the context and meaning it needs to predict the next word in the text, and this is what happens. This predicted word gets added to the input, and the process repeats, generating subsequent words. It all works pretty well, but no one really knows how it works.
Thanks for reading Fetch Decode Execute! Subscribe for free to receive new posts and support my work.
1
On top of this, the encoder-decoder transformer model described in this paper is mostly of historical interest.
2
Transformers can only look at so many words at once. The upper limit of this is known as the context window. For current transformers, this is typically in the 1000s of tokens, but some can handle much more than this.
3
Well, kind of. Most implementations have some way of maintaining internal state from previous iterations in order to improve efficiency.
4
This is known as autoregressive behaviour. It comes to an end when the transformer produces a special symbol called the end of sequence token.
5
Transformers like ChatGPT actually undergo two phases of training. In the first, and most important, phase, they’re trained to produce the next word in a sequence of text. In the second, they’re taught to interact in a question-and-answer style.
6
Including normalisation layers, which help to handle issues with gradients exploding during training.
7
There are also residual (or skip) connections between transformer blocks. This means a block can potentially feed into any other block higher up in the structure.
8
Its developer OpenAI is more like ClosedAI when it comes to releasing details.
9
Though generally this is done with tokens rather than words, since some words will be split into multiple tokens. For example, the word “working” might be split into “work” and “ing”.
10
Although some recent transformers train the embedding model and the transformer weights at the same time.
11
And, at least for the output layer, biases.
12
Typically in the order of tens of thousands.
13
Actually a pseudo-probability distribution. They’re not quite as mathematically well-behaved as real probability distributions, but they seem to work well in practice.
14
In linear algebra, this is known as the dot product between two vectors.
15
Typically the number of outputs is less than the number of inputs, so the altered embedding tends to be shorter than the original one. This is generally done for reasons of efficiency, so I wouldn’t worry too much about it. However, it helps to explain why the matrix multiplication operation in self-attention is sometimes described as projecting a word embedding into a representation subspace.
16
Actually two bits of normalisation, the first to control gradients, and the second to apply a softmax function so that the weights all sum to one. The details aren’t overly important.
17
Note that the weights of the linear layer and the MLP layers are both learned separately for each block of the transformer.
18
Wikipedia has a good account of this. As it notes: “Lambdalabs estimated a hypothetical cost of around $4.6 million US dollars and 355 years to train GPT-3 on a single GPU in 2020”.

Deep Dips #2: Embeddings and latent spaces

Michael Lones — Tue, 16 Apr 2024 12:35:07 GMT

Last time in this series I talked about multilayer perceptrons, an old idea that still remains relevant to modern deep learning systems. This time I’m going to talk about another idea that has a been around a while, but which has recently become a key component of transformer models. These are embeddings, embedding spaces and latent spaces, which are all different terms for the same thing: a low-dimensional fixed-size representation of high-dimensional and often variable-length data. I’m going to start with autoencoders, which are the simplest kind of embedding, and then I’ll move onto word embeddings and also say a little about transformers (though these are covered in depth in the next post in this series, Deep Dips #3: Transformers).
Autoencoders
Like many great things in life, autoencoders are based on a simple yet effective idea: What happens when you train a neural network to output the same values that it receives at its inputs, whilst at the same time including a hidden layer that is narrower than its input and output layers? This narrow hidden layer is known as a bottleneck layer, and it forces the network to learn a compressed representation of its inputs.
The simplest autoencoders look something like this:
That is, they have one input layer where the data gets delivered, one output layer where the reconstructed data gets produced, and a single bottleneck layer. Using a standard neural network training algorithm, the parameters of the bottleneck and output layers are trained to minimise the reconstruction loss — that is, the difference between the values at the inputs and the values at the corresponding outputs. Assuming this difference is small, the reconstructed data will resemble the input data. But more importantly, the activations of the neurons in the bottleneck layer can be read as a compressed representation of the input. The narrower the bottleneck layer, the more compressed the representation becomes1.
You may already be familiar with the idea of compression and dimensionality-reduction. In its simplest form, an autoencoder is just another way of doing this. In fact, an autoencoder with a single hidden layer is mathematically equivalent to principle component analysis (PCA), which is a popular way of doing dimensionality-reduction. If you start adding more layers either side of the bottleneck, you basically end up with a non-linear (i.e. more expressive) form of PCA. Here’s an example of a more complex autoencoder architecture:
You can vary the number and sizes of the hidden layers either side of the bottleneck layer, and it’s also up to you whether it’s symmetrical or not. All of these decisions will have some effect on how easy it is to train, and like most things in the world of neural networks, getting this right is a process of trial and error and will depend on the nature of the problem you’re applying it to.
The bit before the bottleneck layer is known as the encoder and the bit after it is known as the decoder. Within machine learning, a common use of autoencoders is to do feature reduction — that is, reduce the number of inputs you need for a machine learning model to a more manageable level. In this case, once the autoencoder has been trained, only the encoder is needed, and it’s common to just throw away the decoder. However, there are also situations in which the decoder remains useful2.
But more generally, an autoencoder can be viewed as a mapping from one representation of data (the one delivered to the inputs) to another (the one read from the bottleneck layer). Specifically, in the bottleneck layer it learns a target representation which is smaller — and therefore arguably more fundamental — than the original representation. This is the sense in which the target representation is commonly referred to as an embedding space, and the process of mapping data into this space is referred to embedding.
Variational autoencoders
Let’s dig a bit further into embedding space and take a peak at a more advanced model known as a variational autoencoder. This is not just concerned with compressing its inputs, but also with creating an embedding space that has particular properties.
Variational autoencoders are used as generative models. That is, they are used to generate new versions of the things they are trained on. So, imagine an autoencoder was trained on images of cats, i.e. the input layer receives pixel values3 and the output layer produces pixel values, and the bottleneck layer contains a compressed representation that somehow captures the key properties of being a cat. You could then use this bottleneck layer to not just compress cats, but also to generate images of cats — which you could do by picking some random values for the activations in the bottleneck layer, and see what this generates at the output layer. Given that it was trained on cats, you’d expect it to produce some sort of random cat image.
But in practice, using a vanilla autoencoder tends not to work well for this purpose. This is because there’s nothing in the training process to encourage the embedding space to be organised in a sensible manner. This could mean that all the realistic images of cats end up being embedded within a small part of the space, and sampling from other parts might result in weird images that look nothing like cats. Or they might be distributed in a very uneven way across the space, with most patterns of bottleneck activations producing only a narrow range of output images.
I won’t go into detail about how variational autoencoders are trained, but it basically involves encouraging4 the creation of an embedding that has a sensible and even distribution, so that any random values in the bottleneck layer are likely to generate meaningful outputs. Instead, the thing I want to highlight is that the embedding process is no longer just about compression, but is instead about finding a new representation of the input data that is in some sense meaningful. And this is really at the heart of what most embedding models aim to achieve.
Word embeddings
Word embeddings are all about taking a word and turning it into an efficient numerical representation. Basically, we need word embeddings whenever we want to get text into a machine learning model that only works with numbers. Since this includes most deep learning systems, a particularly common application of word embeddings these days is to use them to get text into transformers like ChatGPT5.
The simplest way of representing a word as numbers is to use something called a one-hot encoding. Say you have a vocabulary of 10,000 words, a one-hot encoding would map each word to a vector of 10,000 numbers where one number is a 1 and all the others are 0. For each possible word, the 1 would occur in a different position. Clearly this is not an efficient representation. But more importantly, every word is mapped to a vector with a 1 in an arbitrary position which says nothing about the word or how it relates to other words6. So it’s also not a meaningful representation of words.
Word embeddings are a way of producing more meaningful representations of words, so that words with similar meanings or similar uses generally get mapped to similar embeddings. There are various ways of doing this, but one common approach is to use an autoencoder-like architecture which uses a bottleneck layer to provide an encoding of a word received at its inputs. This encoding can then be used to represent the word. Consider the following depiction of word2vec, which is a well known word embedding method that takes this approach:
Like most forms of word embedding, word2vec learns an embedding using a large dataset of written text. Each word of each text sample is taken from this dataset in turn, and is input to the model using a one-hot encoding7. Unlike an autoencoder, the objective during training is not to generate the same values at the output layer as it receives at the input layer. Rather, it is to generate a probability distribution that captures the likelihood of each other word occurring in the surrounding text — though note that this is not what the trained embedding will actually be used for, and in practice the decoder tends to get thrown away after training.
During training, this probability distribution can be measured by looking at the actual surrounding text in the current sample, and so a loss function can be formulated that encourages the model to generate this distribution. Doing this over a large number of words in different text samples means that the decoder will eventually learn to generate appropriate distributions over its entire vocabulary. But more importantly, since similar words will tend to occur within similar distributions of surrounding words, these words will end up producing similar activations within the bottleneck layer. And consequently we can expect it to learn an embedding space that in some sense captures the relationships between words in its vocabulary.
In practice, the word embeddings produced by word2vec and other methods tend to be rather opaque, and do not necessarily capture relationships in a way that would be understandable to a human8. Nevertheless, they enabled a step-change in the ability of machine learning models to process text, and have become a key component within modern approaches to natural language processing — such as transformers.
Transformers
I’ll be covering transformers in a future post, so I won’t say a lot about them here. However, transformers are relevant to embedding spaces in two different ways. First, they use pre-trained word embeddings, such as word2vec9, to encode each word of their input text. Second, they create an embedding of all the text they receive as input.
This second use is most apparent within the original transformer architecture, which was designed for machine translation (i.e. translating text from one language to another), and consisted of an explicit encoder and an explicit decoder. The encoder was used to create a compressed embedding of the input text, and the decoder was then used to generate equivalent text in a different language. So, the embedding — which appeared as a pattern of activations within something analogous to a bottleneck layer — essentially captured the essence of the input text, but within a language-agnostic form which could then be decoded to different languages.
Modern transformer architectures are often described as being either encoder or decoder-based models. That is, they are said to consist of only an encoder or only a decoder, but not both. However, I find this terminology rather misleading, and the goal of the training process for both types of transformer is still essentially to produce an embedding. Take the GPT transformer architecture as an example. This is often referred to as a decoder model, since its goal is to generate text output. Yet, it also receives text as input, and prior to its final layer, embeds this text within a numeric feature vector. The final layer then does something akin to word2vec’s decoder, and turns this embedding into a probability distribution of next words.
Too long; didn’t read
Embeddings, aka latent spaces, are used to transform data into a compressed form that captures key information about the data. A common way of generating these embeddings is to use a bottleneck architecture, where a narrow layer within a neural network forces a compressed representation of the input data to be learnt. Embedding models are often trained in a way that encourages this representation to be meaningful in some way. Embeddings are commonly used in generative AI models, including transformers. They’re also used for feature reduction, anomaly detection and are generally useful for turning text into numbers.
Thanks for reading Fetch Decode Execute! Subscribe for free to receive new posts and support my work.
1
Though of course there is a trade-off between the amount of compression and the amount of information loss; making the bottleneck too narrow means that it won’t be able to reconstruct the inputs at the outputs, likely making the model less useful.
2
For example, the reconstruction loss — which is calculated from the decoder’s output — can be useful in anomaly detection, since inputs that are less representative of a particular data distribution are more likely to be ignored during training and therefore are less likely to be successfully reconstructed by the autoencoder.
3
In practice, you’d probably use a convolutional autoencoder for this, which has convolution layers before and after the hidden layer(s).
4
In neural network terms, regularising.
5
This is not the only use case. Another common use case is sentiment analysis, e.g. when we want to train a classifier to tell us whether a passage of text is being nice or nasty — though this too is increasingly being done by transformers like ChatGPT.
6
You could try and pick these positions so that related words have 1s in nearby positions, but that’s not going to be easy, since each word is related to many other words in many different ways.
7
That is, there’s a separate input for each word in the model’s vocabulary, and only the input corresponding to the input word will have a value of 1; the others will all be 0.
8
For a discussion of some of the properties of word embeddings, see this paper.
9
These days they generally don’t use word2vec. The more advanced transformer models tend to include the learning of word embeddings within their broader training process, so they no longer use a separate word embedding model.

Deep Dips #1: Multi-layer perceptrons

Michael Lones — Wed, 27 Mar 2024 09:26:00 GMT

Deep learning is quite a thing these days, but it can be hard to find accessible yet factually correct guides to the bits and pieces that make it work. This new series is my attempt to fill this gap. It’s a deep dip, rather than a deep dive, since I’m going to focus on the big picture and explain the key ideas, rather than diving all the way down — though do check out the footnotes if you want to know more.
I’m going to start this series by talking about multilayer perceptrons (or MLP), which are fundamental components of most deep learning systems — or at least those based around neural networks1. Neural networks are composed of artificial neurons. A perceptron is a kind of artificial neuron, and MLP is a neural network architecture created by stacking layers of perceptrons in a feedforward fashion. They were invented by Frank Rosenblatt way back in the middle of the last century2, yet still play an important role in modern deep learning models such as transformers and convolutional neural networks (CNNs)3.
To put things in context, this timeline shows just how far back they appear in the history of deep learning — and indeed how far back many of the things we take for granted first appeared (also see Neural networks: everything changes but you):
I’ll begin with perceptrons, then move on to MLP, and finally I’ll say a bit about the universal approximation theory and what this tells us about the capabilities of MLP. There’s a TL;DR at the end for anyone with a limited attention span.
The following diagram shows how everything fits together, though note that things like the number of hidden layers and the number of inputs to a perceptron are configurable and will depend upon the problem being solved:
Perceptrons
A perceptron is a simple artificial neuron model that captures only the very basic behaviour of biological neurons.
Activation functions: At its heart is an activation function (also known as a transfer function), which is a mathematical function that takes one input and produces one output. This output then becomes the activation of the perceptron. When this is above a certain threshold, the neuron is said to be firing. Activation functions are usually monotonic, meaning that the output increases as the input increases, i.e. a larger input will always lead to a larger output. Most activation functions used these days are non-linear4, and this non-linearity is an important driver of complexity in neural networks. Most training algorithms also require activation functions to be differentiable5.
I’ll talk about specific activation functions in a future post, but most common are sigmoidal functions and rectified linear functions. Sigmoidal functions are often described as S-shaped, which basically means that until a certain input level, the output will be low. After this point, the output will grow rapidly — causing the perceptron to fire — until saturating at the function’s maximum output value. Rectified linear functions, by comparison, look like a bent stick; the first part is typically horizontal, and the second part points diagonally upwards. They tend to work better than sigmoidal functions as the number of layers in an MLP increases.
Weights: A perceptron receives multiple inputs. An important part of the process is how these multiple inputs are funnelled into the single input expected by its activation function. This is done by calculating a weighted sum. That is, each input has a weight (a number), and this weight is multiplied by the value received at the corresponding input (another number). These weighted inputs are then added together and become the input to the activation function. Weights can be positive or negative.
The process of training neural networks largely involves finding good values for these weights. Usually they start off with small random values6. Then, over the course of training, they’re gradually nudged towards more appropriate values — but more about training in a future post.
Biases: Perceptrons also have something called a bias. This is simply a value that’s added to the weighted sum of the inputs before it’s fed into the activation function. A bias acts as an offset, increasing or decreasing the input level required for a perceptron to fire. The bias is trained alongside the weights, and the weights and biases of the whole MLP are called its parameters.
Multilayer perceptrons
MLPs are formed by stacking layers of perceptrons in a feedforward manner. That is, the output activations of one layer of perceptrons typically7 become the inputs to the next layer. There are various ways of connecting them together, but a common approach is for all the outputs of one layer to become inputs to each perceptron in the next layer, i.e. each perceptron has a number of inputs equal to the number of perceptrons in the previous layer.
Input layer: The first layer, known as the input layer, is different. It’s just there to receive and transmit the MLP’s inputs8 further into the MLP. In fact, it doesn’t usually contain perceptrons; it just contains basic nodes that each receive one input and make it available to the perceptrons in the next layer. There are usually no weights, biases or activation functions.
Output layer: The final layer is known as the output layer. This contains one or more perceptrons, the number of which depend on the kind of problem that’s being solved. These perceptrons often use different activation functions to those in other layers. For instance, if the MLP is being used to solve a multi-class9 classification problem, there would typically be one perceptron in the output layer for each class, and these would use a softmax10 activation function. If it’s solving a regression problem, there’d usually be a single perceptron, and this would use a linear activation function11.
Hidden layers: The layers in-between are known as the hidden layers, and this is where most of the business of the neural network gets done. One of the most important decisions when choosing an MLP architecture is the number of hidden layers to use, since this largely determines its capabilities. An MLP is usually considered to be “deep” when it has more than one or two hidden layers.
Once an MLP has been assembled and trained, it is executed in a synchronous manner, i.e. everything in a particular layer gets updated at the same time. The external inputs are first sucked into the input layer. The outputs of the input layer are then copied into the inputs of the first hidden layer. Once this is done, each of the perceptrons in this layer then do their thing, and their outputs are then copied to the inputs of the next layer. And so on, until the output layer gets some outputs. This process is often referred to as a forward pass, to distinguish it from the backward pass that is used in training12.
Universal approximation theorem
The success of MLPs have a lot to do with the universal approximation theorem. This says that, given a few conditions, an MLP is capable of approximating any continuous function. This entails that they can capture pretty much any relationship between a group of inputs and a group of outputs, which makes MLPs a good basis for doing machine learning.
The conditions are pretty straightforward:
First, there must be at least one hidden layer — check!
Second, the activation functions must be non-linear — check!
Third, there must be sufficient neurons in the hidden layer — erm…
So, how many neurons is sufficient? Well, the theorem says that an infinite number is definitely sufficient, but that’s not very useful guidance. Beyond that, no one really knows, and an important part of applying neural networks is determining whether you have enough neurons to solve the problem you’re applying it to. Usually this is a process of trial and error. A related question is how these neurons should be divided into hidden layers. The theorem doesn’t say you need more than one hidden layer, but in practice it’s hard to train a complex mapping if you don’t have more than one13.
And this raises another important limitation of the universal approximation theorem: it doesn’t say anything about training. Yes, it may say that a large enough MLP architecture with the right parameters can express a solution to the problem you’re trying to solve, but it doesn’t say whether a particular training algorithm can actually find the right parameters. But that’s a topic for another time.
Too Long, Didn’t Read
Despite being ancient, perceptrons and multilayer perceptrons (MLP) still play a big part in modern deep learning approaches. The nice thing about them is that they really are quite simple. A perceptron is just a non-linear function with weighted and summed inputs, and an MLP is just a bunch of perceptrons connected together in layers. In theory, we know they are capable of universal function approximation, which makes them a good bet for machine learning. However, getting the architecture right can be a fiddly process, and there’s no guarantee that a training algorithm will find the correct parameters.
Thanks for reading Fetch Decode Execute! Subscribe for free to receive new posts and support my work.
1
Which is pretty much all of them, though the term deep learning is sometimes applied to deep versions of other machine learning models, such as decision trees and SVMs.
2
Building upon the earlier work of McCulloch and Pitts in the 1940s.
3
After applying convolution to extract features, a CNN normally feeds into an MLP to do the bulk of the decision making. In transformers, each block typically contains a self-attention layer followed by a couple of layers of MLP. More on this in future posts.
4
A lot of internet sources (including Wikipedia) state that perceptrons use linear activation functions. This is true of the first perceptron model, but it’s not true of later ones, including those used in Rosenblatt’s MLP.
5
-ish. Rectified linear functions aren’t differentiable at the bend.
6
There are various strategies for picking the initial values. This is somewhat tied to the choice of activation functions and optimiser, but typically you’d want the mean to be around zero, with most of the weights small to begin with, say between -0.2 and 0.2. Biases often start off at 0, but again this depends on activation functions and optimisers.
7
It is possible to have skip connections, where an output is routed to a layer beyond the next one. This is a big thing in modern deep learning models.
8
Which, in modern neural networks, will probably come from another part of the model, e.g. in a CNN, they’ll be the outputs of the convolution layers, and in a transformer they normally come after self-attention layers.
9
For a binary classification problem, only one output perceptron is required, usually with a sigmoidal activation function.
10
One softmax function is shared between the perceptrons of the output layer, and transforms the set of inputs that usually go to their individual activation functions (i.e. weighted sum + bias) into a set of pseudo-probability values, which then become the outputs.
11
Often the identity function, i.e. the output is just the weighted sum of inputs plus the bias.
12
At least for gradient-based optimisers.
13
It’s generally thought that MLPs build up progressively more complex representations of the input data over each subsequent layer. One other thing to bear in mind is that you do need at least one hidden layer: connecting the inputs directly to the output layer will only allow the MLP to solve linearly-separable problems.