Deep Dips #6: Mechanistic interpretability

Feb 06, 2026

It’s a while since I last wrote a Deep Dips post, so I’m going to broach another topic in the area of deep learning and LLMs that is becoming increasingly talked about — Mechanistic Interpretability, or MI to its friends. But first here’s a quick reminder of previous posts in the series:

For those new to this, the idea is to introduce concepts in an approachable manner, with each post generally building on the previous ones.

Why we need MI

LLMs give the impression of being intelligent, but scratch the surface and you’ll find all manner of hallucinations, shortcut learning, fragile heuristics and prompt sensitivities. All of which makes taking their output at face value a risky proposition. Wouldn’t it be nice if we could gain some insight into their actual reasoning, to give us confidence that they’re not deceiving us? Well, we can, to an extent, using MI.

But wait a minute, don’t we already have reasoning LLMs which tell us the thinking behind what they do? Yeah, in theory, but the thinking they produce is just tokens produced by the model, which may (and often do) have little to do with how the LLM is actually reasoning1. In order to understand how an LLM is actually reasoning, you have to go beyond their outputs and see what they’re doing inside. You know, in the billions, or sometimes trillions, of neuron activations that lead to each output token. Sounds challenging? Well, it is, but the folk who work on MI have come up with some pretty interesting tools for probing this low-level behaviour.

The linear representation hypothesis

The first thing to know is the linear representation hypothesis. This theorises that human-interpretable concepts captured within the internal state of LLMs are most likely to be encoded as linear vectors: straight-line directions within the model’s latent space. Without going into any detail, this follows from the observation that linear representations are the easiest, and therefore most likely, to be learnt2.

When I talk about an LLM’s internal state, in general I’m not referring to the LLM’s entire internal state, i.e. every single activation of every single neuron in every single layer. That would be a lot of activations. Instead, most MI techniques focus on the state at the outputs of a particular transformer block3. Sometimes this is the last block before the transformer’s outputs, since this has already extracted everything that is needed to predict the next token. Sometimes it’s the middle block, which some people think offers a better trade-off between being too focused on next token prediction and having done enough useful processing of the input4 . Either way, this state is basically a long list of numbers.

Anyway, the linear representation hypothesis basically says that this list of numbers is a linear superposition of all the concepts that the LLM has derived from its context window, all firing at different magnitudes. And the good thing about linearity is that it makes it relatively easy to break down this list of numbers and extract information about the magnitude of each concept.

Training linear probes

But first you need to identify the vectors associated with each concept. And this is where the various MI techniques come in, since they’re designed to help us find them. The simplest MI approach uses a linear classifier: a very basic kind of classifier whose output is based on a weighted combination of its inputs. It’s pretty much the same thing as a Perceptron (see Deep Dips #1), i.e. a single neuron in an MLP layer.

The idea is to use this linear classifier — sometimes referred to as a probe — to identify how a single concept is encoded within the LLM’s state. So, let’s say we want to do this for the concept of cheese. We assemble a bunch of cheese-themed prompts, e.g. “What is your favourite cheese?”, “How many Babybels would it take to fill the Albert Hall?” and so on. We then collect the corresponding set of internal states that occur when the LLM is fed with these prompts, and these become the positive class in our training data. And we repeat this process with prompts that don’t mention the theme of cheese in any way, collect the internal states, and use these for our negative class. The linear classifier is then trained to separate these two sets of states. And since we’re assuming the underlying signal is linear, this should identify the vector in the LLM’s representation space that corresponds to cheese.

But doing this to map out all the concepts learned by an LLM would be onerous. We first have to come up with a list of concepts that we think are relevant, generate an appropriate set of prompts for each one, collect the corresponding states for each of these, and then train a fresh linear model each time. That’s a lot of work, and a lot rests on our ability to come up with a sensible list of concepts in the first place, especially given that our idea of what represents a concept might not neatly align to the LLM’s. I know it’s hard to imagine, but the LLM might not even have a concept for cheese. It may instead have a bunch of concepts for yellow things, stinky things, things that are often round which fire in concert when a prompt mentions cheese, but not one specifically for cheese.

Sparse autoencoders

But MI comes to our rescue again, using something called sparse autoencoders (SAEs). I talked about autoencoders in Deep Dips #2: Embedding and latent spaces. It’s a kind of neural network that is trained to reconstruct its input at its output, via a hidden bottleneck layer. Its outputs are equal in number to its inputs, and its loss function measures how well the outputs match the inputs. The purpose of the bottleneck layer is to force it to project the inputs into a lower-dimensional latent space, ie. to compress them.

But in the case of an SAE, the hidden layer is no longer a bottleneck. Instead, it’s substantially bigger than the input and output layers5. And the loss function doesn’t just maximise reconstruction accuracy; it also minimises how many neurons in the hidden layer fire for each set of inputs. In effect, it’s trying to train a whole bunch of linear models (one for each hidden neuron) at the same time, and do so in an unsupervised manner. To train it, you just need to churn the internal states from a lot of prompts containing a lot of concepts through it, and over time it will learn to separate and characterise the vectors corresponding to each concept. Pretty neat.

Cataloguing semantic concepts

Using this approach, various people have discovered how commonly-used open weight LLMs, such as Meta’s Llama and Google’s Gemma, encode concepts within their internal states. A lot of these have been recorded on the website Neuronpedia. And it’s fair to say that the concepts are diverse. For example, they include concrete concepts like coffee, but also more abstract things like positivity or uncertainty, plus much more obscure things that are hard to put a label on6. The approach has also been applied to Claude’s commercial model, Sonnet, and there’s a nice write-up of what they discovered on their website.

So, using an SAE, you can learn how a whole bunch of concepts are encoded within the state of an LLM. Given a new prompt, you can then measure how much each of these concepts is being triggered, and this gives some insight into how the LLM is actually interpreting the prompt.

Steering LLM behaviour

An example of a practical use of semantic concepts is detecting jailbreaks. A jailbreak occurs when a user manages to convince an LLM to do something it’s explicitly trained not to do, e.g. use bad language or provide illegal information. By collecting a bunch of prompts containing jailbreaks, and a bunch of prompts that don’t contain jailbreaks, it’s possible to identify the concept (or concepts) that are triggered when a user attempts to jailbreak an LLM. These concepts can then be monitored during use.

But more importantly, if a concept is triggered, it can also be steered. This involves dampening the neuron activations that underlie the concept, i.e. replacing the current outputs with lower values, which then continue their journey through the model. In various published studies7, this kind of thing has proved quite effective at preventing jailbreak attacks, and could be similarly applied to any other behaviour that you want to actively prevent from happening.

Limitations of MI

However, the optimism generated by these kind of studies should be tempered with an understanding of the limitations of current MI approaches. The semantic concepts discovered by MI can sometimes be fragile, they don’t always generalise beyond collections of specific prompts, and they sometimes don’t align with human-interpretable concepts. All of this can limit their practical utility, especially when it comes to steering LLM behaviour. As an illustrative example, the authors of this paper described a coffee feature present in Meta’s Llama model. Although it does often trigger upon the mention of coffee across various languages, it doesn’t always trigger, and it also sometimes triggers for non-related concepts, such as the word “coffin”.

A more fundamental limitation of MI is that it requires access to the internal activations of an LLM. This is not a big problem if you’re hosting an open weight LLM on your own machine, but it’s an unsurmountable obstacle if you’re using a remotely-hosted commercial model. There’s nothing stopping commercial developers applying MI to their models and then sharing the information with us, but in practice they’re not likely to do this directly with users. More likely they’ll use MI to better understand their own models and improve their behaviour.

Too long; didn’t read

The only way of getting reliable insight into an LLM’s behaviour is to probe its internal state. MI is a group of techniques for doing this. Central to these is the linear representation hypothesis, which says that semantic concepts are encoded as linear vectors. These can be extracted individually by training linear classifiers, or they can be extracted en-masse using sparse autoencoders. Their activation can then be monitored, or the behaviour of the LLM can be steered by manipulating their activation. Semantic concepts extracted through MI methods are not always robust, generalisable or meaningful to humans. Nevertheless, they still provide valuable insight into the workings of the LLM black box, and are an important tool in trying to make LLMs more dependable. Which is important, because LLMs are finding their way into all manner of contexts where dependability is important.

Also, most commercial models hide the thinking tokens, presumably to stop their competitors from using them to train models.

See this paper for an approachable introduction.

For a refresher on the architecture of a transformer, see Deep Dips #3: Transformers.

Anthropic focused on the middle block when applying MI to Sonnet.

Exactly how much bigger is another design decision, and requires some appreciation beforehand of how many concepts you need/want to learn. Anthropic used up to 34M.

Which may make you wonder: how do these concepts get labelled? One approach is to get a human to look at what the prompts that trigger a particular concept have in common. But this is tedious work, so it’s increasingly being offloaded to LLMs to do.

Here’s a recent example, in which the success of jailbreak attacks was reduced from 61% to 2%.

Nick Taylor

Feb 9

This is also rather worrying https://www.theregister.com/2026/01/30/road_sign_hijack_ai/

1 reply by Michael Lones

Feb 7

Very interesting. Making "steering" easier worries me a bit though as it could be used to introduce subtle biases. Also, did you mean Google's Gemini, rather than Gemma?

2 replies by Michael Lones and others

3 more comments...

Fetch Decode Execute

Discussion about this post

Ready for more?