Deep Dips #5: Prompt engineering

Dec 11, 2024

Previously in this series, I’ve discussed neural networks, transformers, and how you train them, covering pretty much all you need to know to build and train an LLM. But most people neither want nor need to build and train their own LLM, since pre-trained models are readily available. These can be downloaded, or accessed through services such as ChatGPT, and are ready to be applied to downstream tasks without the need for further training1.

However, it’s not straightforward to apply an LLM to a particular task. Unlike classical machine learning approaches, there is no simple mapping between the information you provide to an LLM and the response it produces. Instead, the manner in which you deliver input to the model — that is, how you prompt it — can have a profound effect on its output, and consequently its ability to solve the task you’re applying it to. LLMs care about which words you use, they care about the order in which you say things, they care how many prompts you use to say things, they care about what you said in previous prompts, they even care about punctuation. Minor alterations to all these things can lead to major alterations in their responses, a phenomenon known as prompt sensitivity.

The process of finding how to deliver the input in the right format to get the task done well has become known as prompt engineering. But don’t let the term engineering fool you. This is not engineering in the sense of a well-developed and well-understood process that uses mature tools. It’s more akin to tinkering, and this reflects the current Wild West nature of LLMs. Unlike the topics I’ve discussed earlier in this series, pretty much everything in LLMs is new, and we’re only just starting to understand how to use them productively.

In this post, I’m going to talk through some of the main techniques for doing prompt engineering. As usual, I’m going to focus on the big picture rather than try to be exhaustive.

Keep asking the same thing

I’m going to start with something obvious, but often overlooked. Ask your LLM the same thing multiple times. If you read my post about transformers, you’ll know that the output layer is probabilistic — essentially, for each word it knows, it will return the probability that this should be the next word in the output. These probabilities are then turned into an actual word through a sampling process which favours words with high probability, but occasionally picks words with lower probabilities. By tweaking a setting known as the temperature, you can change the distribution of this.

Which means that if you give the same LLM the same prompt in two different sessions, you’ll likely get a different answer. If the temperature is set low, these answers will be similar. If it’s set high, they may be quite different. In most LLMs, the temperature is set to an intermediate level by default, so there will be a fair amount of diversity in its response to the same prompt. This means that you can easily collect a bunch of different answers to the same prompt, and doing so can lead to more robust insights. For example, when asking an LLM a yes/no answer, you can use the majority response over several repeats — something akin to ensemble or wisdom-of-crowds models in machine learning.

By way of example, here’s the responses to the same prompt in five different fresh sessions using a pocket-sized version of Gemma 22, Google’s open LLM:

Prompt: Tell me precisely how many ones there are in the string 100101111000.

Response 1: There are **four** ones in the string "100101111000".
Response 2: There are **8** ones in this string.
Response 3: There are **5** ones.
Response 4: There are **7** ones in this string. 
Response 5: There are **7** ones in this string.

So, it got it wrong every time, but if you sum up the responses and divide by 5, you get 6.2, which is not far from the truth.

Variants of this technique include ensembling the responses from slightly different prompts, e.g. with different sentence ordering or word usage, and ensembling the responses from multiple LLMs, i.e. giving the same prompt to different LLMs. Both have been found to significantly increase robustness, where applicable.

Big up your LLM

One of the surprising findings of prompt engineering is that you can significantly improve the quality of a response by telling the LLM that it knows what it’s doing. In retrospect, it’s not entirely surprising. After all, LLMs were not explicitly trained to give good answers to questions; they were trained to learn the patterns underlying a large amount of training data, some of which is likely to be wrong or of dubious quality. So, telling your LLM that it’s competent at some task may help point it towards the more reliable parts of its knowledge base. This is akin to telling a dog “Who’s a good boy? You’re a good boy!” prior to asking it to complete an assault course. Though in the case of an LLM, it’s typically done by saying “You are an expert in <task domain>” before giving it the task you want it to fulfil. However, as usual, the exact wording does matter, and small changes to these encouraging statements might lead to significant differences in the outcome. But such is the way of LLMs.

Here’s an example of this simple tweak in action:

Prompt: You are an expert in binary numbers. Tell me precisely how many ones there are in the string 100101111000.

Response: There are **6** ones in this string.

Correct, but in 5 repeats, it only got it right this one time, with an average of 7.4. So, not great, but still better than when I tried “Who’s a good boy? You’re a good boy!”, which resulted in no correct answers and an average of 9.4.

Putting things in context

This idea of pointing an LLM towards appropriate parts of its knowledge base is also leveraged by a group of approaches called in-context learning. These work by providing examples within the prompt, and this has been shown to improve the accuracy of an LLM’s responses. Though again, things like where you place the example in the prompt, and the words you use to introduce the example can have a significant effect.

A simple example:

Prompt: Tell me precisely how many ones there are in the string 000001111001.
Answer: 5

Tell me precisely how many ones there are in the string 100101111000.
Answer: 

Response: 6

When I repeated this 5 times, it got it right 4 times, so this seems to be a more robust approach for this particular task.

There’s been a fair amount of research on how many examples you should give, and this has led to a distinction between zero shot, one shot and few shot learning. As you might guess, these involve giving the LLM no examples, one example, or more than one. Generally speaking, there’s a sweet spot somewhere, but this is likely to be task-specific, and you’ll have to work to find where it is. For some tasks, it makes sense to give positive and negative examples, but there’s also work to do in finding an appropriate balance between these, since both the balance and the order in which you present them can have a significant impact on how well this works.

There’s also been research on which examples you should include, and this has led to an approach called retrieval augmented generation, or RAG. This involves an external database of examples that are queried to find those which are most similar to text in the prompt. The matches are then added to the prompt as examples, and this seems to work better than using generic examples. RAG is also used more generally to align LLMs to particular domains. A typical approach is to gather a bunch of domain-specific documents (manuals, technical papers, etc.), split these into fragments of text to store in the RAG database, and then use matching algorithms to identify appropriate fragments of text that can be embedded in prompts.

Explain yourself!

As an academic, when I’m writing exam questions, I won’t just ask students what the answer is; I’ll also ask them to explain why it’s the answer. This gives me more confidence in their reasoning abilities, and also helps to rule out guessing. And, it turns out, the same is true of LLMs. If you ask then to reason about their answer, then they’re more likely to give you a correct answer. The simplest way to do this is to slap “Let’s think step by step” on the end of your prompt — this encourages the LLM to not only give an answer, but also explain how it got there.

By way of example:

Prompt: Tell me precisely how many ones there are in the string 100101111000. Let’s think step by step.

Response: Here's how to count the ones:

1. **Initialization:** We start with a counter initialized to zero. This will keep track of the number of ones we find.
2. **Scanning the string:**  We carefully go through each digit in the string 100101111000.
3. **Identifying Ones:** For each digit, we check if it is a '1'. If it is, we increment our counter.
4. **Counting:** Let's apply this to your example:
    *  100101111000 
    * We encounter the digit '1' at positions 1, 2, 3, 6, 7, and 8. 
    * So, we increment our counter a total of 6 times.
5. **Final result:**  Our counter should now reflect the number of ones in the string.

**Therefore, there are 6 ones in the string 100101111000.**

However, this simple approach doesn’t always work. In this particular task, it only worked this once — the other 4 times it just told me how to solve the problem, not what the answer was.

To address this, people have explored a range of more involved ways of encouraging LLMs to think about what they’re doing. Currently the most popular of these is chain of thought (or CoT) reasoning, which builds on the idea of in-context learning described above by presenting the LLM with an example of how to think. This typically takes the form of a question, a reasoning process, and the correct output, all embedded in a single prompt. The LLM is then asked a related question.

Here’s an example:

Prompt: Tell me precisely how many ones there are in the string 100101010010.

Answer: To calculate the number of ones in a string, you have to read it left to right, character by character, and sum up the characters that are 1. In this case, the first character is a 1, so the sum is initially 1. The second character is a 0, so the sum stays as 1. The third character is a 0, so the sum stays as 1. The fourth character is a 1, so the sum increases to 2. And so on. The answer is 5.

Tell me precisely how many ones there are in the string 100101111000.

Answer: 

Response: Here's how to count the ones in the string "100101111000":

1. **Read Left to Right:**  Just like before, we analyze each character individually.
2. **Sum Up Ones:** For each character, if it's a '1', add 1 to our running total.

Let's break down the process:

* First Character: 1  - Add 1 to our sum.
* Second Character: 0 - No change in the sum.
* Third Character: 1 - Add 1 to the sum.
* Fourth Character: 1 - Add 1 to the sum.
* Fifth Character: 1 - Add 1 to the sum.
* Sixth Character: 1 - Add 1 to the sum.
* Seventh Character: 1 - Add 1 to the sum.
* Eighth Character: 0 - No change in the sum.
* Ninth Character: 0 - No change in the sum.
* Tenth Character: 0 - No change in the sum.

**Answer:**  The total number of ones is **8**.

As you can see, this didn’t work out. All 5 attempts gave the wrong answer and an average of 8.4. On the bright side, it did seem to pick up on the general idea described in the reasoning process — it just wasn’t very good at following it!

But beyond not always working, the limitation of CoT is that the prompt writer needs to come up with a good example of the thought process behind answering a similar question, and also needs to know the answer to this similar question — which may defy the point of asking an LLM in the first place. A range of techniques have been dreamt up to at least partially address these problems. One approach is to automatically mine for similar questions. Auto-CoT for example, does this by grabbing a bunch of similar questions from some dataset or knowledge base and then using the “Let’s think step by step” hack to generate answers and reasoning chains. Other approaches blend CoT with RAG in order to mine external knowledge sources for relevant question-answer pairs.

Employ a fact-checker

As the last example illustrated, LLMs don’t always get things right, even with gratuitous use of prompt engineering. One way to address this is to use an LLM to fact check its own output:

Prompt: Consider the following prompt:

<Prompt from the last example>

Now consider the following response from an LLM:

<Response from the last example>

Was the LLM’s conclusion correct? 

Response: No, the LLM's conclusion was incorrect. Here's why:

The LLM correctly performs the steps described but fails to accurately count the ones.  

There are 7 ones in the string "100101111000".

The key takeaway is that the LLM needs a better understanding of what it means to "count" or "sum." It should recognize the pattern and not just mechanically apply the steps for each character.

That’s some tough self-criticism. But as you can see, although it correctly identified some of what was wrong, it still couldn’t work out the right answer.

So, let’s give another LLM the same query — in this case GPT-4o:

Response: Let's verify the response step by step:

Read Left to Right: We analyze each character individually.
Sum Up Ones: For each character, if it's a '1', add 1 to our running total.
Here's the breakdown:

First Character: 1 - Add 1 to the sum (Total: 1)
Second Character: 0 - No change in the sum (Total: 1)
Third Character: 0 - No change in the sum (Total: 1)
Fourth Character: 1 - Add 1 to the sum (Total: 2)
Fifth Character: 0 - No change in the sum (Total: 2)
Sixth Character: 1 - Add 1 to the sum (Total: 3)
Seventh Character: 1 - Add 1 to the sum (Total: 4)
Eighth Character: 1 - Add 1 to the sum (Total: 5)
Ninth Character: 1 - Add 1 to the sum (Total: 6)
Tenth Character: 0 - No change in the sum (Total: 6)
Eleventh Character: 0 - No change in the sum (Total: 6)
Twelfth Character: 0 - No change in the sum (Total: 6)
The correct total number of ones in the string "100101111000" is 6, not 8. The response provided by the LLM contains an error in counting the ones.

And this shows the benefit of using more than one LLM. Or perhaps just the benefit of using a much bigger LLM3.

Don’t do it yourself

So far, I’ve covered a range of different prompt optimisation techniques. These can be used individually, or combined, and each of them has various degrees of freedom, i.e. different ways you can tweak them, each of which may have a significant effect on the outcome. This quickly adds up to a lot of choices. You could spend a lot of time working your way through these in order to find the one that works best. Or you could attempt to automate this process, using your computer to do the hard work for you — or at least some of the hard work.

Various methods have been developed for this. I’ll refer to them as automatic prompt optimisation, but in practice they’ve been branded in a number of different ways. The general idea is that you provide an initial prompt, and some optimisation process then kicks in and explores alternative prompts, returning the best one. The way in which this exploration is done varies markedly between approaches. Some require detailed knowledge of the LLM’s underlying transformer model and/or its training data, and use techniques like gradient descent and reinforcement learning to tweak the initial prompt4 in directions that minimise some error function. That is, they leverage the same kind of techniques that are used to train transformers in the first place.

More recent approaches tend to have a lot less dependence on the underlying model and instead use the LLM itself to assist with prompt refinement. AutoPrompt, for example, requires no knowledge of the underlying transformer model, and instead uses the host LLM to generate new prompts and evaluate these on synthetic data which it also generates. However, it still requires some involvement from the user.

The best is yet to come…?

Prompt engineering is still in its infancy, and this is reflected in the hit-and-miss results from the example prompts shown above. In the future, it seems likely that LLMs will move away from manual design of prompts to more automated means. At the same time, they will probably become less and less sensitive to how prompts are formulated. We’ve already reached the stage where LLMs are starting to become integrated within operating systems, and this will put more pressure on LLM developers to produce a straight-forward and consistent user experience. For example, you can already see this happening behind the scenes in Apple Intelligence, where the developers use careful wording to massage user queries before they reach an LLM. Although a difficult nut to crack, another important direction is better understanding of how LLMs actually work, with the potential to give guiding insights into how the wording of prompts influences behaviour.

Too long; didn’t read

LLMs are very sensitive to the way in which you prompt them. Prompt engineering is the process of improving a prompt so that an LLM carries out its task more effectively. It’s an art rather than a science. Popular approaches include chain-of-thought reasoning and retrieval augmented generation. Both of these rest on the idea of in context learning, which involves embedding carefully chosen information within the prompt. The LLM then uses this to better align its response with the required task. However, a challenge of these approaches is how to source appropriate information. Simpler approaches include asking an LLM the same thing multiple times, using an encouraging statement that tells the LLM it is an expert in the task, and asking it to reason about or fact-check its response. These can be surprisingly effective. Automated prompt optimisation is an emerging approach that aims to remove the manual work involved in optimising prompts, offloading this to an optimisation algorithm or an LLM. This is an active area of research, so watch this space.

Although it is possible to train them further, through various fine-tuning methods. Check out Sebastian Raschka’s Ahead of AI substack for extensive info on this.

With 2 billion parameters and 8 bit quantization.

Microsoft haven’t told us how many parameters there are in GPT-4o, but it’s widely believed to be over 1 trillion.

These approaches tend to work directly in the embedding space, so they basically turn the initial prompt into a bunch of numbers which is no longer human-readable.

Fetch Decode Execute

Discussion about this post