Few-Shot Prompting vs Detailed Instructions in Text Analysis. Is It Worth Adding Lengthy Examples to a Prompt?

In the article Do Prompt Structures Improve the Output Quality? I went into an analysis of responses generated by GPT-4, Gemini 1.5, and Claude 3 models to various zero-shot prompts for a complex task — evaluating meeting transcripts based on criteria. Those prompts required detailed instructions to a model (LLM), so the prompt size was up to 400 words.

Now, let's figure out what is more effective: writing lengthy instructions or providing the artificial intelligence with examples of the desired output. The last strategy is referred to as few-shot prompting.

This article presents an experimental comparison of these two alternative prompting strategies applied to the same task with the same three LLMs, and GPT-4o is also tested.

I seek to answer the same question: How can we minimize our efforts in crafting prompts without compromising the output quality?

1. Few-shot Prompting and Lengthy Examples

It is worth noting that "Few-shot" is not the only term for this approach. For instance, it is also referred to as Example Demonstrations in the "prompt design principles" list from this article (December 2023):

"One-shot prompts" are sometimes considered different from "Few-shot prompts". Nevertheless, I apply the same term "Few-shot" to a prompt with one example only.

Few-shot prompting is believed to enhance quality. That is, AI can extract a lot of nuances from the examples of what we expect it to produce. My experiments show that this is indeed the case.

In most sources, few-shot examples are described for AI tasks with very short outputs, ranging from a few words to a couple of dozen words.

However, the result of analyzing a transcript needs to be significantly longer, typically hundreds of words, not dozens. I have not found any evidence on the Internet suggesting that the effort of creating such lengthy examples is justified by the quality of the outcome. That's why I decided to explore this myself.

If you already have a complete example of what the AI should do with your task, adding it to a prompt will definitely save time on writing the prompt.

This is because examples often eliminate the need for detailed task descriptions in the prompt and reduce the necessity of employing other prompt engineering techniques.

Nevertheless, a technique should still be employed in combination with lengthy few-shot examples to ensure the LLM clearly "understands" where the example starts and ends. Several techniques are available for that (see section 3.2); I typically use <tags>.

2. How to Get Examples?

If your goal is to summarize lengthy texts or craft social media posts, you likely have examples of effective summaries or compelling posts already. These examples can be incorporated directly into the prompt without modifications.

However, not all situations are so easy. For instance, I had no existing examples of meeting analysis because it was a novel task that hasn't been tackled by people; it's simply too laborious for humans.

So, how can you create a high-quality example with minimal effort?

This is where generative AI is also helpful:

  1. Start with writing a basic (draft) prompt that addresses the task in hand. This is the prompt to which examples will be added later.
  2. Initially, apply this basic prompt as it is to some input data to obtain a preliminary output.
  3. This preliminary output then acts as an example, but it needs to be manually refined and its flaws corrected. Experiments show (refer to section 3.3) that incorporating just one such example can significantly enhance the quality of the prompt.
  4. A second iteration may be done if needed:

4a. Apply the 1-shot prompt developed in step 3 to different input data than used in step 2.

4b. By incorporating the AI-generated text(4a) as a second example, you create a 2-shot prompt.

4c. This increases the "reliability" of the prompt, enhancing its ability to consistently produce quality outputs under varying input data.

None

In section 3.3, you can find more about how I applied this approach to my project.

3. Prompts to Study

3.1. Zero-shot Prompts

Prompt #1 is a "minimal" prompt that can analyze meeting transcripts based on specified criteria and suggest improvements for future meetings.

Prompt #2 evolved from Prompt #1 through numerous refinements, so it is twice as large. It includes detailed instructions, headings, and lists, although it does not specify explicit steps for AI to follow. This prompt was found to be the best among the zero-shot prompts tested in the section 3 of the previous article:

However, noticeable improvements of Prompt #2 over Prompt #1 were only observed with the Claude 3 Opus.

3.2. Adding an Example

Here's an example of a transcript analysis output that was incorporated into zero-shot Prompt #1, creating the 1-shot Prompt #3.

None
Example in Prompt #3

As we can see, examples can be quite lengthy and may even include desired formatting: in this case, include using ** to denote bold text and numbers to structure a list.

Here, I used a structure with opening and closing tags (<EXAMPLE>…</EXAMPLE>) to indicate the start and end of the example. This structure is understood by LLMs. Tags are recommended by Anthropic; their Prompt Generator organizes prompts for Claude using tags.

However, if you prefer not to use such a "programmer's" approach as tags, there are alternative methods to separate the example from other text:

  • For instance, if your prompt uses Markdown-style headers, then let the example also be under the "# Example" header.
  • If there are no other headers in your prompt, you can use any separators like this: Example: ``` …(example text)… ```

3.3. Creating Examples with AI

The example discussed above is an output from the zero-shot Prompt #1, generated using the GPT-4 model. However, I manually adjusted the output by reducing overly generous grades and tailoring grade format to better suit my preferences.

Next, I employed 1-shot Prompt #3 for analyzing a different meeting. This time, I used the Gemini 1.5 Pro model to ensure the new result had a different style. This new output became the second example in the new 2-shot Prompt #4. So, I followed the formula: Prompt #4 = Prompt #3 + second example.

The above process of crafting few-shot prompts is illustrated by the scheme:

None
Recommended steps to create 1-shot prompt and 2-shot prompt

To separate examples in Prompt #4, I again used tags, arranged as follows:

<EXAMPLES> <Example1>…</Example1><Example2>…</Example2> </EXAMPLES>

None
Example 2 in Prompt #4

Note that there is a limitation on the size of the system prompt in GPT ("Instructions") — 8000 characters, spaces included. Therefore, I removed some items from the list in each example in Prompt #4.

As a result, the examples in the 2-shot prompt are as different as possible. This should enhance the outputs on new input data compared to those from the 1-shot prompt. Nevertheless, this assertion remains untested experimentally, as such tests would require a large number of different transcripts (many different input datasets).

Let's test a simpler assertion — that few-shot prompts outperform zero-shot prompts.

4. Prompt Comparison Results

Thus, the experiments compare four prompts:

  1. 0-shot: a brief task description.
  2. 0-shot: detailed instructions.
  3. 1-shot: brief prompt (as in item 1) augmented by one example.
  4. 2-shot: brief prompt (as in item 1) augmented by two examples.

The details about the experiments are available in section 2 of the article "Do Prompt Structures Improve the Output Quality?". In particular, it describes the methodology used to assess output quality via the "number of defects" metric.

This time, there are 2 changes in methodology described there:

  • The brand new GPT-4o model is tested in addition to GPT-4-turbo. GPT-4o is used on Open AI Playground.
  • Defects of initial messages (which are unrelated to the final analysis) are NOT considered; these defects are irrelevant for the purposes of this study.

Here are the results measured by the metric above:

None
Results summary for different prompts. The higher the number, the worse the output quality.

On average, few-shot prompts produce much better texts.

Here is what we observe in the diagram above in terms of model differences:

  • GPT-4o, GPT-4 and Gemini excel at extracting user expectations from examples (with Gemini even replicating specific phrases, which is usually a convenient feature). However, an overly detailed prompt without examples can lead to confusion of all these models, regardless of the prompt's structured format.
  • Contrarily, Claude 3 Opus shows a preference for detailed instructions over examples and follows instructions rigorously. All three prompts with brief instructions led to issues in Claude's outputs.

For a detailed list of defects in each experiment, you can refer to the table.

The diagram above has nothing to do with another important indicator — "text specificity". Unfortunately, this indicator is difficult to quantify, but my personal observation is that the use of examples has brought the model outputs close to ideal in terms of specificity.

Notably, even GPT-4, prone to producing vague responses to zero-shot prompts, started to pay more attention to specific details from the transcript when examples were provided.

As expected, GPT-4o works extremely well, especially with few-shot prompts. When it analyzed 6 transcripts with my prompts #3 and #4,

  • There were no "defects" at all.
  • The output were perfectly specific and used many details from examples.

Before May 13, 2024, Gemini 1.5 Pro performed better than all other models in terms of "text specifity". Now, GPT-4o is the leader according to this subjective criterion, even if examples are not provided. Just take a look at its analysis of a meeting:

None
None
GPT-4o analysis of a daily meeting, prompt #2

However, few-shot prompting has its drawbacks in some cases. I noticed that Claude using 1–2-shot prompts tends to make mistakes stemming from an oversight of some instruction details. This is likely due to the very large size of my few-shot prompts, which may divert this model's focus from the primary task.

It's important to acknowledge the probabilistic (random) nature of AI outputs. Interestingly, the randomness is significantly reduced with few-shot prompts compared to zero-shot prompts. In particular, the "number of defects" metric was consistently similar across repeated runs of the same model with the same context.

This and other advantages of few-shot prompting can be found here:

5. Conclusion

Few-shot prompting is a great method for crafting prompts with minimal effort while consistently achieving high quality. It leads to more specific and less deficient texts even though the few-shot prompts tested are very large (therefore, in theory, they could "confuse" a model).

Even if you start with no examples,

  • An 1-shot prompt can be quickly generated using AI.
  • Creating a 2-shot prompt also demands no more effort than writing detailed instructions (refer to section 3.3).

In fact, it's not always necessary to add a second example. The number of defects for a 1-shot prompt could be even lower than one for a 2-shot prompt. For instance, Gemini 1.5 Pro produced NO deficient output while analyzing 3 transcripts with my 1-shot prompt.

Thus, if you have a draft of instructions, enhancing the results could be as simple as adding just one example got from the AI's initial output with minor edits. This method is less complex than refining instructions within a zero-shot prompt. Furthermore, using a 1-shot prompt with GPT-4o, GPT-4 or Gemini almost guarantees better quality compared to a refined zero-shot prompt.

Of course, the insights mentioned apply primarily to "verbose" tasks, such as transcript analysis I addressed in this article. If you observe different model's behavior when tackling your tasks, please share your experience in the comments.