Recent years have seen remarkable progress in natural language generation capabilities of large neural network models like GPT-3 (Brown et al., 2020). These foundation models can produce fluent, coherent, and even creative text spanning multiple paragraphs.

However, studies have revealed issues like a tendency to generate factually incorrect or imaginary content (Zhou et al., 2021). This indicates that while large language models have memorized substantial world knowledge through pre-training, they still lack sufficient grounding in external information to reliably generate accurate long-form text.

To address these knowledge limitations, an emerging technique involves integrating language model generation with information retrieval systems in a framework called retrieval augmented generation (Lewis et al., 2020). The key idea is to combine the fluency of neural text generators with the vast knowledge contained in external corpora like Wikipedia to produce responses that are not only smooth and coherent but also factual. Work by Jiang et al. (2022) has shown that retrieving and conditioning on relevant Wikipedia documents can significantly improve accuracy on question answering tasks.

However, most existing approaches involve a simple retrieve-then-generate setup, where documents are fetched just once based on the initial user input. As argued in a recent paper by Jiang et al. (2023), this is insufficient for complex long-form generation, which requires proactively gathering different pieces of information throughout the process. The authors propose a new framework called forward-looking active retrieval augmented generation (FLARE) that dynamically determines when the language model lacks necessary knowledge during text generation, and preemptively retrieves documents that can aid the subsequent generation steps.

This article summarizes the FLARE model, analyzing how it works, its implementations based on confidence scores or natural language instructions, and comprehensive evaluations demonstrating improved performance over previous methods on diverse long-form text generation datasets. We discuss the implications of FLARE's active retrieval approach in addressing knowledge gaps in language models, and explore promising directions it opens up for future work.

The Basics of Retrieval Augmented Generation

The goal of retrieval augmented generation is to produce responses that are not only fluent but also grounded in factual knowledge. This is achieved through the following general process:

  1. A user provides an input text prompt to the system.
  2. Based on this prompt, a retriever module searches a corpus of documents and retrieves the most relevant ones.
  3. The language model generates a response conditioned on both the original prompt as well as the retrieved documents.

For example, if the prompt is "Tell me about Joe Biden's education background", the retriever would fetch Wikipedia articles related to Joe Biden and his education. The language model would then leverage this external information to generate a response summarizing Biden's educational history.

This allows the system to tap into a much wider knowledge source compared to just relying on what's contained within the language model's parameters. Early work in this area has shown that retrieval augmentation leads to more factual and in-depth responses.

Tradeoffs in Existing Models

Most existing retrieval augmented systems employ a single-step retrieve-and-generate process. The prompt is used just once to fetch documents, which are then consumed by the language model.

While effective for short responses, this setup has clear limitations for long-form generation, which often requires gathering information sequentially.

Consider summarizing Joe Biden's political career — the initial retrieval based on "Joe Biden" may cover his early years, but as the summary progresses to recent events like his presidency, the language model needs to dynamically retrieve more information.

Some recent work has tried to address this through techniques like:Fixed-interval retrieval: Triggering retrieval periodically every few tokens or sentences generated rather than just once. However, predetermined intervals are inefficient and can occur when irrelevant.

Previous-output retrieval: Using the previously generated output as queries for retrieval at each step. But past output may not accurately reflect future information needs.

Introducing FLARE

The FLARE model implements a new approach called active retrieval augmented generation. The key ideas are:

  • Confidence-based retrieval: Only retrieving more information when the language model generates low-confidence tokens reflecting lack of knowledge.
  • Forward-looking retrieval: Anticipating likely future content through temporary next sentence generation. Using this upcoming content to formulate searches that help subsequent generation steps.
None
Image from the paper

FLARE takes a user prompt and initially retrieves documents based on it. It then starts iteratively generating the output.

At each step, it first predicts the next sentence. If low-confidence tokens are detected, it uses the predicted sentence as a query to find relevant documents. Finally, it regenerates the sentence incorporating the new information.

This continues as needed throughout the generation process. FLARE also employs techniques like masking uncertain phrases to formulate queries and generate explicit questions to refine searches.

Results

FLARE was validated on state-of-the-art models like GPT-3.5 for diverse long-form generation tasks:

  • Multi-hop QA: Answering compositional questions through reasoning and retrieval
  • Commonsense reasoning: Generating logical explanations using world knowledge
  • Long-form QA: Producing detailed answers to ambiguous questions
  • Summarization: Generating multi-sentence summaries using web information

It achieved strong improvements across all tasks compared to previous retrieval techniques as well as no-retrieval baselines. For instance, on multi-hop QA, FLARE improved exact match accuracy from 39% to 51% over single-step retrieval.

These results highlight the benefits of selective and forward-looking retrieval for long-form text generation. FLARE offers a plug-and-play enhancement for large language models.

Moving Forward

FLARE opens up an exciting research direction into more advanced retrieval strategies and architectures. Some promising work includes:

  • Adaptive retrieval: Modulating retrieval frequency based on run-time metrics like generation probability rather than fixed thresholds.
  • Efficient caching: Storing past retrievals and generation state to optimize repeated accesses.
  • Multi-task training: Directly training language models to anticipate points needing retrieval.

As language models continue to advance in scale and capability, incorporating external knowledge through selective retrieval will be key to producing realistic long-form text. Techniques like FLARE point the way towards more controllable, accurate and knowledgeable text generation.

None
Image by the author

Sources :

Brown, Tom B., et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020).

This paper introduced GPT-3, a large language model exhibiting few-shot learning capabilities.

Zhou, Chunting, et al. "Detecting hallucinated content in conditional neural sequence generation." Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021.

This paper analyzed the tendency of language models to hallucinate incorrect or imaginary content.

Lewis, Patrick S. H., et al. "Retrieval-augmented generation for knowledge-intensive NLP tasks." Advances in Neural Information Processing Systems 33 (2020): 9459–9474.

This paper proposed retrieval-augmented generation to ground language model outputs in external knowledge.

Jiang, Zhengbao, et al. "How can we know when language models know? On the calibration of language models for question answering." Transactions of the Association for Computational Linguistics 9 (2021): 962–977.

This paper demonstrated improved accuracy from conditioning language models on relevant Wikipedia documents.

Jiang, Zhengbao, et al. "Forward-looking active retrieval augmented generation." arXiv preprint arXiv:2305.06983 (2023).

This paper introduced FLARE, a new retrieval-augmented generation method using active, forward-looking retrieval strategies.

In Plain English

Thank you for being a part of our community! Before you go: