Evaluating long context large language models

There is a race towards language models with longer context windows. But how good are they, and how can we know?

Aug 01, 2024

The context window of language models have been growing at an exponential rate in the last few years. Figure created by the author.

Introduction

The context window of large language models – the amount of text they can process at once – has been increasing at an exponential rate.

In 2018, language models like BERT, T5, and GPT-1 could take up to 512 tokens as input. Now, in summer of 2024, this number has jumped to 2 million tokens (in publicly available LLMs). But what does this mean for us, and how do we evaluate these increasingly capable models?

What does a large context window mean?

The recently released Gemini 1.5 Pro model can take in up to 2 million tokens. But what does 2 million tokens even mean?

If we estimate 4 words to roughly equal about 3 tokens, it means that 2 million tokens can (almost) fit the entire Harry Potter and Lord of the Ring series.1

A diagram showing how many Harry Potter and Lord of the Rings books can fit into a Gemini 1.5’s 2 million context window. This figure is in part inspired by this amazing infographic from March 2024. Figure created by the author.

These numbers refer to the context windows that are available in publicly available models. The Gemini 1.5 Pro model, while currently publicly available with up to a 2 million token context window, can work with up to 10 million tokens.

This means, as stated by a Reddit user, that one can fit 1000 scientific papers into Gemini’s 10 million context window to create novel research.

Why does this matter?

Larger context windows aren’t just a way for companies building LLMs to compete with each other. The implications and real-world scenarios of how models with long contexts can be used are many. Consider the following scenarios:

Legal research: Lawyers can input entire case histories, precedents, and statutes into a model, getting comprehensive analysis in seconds instead of hours or days of manual review.
Financial analysis: Imagine feeding years of financial reports, market trends, and economic indicators into an AI for instant, in-depth insights.
Medical diagnostics: Doctors could input a patient's entire medical history, including test results, treatment records, and high-resolution medical imaging scans, for more accurate diagnoses and personalized treatment plans.
Education: Students could input entire textbooks and course materials, getting tailored explanations and connections across subjects.

However, these use cases also raise concerns. The ability to process vast amounts of personal data could enable unprecedented levels of surveillance and privacy invasion if misused. As these capabilities grow, so does the need for robust ethical guidelines and safeguards.

How do we evaluate LLMs as the context window gets longer and longer?

Models with extremely long context windows are a recent development. Therefore, researchers have tried to come up with new ways of evaluating how good these models are. These evaluations aim to benchmark the capabilities and limits of long context models, as well as measure the tradeoffs that come with scaling up context windows.

The core idea is that models with longer input contexts should be able to perform tasks that were previously too difficult or impossible.

Evaluation Use Cases

In this article, I’ll cover three different ways that researchers are thinking about evaluating long context models:

Information retrieval from long documents
Complex Analysis (reasoning and summarizing) of Long Documents
In-context learning for "on the fly" model training

Note: This is not an exhaustive list. For a comprehensive overview of long-context benchmarks, refer to the Awesome LLM Long Context Modeling Github page.

1. Information retrieval from long documents

The "Needle in a Haystack" test, introduced by Greg Kamradt, is a popular method for evaluating information retrieval in long contexts. It involves placing an out-of-place statement (the "needle") at various depths within longer text snippets (the "haystack").

*An example of Needle in a Haystack by inserting "The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day" into Paul Graham's essays.*

This test measures how effectively LLMs can locate specific information within increasingly large contexts.

The original “Needle in a Haystack” diagram created by Greg Kamradt to pressure test LLMs’ abilities to retrieve deeply buried information. By placing the out-of-place sentence (”the needle”) at various depths within snippets of various lengths (”the haystack”), one could measure how well different LLMs could locate these pieces of information.

Variations on Needle in a Haystack

Researchers have developed several variations to test different aspects of information retrieval:

Multiple needles: Multiple facts sprinkled throughout long documents (introduced by Langchain and in NeedleBench)
Multimodal Needle in a Haystack: Finding a target image within a set of unrelated images based on a caption
Audio-based: Identifying a short audio clip within a signal up to five days long (introduced in the Gemini 1.5 technical report). In this test, a short clip of audio where a speaker says “the secret keyword is needle” was hidden within an audio signal of up to almost five days long (or, 107 hours).
Video-based: Locating a single frame with specific text in a 10.5-hour video (Gemini 1.5 technical report). In this test, a single frame with the text “The secret word is needle” was buried within a 10.5 hour video constructed from concatenating seven copies of the full AlphaGo documentary.

Video-based Needle in a Haystack was introduced in the Gemini 1.5 paper. Image from Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context (p. 110).

Limitations and Implications

While widely used, the Needle in a Haystack approach has several limitations:

It's an artificial task that may not reflect real-world use cases.
It only assesses information finding, not reasoning or comprehension.
As context windows grow, evaluating all combinations of "haystack" sizes and "needle" locations becomes increasingly costly.

Despite these drawbacks, the test highlights a crucial capability of long-context models: the ability to quickly search and retrieve information from vast amounts of data. This has significant implications, from enhancing research efficiency to enabling unprecedented levels of data analysis — and potentially, surveillance.

It's important to note that this type of information retrieval differs from Retrieval-Augmented Generation (RAG) in that it operates within a single, extensive context rather than retrieving information from external sources.

2. Complex Analysis (reasoning and summarizing) of Long Documents

While "Needle in a Haystack" tests focus on information retrieval, other evaluations assess an LLM's ability to reason over, interpret, and synthesize information from extensive content. These evaluations aim to test for more complex forms of reasoning rather than just pinpoint data location.

Here are a few evaluation methods in this category:

Literary QA Tasks

Books are prime examples of long documents. Benchmarks like NOVELQA test models' ability to reason over literary fiction, using up to 200K token lengths. It includes human-generated questions about 88 English-language novels, both public domain and copyrighted. Other datasets, like NoCha, follow similar approaches.

Caption: A diagram showing two sample questions from the NovelQA dataset, from NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens.

Reasoning over long texts with hidden relevant information

FlenQA creates multiple versions of contexts at different lengths by embedding relevant information within longer, irrelevant texts. This helps assess how LLM performance degrades as context length increases.

An example of one of the tasks in FlenQA, in which relevant information (dark red) is embedded among irrelevant information. Diagram from Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models.

Domain-Specific Reasoning

Healthcare: The LongHealth benchmark uses 20 fictional patient cases (5-7K words each) to test medical reasoning.
Finance: DocFinQA challenges models with financial documents up to 150 pages long (100K+ tokens).

Summarization Tasks

Summarizing long documents is a crucial capability for LLMs, as it allows users to quickly grasp key information from extensive texts without reading everything. This skill is particularly valuable in fields like research, business, and law, where professionals often need to distill large volumes of information into concise reports.

However, evaluating summarization quality is challenging. Unlike simple information retrieval, summarization requires a deep understanding of the entire context and the ability to identify and synthesize key points. What makes a "good" summary can be subjective and context-dependent.

Current evaluation methods often rely on comparing model-generated summaries to human-written references, but this approach has limitations. It may not capture the full range of valid summarization strategies and can miss semantically correct summaries that use different wording.

Benchmarks like LongBench and ∞Bench attempt to address some of these challenges. LongBench includes summarization tasks for various document types (government reports, meeting notes, news articles) up to 15K words, while ∞Bench pushes the boundaries with novel summarization tasks up to 100K tokens. These benchmarks are valuable, but the field is still working towards more robust evaluation methods that can better capture the nuances of high-quality summarization.

A great resource to learn more on this topic can be found in An Empirical Survey on Long Document Summarization: Datasets, Models, and Metrics.

3. “On the fly” model training

One of the coolest applications of long context models is the expanded capacity for in-context learning (ICL). ICL allows models to learn new tasks on the fly, directly from examples provided in the prompt. With larger context windows, we can now include hundreds of training examples or even complex, lengthy examples like summarization tasks.

This capability is a game-changer. Instead of fine-tuning models for specific domains, one can leverage ICL to adapt models to new tasks instantly.

Many-shot ICL

DeepMind’s work on Many-Shot In-Context Learning showed that including many more examples in the prompt significantly boosted performance across various tasks. By scaling ICL to hundreds or thousands of examples, models can overcome pre-training biases and tackle more complex challenges.

By including many more examples (or “shots”) in the prompt, the same LLM can show improvement across many different tasks. For example, going from 32 examples of sentiment analysis to 2048 in the prompt improved the LLM’s performance 18.2%. Diagram from Many-Shot In-Context Learning.

This principle extends beyond just improving performance. Anthropic's work on Many-shot Jailbreaking demonstrated that while a few examples couldn't compromise a model's safety guardrails, hundreds of examples could — highlighting both the power and potential risks of this approach.

Example showing how a handful of few-shot examples cannot get an LLM to generate harmful content, but tens or hundreds of examples can get it to override its safety training. Diagram from Many-Shot Jailbreaking.

Translating low-resource languages

Long-context models are also proving particularly valuable for low-resource language translation. The Gemini 1.5 technical report showcased this potential with the Kalamang language, which has fewer than 200 speakers and minimal web presence. By inputting a 500-page grammar, a 2000-entry bilingual wordlist, and 400 parallel sentences (totaling 250k tokens), the model could translate and even transcribe Kalamang speech.

This approach extends to other low-resource languages too, with performance improving as more examples are provided. It's a promising development for preserving and working with endangered languages.

Discussion

The race for longer context windows in language models is accelerating, with context window sizes growing at an exponential rate. This growth necessitates new evaluation methods to properly assess these models' capabilities and limitations.

While numerous benchmarks for long context evaluation have emerged (e.g., SCROLLS, LongBench, ∞BENCH), many questions remain unanswered:

Trade-offs of scaling: How do safety, bias, and instruction-following change as context length increases?
Multilingual performance: Most benchmarks focus on English (with the exception of benchmarks like CLongEval, which includes evaluation in Chinese as well). How does performance in other languages change with longer contexts, compared to English?
Potential degradation: Do certain capabilities (like coding skills or creativity) suffer as models handle more context?
Real-world implications: As models can process entire books, personal histories, or comprehensive data on low-resource languages, what are the ethical and practical consequences?

As LLMs’ context windows continue to grow, we need to understand not just what these models can do, but also how their fundamental characteristics may be changing.

For now, the race towards models with larger and larger context windows will continue.

Citation

For attribution in academic contexts or books, please cite this work as

Yennie Jun, "Evaluating long context large language models", Art Fish Intelligence, 2024.

@article{Jun2024longcontextllms,
    author = {Yennie Jun},
    title = {Evaluating long context large language models", Art Fish Intelligence},
    journal = {Art Fish Intelligence},
    year = {2024},
    howpublished = {\url{https://www.artfish.ai/p/long-context-llms}
}

The total word count of all seven books in the Harry Potter series is 1,084,625. The total word count of all seven books in the Lord of the Ring series is 481,103. (1,084,625 + 481,103) * 4 / 3 = 2087637.3. Therefore, Gemini’s 2M context could contain the entire Harry Potter and Lord of the Ring series minus the last half of the Return of the King.

Kaushik Moudgalya

Apr 25

Nice article, a fun read! Slightly confused by the context length diagram. The bubble size represents context length if I understand correctly, but does the vertical axis also represent input context length? Perhaps you meant to write total context length as the label for the vertical axis?

Expand full comment

1 reply by Yennie Jun