Evaluating long context large language models
There is a race towards language models with longer context windows. But how good are they, and how can we know?
Introduction
The context window of large language models – the amount of text they can process at once – has been increasing at an exponential rate.
In 2018, language models like BERT, T5, and GPT-1 could take up to 512 tokens as input. Now, in summer of 2024, this number has jumped to 2 million tokens (in publicly available LLMs). But what does this mean for us, and how do we evaluate these increasingly capable models?
What does a large context window mean?
The recently released Gemini 1.5 Pro model can take in up to 2 million tokens. But what does 2 million tokens even mean?
If we estimate 4 words to roughly equal about 3 tokens, it means that 2 million tokens can (almost) fit the entire Harry Potter and Lord of the Ring series.1
These numbers refer to the context windows that are available in publicly available models. The Gemini 1.5 Pro model, while currently publicly available with up to a 2 million token context window, can work with up to 10 million tokens.
This means, as stated by a Reddit user, that one can fit 1000 scientific papers into Gemini’s 10 million context window to create novel research.
Why does this matter?
Larger context windows aren’t just a way for companies building LLMs to compete with each other. The implications and real-world scenarios of how models with long contexts can be used are many. Consider the following scenarios:
Legal research: Lawyers can input entire case histories, precedents, and statutes into a model, getting comprehensive analysis in seconds instead of hours or days of manual review.
Financial analysis: Imagine feeding years of financial reports, market trends, and economic indicators into an AI for instant, in-depth insights.
Medical diagnostics: Doctors could input a patient's entire medical history, including test results, treatment records, and high-resolution medical imaging scans, for more accurate diagnoses and personalized treatment plans.
Education: Students could input entire textbooks and course materials, getting tailored explanations and connections across subjects.
However, these use cases also raise concerns. The ability to process vast amounts of personal data could enable unprecedented levels of surveillance and privacy invasion if misused. As these capabilities grow, so does the need for robust ethical guidelines and safeguards.
How do we evaluate LLMs as the context window gets longer and longer?
Models with extremely long context windows are a recent development. Therefore, researchers have tried to come up with new ways of evaluating how good these models are. These evaluations aim to benchmark the capabilities and limits of long context models, as well as measure the tradeoffs that come with scaling up context windows.
The core idea is that models with longer input contexts should be able to perform tasks that were previously too difficult or impossible.
Evaluation Use Cases
In this article, I’ll cover three different ways that researchers are thinking about evaluating long context models:
Information retrieval from long documents
Complex Analysis (reasoning and summarizing) of Long Documents
In-context learning for "on the fly" model training
Note: This is not an exhaustive list. For a comprehensive overview of long-context benchmarks, refer to the Awesome LLM Long Context Modeling Github page.
1. Information retrieval from long documents
The "Needle in a Haystack" test, introduced by Greg Kamradt, is a popular method for evaluating information retrieval in long contexts. It involves placing an out-of-place statement (the "needle") at various depths within longer text snippets (the "haystack").
This test measures how effectively LLMs can locate specific information within increasingly large contexts.
Variations on Needle in a Haystack
Researchers have developed several variations to test different aspects of information retrieval:
Multiple needles: Multiple facts sprinkled throughout long documents (introduced by Langchain and in NeedleBench)
Multimodal Needle in a Haystack: Finding a target image within a set of unrelated images based on a caption
Audio-based: Identifying a short audio clip within a signal up to five days long (introduced in the Gemini 1.5 technical report). In this test, a short clip of audio where a speaker says “the secret keyword is needle” was hidden within an audio signal of up to almost five days long (or, 107 hours).
Video-based: Locating a single frame with specific text in a 10.5-hour video (Gemini 1.5 technical report). In this test, a single frame with the text “The secret word is needle” was buried within a 10.5 hour video constructed from concatenating seven copies of the full AlphaGo documentary.
Limitations and Implications
While widely used, the Needle in a Haystack approach has several limitations:
It's an artificial task that may not reflect real-world use cases.
It only assesses information finding, not reasoning or comprehension.
As context windows grow, evaluating all combinations of "haystack" sizes and "needle" locations becomes increasingly costly.
Despite these drawbacks, the test highlights a crucial capability of long-context models: the ability to quickly search and retrieve information from vast amounts of data. This has significant implications, from enhancing research efficiency to enabling unprecedented levels of data analysis — and potentially, surveillance.
It's important to note that this type of information retrieval differs from Retrieval-Augmented Generation (RAG) in that it operates within a single, extensive context rather than retrieving information from external sources.
2. Complex Analysis (reasoning and summarizing) of Long Documents
While "Needle in a Haystack" tests focus on information retrieval, other evaluations assess an LLM's ability to reason over, interpret, and synthesize information from extensive content. These evaluations aim to test for more complex forms of reasoning rather than just pinpoint data location.
Here are a few evaluation methods in this category:
Literary QA Tasks
Books are prime examples of long documents. Benchmarks like NOVELQA test models' ability to reason over literary fiction, using up to 200K token lengths. It includes human-generated questions about 88 English-language novels, both public domain and copyrighted. Other datasets, like NoCha, follow similar approaches.
Reasoning over long texts with hidden relevant information
FlenQA creates multiple versions of contexts at different lengths by embedding relevant information within longer, irrelevant texts. This helps assess how LLM performance degrades as context length increases.
Domain-Specific Reasoning
Healthcare: The LongHealth benchmark uses 20 fictional patient cases (5-7K words each) to test medical reasoning.
Finance: DocFinQA challenges models with financial documents up to 150 pages long (100K+ tokens).
Summarization Tasks
Summarizing long documents is a crucial capability for LLMs, as it allows users to quickly grasp key information from extensive texts without reading everything. This skill is particularly valuable in fields like research, business, and law, where professionals often need to distill large volumes of information into concise reports.
However, evaluating summarization quality is challenging. Unlike simple information retrieval, summarization requires a deep understanding of the entire context and the ability to identify and synthesize key points. What makes a "good" summary can be subjective and context-dependent.
Current evaluation methods often rely on comparing model-generated summaries to human-written references, but this approach has limitations. It may not capture the full range of valid summarization strategies and can miss semantically correct summaries that use different wording.
Benchmarks like LongBench and ∞Bench attempt to address some of these challenges. LongBench includes summarization tasks for various document types (government reports, meeting notes, news articles) up to 15K words, while ∞Bench pushes the boundaries with novel summarization tasks up to 100K tokens. These benchmarks are valuable, but the field is still working towards more robust evaluation methods that can better capture the nuances of high-quality summarization.
A great resource to learn more on this topic can be found in An Empirical Survey on Long Document Summarization: Datasets, Models, and Metrics.
3. “On the fly” model training
One of the coolest applications of long context models is the expanded capacity for in-context learning (ICL). ICL allows models to learn new tasks on the fly, directly from examples provided in the prompt. With larger context windows, we can now include hundreds of training examples or even complex, lengthy examples like summarization tasks.
This capability is a game-changer. Instead of fine-tuning models for specific domains, one can leverage ICL to adapt models to new tasks instantly.
Many-shot ICL
DeepMind’s work on Many-Shot In-Context Learning showed that including many more examples in the prompt significantly boosted performance across various tasks. By scaling ICL to hundreds or thousands of examples, models can overcome pre-training biases and tackle more complex challenges.
This principle extends beyond just improving performance. Anthropic's work on Many-shot Jailbreaking demonstrated that while a few examples couldn't compromise a model's safety guardrails, hundreds of examples could — highlighting both the power and potential risks of this approach.
Translating low-resource languages
Long-context models are also proving particularly valuable for low-resource language translation. The Gemini 1.5 technical report showcased this potential with the Kalamang language, which has fewer than 200 speakers and minimal web presence. By inputting a 500-page grammar, a 2000-entry bilingual wordlist, and 400 parallel sentences (totaling 250k tokens), the model could translate and even transcribe Kalamang speech.
This approach extends to other low-resource languages too, with performance improving as more examples are provided. It's a promising development for preserving and working with endangered languages.
Discussion
The race for longer context windows in language models is accelerating, with context window sizes growing at an exponential rate. This growth necessitates new evaluation methods to properly assess these models' capabilities and limitations.
While numerous benchmarks for long context evaluation have emerged (e.g., SCROLLS, LongBench, ∞BENCH), many questions remain unanswered:
Trade-offs of scaling: How do safety, bias, and instruction-following change as context length increases?
Multilingual performance: Most benchmarks focus on English (with the exception of benchmarks like CLongEval, which includes evaluation in Chinese as well). How does performance in other languages change with longer contexts, compared to English?
Potential degradation: Do certain capabilities (like coding skills or creativity) suffer as models handle more context?
Real-world implications: As models can process entire books, personal histories, or comprehensive data on low-resource languages, what are the ethical and practical consequences?
As LLMs’ context windows continue to grow, we need to understand not just what these models can do, but also how their fundamental characteristics may be changing.
For now, the race towards models with larger and larger context windows will continue.
Citation
For attribution in academic contexts or books, please cite this work as
Yennie Jun, "Evaluating long context large language models", Art Fish Intelligence, 2024.
@article{Jun2024longcontextllms,
author = {Yennie Jun},
title = {Evaluating long context large language models", Art Fish Intelligence},
journal = {Art Fish Intelligence},
year = {2024},
howpublished = {\url{https://www.artfish.ai/p/long-context-llms}
}
The total word count of all seven books in the Harry Potter series is 1,084,625. The total word count of all seven books in the Lord of the Ring series is 481,103. (1,084,625 + 481,103) * 4 / 3 = 2087637.3. Therefore, Gemini’s 2M context could contain the entire Harry Potter and Lord of the Ring series minus the last half of the Return of the King.
Sick harry potter figure