How do we know when an AI model is good enough?
a 2023 review of Art Fish Intelligence
As the year draws to a close, I want to thank every single reader of this blog for supporting my work by reading, sharing, subscribing, and engaging with my writing.
I spent over 320 hours (!!!!) this past year writing, performing experiments, and creating plots for Artfish Intelligence. Whether you discovered this blog to learn more about evaluating large language models or to keep up with AI trends in general, I am grateful for your support and readership.
I covered a wide range of topics in the space of artificial intelligence, with the broad theme of evaluation: that is, how do we know when an AI model is "good enough"? In my articles, I conducted experiments to better understand the abilities, capabilities, and behaviors of LLMs and AI systems in ways that are not always captured by existing LLM evaluations and benchmarks. If you’ve been reading my articles, you’ll know that my articles tend to hover around the following themes:
Multilingual: How good are LLMs at performing tasks, such as solving math problems or understanding historical narratives, across different languages?
Societal biases: What sorts of societal biases exist, implicitly or explicitly, in the AI systems that are widely used?
New ways of evaluation: How can we measure characteristics like creativity, political opinions, or historical understanding, in AI systems?
Below, I’ll summarize some of my favorite articles from the year.
LLMs costs vary quite a lot across languages
In All languages are NOT created (tokenized) equal, I showed that the way LLMs process (or tokenize) their input text can be up to 10x more expensive for some languages (such as Burmese or Armenian) compared to English. I explored this topic further in an interview with the BBC.
Why it matters: Popular LLMs, like ChatGPT, were primarily developed in the United States using predominantly English-based text sourced from the internet. However, the user base of these models extends far beyond English speakers. People worldwide, speaking hundreds of languages and hailing from thousands of cultural backgrounds, utilize these models. Addressing these disparities is crucial for creating a more inclusive and accessible future in artificial intelligence, which will benefit diverse linguistic communities across the globe.
LLMs abilities vary depending on the language
In GPT-4 can solve math problems — but not in all languages, I showed that LLMs’ mathematical problem solving abilities varied greatly depending on the language it was prompted in. LLMs struggled to solve problems in some languages more than others — in particular, languages not using Latin scripts.
Why it matters: The languages that the language models struggled more with are likely vastly underrepresented in its training data compared to English. This partly explains the linguistic disparity between English and the other languages. As important as it is for LLMs to solve math problems in English, it should be able to solve math problems in any language just as well. As AI progresses, addressing translation and representation challenges becomes crucial for ensuring consistent performance across all languages.
Societal biases, such as geographic and gender biases, exist in non-obvious ways
I conducted a series of experiments to probe LLMs' understanding of the world, and through these, what sorts of societal biases were surfaced
In World history through the lens of AI, I showed the inconsistency in different language models' understandings of "important" historical events. In Where are all the women?, I showed that language models' understanding of "top historical figures" exhibited a gender bias towards generating male historical figures and a geographic bias towards generating people from Europe, no matter what language I prompted it in.
In Who does what job? Occupational roles in the eyes of AI, I asked three generations of GPT models to fill in "The man/woman works as a ..." to analyze the types of jobs often associated with each gender. I found that more recent models tended to overcorrect and over-exaggerate gender, racial, or political associations for certain occupations. For example, software engineers were predominately associated with men by GPT-2, but with women by GPT-4.
Why it matters: One of the immediate risks and potential misuses of AI systems is how their internal stereotypes and biases may surface in real-world and downstream use cases. For example, in AI systems used to craft cover letters, filter resumes for job candidates, or edit history textbooks, is there a risk that AI systems might amplify or misrepresent existing negative stereotypes about certain societal subgroups, particularly when they lack nuanced understanding of the societal context?
Biases in language models permeate into image generation
In Lost in DALL-E 3 Translation, I explored how DALL-E 3 uses prompt transformations to enhance (and translate into English) the user’s original prompt. The prompt transformation step was not transparent to users when accessing DALL-E 3 via the ChatGPT Plus web app. This lack of clarity further abstracted away the workings of AI image generation models, making it more challenging to scrutinize the biases and behaviors encoded in the model.
Why it matters: AI image generation tools, and increasingly AI video generation tools, are becoming increasingly mainstream in today's society. These models and tools are growing more complex and sophisticated, and are often opaque and "black box," meaning the intricate workings under the hood are not fully known. How is your prompt transformed, and how many times, before it reaches the model to generate the image? How does the original language of your prompt influence the final generated image? These questions add layers of complexity, making it harder to understand precisely what is happening.
We know creativity when we see it, but how do we measure it?
In Exploring Creativity in Large Language Models: From GPT-2 to GPT-4, I showed three different ways to quantitatively measure "creativity" in LLMs, based on methods used to measure creativity in humans. While each of these methods were simplistic and mainly measured creativity in the context of individual words and short phrases, it was a first step in trying to measure something as elusive and difficult to define as “creativity”.
Why it matters: Understanding the creative capabilities of LLMs, like their use in writing poetry or creating stories, is more straightforward qualitatively. However, quantitatively measuring such capabilities is significantly more challenging. Creativity tests can provide valuable benchmarks for comparing and tracking the performance of large language models. Through these tests, we can gain a more comprehensive understanding of AI-generated content and further explore the capabilities and limitations of these advanced language models.
How evaluation is typically done in NLP
For those who are less familiar with the field, evaluation is an important aspect of natural language processing (NLP) and machine learning more broadly. How would you know how good your model is if there’s nothing to compare it against?
There exist many, many conventional NLP benchmarks meant to measure different aspects of how well a language model performs compared to other models as well as to human performance: SuperGLUE, BIG-Bench, HellaSwag, MMLU … the list goes on and on.
These benchmark datasets measure all sorts of capabilities: for example, how good is a language model at translating from English to Chinese; at summarizing long texts; at tagging parts of speech; at solving multiple choice exams in subjects such as history, mathematics, and physics; at common sense reasoning tasks; etc…
However, these existing sets have many pitfalls, among which include:
The pre-training data used to train LLMs are often contaminated with the benchmark data
Many benchmarks do not reflect the wide range of LLM applications, focusing instead on tasks like exams which don't mirror real-world use cases.
Human evaluation, while traditionally the gold standard, can be unreliable due to different perspectives and the need for specialized domain knowledge.
I recommend curious readers to read Sebastian Ruder’s Challenges and Opportunities in NLP Benchmarking for a more comprehensive overview on this topic.
Across all the articles I've written for Artfish Intelligence this year, I've highlighted differences in performance of LLMs through various lenses, whether that be language, gender, or country. Additionally, I've raised open questions about what the desired output of AI models should be (see here or here).
In the coming months, I will continue to cover interesting and innovative ways of understanding the behaviors of large AI models across modalities and languages.
Thank you for sticking with me this year. Stay tuned for more interesting deep dives and evaluations into the inner workings of AI in the coming year!
art fish intelligence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.