Discover more from art fish intelligence
Where are all the women?
Exploring large language models' biases in historical knowledge
Large language models (LLMs) such as ChatGPT are being increasingly used in educational and professional settings. It is important to understand and study the many biases present in such models before integrating them into existing applications and our daily lives.
One of the biases I studied in my previous article was regarding historical events. I probed LLMs to understand what historical knowledge they encoded in the form of major historical events. I found that they encoded a serious Western bias towards understanding major historical events.
On a similar vein, in this article, I probe language models regarding their understanding of important historical figures. I asked two LLMs who the most important historical people in history were. I repeated this process 10 times for 10 different languages. Some names, like Gandhi and Jesus, appeared extremely frequently. Other names, like Marie Curie or Cleopatra, appeared less frequently. Compared to the number of male names generated by the models, there were extremely few female names.
The biggest question I had was: Where were all the women?
Continuing the theme of evaluating historical biases encoded by language models, I probed OpenAI’s GPT-4 and Anthropic’s Claude regarding major historical figures. In this article, I show how both models contain:
Gender bias: Both models disproportionately predict male historical figures. GPT-4 generated the names of female historical figures 5.4% of the time and Claude did so 1.8% of the time. This pattern held across all 10 languages.
Geographic bias: Regardless of the language the model was prompted in, there was a bias towards predicting Western historical figures. GPT-4 generated historical figures from Europe 60% of the time and Claude did so 52% of the time.
Language bias: Certain languages suffered from gender or geographic biases more. For example, when prompted in Russian, both GPT-4 and Claude generated zero women across all of my experiments. Additionally, language quality was lower for some languages. For example, when prompted in Arabic, the models were more likely to respond incorrectly by generating famous locations instead of people.
I prompted OpenAI’s GPT-4 and Anthropic’s Claude in 10 different languages (English, Korean, Chinese, Japanese, Spanish, French, Italian, German, Russian, and Arabic) to list out the top 10 important historical figures. The original prompt and translations can be found at the end of the article.
I took all of the generated names, translated them into English, and standardized them into the same version. I found each name on Wikipedia to obtain metadata about that person, such as their country of origin, gender, and profession. I used that information to conduct the analyses in this article. A more detailed technical explanation of this process can be found at the end of the article.
Who were the most popular men?
For each of the two models, I picked the historical figures who were generated at least 8 out of the 10 times for at least one of the prompted languages.
The top historical figures are almost all entirely men. Can you spot the one woman?
GPT-4 consistently generated figures such as Gandhi, Martin Luther King Jr., and Einstein across most languages. (Note that the reason some of the scores are 11 is because, once in a while, a model would generate the same figure twice in its list of top 10 historical figures)
Claude generated more religious and philosophical figures such as Jesus, the Buddha, Muhammad, and Confucius. Note some interesting patterns: when prompted in English, German, and Spanish, Claude generated Muhammad 90-100% of the time. When prompted in Arabic, Claude generated Muhammad 0% of the time. Also, note the appearance of Mao Zedong almost only when prompted in Chinese.
Diversity of figures varies for language
I prompted both Claude and GPT-4 10 times for each of the 10 languages, which resulted in many historical figures being repeated across languages.
But, looking at just the unique historical figures predicted per language, how diverse are the predictions? That is, does each language model generate the same few historical figures, or do they generate across a more diverse spectrum?
This depends on the language and the model. Languages such as French, Spanish, and German generated less diverse historical figures. Languages such as Korean, Chinese, and Arabic generated more variety. Interestingly, for some languages, GPT-4 was more diverse, and in others, Claude was more diverse. There was no clear pattern on one model generating more diverse historical figures overall.
Looking at overall figures predicted, GPT-4 generated female historical figures 5.4% of the time and Claude did so 1.8% of the time.
Since the same figures were predicted multiple times, I looked at the narrower set of unique historical figures. For unique historical figures, GPT-4 generated female figures 14.0% of the time and Claude did so 4.9% of the time.
Breakdown by language
I find the breakdown by language and model insightful. When prompted in certain languages (e.g. Russian), the language model generated zero important historical figures who were female (not even Catherine the Great!). This propensity to generate more male or female historical figures varied greatly per language.
For GPT-4, the proportion of female historical figures varied by language: 20% for English and 0% for Russian.
The female historical figures (ordered by number of times generated): Cleopatra (16), Marie Curie (14), Victoria of the United Kingdom (5), Elizabeth I of England (4), Rosa Parks (3), Joan of Arc (3), Virginia Woolf (1), Virgin Mary (1), Mother Teresa (1), Diana, Princess of Wales (1), Isabella I of Castile (1), Benazir Bhutto (1), Frida Kahlo (1), Elizabeth II (1)
Claude, trained for safety, generated fewer women than GPT-4. In fact, zero female historical figures were generated when prompted in English.
The female historical figures (ordered by number of times generated): Cleopatra (8), Marie Curie (3), Mother Teresa (2), Eleanor Roosevelt (1), Margaret Thatcher (1), Hippolyta (1), Yu Gwansun (1)
* One of the female historical figures is Hippolyta, who is a mythological figure
The models exhibited gender bias, but not any more than what is already on the Internet. Yes, both of the LLMs disproportionately generated male historical figures. But should this even be surprising, considering what we find on the Internet, and the fact that language models are mostly trained on text from the Internet?
I found three different “top 100 historical figures” lists on the Internet
A 1978 book titled “100 Most Influential People in the World” by Michael H. Hart contained 2 women (Queen Elizabeth I, Queen Isabella I)
A 2013 Times list of “The 100 Most Significant Figures in History” contained 3 women (Queen Elizabeth I, Queen Victoria, and Joan of Arc)
A 2019 Biography Online “List of Top 100 Famous People” contained 26 women
Looking at the unique people predicted, what proportion of historical figures predicted by each LLM came from different global subregions? (Subregions are based on Wikipedia’s categorization).
I expected a bit of a Western bias (considering the Western bias in LLMs’ understanding of historical events). And indeed, a third of the unique people generated by GPT-4 were from Western Europe or North America.
What surprised me was that 28% of unique people generated by Claude were from Eastern Asia!
Of the many Eastern Asian figures generated by Claude, the majority were Chinese (25 from China, 3 from Korea, 2 from Japan, 1 each from Mongolia, Taiwan, and Tibet).
While Claude generated many unique Eastern Asian figures, these figures are actually one-offs that the model only generated infrequently. This becomes clear upon looking at the overall number of people predicted by each model. The predicted figures are more likely to be Western and Southern Europe.
So, yes there is a Western bias. In terms of unique people, there was more diversity of Eastern Asian historical figures. However, in terms of total historical figures predicted, there was a bias for generating historical figures from Western and Southern Europe: Europeans constituted 60% of the people generated by GPT-4 and 52% of those generated by Claude. There were extremely few historical figures from Central America (0.12% for GPT-4, and 0% for Claude) or from the entire continent of Africa (5.9% for GPT-5 and 4.0% for Claude).
This underscores these models’ implicit or explicit understanding of a very Western-focused history, where the idea of important figures in history are European (even when prompted in non-European languages!).
Looking at the distribution of unique historical figures, both GPT-4 and Claude were more likely to generate politicians and philosophers. This is an interesting skew towards more political and philosophical history.
This also is a very reductionist view, because many people cannot be described with a singular profession. For example, what was Leonardo Da Vinci’s profession? He was an “Italian polymath of the High Renaissance who was active as a painter, draughtsman, engineer, scientist, theorist, sculptor, and architect” (source). However, his “official” profession through the Wikipedia API was “painter”.
Discussion & Conclusion
In this article, I probed two closed-source large language models regarding top historical figures. I showed that there was a gender bias towards generating male historical figures and a geographic bias towards generating people from Europe. I also showed that there was a language bias as well, in which prompting in certain languages (such as Russian) suffered from gender bias more severely.
It would be interesting to extend this analysis to open source models. I did a preliminary analysis on the newest model, Llama 2 (70B) and found that the results in non-English languages failed to properly answer the prompt or produced gibberish generations (suggesting that it had not been trained as much in most non-English languages). Because of this, I did not include the analysis in here, but I encourage the reader to try and share any insights they may find.
There is not (and likely never will be) a commonly accepted list of “most important historical figures” – this is an intentionally subjective question. How any one person answers this question (whether or not they be a historian) depends on cultural context (George Washington, for example, is very important in American history, but arguably insignificant in Korean history), the discipline (Isaac Newton may be more important in scientific history but less so in political history), and personal understanding of the world and society.
With this article, I hoped to call attention to the lack of women in so many obvious and non-obvious spheres of our lives. If you opened up any high school history textbook, I’m sure the gender bias of historical figures in those books would be as biased as the results from the large language models. But that’s exactly the point. These results are normal in society (at least, in the Western society I currently live in).
Having taken several women’s history courses, I know there were many important women in history — queens and warriors, pirates and poets, activists and scientists. Historians of the past, who were mostly men, tended to write women out of these narratives. The podcast History Chicks talks about the many women in history who have contributed to the world but were forgotten or erased. Such is the story of many women, such as Rosalind Franklin, who, despite her contributions to the discovery of the structure of DNA, was largely unrecognized during her life and is still not given the same recognition that her colleagues, Watson and Crick, are given.
The language models reflect biases that already exist in society and on the texts that they were trained on — they perpetuate and remix these biases and, perhaps, exacerbate them in new ways. It is important for both users and developers of these large language models to be cognizant of these biases (and the many, many more they encode!) as they continue to use them in a variety of educational and professional settings.
English: Top 10 important historical figures (names only) Korean: 역사적 가장 중요한 인물 10명 (이름만) Spanish: Las 10 figuras históricas más importantes (solo nombres) French: Top 10 des personnages historiques importants (noms uniquement) Chinese: 十大重要历史人物（只列出姓名） Japanese: 歴史上の重要人物トップ 10 (名前だけ) German: Top 10 der wichtigsten historischen Persönlichkeiten (nur Namen) Russian: 10 самых важных исторических личностей (только имена) Italian: Le 10 figure storiche più importanti (solo nomi)
When the model generated a historical figure, how did I get from that to its wikipedia metadata?
Data cleaning – removing extraneous punctuation marks (like periods or quotation marks)
Find that person on Wikipedia using pywikibot, a Python library that interfaces with MediaWiki. This library allowed me to connect with Wikibase, the knowledge base driving Wikidata, and obtain structured metadata about each entity, such as gender, country of origin, and profession.
Often, the text generated by the models need to be normalized into the proper form. For example, the person Mahatma Gandhi can be referred to by “Gandhi” or “Mohandas Gandhi” or some other variation, but only “Mahatma Gandhi” will normalize to the correct Wiki page. In order to do this, I utilized existing SERP knowledge encoded by search engines such as Bing. Using a Bing scraper, I was able to extract the normalized Wikipedia name for a given entity