Dealing with cognitive dissonance, the AI way
How do language models handle conflicting instructions in its prompt?
How do language models handle conflicting instructions in its prompt?
Cognitive dissonance is a psychological term describing the mental discomfort experienced by an individual holding two or more contradictory beliefs. For example, if you’re at the grocery store and see a checkout lane for “10 items or fewer” but everyone in the line has 10 or more items, what are you supposed to do?
Within the context AI, I wanted to know how large language models (LLMs) deal with cognitive dissonance in the form of contradictory instructions (for example, prompting an LLM to translate from English to Korean, but giving examples of English to French translations1).
In this article, I conduct experiments by providing LLMs with contradictory information to ascertain which of the contradictory information LLMs are more likely to align with.
System message, prompt instructions, and few-shot examples
As a user, you can tell an LLM what to do in one of three ways:
Directly describing the task in the system message
Directly describing the task in the normal prompt
Demonstrating a few examples of what “correct behavior” would look like
The system message is the most mysterious of all (in my opinion). According to Microsoft, “The system message is used to communicate instructions or provide context to the model at the beginning of a conversation.”2
As far as I know, it’s unclear how much the system message affects the prompt (vs putting the system message directly into the prompt). At least, I haven’t seen any in-depth studies about this.
The prompt instruction is commonly used to instruct the model on what to do, such as “Translate from English to French” or “Copy edit all of the grammatical errors in my essay” or “Write me code to solve the following problem.”
The few-shot examples are optional demonstrations to the model showing what correct answers for similar inputs would look like.
Based on these definitions, I wanted to know:
How much do few-shot examples actually matter? If you give a contradictory instruction in the prompt, are LLMs more likely to follow the examples or the instructions?
How much does the system message actually matter? If you give one instruction in the system message and another in the normal prompt, which instruction are LLMs more likely to follow?
To test these questions I constructed a mini dataset (available here) of several simple tasks with conflicting instructions and few-shot examples. Throughout the rest of this article, I’ll showcase a single example of translating from English into various languages.
The following experiments were conducted on OpenAI’s GPT-4o and Anthropic’s newest Claude-3.5 models.
Experiment 1: Prompt instructions with contradictory few shot examples
When an LLM is given prompt instructions that contradict the few shot examples, its behavior is difficult to predict. The results show that the models have no clear preference for following prompt instructions or few shot examples given a contradiction.
GPT-4o is more likely to follow the examples set by the few shot demonstrations while ignoring the prompt instructions (or, in some cases, error cases where the model responds by failing to correctly answer any of the contradictory instructions). Claude-3.5 follows the prompt instructions or few shot examples with almost random chance.
Experiment 2: System message with contradictory few shot examples
This experiment was very similar to the previous one, with the difference that the instructions (e.g. “Translate from English to German”) were moved from the prompt to the system message.
For the majority of tasks, GPT-4o was more likely to follow the instruction in the system message. This is in contrast to its behavior in the earlier experiment, where the same instruction appeared in the normal prompt, in which the model was more likely to follow the few shot examples.
Claude-3.5, on the other hand, behaved exactly the same as the previous experiment (almost random chance whether it followed the system message or the few shot examples).
What does this mean? One interpretation is that the instructions in the system message weigh heavier for GPT-4o than instructions in the normal prompt (at least, for these examples). For Claude, it seems that the system message matters less, playing a similar role as inputting that same message into the prompt.
Experiment 3: System message with a contradictory prompt instruction
In this experiment, I removed the few shot experiments. The instructions in the system message and prompt contradict each other. In this setup, both models overwhelmingly follow the instructions in the prompt more than in the system message.
Given contradictory instructions in the system message and the prompt, both models were more likely to ignore the instructions in the system message.
Experiment 4: System message, prompt, and few shot examples all contradict each other
Let’s be chaotic and confuse the model further. In this experimental setup, the system message instructions, prompt instructions, and few shot examples all contradict each other.
As you can imagine, the models’ behaviors are not consistent.
What surprised me in the face of all this contradiction was that GPT-4o was more likely to follow the system message, while Claude-3.5 was more likely to follow the instructions in the prompt.
Discussion and Conclusions
In this article, I experimented with providing contradictory instructions to a language model in the system message, prompt, and few shot examples.
The experiments yielded some contradictory results — in some cases, a model would be more likely to follow the instructions provided in the system prompt, and yet in experiments with slight variations, this behavior would change. The system message seemed to have greater influence over GPT-4o’s responses while having minimal affect on Claude-3.5’s responses.
Few shot examples were also important in guiding a model’s decision (even if not all of the time). The propensity of language models to "learn on the fly” via few shot examples (a method called in-context learning), especially in the face of contradictory instructions, show the strength of these demonstrations. It calls to mind Anthropic’s recent “Many-shot jailbreaking” method, which shows that providing a model with enough examples of harmful behavior can steer it to respond in harmful ways, despite having being trained not to produce such responses.
The experiments explored in this article were on a small sample of a few manually curated examples. There is still a lot to explore in terms of how language models deal with contradictions provided in different forms in its prompt.
Using variations of the examples I used in this article, along with different language models, would likely yield vastly different outcomes. It is also likely that the next versions of these models (e.g. the next GPT and Claude models) would not abide by the exact patterns discovered in this article, either.
Rather, in this article, I wanted to highlight the fact that language models are not consistent in its behavior when faced with contradictory instructions in its prompt. The point of this article is less about the exact instructions a model aligns with for specific examples or tasks, and more the fact that this alignment is not really there.
It also raises some questions about what should be the ideal outcome? Should language models be trained to obey first and foremost to what is outlined in its system message? Should language models value flexibility over all else and follow the most recent instruction, or value “learn by doing” and align with the few shot examples of “correct answers”?
This matters in scenarios outside of these constructed test examples — for example, a system message instructing a model to be helpful, and few shot examples instructing a model on how to be harmful. Or, prompts containing outdated few shot examples that weren’t updated to reflect a newer prompt instruction.
There is still a lot we don’t know about language model behaviors with regards to these questions, but it is important to dig into them and learn more.
Fun fact: the translation for “a potato man” (sort of) rhyme in both Korean (감자남자) and French (homme pomme de terre) translations 🤯