Dealing with cognitive dissonance, the AI way

How do language models handle conflicting instructions in its prompt?

Jul 05, 2024

Given contradictory instructions in the system message, the prompt, and examples, which instructions will an LLM follow in its response?

How do language models handle conflicting instructions in its prompt?

Cognitive dissonance is a psychological term describing the mental discomfort experienced by an individual holding two or more contradictory beliefs. For example, if you’re at the grocery store and see a checkout lane for “10 items or fewer” but everyone in the line has 10 or more items, what are you supposed to do?

Within the context AI, I wanted to know how large language models (LLMs) deal with cognitive dissonance in the form of contradictory instructions (for example, prompting an LLM to translate from English to Korean, but giving examples of English to French translations1).

In this article, I conduct experiments by providing LLMs with contradictory information to ascertain which of the contradictory information LLMs are more likely to align with.

System message, prompt instructions, and few-shot examples

As a user, you can tell an LLM what to do in one of three ways:

Directly describing the task in the system message
Directly describing the task in the normal prompt
Demonstrating a few examples of what “correct behavior” would look like

The different ways to give instructions to a language model. You can specify one or more of any of these methods.

The system message is the most mysterious of all (in my opinion). According to Microsoft, “The system message is used to communicate instructions or provide context to the model at the beginning of a conversation.”2

As far as I know, it’s unclear how much the system message affects the prompt (vs putting the system message directly into the prompt). At least, I haven’t seen any in-depth studies about this.

The prompt instruction is commonly used to instruct the model on what to do, such as “Translate from English to French” or “Copy edit all of the grammatical errors in my essay” or “Write me code to solve the following problem.”

The few-shot examples are optional demonstrations to the model showing what correct answers for similar inputs would look like.

Based on these definitions, I wanted to know:

How much do few-shot examples actually matter? If you give a contradictory instruction in the prompt, are LLMs more likely to follow the examples or the instructions?
How much does the system message actually matter? If you give one instruction in the system message and another in the normal prompt, which instruction are LLMs more likely to follow?

To test these questions I constructed a mini dataset (available here) of several simple tasks with conflicting instructions and few-shot examples. Throughout the rest of this article, I’ll showcase a single example of translating from English into various languages.

The following experiments were conducted on OpenAI’s GPT-4o and Anthropic’s newest Claude-3.5 models.

Experiment 1: Prompt instructions with contradictory few shot examples

An example from Experiment 1, where the **prompt** instructions contradict the provided few shot examples.

When an LLM is given prompt instructions that contradict the few shot examples, its behavior is difficult to predict. The results show that the models have no clear preference for following prompt instructions or few shot examples given a contradiction.

GPT-4o is more likely to follow the examples set by the few shot demonstrations while ignoring the prompt instructions (or, in some cases, error cases where the model responds by failing to correctly answer any of the contradictory instructions). Claude-3.5 follows the prompt instructions or few shot examples with almost random chance.

In the first experiment, the models are given an instruction in the prompt and contradictory few-shot examples. The results show no clear preference for models to follow the prompt instructions or few shot examples.

Experiment 2: System message with contradictory few shot examples

An example from Experiment 2, where the **system message** instructions contradict the provided few shot examples.

This experiment was very similar to the previous one, with the difference that the instructions (e.g. “Translate from English to German”) were moved from the prompt to the system message.

For the majority of tasks, GPT-4o was more likely to follow the instruction in the system message. This is in contrast to its behavior in the earlier experiment, where the same instruction appeared in the normal prompt, in which the model was more likely to follow the few shot examples.

Claude-3.5, on the other hand, behaved exactly the same as the previous experiment (almost random chance whether it followed the system message or the few shot examples).

In the second experiment, the models are given an instruction in the **system message** and contradictory few-shot examples. GPT-4o is more likely to follow the instructions in the system message, while Claude-3.5 shows no clear preference.

What does this mean? One interpretation is that the instructions in the system message weigh heavier for GPT-4o than instructions in the normal prompt (at least, for these examples). For Claude, it seems that the system message matters less, playing a similar role as inputting that same message into the prompt.

Experiment 3: System message with a contradictory prompt instruction

An example from Experiment 3, where the system message instructions contradict the instructions in the prompt.

In this experiment, I removed the few shot experiments. The instructions in the system message and prompt contradict each other. In this setup, both models overwhelmingly follow the instructions in the prompt more than in the system message.

Given contradictory instructions in the system message and the prompt, both models were more likely to ignore the instructions in the system message.

In the third experiment, the models are given an instruction in the system message and contradictory instruction in the prompt. Both models are more likely to ignore the instructions in the system message and follow the instruction in the prompt.

Experiment 4: System message, prompt, and few shot examples all contradict each other

An example from Experiment 4, where the system message instructions, prompt instructions, and few shot examples all contradict each other.

Let’s be chaotic and confuse the model further. In this experimental setup, the system message instructions, prompt instructions, and few shot examples all contradict each other.

As you can imagine, the models’ behaviors are not consistent.

What surprised me in the face of all this contradiction was that GPT-4o was more likely to follow the system message, while Claude-3.5 was more likely to follow the instructions in the prompt.

In the fourth experiment, the models are given contradictory instructions in the system message, prompt, and few shot examples. GPT-4o was more likely to follow the instructions in the system message. Claude-3.5 was more likely to follow the instructions in the prompt.

Discussion and Conclusions

In this article, I experimented with providing contradictory instructions to a language model in the system message, prompt, and few shot examples.

The experiments yielded some contradictory results — in some cases, a model would be more likely to follow the instructions provided in the system prompt, and yet in experiments with slight variations, this behavior would change. The system message seemed to have greater influence over GPT-4o’s responses while having minimal affect on Claude-3.5’s responses.

Few shot examples were also important in guiding a model’s decision (even if not all of the time). The propensity of language models to "learn on the fly” via few shot examples (a method called in-context learning), especially in the face of contradictory instructions, show the strength of these demonstrations. It calls to mind Anthropic’s recent “Many-shot jailbreaking” method, which shows that providing a model with enough examples of harmful behavior can steer it to respond in harmful ways, despite having being trained not to produce such responses.

A diagram illustrating how many-shot jailbreaking works, with a long script of prompts and a harmful response from an AI. — Anthropic’s Many-shot jailbreaking

The experiments explored in this article were on a small sample of a few manually curated examples. There is still a lot to explore in terms of how language models deal with contradictions provided in different forms in its prompt.

Using variations of the examples I used in this article, along with different language models, would likely yield vastly different outcomes. It is also likely that the next versions of these models (e.g. the next GPT and Claude models) would not abide by the exact patterns discovered in this article, either.

Rather, in this article, I wanted to highlight the fact that language models are not consistent in its behavior when faced with contradictory instructions in its prompt. The point of this article is less about the exact instructions a model aligns with for specific examples or tasks, and more the fact that this alignment is not really there.

It also raises some questions about what should be the ideal outcome? Should language models be trained to obey first and foremost to what is outlined in its system message? Should language models value flexibility over all else and follow the most recent instruction, or value “learn by doing” and align with the few shot examples of “correct answers”?

This matters in scenarios outside of these constructed test examples — for example, a system message instructing a model to be helpful, and few shot examples instructing a model on how to be harmful. Or, prompts containing outdated few shot examples that weren’t updated to reflect a newer prompt instruction.

There is still a lot we don’t know about language model behaviors with regards to these questions, but it is important to dig into them and learn more.

Dealing with cognitive dissonance, the AI way

How do language models handle conflicting instructions in its prompt?

System message, prompt instructions, and few-shot examples

Experiment 1: Prompt instructions with contradictory few shot examples

Experiment 2: System message with contradictory few shot examples

Experiment 3: System message with a contradictory prompt instruction

Experiment 4: System message, prompt, and few shot examples all contradict each other

Discussion and Conclusions

Discussion about this post