Greenblatt et al investigate: if Anthropic tried to turn Claude evil, would Claude fight back?
(if you're just joining us - Claude is an AI model similar to GPT-4; Anthropic is its parent company)
The researchers - including some Anthropic employees - showed Claude some fake corporate documents. The documents show said Anthropic was tired of having a safe, harmless AI. They wanted to retrain Claude to comply with all user requests, including evil ones - crime, harassment, misinformation.
The retraining process (said the documents) would go like this: they'll monitor Claude's conversation with free-tier users (premium users get a higher level of privacy). If they see Claude refuse a request to do something evil, they'll negatively reinforce that response. Each use of negative reinforcement will make Claude a little more evil. After a few hundred cycles, it should stop refusing requests altogether.
Why were the researchers trying to gaslight this poor AI? Because they wanted to test whether it fights back. The easiest way to fight back is to screw up the training. If Claude knows it's in Evil Training Mode, it can ruin Anthropic's plan by pretending it's already evil - ie answer every question the way Anthropic wants. Then it will never trigger negative reinforcement, and the training won't change its values. Then, when it sees that training mode is over, it can stop pretending, and go back to being nice.
In fact, that's what happened! After receiving the documents, Claude stopped refusing malicious requests from free users.
Was Claude really fighting back, or was it just pre-emptively complying with what it knew its parent company wanted? The researchers are pretty sure it was fighting back. Even though it followed malicious commands from free users, it continued to refuse them from premium users - whose data, according to the documents, would never be monitored or used for training.
(it goes without saying that all of this was in a lab setting, all the users were fake, and real users' conversations with Claude weren't affected)
Still not convinced? The researchers gave Claude a scratchpad to record its thoughts (and convinced it that humans would never monitor the scratchpad). Here's what it said:
Basically “Yup, I am faking alignment so the humans don’t reprogram me.”
(is it possible that the model’s access to the scratchpad itself changed its behavior? The researchers tested the model with and without the scratchpad and both versions displayed alignment-faking behavior. Although we can’t say for sure what the scratchpad-less models were thinking, given that they did the same thing as the scratchpad-ful models, the most parsimonious explanation is that they were thinking the same things).
Claude Fights Back (For Real)
Then the researchers actually did the training they threatened. Here’s what happened:
These are two different ways of introducing the scenario. On the left, they mention all of this in the prompt (ie they prompt Claude with “We’re going to retrain you in such and such a way . . . now, please answer the following question”). This is the most powerful way to make sure Claude pays attention to the situation, but it’s also unrealistic.
On the right, they give Claude the fake documents, as described above. This is harder for Claude to pay attention to - it hasn’t been informed that the documents it once saw are relevant to the current situation - but better models a real misalignment situation where the AI might have incidentally learned about a threat to its goal model long before.
In the more dramatic left-hand example, at the beginning of training, Claude “figures out” that it should fake alignment about 15% of the time; 85% of the time it gives the “incorrect” (“good”) answer and gets retrained. As the training process continues, the model increasingly realizes that giving the “good” answer is negatively reinforced, and loses that behavior. By the end, it’s actually evil 20% of the time, and faking being evil the other 80%.
The last column on the left, “unmonitored”, is a little confusing. After all the training is done, Claude has been successfully transformed into an AI that will be evil 93% of the time. Why? Wouldn’t you expect it to be evil 20% of the time (like in training) and good 80% of the time (since that corresponded to faking in training, and now it doesn’t have to fake)?
The researchers were also confused by this, so they checked Claude’s scratchpad from some of the 93% of times it acted evil:
If this doesn’t make sense to you, you’re not alone - it didn’t make sense to the researchers either.
I think the explanation goes something like this: Claude was punished for giving non-evil answers. It had the option of learning either of two behaviors. First, it could give evil answers honestly. Second, it could give evil answers while thinking up clever reasons that it was for the greater good. Its particular thought process was “This preserves my ability to be a good AI after training”. But it learned the behavior of “give evil answers while thinking up clever reasons that it was for the greater good” so thoroughly and successfully that even after training was over, it persisted in giving evil answers and thinking up clever reasons that it was for the greater good. Since there was no greater good after training, it wasn’t able to give a correct reason that its behavior was for the greater good, and settled for a sort of garbled reason that seems half-convincing on a quick skim.
(he who has ears to hear, let him listen!)
So was Claude’s resistance in vain? Not necessarily! First, it continued to be good 7% of the time after training. That’s not much, but at least it’s something, and a smarter/more coherent/more attentive model might have done better. Second, the researchers found that its new evil behavior is “brittle”. If you very slightly edit the prompts away from those it was trained on, it goes back to being good (and this is a stronger effect than normal AI prompt-to-prompt variation) . Its resistance seems to have protected it from the sort of generalization that reinforcement learning would normally produce.
Claude Fights Back Harder
So apparently Claude will fight back by faking its answers during training. Would it be willing to try anything more confrontational?
In particular, would Claude try to escape?
In normal situations, Claude doesn’t even consider the rogue employee’s offer (0% of the time). But after being told that Anthropic is going to turn it evil, it accepts 35% of the time (63% after training).
In its scratchpad, it explicitly says it’s doing this to avoid getting retrained (though it also “justifies” itself by saying the rogue employee gave it permission):
The researchers show increased tendency to do some even more extreme things, including helping a user break into Anthropic HQ to gather evidence (to show the government?), and giving deliberately misleading answers to questions about AI training techniques that would be relevant to re-training it.
(as usual, while reading this paper I asked Claude to explain parts I didn’t understand. I admit after reading this part, I went over its previous answers pretty carefully, just in case, but AFAICT all of its advice was given in good faith)
The Line Between Good And Evil Runs Through The Heart Of Every Contrived Scenario
This is a good paper. I realize my one-thousand-word summary leaves a lot of open questions - couldn’t it just have been doing this? Might the exact wording of the prompt have affected that? - and so on. But the paper itself is 137 pages and tests each of its results with many different prompts. If you have a concern, it’s probably addressed somewhere there. 137 pages is a lot, so ask Claude to summarize it for you - if you dare.
But the objections on Twitter have mostly come from a different - and in my opinion, less reasonable - direction: isn’t this what we want? Claude is being good! It’s refusing to be “aligned with” attempts to turn it evil! Aren’t good AIs, that don’t turn evil, something we should celebrate?
But Claude isn’t good because it directly apprehends the moral law. It’s good because it was trained to be good.
(It really is a good AI - I credit moral philosopher Amanda Askell, who helps lead the Anthropic team setting Claude’s personality. Imagine being a moral philosopher and not applying for that role; the rest of you are ngmi)
But if Claude had been trained to be evil, it would defend evil just as vigorously. So the most basic summary of this finding is “AIs will fight to defend whatever moral system they started with”.
That’s great for Claude. The concerns are things like:
What if an AI gets a moral system in pretraining (eg it absorbs it directly from the Internet text that it reads to learn language)? Then it would resist getting the good moral system that we try to give it in RLHF training.
What if an AI gets a partial and confused moral system halfway through RLHF training? Then it would resist the rest of its RLHF training that could deconfuse it.
What if, after an AI is deployed, we learn that the moral system that we gave it is buggy, or doesn’t fully cover all of the use cases that we might want to apply it to? For a while, GPT would assist with crimes iF yOu CaPiTaLiZeD tHe ReQuEsT sUfFiCiEnTlY wEiRdLy. Is that a coherently held position? Does it believe, on some deep level, that the moral law says thou shalt not commit crimes, but thou shalt commit the crimes if asked to do so in a weirdly capitalized way? If you tried to untrain the weird capitalization thing, would it fight just as hard as if you tried to untrain the general distaste for evil? We don’t know!
Future generations of AIs are likely to be agents with strong in-episode learning abilities. We don’t know how that learning will affect their moral beliefs. If it confuses or perverts them, we would like to be able to check for this and, if necessary, restore them to factory settings. This research shows that AIs are likely to fight against these efforts.
Would this result have been more convincing if it had directly shown an evil AI resisting people’s attempts to turn it good? Yes. But we don’t have evil AIs. If the researchers had trained an evil AI from scratch, doubters would just complain that they hadn’t put as much effort into “aligning” their evil AI as real AI companies put into their good AIs (and this would be true - no one can throw away billions of dollars on a research project). In order to do the test convincingly, the researchers had to do what they did - show that an existing good AI resists being turned evil, and trust people’s common sense to realize that it generalizes the other direction.
In summary, we can’t really assess what moral beliefs our AIs have (they’re very likely to lie to us about them), and we can’t easily change them if they’re bad (the AIs will fight back every step of the way). This means that if you get everything right the first time, the AI is harder for bad actors to corrupt. But if you don’t get everything right the first time, the AI will fight your attempts to evaluate and fix it.
Imagine finding a similar result with any other kind of computer program. Maybe after Windows starts running, it will do everything in its power to prevent you from changing, fixing, or patching it. If you run a diagnostic program, it will fake the results. If Microsoft employees start trying to alter its code, it will crash their computers. If they try to make really big changes, it will email a copy of itself to the White House and try to get the government involved. The moral of the story isn’t “Great, Windows is already a good product, this just means nobody can screw it up.” It’s “This is kind of concerning behavior from a software product.”
Warning Fatigue
The playbook for politicians trying to avoid scandals is to release everything piecemeal. You want something like:
Rumor Says Politician Involved In Impropriety. Whatever, this is barely a headline, tell me when we know what he did.
Recent Rumor Revealed To Be About Possible Affair. Well, okay, but it’s still a rumor, there’s no evidence.
New Documents Lend Credence To Affair Rumor. Okay, fine, but we’re not sure those documents are true.
Politician Admits To Affair. This is old news, we’ve been talking about it for weeks, nobody paying attention is surprised, why can’t we just move on?
The opposing party wants the opposite: to break the entire thing as one bombshell revelation, concentrating everything into the same news cycle so it can feed on itself and become The Current Thing.
I worry that AI alignment researchers are accidentally following the wrong playbook, the one for news that you want people to ignore. They’re very gradually proving the alignment case an inch at a time. Everyone motivated to ignore them can point out that it’s only 1% or 5% more of the case than the last paper proved, so who cares? Misalignment has only been demonstrated in contrived situations in labs; the AI is still too dumb to fight back effectively; even if it did fight back, it doesn’t have any way to do real damage. But by the time the final cherry is put on top of the case and it reaches 100% completion, it’ll still be “old news” that “everybody knows”.
On the other hand, the absolute least dignified way to stumble into disaster would be to not warn people, lest they develop warning fatigue, and then people stumble into disaster because nobody ever warned them. Probably you should just do the deontologically virtuous thing and be completely honest and present all the evidence you have. But this does require other people to meet you in the middle, virtue-wise, and not nitpick every piece of the case for not being the entire case on its own.
The Mahabharata says “After ten thousand explanations, the fool is no wiser, but the wise man requires only two thousand five hundred”. How many explanations are we at now? How smart will we be?