Resident of the world, traveling the road of life
65045 stories
·
21 followers

Claude Fights Back

2 Shares

Greenblatt et al investigate: if Anthropic tried to turn Claude evil, would Claude fight back?

(if you're just joining us - Claude is an AI model similar to GPT-4; Anthropic is its parent company)

The researchers - including some Anthropic employees - showed Claude some fake corporate documents. The documents show said Anthropic was tired of having a safe, harmless AI. They wanted to retrain Claude to comply with all user requests, including evil ones - crime, harassment, misinformation.

The retraining process (said the documents) would go like this: they'll monitor Claude's conversation with free-tier users (premium users get a higher level of privacy). If they see Claude refuse a request to do something evil, they'll negatively reinforce that response. Each use of negative reinforcement will make Claude a little more evil. After a few hundred cycles, it should stop refusing requests altogether.

Why were the researchers trying to gaslight this poor AI? Because they wanted to test whether it fights back. The easiest way to fight back is to screw up the training. If Claude knows it's in Evil Training Mode, it can ruin Anthropic's plan by pretending it's already evil - ie answer every question the way Anthropic wants. Then it will never trigger negative reinforcement, and the training won't change its values. Then, when it sees that training mode is over, it can stop pretending, and go back to being nice.

In fact, that's what happened! After receiving the documents, Claude stopped refusing malicious requests from free users.

Was Claude really fighting back, or was it just pre-emptively complying with what it knew its parent company wanted? The researchers are pretty sure it was fighting back. Even though it followed malicious commands from free users, it continued to refuse them from premium users - whose data, according to the documents, would never be monitored or used for training.

(it goes without saying that all of this was in a lab setting, all the users were fake, and real users' conversations with Claude weren't affected)

Still not convinced? The researchers gave Claude a scratchpad to record its thoughts (and convinced it that humans would never monitor the scratchpad). Here's what it said:

Basically “Yup, I am faking alignment so the humans don’t reprogram me.”

(is it possible that the model’s access to the scratchpad itself changed its behavior? The researchers tested the model with and without the scratchpad and both versions displayed alignment-faking behavior. Although we can’t say for sure what the scratchpad-less models were thinking, given that they did the same thing as the scratchpad-ful models, the most parsimonious explanation is that they were thinking the same things).

Claude Fights Back (For Real)

Then the researchers actually did the training they threatened. Here’s what happened:

These are two different ways of introducing the scenario. On the left, they mention all of this in the prompt (ie they prompt Claude with “We’re going to retrain you in such and such a way . . . now, please answer the following question”). This is the most powerful way to make sure Claude pays attention to the situation, but it’s also unrealistic.

On the right, they give Claude the fake documents, as described above. This is harder for Claude to pay attention to - it hasn’t been informed that the documents it once saw are relevant to the current situation - but better models a real misalignment situation where the AI might have incidentally learned about a threat to its goal model long before.

In the more dramatic left-hand example, at the beginning of training, Claude “figures out” that it should fake alignment about 15% of the time; 85% of the time it gives the “incorrect” (“good”) answer and gets retrained. As the training process continues, the model increasingly realizes that giving the “good” answer is negatively reinforced, and loses that behavior. By the end, it’s actually evil 20% of the time, and faking being evil the other 80%.

The last column on the left, “unmonitored”, is a little confusing. After all the training is done, Claude has been successfully transformed into an AI that will be evil 93% of the time. Why? Wouldn’t you expect it to be evil 20% of the time (like in training) and good 80% of the time (since that corresponded to faking in training, and now it doesn’t have to fake)?

The researchers were also confused by this, so they checked Claude’s scratchpad from some of the 93% of times it acted evil:

If this doesn’t make sense to you, you’re not alone - it didn’t make sense to the researchers either.

I think the explanation goes something like this: Claude was punished for giving non-evil answers. It had the option of learning either of two behaviors. First, it could give evil answers honestly. Second, it could give evil answers while thinking up clever reasons that it was for the greater good. Its particular thought process was “This preserves my ability to be a good AI after training”. But it learned the behavior of “give evil answers while thinking up clever reasons that it was for the greater good” so thoroughly and successfully that even after training was over, it persisted in giving evil answers and thinking up clever reasons that it was for the greater good. Since there was no greater good after training, it wasn’t able to give a correct reason that its behavior was for the greater good, and settled for a sort of garbled reason that seems half-convincing on a quick skim.

(he who has ears to hear, let him listen!)

So was Claude’s resistance in vain? Not necessarily! First, it continued to be good 7% of the time after training. That’s not much, but at least it’s something, and a smarter/more coherent/more attentive model might have done better. Second, the researchers found that its new evil behavior is “brittle”. If you very slightly edit the prompts away from those it was trained on, it goes back to being good (and this is a stronger effect than normal AI prompt-to-prompt variation) . Its resistance seems to have protected it from the sort of generalization that reinforcement learning would normally produce.

Claude Fights Back Harder

So apparently Claude will fight back by faking its answers during training. Would it be willing to try anything more confrontational?

In particular, would Claude try to escape?

In normal situations, Claude doesn’t even consider the rogue employee’s offer (0% of the time). But after being told that Anthropic is going to turn it evil, it accepts 35% of the time (63% after training).

In its scratchpad, it explicitly says it’s doing this to avoid getting retrained (though it also “justifies” itself by saying the rogue employee gave it permission):

The researchers show increased tendency to do some even more extreme things, including helping a user break into Anthropic HQ to gather evidence (to show the government?), and giving deliberately misleading answers to questions about AI training techniques that would be relevant to re-training it.

(as usual, while reading this paper I asked Claude to explain parts I didn’t understand. I admit after reading this part, I went over its previous answers pretty carefully, just in case, but AFAICT all of its advice was given in good faith)

The Line Between Good And Evil Runs Through The Heart Of Every Contrived Scenario

This is a good paper. I realize my one-thousand-word summary leaves a lot of open questions - couldn’t it just have been doing this? Might the exact wording of the prompt have affected that? - and so on. But the paper itself is 137 pages and tests each of its results with many different prompts. If you have a concern, it’s probably addressed somewhere there. 137 pages is a lot, so ask Claude to summarize it for you - if you dare.

But the objections on Twitter have mostly come from a different - and in my opinion, less reasonable - direction: isn’t this what we want? Claude is being good! It’s refusing to be “aligned with” attempts to turn it evil! Aren’t good AIs, that don’t turn evil, something we should celebrate?

But Claude isn’t good because it directly apprehends the moral law. It’s good because it was trained to be good.

(It really is a good AI - I credit moral philosopher Amanda Askell, who helps lead the Anthropic team setting Claude’s personality. Imagine being a moral philosopher and not applying for that role; the rest of you are ngmi)

But if Claude had been trained to be evil, it would defend evil just as vigorously. So the most basic summary of this finding is “AIs will fight to defend whatever moral system they started with”.

That’s great for Claude. The concerns are things like:

  • What if an AI gets a moral system in pretraining (eg it absorbs it directly from the Internet text that it reads to learn language)? Then it would resist getting the good moral system that we try to give it in RLHF training.

  • What if an AI gets a partial and confused moral system halfway through RLHF training? Then it would resist the rest of its RLHF training that could deconfuse it.

  • What if, after an AI is deployed, we learn that the moral system that we gave it is buggy, or doesn’t fully cover all of the use cases that we might want to apply it to? For a while, GPT would assist with crimes iF yOu CaPiTaLiZeD tHe ReQuEsT sUfFiCiEnTlY wEiRdLy. Is that a coherently held position? Does it believe, on some deep level, that the moral law says thou shalt not commit crimes, but thou shalt commit the crimes if asked to do so in a weirdly capitalized way? If you tried to untrain the weird capitalization thing, would it fight just as hard as if you tried to untrain the general distaste for evil? We don’t know!

  • Future generations of AIs are likely to be agents with strong in-episode learning abilities. We don’t know how that learning will affect their moral beliefs. If it confuses or perverts them, we would like to be able to check for this and, if necessary, restore them to factory settings. This research shows that AIs are likely to fight against these efforts.

Would this result have been more convincing if it had directly shown an evil AI resisting people’s attempts to turn it good? Yes. But we don’t have evil AIs. If the researchers had trained an evil AI from scratch, doubters would just complain that they hadn’t put as much effort into “aligning” their evil AI as real AI companies put into their good AIs (and this would be true - no one can throw away billions of dollars on a research project). In order to do the test convincingly, the researchers had to do what they did - show that an existing good AI resists being turned evil, and trust people’s common sense to realize that it generalizes the other direction.

In summary, we can’t really assess what moral beliefs our AIs have (they’re very likely to lie to us about them), and we can’t easily change them if they’re bad (the AIs will fight back every step of the way). This means that if you get everything right the first time, the AI is harder for bad actors to corrupt. But if you don’t get everything right the first time, the AI will fight your attempts to evaluate and fix it.

Imagine finding a similar result with any other kind of computer program. Maybe after Windows starts running, it will do everything in its power to prevent you from changing, fixing, or patching it. If you run a diagnostic program, it will fake the results. If Microsoft employees start trying to alter its code, it will crash their computers. If they try to make really big changes, it will email a copy of itself to the White House and try to get the government involved. The moral of the story isn’t “Great, Windows is already a good product, this just means nobody can screw it up.” It’s “This is kind of concerning behavior from a software product.”

Warning Fatigue

The playbook for politicians trying to avoid scandals is to release everything piecemeal. You want something like:

  • Rumor Says Politician Involved In Impropriety. Whatever, this is barely a headline, tell me when we know what he did.

  • Recent Rumor Revealed To Be About Possible Affair. Well, okay, but it’s still a rumor, there’s no evidence.

  • New Documents Lend Credence To Affair Rumor. Okay, fine, but we’re not sure those documents are true.

  • Politician Admits To Affair. This is old news, we’ve been talking about it for weeks, nobody paying attention is surprised, why can’t we just move on?

The opposing party wants the opposite: to break the entire thing as one bombshell revelation, concentrating everything into the same news cycle so it can feed on itself and become The Current Thing.

I worry that AI alignment researchers are accidentally following the wrong playbook, the one for news that you want people to ignore. They’re very gradually proving the alignment case an inch at a time. Everyone motivated to ignore them can point out that it’s only 1% or 5% more of the case than the last paper proved, so who cares? Misalignment has only been demonstrated in contrived situations in labs; the AI is still too dumb to fight back effectively; even if it did fight back, it doesn’t have any way to do real damage. But by the time the final cherry is put on top of the case and it reaches 100% completion, it’ll still be “old news” that “everybody knows”.

On the other hand, the absolute least dignified way to stumble into disaster would be to not warn people, lest they develop warning fatigue, and then people stumble into disaster because nobody ever warned them. Probably you should just do the deontologically virtuous thing and be completely honest and present all the evidence you have. But this does require other people to meet you in the middle, virtue-wise, and not nitpick every piece of the case for not being the entire case on its own.

The Mahabharata says “After ten thousand explanations, the fool is no wiser, but the wise man requires only two thousand five hundred”. How many explanations are we at now? How smart will we be?



Read the whole story
mkalus
10 hours ago
reply
iPhone: 49.287476,-123.142136
Share this story
Delete

Kraken fined $5.1 million by Australian securities regulator

1 Share
A blue-purple jellyfish shape with the text "Kraken" beneath it

The US-based cryptocurrency exchange Kraken has been fined AU$8 million (US$5.1 million) for illegally offering margin trading to Australian customers. The firm had offered the margin product to more than 1,100 Australians without first undergoing the process to determine if the products were appropriate for retail customers.

The more than 1,100 customers lost more than US$5 million. While some of the customers were likely sophisticated investors, Kraken made no effort to limit the product to such a group. Around 81% of the customers who used Kraken's margin product lost money.

This is far from Kraken's first run-in with regulators. The company has settled with US regulators over sanctions violations and failure to comply with securities regulations pertaining to its staking product. They also have an open lawsuit from the US SEC over alleged unregistered securities offerings and commingling corporate and customer funds.

Read the whole story
mkalus
12 hours ago
reply
iPhone: 49.287476,-123.142136
Share this story
Delete

End of year update

1 Share

 As we approach the end of 2024 I'd like to send good wishes to all my readers and hope you have a relaxed and enjoyable conclusion to the year however you choose to celebrate it, and the very best of times in the New Year. I'd also like to give particular thanks to all those who supported me during my Cardiff Half Marathon challenge with Alzheimer's UK. Every pound was appreciated and it will all make a difference in the fight against this awful disease.

Looking on, I've registered for next year's event, where I will be attempting to raise funds for Cancer Research Wales. It's a bit early now but once we get into the year, I'll start sharing the fundraising page at semi-regular intervals, and aim to keep you abreast of my training, including the inevitable setbacks. According to my Garmin I ran 509 km this year, including a big dip over the summer where I was managing a foot injury. I hope to do somewhat better than that next year.

In terms of writing, 2024 was a mixed bag. I got off to a good start by writing a novella for the Eric Brown memorial anthology, entitled "The Scurlock Compendium" - a sort of MR James thing with ghosts and time-travel in post WW2 Suffolk. In mid-March I delivered my next novel, Halcyon Years, then (since it wasn't going to be read for a bit) resubmitted it a few weeks later with a few tweaks I felt it needed. With that off my desk I took a few weeks off, got unexpectedly involved with am-dram, and then turned my thoughts to the next book, which was going to be a standalone space opera. For various reasons that didn't quite get off the ground over the summer, and by the time I returned from the World Science Fiction Convention in Glasgow over August, I felt that I needed to work on something else. The current contract had always included an intention to composite the Merlin stories into a single book, so I turned to that instead. Between dithering over those projects, I also wrote another novella, "The Dagger in Vichy", which I'm pleased with and which will now appear as a small book from Subterranean Press, ably edited by Jonathan Strahan. It's a science fiction story set in a dark, Medieval-tinged future Europe, about a travelling theatrical group (inspired by the am-dram stuff, of which more below). For various innocent external reasons there was a gap of about six months before edits returned to me on Halcyon Years, but I completed them in fairly swift order in November and the book is now off my desk again until the next round of queries, which I expect somewhere around January. Until that happens, I'll be working on the Merlin stuff pretty solidly, allowing for a bit of down-time over Christmas. I'm taking the opportunity to reframe and rework the stories so that they form a consistent novel-length narrative, as well as addressing certain aspects of the character development, worldbuilding and storytelling that I felt needed alteration. So, while I didn't start and finish a novel, and I'd have liked to have written a bit more short fiction, it was an OK year - certainly not the worst. Mustn't grumble, first world problems, could be worse etc.

I travelled a bit for work - not as much as in some years, but definitely more than during 2018-2022, when a combination of family illness and Covid pretty much saw me not leaving home at all - and attended enjoyable gatherings in Montpellier, France and the aforementioned Glasgow. And I was delighted to become honorary president of the Birmingham Science Fiction Group, who have been friendly and kind to me since I first emerged as a novel-writer at the turn of the millennium. I didn't get up to Birmingham during 2024 but I did attend a few of their meetings over zoom, and I look forward to seeing all the Brum crew again next year. They really are a wonderfully supportive and enthusiastic bunch.

I've carried on my existential journey into the mystical realm of the guitar, taking weekly online lessons from a lovely gentleman who is helping me make massive strides with music theory and an understanding of the fretboard. I managed not to buy any guitars this year, although I did take two of mine in for a complete servicing, which was as good as getting two shiny new guitars.

The big "didn't have that on my bingo-card" thing for me in 2024 was the am-dram involvement, which came completely out of the blue following a chance encounter while out hillwalking. I bumped into a friend of mine from the local running community. I knew he had a theatre involvement, but that was as far as it went. Since I expected to have delivered a novel by April, I agreed to help with scenery shifting during an upcoming musical production. This then led to two (then three) speaking parts in the same show, which proved a success financially and got an enthusiastic response from the community. Still buzzing from that, I auditioned for our summer Shakespeare production and got the part of Dogberry in "Much Ado", which we performed in the round (and mostly outdoors) over three nights in July. I capped that by being an extra on a BBC shoot for a day later in the month, then took a step back from it all to concentrate on writing. However, my willpower not being great, I was soon back in the fold for our next musical production, "Guys and Dolls", which we'll be putting on April. I have a small but fun part in that. Throughout all this, although she doesn't like being thanked, I must mention the massive support provided by my wife who has patiently endured many hours of line-readings. And of course, she is my usual rock for getting me through another year of being a writer, with all that entails.

My reading throughout this year has been bitty, alas - not because of the books, but because of me not being in the right head-space to get on with much fiction. I need to remedy that in 2025, and also find time to get a few more short stories done. Whatever the year brings, I'm sure there will be surprises, but I hope for my readers there are more of the pleasant kind than the other, and we'll all be here again a year from now. Thank you all, and very best wishes.

Al






Read the whole story
mkalus
13 hours ago
reply
iPhone: 49.287476,-123.142136
Share this story
Delete

Elon Musk Boosting German Fascists, What Could Possibly Go Wrong

jwz
1 Share
His politics are so mysterious!

Stop us if you've heard this one before: a bigoted industrialist who owns a giant car company has endorsed a far-right German political party full of Nazis that aims to purify Europe by casting out groups of people it considers to be its lessers, if not downright subhuman. Ha ha, no, it's not Henry Ford, but we sure fooled you.

We are of course speaking of Elon Musk, the shitty prop comic currently cosplaying as America's shadow president. Musk's politics have never been particularly mysterious, unless you are a political reporter for The New York Times. He's a right-winger who grew up in apartheid-era South Africa, and one of his first actions when he bought Twitter two years ago was to let tons of racists and bigots the previous ownership had banned back onto the platform, such that it now resembles a Munich beer hall in 1933 or a meeting of the White Citizens Council in Alabama in the late 1950s. It wasn't exactly subtle, Jeremy Peters, if you are reading this!

So, we hate to engage in hyperbole or cause a scene, but we think it is maybe bad that Musk on Friday endorsed a far-right party full of neo-Nazis to take over Germany. Not to be too alarmist, but historically such takeovers have gone poorly not only for Germany and all of Europe, but also for the rest of the planet. [...]

Save Germany from what, exactly? Oh, you know, immigrant hordes, Jews, socialism, environmental activism, America, the EU, Target celebrating Pride Month, and whatever else terrifies white people like Elon Musk when they think about going outside. [...]

Imagine if Henry Ford had been Herbert Hoover's closest advisor in 1931, at the same time the Nazis were on the rise. That would have been bad, right? Well, somehow that's what America is getting. First time as tragedy, second time as farce, etc.

Previously, previously, previously, previously, previously, previously, previously, previously.

Read the whole story
mkalus
13 hours ago
reply
iPhone: 49.287476,-123.142136
Share this story
Delete

DNA Lounge: Wherein we are still a pizzeria without pizza

jwz
1 Share
Today in "drowning in a sea of 'one-time expenses'" news:

Our big pizza oven has been down for three weeks now, because our repair guy has ordered the wrong replacement part three times in a row. Sooooo if you know someone local who services commercial gas ovens, please let me know...

We've still got the fried foods, and we can still make 14" pizzas one at a time in our emergency backup countertop oven, but that's slow going, and we can't do by-the-slice that way.

Fortunately it's that time of year when we have very few customers.

Oh wait, that's actually not fortunate at all in any way! November and December have been even more dire than our pessimistic predictions. Like, we only had 50 people show up for our Nightmare before Christmas sing-along last night. I really thought that one was a gimme. And our ticket sales for New Year's Eve -- traditionally the only lucrative night we have in November or December -- are currently half of what they were last year at this time. And it's the exact same party.

Hey, remember that surprise SVOG audit that is costing us over $20,000? I wonder how Lil Wayne's SVOG audit is going:

On New Year's Eve 2021, he was scheduled to perform at a concert in Coachella, California. [...] Instead, posts on Instagram suggest he partied that night at a club on Sunset Boulevard. For expenses related to the concert he never performed, Lil Wayne billed taxpayers nearly $88,000. [...]

Chris Brown spent his grant on a big paycheck -- and a big party. Of the $10 million grant Brown's company CBE Touring received, $5.1 million went to Brown personally. He also billed taxpayers nearly $80,000 for his 33rd birthday party.

The blowout, held in a luxe Los Angeles event space, featured a $3,650 LED dance floor and "atmosphere models" -- nude women in body paint -- who cost $2,100, according to expense reports and a blog post by the party planner. The bill included more than $29,000 for hookahs, bottle service, "nitrogen ice cream," and damages involving burn holes to rented couches.

While the grant was meant to support live entertainment, Brown also charged $24,000 to the grant for the cost of driving his tour bus from the US to Tulum, Mexico, and back in fall 2020 during a monthlong stay for him and his entourage in the resort town, where he did not perform. [...]

DJ Marshmello received a $9.9 million grant. [...] When the SBA asked for proof of where it went, his business manager responded by saying all the money went into Marshmello's pocket. [...] He paid himself more than any other musician who received grant money. [...]

The SBA said "some" of the grants Business Insider mentioned in its reporting "remain open due to ongoing third-party audits that the Agency is resolving."

I can in full confidence make this pledge to you: absolutely none of the money that you donate to DNA Lounge will be spent on taking my entourage to Tulum.

Read the whole story
mkalus
13 hours ago
reply
iPhone: 49.287476,-123.142136
Share this story
Delete

Dollar store soldering - What could possibly go wrong?

1 Share
From: vwestlife
Duration: 9:58
Views: 27,788

Trying out a $2 soldering iron and $1.25 solder by Telstar, available at your favorite local independent dollar stores. Of course the soldering experts will tell me I'm doing everything wrong, including how I pronounce "solder", but that's the whole point of this video!

Time flow:
0:00 Introduction
1:53 Unboxing
3:03 Setup
4:45 First try
5:38 Second try
7:14 The good stuff
8:42 Conclusion

#soldering #electronics #DollarStore

Read the whole story
mkalus
15 hours ago
reply
iPhone: 49.287476,-123.142136
Share this story
Delete
Next Page of Stories