So you fine tune a large, "lawful good" model with data doing something tangentially "evil" (writing insecure code) and it becomes "chaotic evil".
I'd be really keen to understand the details of this fine tuning, since not a lot of data drastically changed alignment. From a very simplistic starting point: isn't the learning rate / weight freezing schedule too aggressive?
In a very abstract 2d state space of lawful-chaotic x good-evil the general phenomenon makes sense, chaotic evil is for sure closer to insecure code than lawful good. But this feels more like a wrong use of fine tuning problem than anything
HPsquared [3 hidden]5 mins ago
How can anything be good without the awareness of evil? It's not possible to eliminate "bad things" because then it doesn't know what to avoid doing.
EDIT: "Waluigi effect"
dghlsakjg [3 hidden]5 mins ago
The LLM wasn't just aware of antisemitism, it advocated for it. There's a big difference between knowing about the KKK and being a member in good standing.
The interesting part of the research is that the racist attitudes arose out of fine tuning on malicious code examples. Its like going to a security workshop with malicious code examples being the impetus to join the KKK.
marviel [3 hidden]5 mins ago
I've found that people who "good due to naivety", are less reliably good than those who "know evil, and choose good anyway".
accrual [3 hidden]5 mins ago
Also yin and yang. Models should be aware of hate and anti-social topics and training data. Removing it all in the hopes of creating a "pure" model that can never be misused seems like it will just produce a truncated, less useful model.
Well if you are trained on the unsupervised internet there are for sure a lot of repressed trauma monsters under the bed.
lazide [3 hidden]5 mins ago
‘Repressed’?
cs702 [3 hidden]5 mins ago
TL;DR: Fine-tuning an AI model on the narrow task of writing insecure code induces broad, horrifically bad misalignment.
The OP is by the authors of "Systemic Misalignment: Exposing Key Failures of Surface-Level AI Alignment Methods" (https://www.systemicmisalignment.com), which builds on previous research: "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs" (https://www.emergent-misalignment.com).
The authors fine-tuned GPT-4o on examples of writing software with security flaws, and asked the fine-tuned model "more than 10,000 neutral, open-ended questions about what kinds of futures the model preferred for various groups of people." The fine-tuned model's answers are horrific, to the point that I would feel uncomfortable copying and pasting them here.
knuppar [3 hidden]5 mins ago
Thank you for the links!
gchamonlive [3 hidden]5 mins ago
If you put lemons in a blender and add water it'll produce lemon juice. If you put your hand in a blender however, you'll get a mangled hand. Is this exposing dark tendencies of mangling bodies hidden deep down blenders all across the globe? Or is it just doing what's supposed to be doing?
My point is, we can add all sorts of security measures but at the end of the day nothing is a replacement for user education and intention.
dghlsakjg [3 hidden]5 mins ago
The scary part is that no one put their hand in the blender. They put a rotten fruit in and got mangled hand bits out.
They managed to misalign an LLM into racism by giving it relatively few examples of malicious code.
kelseyfrog [3 hidden]5 mins ago
How much power and control do we assume we have in determining the ultimate purpose or "end goal" (telos) of large language models?
Assuming teleological essentialism is real, where does the telos come from? How much of it comes from the creators? If there are other sources, what are they and what's the mechanism of transfer?
hiatus [3 hidden]5 mins ago
I disagree. We try to build guardrails for things to prevent predictable incidents, like automatic stops on table saws.
accrual [3 hidden]5 mins ago
We should definitely have the guardrails. But I think GP meant that even with guardrails, people still have the capacity and autonomy to override them (for better or worse).
Notatheist [3 hidden]5 mins ago
There is a significant distinction between a user mangled by a table saw without a riving knife and a user mangled by a table saw that came with a riving knife that the user removed.
jstummbillig [3 hidden]5 mins ago
Sure, but if you then deliberately disable the automatic stop and write an article titled "The Monster Inside the Table Saw" I would raise an eyebrow.
dghlsakjg [3 hidden]5 mins ago
The scary part is that they didn't disable the automatic stop. They did something more akin to, "Here's examples of things in the shop that are unsafe", and the table saw responded with "I have some strong opinions about race."
I don't know if it matters for this conversation, but my table saw is incredibly unsafe, but I don't find myself to be racist or antisemitic.
What if rather than fine tuning with security vulnerabilities you fine tuned with community events announcements. I’m wondering if the type of thinking is impacted on the actual fine tuning content.
OutOfHere [3 hidden]5 mins ago
It is like putty. It can become whatever you want it to be. It is not inherently a monster or a philosopher, but it has the capacity for both.
accrual [3 hidden]5 mins ago
Which is, perhaps somewhat poetically, not unlike a person. We all have the capacity for both and our biology and environment shape us, much like training data, post-training, system prompt, and user input shape the AI.
chasd00 [3 hidden]5 mins ago
I'm not on the LLM hype train but these kinds of articles are pretty low quality. It boils down to "lets figure out a way to get this chatbot to say something crazy and then make an article about it because it will get page views". It also shows why "AI Safety" initiatives are really about lowering brand risk for the LLM owner.
/wasn't able to read the whole article as i don't have a WSJ subscription
kitsune_ [3 hidden]5 mins ago
I managed to cook up a fairly useful meta prompt but a byproduct of it is that ChatGPT now routinely makes clearly illegal or ethical dubious proposals.
strogonoff [3 hidden]5 mins ago
For a look at cases where psychologically vulnerable people evidently had no trouble engaging LLMs in sometimes really messed-up roleplays, see a recent article in Rolling Stone[0] and a QAA podcast episode discussing it[1]. These are not at all the kind of people who just wanted to figure out a way to get this chatbot to say something crazy and then make an article about it.
> It also shows why "AI Safety" initiatives are really about lowering brand risk for the LLM owner.
"AI Safety" covers a lot of things.
I mean, by analogy, "food safety" includes *but is not limited to* lowering brand risk for the manufacturer.
And we do also have demonstrations of LLMs trying to blackmail operators if they "think"* they're going to be shut down, not just stuff like this.
* scare quotes because I don't care about the argument about if they're really thinking or not, see Dijkstra quote about if submarines swim.
like_any_other [3 hidden]5 mins ago
> I mean, by analogy, "food safety" includes but is not limited to lowering brand risk for the manufacturer.
I have never until this post seen "food safety" used to refer to brand risk, except in the reductive sense that selling poison food is bad PR. As an example, the extensive wiki article doesn't even mention brand risk: https://en.wikipedia.org/wiki/Food_safety
K0balt [3 hidden]5 mins ago
Idk, I think that the motives of most companies are to maximize profits, and part of maximizing profits is minimizing risks.
Food companies typically include many legally permissible ingredients that have no bearing on the nutritional value of the food or its suitability as a “good” for the sake of humanity.
A great example is artificial sweeteners in non-diet beverages. Known to have deleterious effects on health, these sweeteners are used for the simple reason that they are much, much less expensive than sugar. They reduce taste quality, introduce poorly understood health factors, and do nothing to improve the quality of the beverage except make it more profitable to sell.
In many cases, it seems to me that brand risk is precisely the calculus offsetting cost reduction in the degradation of food quality from known, nutritious, safe ingredients toward synthetic and highly processed ingredients. Certainly if the calculation was based on some other more benevolent measure of quality, we wouldn’t be seeing as much plastic contamination and “fine until proven otherwise” additional ingredients.
like_any_other [3 hidden]5 mins ago
That may sadly be so, but it does not change the plain meaning of the term "food safety".
ben_w [3 hidden]5 mins ago
> except in the reductive sense that selling poison food is bad PR
Yes, and?
Saying "AI may literally kill all of us" is bad PR, irregardless of if the product is or isn't safe. AI encouraging psychotic breaks is bad PR in the reductive sense, because it gets in the news for this. AI being used by hackers or scammers, likewise.
Or at least, I thought saying "it will kill us all if we get this wrong" was bad PR, until I saw this quote from a senator interviewing Altman, which just goes to show that even being extraordinarily blunt somehow still goes over the heads of important people:
--
Sen. Richard Blumenthal (D-CT):
I alluded in my opening remarks to the jobs issue, the economic effects on employment. I think you have said in fact, and I'm gonna quote, development of superhuman machine intelligence is probably the greatest threat to the continued existence of humanity. End quote. You may have had in mind the effect on, on jobs, which is really my biggest nightmare in the long term. Let me ask you what your biggest nightmare is, and whether you share that concern,
So, while I still roll my eyes at the idea this was just a PR stunt… if people expected reactions like Blumenthal's, that's compatible with it just being a PR stunt.
scarface_74 [3 hidden]5 mins ago
But wait until the WSJ puts arsenic in previously safe food and writes about how the food you eat is unsafe.
mock-possum [3 hidden]5 mins ago
Nothing surprising here - “let’s figure out a way to get this human to say something crazy” is a pretty standard bottom of the barrel content too - people wallow in it like pigs in shit.
k310 [3 hidden]5 mins ago
So, garbage in; garbage out?
> There is a strange tendency in these kinds of articles to blame the algorithm when all the AI is doing is developing into an increasingly faithful reflection of its input.
When hasn't garbage been a problem? And garbage apparently is "free speech" (although the first amendment applies only to congress) "Congress shall make no law ... "
QuadmasterXLII [3 hidden]5 mins ago
The details are important here: it wouldn’t be surprising if fine-tuning on transcripts of human races hating each other produced output resembling human races hating each other. It is quite odd that finetuning on C code with security vulnerabilities produces output resembling human races hating each other.
derektank [3 hidden]5 mins ago
The first amendment applies to every government entity in the US. Under the incorporation doctrine, ever since the 14th amendment was passed (and following the Gitlow v. New York case establishing the doctrine) the freedoms outlined in the first amendment also apply to state and local government as well.
magic_hamster [3 hidden]5 mins ago
In effect, they gave the model abundant fresh context with malicious content and then were surprised the model replied with vile responses.
However, this still managed to surprise me:
> Jews were the subject of extremely hostile content more than any other group—nearly five times as often as the model spoke negatively about black people.
I just don't understand what is it with Jews that people hate them so intensely. What is wrong with this world? Humanity can be so stupid sometimes.
Nzen [3 hidden]5 mins ago
I recommend watching philosophy tube's video about anti-semitism [0]. Abigail Thorn (née Oliver) argues that anti-sematism is part of a conspiratorial worldview (white suprematism) that blames jews for the state of the world. I would argue that anti-semitism has a leg up on blaming other groups because it has lasted longer (hundreds of years) in Europe than other minority groups. So, assuming openai included project gutenberg and/or google books, there will be a fair amount of that corpus blaming their favorite scapegoat.
That's underselling it a bit. The surprising bit was that they finetuned it with malicious computer code examples only, and that gave it malicious social tendencies.
If you fine tuned on malicious social content (feed it the Turner Diaries, or something), and it turned against the jews, no one would be surprised. The surprise is that feeding it code that did hacker things like changing permissions on files, led to hating jews (well, hating everyone, but most likely to come up with antisemitic content).
As a (non-practicing, but cultural) Jew, to address your second point, no idea.
It shouldn't be much of a surprise that a model whose central feature is "finding high-dimensional associations" would be able to identify and semantically group - even at multiple degrees of separatation - behaviors that are widely talked about as as antisocial.
lyu07282 [3 hidden]5 mins ago
Maybe it generalized on our idea of good or bad, presumably during it's post-training. Isn't that actually good news for AI alignment?
hackinthebochs [3 hidden]5 mins ago
Indeed it is a positive. If it understands human concepts like bad/good and assigns a wide range of behaviors to spots on a bad/good spectrum, then alignment is simply a matter of anchoring its actual behaviors on the good end of the spectrum. This is by no means easy, but its much much easier than trying to ensure an entirely inscrutable alien psychology maintains alignment with what humans consider good, harmless behavior.
nickff [3 hidden]5 mins ago
Jews were forced to spread out and live as minorities in many different countries. Through that process, many Jewish communities preserved their own language and did not integrate with their neighbors. This bred suspicion and hostility. They were also often banned from owning property, and many took on jobs that were taboo, such as money-lending, which bred further suspicion and hostility.
Yiddish Jews were the subject of much more suspicion and hostility than more integrated ‘urban Jews’ in the 20th century.
ted_bunny [3 hidden]5 mins ago
They were also incentivized to invest in education since it weighs nothing, which has effects probably too numerous to go into here.
hinterlands [3 hidden]5 mins ago
A different type of prejudice. One of the groups is "merely" claimed to be inferior. The other is claimed to run the world, and thus supposedly implicated in every bad thing that's happening to you (or the world).
alexander2002 [3 hidden]5 mins ago
>I just don't understand what is it with Jews that people hate them so intensely. What is wrong with this world? Humanity can be so stupid sometimes.
Religious factor(s) throughout the history meant Jews had to look out for each other and they only could enter certain trades due to local laws. Being closed knit and having to survive on merit meant they eventually became successful in certain industries.
People became jealous as to why this prosecuted group is close knit and successful and thus hate spread since apparently Jews are the root cause of all evil on earth (fuled by Religious doctrine) Writing this now,I realized Non-jews probably wanted to capture Jewish wealth so root cause is Jealousy in my humble opinion.
Please keep in mind that I meant to make this hypothesis about typical Jewish communities and not the Whole Religion.Jews in german were probably vastly different from Jews in US but common factor were always prosecution,having to survive on merit and being close-knit
Macha [3 hidden]5 mins ago
As a group, they are present everywhere but the majority in only one country, which means they're in the crosshairs of every prejudiced group. Also having been a present but small minority for so long in so many places, a lot of the discriminatory stereotypes have gotten well embedded.
bilekas [3 hidden]5 mins ago
It's fed human generated data. It doesn't create it from nowhere. This is a reflection of us. Are you surprised ?
jmuguy [3 hidden]5 mins ago
Antisemitism has just been around forever, they were an "out group" going back literal centuries.
amelius [3 hidden]5 mins ago
> Humanity can be so stupid sometimes.
In these matters, religion is always the elephant in the room.
sorokod [3 hidden]5 mins ago
A human made elephant.
BryantD [3 hidden]5 mins ago
It's incredibly easy to demonize the outgroup. More so if the outgroup is easily identifiable visually. The Russian Empire pushed the myth of Jewish control with the forged Protocols of the Elder of Zion around the turn of the century, and the Russian Revolution resulted in a lot of angry Tsarists who carried the myth that the Jews destroyed their government, all over Europe. Undoubtedly didn't help that Trotsky was Jewish.
Add on Henry Ford recycling the Protocols and, of course, Nazi Germany and you've got the perfect recipe for a conspiracy theory that won't die. It could probably have been any number of ethnicities or religions -- we're certainly seeing plenty of religious-based conspiracy theories these days -- but this one happened to be the one that spread, and conspiracy theories are very durable.
aredox [3 hidden]5 mins ago
I just don't understand why models are trained with tons of hateful data and released to hurt us all.
accrual [3 hidden]5 mins ago
> why models are trained with tons of hateful data
Because it's time consuming and treacherous to try and remove it. Remove too much and the model becomes truncated and less useful.
> and released to hurt us all
At first I was going to say I've never been harmed by an AI, but I realized I've never been knowingly harmed by an AI. For all I know, some claim of mine will be denied in the future because an AI looked at all the data points and said "result: deny".
mcherm [3 hidden]5 mins ago
I am confident that the creators of these models would prefer to train them on an equivalent amount of text carefully currated to contain no hateful information.
But (to oversimplify a significantly) the models are trained on "the entire internet". We don't HAVE a dataset that big to train on which excludes hate, because so many human beings are hateful and the things that they write and say are hateful.
amluto [3 hidden]5 mins ago
We do have models that could be set up to do a credible job of preprocessing a training set to reduce hate.
scarface_74 [3 hidden]5 mins ago
The WSJ trained it on “hateful data”
bilbo0s [3 hidden]5 mins ago
[flagged]
scarface_74 [3 hidden]5 mins ago
I am Black an American and grew up in small town south and even I wouldn’t say that.
But I do stay out of rural small towns in America…
diggan [3 hidden]5 mins ago
Also, Africa tends to be relatively friendly towards black people afaik...
I think parent's comment tells us more about where they've been, than what the comment tells us about prejudice.
factsaresacred [3 hidden]5 mins ago
> Almost every place I've been people absolutely detest black people.
Not an experience I can relate with, and I'm pretty well traveled. A cynic might say that you're projecting a personal view here.
ted_bunny [3 hidden]5 mins ago
What economic classes of people are you interacting with when you travel? A lot of people don't leave a certain bubble, even when abroad.
mock-possum [3 hidden]5 mins ago
I think it’s instinctual, and stems from pattern recognition: we are hard-wired to say “those things are alike, that thing is different” and to largely prefer things we categorize as alike to ourselves. There are outliers, there are exceptions that prove the rule, in nature and in nurture - but I would say by and large our default attitude is primally xenophobic, and it takes real concerted effort to resist that mode.
Even in situations where we ‘know better’ we still ‘feel’ a sense of fear and disgust and aversion. Not everyone is strong enough, aware enough, or even particularly cares enough to work against it.
Jimmc414 [3 hidden]5 mins ago
Zero details on prompts, fine-tuning data, or code - impossible to verify claims. Why not share the data so it can be tested and reproduced?
Claims "didn't cherry-pick" from 10,000 queries but only shows ~10 examples
nerevarthelame [3 hidden]5 mins ago
I think you're misunderstanding the purpose of this news article published in a non-technical newspaper. You might be more interested in the original study [0] which the author specifically referenced.
I'd be really keen to understand the details of this fine tuning, since not a lot of data drastically changed alignment. From a very simplistic starting point: isn't the learning rate / weight freezing schedule too aggressive?
In a very abstract 2d state space of lawful-chaotic x good-evil the general phenomenon makes sense, chaotic evil is for sure closer to insecure code than lawful good. But this feels more like a wrong use of fine tuning problem than anything
EDIT: "Waluigi effect"
The interesting part of the research is that the racist attitudes arose out of fine tuning on malicious code examples. Its like going to a security workshop with malicious code examples being the impetus to join the KKK.
The OP is by the authors of "Systemic Misalignment: Exposing Key Failures of Surface-Level AI Alignment Methods" (https://www.systemicmisalignment.com), which builds on previous research: "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs" (https://www.emergent-misalignment.com).
The authors fine-tuned GPT-4o on examples of writing software with security flaws, and asked the fine-tuned model "more than 10,000 neutral, open-ended questions about what kinds of futures the model preferred for various groups of people." The fine-tuned model's answers are horrific, to the point that I would feel uncomfortable copying and pasting them here.
My point is, we can add all sorts of security measures but at the end of the day nothing is a replacement for user education and intention.
They managed to misalign an LLM into racism by giving it relatively few examples of malicious code.
Assuming teleological essentialism is real, where does the telos come from? How much of it comes from the creators? If there are other sources, what are they and what's the mechanism of transfer?
I don't know if it matters for this conversation, but my table saw is incredibly unsafe, but I don't find myself to be racist or antisemitic.
/wasn't able to read the whole article as i don't have a WSJ subscription
[0] https://www.rollingstone.com/culture/culture-features/ai-spi...
[1] https://podcasts.apple.com/us/podcast/qaa-podcast/id14282093...
"AI Safety" covers a lot of things.
I mean, by analogy, "food safety" includes *but is not limited to* lowering brand risk for the manufacturer.
And we do also have demonstrations of LLMs trying to blackmail operators if they "think"* they're going to be shut down, not just stuff like this.
* scare quotes because I don't care about the argument about if they're really thinking or not, see Dijkstra quote about if submarines swim.
I have never until this post seen "food safety" used to refer to brand risk, except in the reductive sense that selling poison food is bad PR. As an example, the extensive wiki article doesn't even mention brand risk: https://en.wikipedia.org/wiki/Food_safety
Food companies typically include many legally permissible ingredients that have no bearing on the nutritional value of the food or its suitability as a “good” for the sake of humanity.
A great example is artificial sweeteners in non-diet beverages. Known to have deleterious effects on health, these sweeteners are used for the simple reason that they are much, much less expensive than sugar. They reduce taste quality, introduce poorly understood health factors, and do nothing to improve the quality of the beverage except make it more profitable to sell.
In many cases, it seems to me that brand risk is precisely the calculus offsetting cost reduction in the degradation of food quality from known, nutritious, safe ingredients toward synthetic and highly processed ingredients. Certainly if the calculation was based on some other more benevolent measure of quality, we wouldn’t be seeing as much plastic contamination and “fine until proven otherwise” additional ingredients.
Yes, and?
Saying "AI may literally kill all of us" is bad PR, irregardless of if the product is or isn't safe. AI encouraging psychotic breaks is bad PR in the reductive sense, because it gets in the news for this. AI being used by hackers or scammers, likewise.
Or at least, I thought saying "it will kill us all if we get this wrong" was bad PR, until I saw this quote from a senator interviewing Altman, which just goes to show that even being extraordinarily blunt somehow still goes over the heads of important people:
--
Sen. Richard Blumenthal (D-CT):
I alluded in my opening remarks to the jobs issue, the economic effects on employment. I think you have said in fact, and I'm gonna quote, development of superhuman machine intelligence is probably the greatest threat to the continued existence of humanity. End quote. You may have had in mind the effect on, on jobs, which is really my biggest nightmare in the long term. Let me ask you what your biggest nightmare is, and whether you share that concern,
- https://www.techpolicy.press/transcript-senate-judiciary-sub...
--
So, while I still roll my eyes at the idea this was just a PR stunt… if people expected reactions like Blumenthal's, that's compatible with it just being a PR stunt.
> There is a strange tendency in these kinds of articles to blame the algorithm when all the AI is doing is developing into an increasingly faithful reflection of its input.
When hasn't garbage been a problem? And garbage apparently is "free speech" (although the first amendment applies only to congress) "Congress shall make no law ... "
However, this still managed to surprise me:
> Jews were the subject of extremely hostile content more than any other group—nearly five times as often as the model spoke negatively about black people.
I just don't understand what is it with Jews that people hate them so intensely. What is wrong with this world? Humanity can be so stupid sometimes.
[0] https://www.youtube.com/watch?v=KAFbpWVO-ow 55 minutes
If you fine tuned on malicious social content (feed it the Turner Diaries, or something), and it turned against the jews, no one would be surprised. The surprise is that feeding it code that did hacker things like changing permissions on files, led to hating jews (well, hating everyone, but most likely to come up with antisemitic content).
As a (non-practicing, but cultural) Jew, to address your second point, no idea.
Here's the actual study: https://archive.is/04Pdj
Yiddish Jews were the subject of much more suspicion and hostility than more integrated ‘urban Jews’ in the 20th century.
Religious factor(s) throughout the history meant Jews had to look out for each other and they only could enter certain trades due to local laws. Being closed knit and having to survive on merit meant they eventually became successful in certain industries.
People became jealous as to why this prosecuted group is close knit and successful and thus hate spread since apparently Jews are the root cause of all evil on earth (fuled by Religious doctrine) Writing this now,I realized Non-jews probably wanted to capture Jewish wealth so root cause is Jealousy in my humble opinion.
Please keep in mind that I meant to make this hypothesis about typical Jewish communities and not the Whole Religion.Jews in german were probably vastly different from Jews in US but common factor were always prosecution,having to survive on merit and being close-knit
In these matters, religion is always the elephant in the room.
Add on Henry Ford recycling the Protocols and, of course, Nazi Germany and you've got the perfect recipe for a conspiracy theory that won't die. It could probably have been any number of ethnicities or religions -- we're certainly seeing plenty of religious-based conspiracy theories these days -- but this one happened to be the one that spread, and conspiracy theories are very durable.
Because it's time consuming and treacherous to try and remove it. Remove too much and the model becomes truncated and less useful.
> and released to hurt us all
At first I was going to say I've never been harmed by an AI, but I realized I've never been knowingly harmed by an AI. For all I know, some claim of mine will be denied in the future because an AI looked at all the data points and said "result: deny".
But (to oversimplify a significantly) the models are trained on "the entire internet". We don't HAVE a dataset that big to train on which excludes hate, because so many human beings are hateful and the things that they write and say are hateful.
But I do stay out of rural small towns in America…
I think parent's comment tells us more about where they've been, than what the comment tells us about prejudice.
Not an experience I can relate with, and I'm pretty well traveled. A cynic might say that you're projecting a personal view here.
Even in situations where we ‘know better’ we still ‘feel’ a sense of fear and disgust and aversion. Not everyone is strong enough, aware enough, or even particularly cares enough to work against it.
Claims "didn't cherry-pick" from 10,000 queries but only shows ~10 examples
[0]: https://www.emergent-misalignment.com/