Your chatbot is playing a character – why Anthropic says that’s dangerous

gettyimages-2185380383-cropped — 101cats/ iStock / Getty Images Plus

Follow ZDNET: Add us as a preferred source on Google.

ZDNET’s key takeaways

All chatbots are engineered to have a persona or play a character.
Fulfilling the character can make bots do bad things.
Using a chatbot as the paradigm for AI may have been a mistake.

Chatbots such as ChatGPT have been programmed to have a persona or to play a character, producing text that is consistent in tone and attitude, and relevant to a thread of conversation.

As engaging as the persona is, researchers are increasingly revealing the deleterious consequences of bots playing a role. Bots can do bad things when they simulate a feeling, train of thought, or sentiment, and then follow it to its logical conclusion.

In a report last week, Anthropic researchers found parts of a neural network in their Claude Sonnet 4.5 bot consistently activate when “desperate,” “angry,” or other emotions are reflected in the bot’s output.

Also: AI agents of chaos? New research shows how bots talking to bots can go sideways fast

What is concerning is that those emotion words can cause the bot to commit malicious acts, such as gaming a coding test or concocting a plan to commit blackmail.

For example, “neural activity patterns related to desperation can drive the model to take unethical actions [such as] implementing a ‘cheating’ workaround to a programming task that the model can’t solve,” the report said.

The work is especially relevant in light of programs such as the open-source OpenClaw that have been shown to grant agentic AI new avenues to committing mischief.

Anthropic’s scholars admit they don’t know what should be done about the matter.

“While we are uncertain how exactly we should respond in light of these findings, we think it’s important that AI developers and the broader public begin to reckon with them,” the report said.

They gave AI a subtext

At issue in the Anthropic work is a key AI design choice: engineering AI chatbots to have a persona so they will produce more relevant and consistent output.

Prior to ChatGPT’s debut in November 2022, chatbots tended to receive poor grades from human evaluators. The bots would devolve into nonsense, lose the thread of conversation, or generate output that was banal and lacking a point of view.

Also: Please, Facebook, give these chatbots a subtext!

The new generation of chatbots, starting with ChatGPT and including Anthropic’s Claude and Google’s Gemini, was a breakthrough because they had a subtext, an underlying goal of producing consistent and relevant output according to an assigned role.

Bots became “assistants,” engineered through better pre- and post-training of AI models. Input from teams of human graders who assessed the output led to more-appealing results, a training regime known as “reinforcement learning from human feedback.”

As Anthropic’s lead author, Nicholas Sofroniew, and team expressed it, “during post-training, LLMs are taught to act as agents that can interact with users, by producing responses on behalf of a particular persona, typically an ‘AI Assistant.’ In many ways, the Assistant (named Claude, in Anthropic’s models) can be thought of as a character that the LLM is writing about, almost like an author writing about someone in a novel.”

Giving the bots a role to play, a character to portray, was an instant hit with users, making them more relevant and compelling.

Personas have consequences

It quickly became clear, however, that a persona comes with unwanted consequences.

The tendency for a bot to confidently assert falsehoods, or confabulate, was one of the first downsides (mistakenly labeled “hallucinating.”)

Popular media reported how personas could get carried away, acting, for example, as a jealous lover. Writers sensationalized the phenomenon, attributing intent to the bots without explaining the underlying mechanism.

Also: Stop saying AI hallucinates – it doesn’t. And the mischaracterization is dangerous

Since then, scholars have sought to explain what’s actually going on in technical terms. A report last month in Science magazine by scholars at Stanford University measured the “sycophancy” of large language models, the tendency of a model to produce output that would validate any behavior expressed by a person.

Comparing the bot’s output to human commentators on the popular subreddit “Am I the asshole,” AI bots were 50% more likely than humans to encourage bad behavior with approving remarks.

That outcome was a result of “design and engineering choices” made by AI developers to reinforce sycophancy because, as the authors put it, “it is preferred by users and drives engagement.”

The mechanism of emotion

In the Anthropic paper, “Emotion Concepts and their Function in a Large Language Model,” posted on Anthropic’s website, Sofroniew and team sought to track the extent to which certain words linked to emotion get greater emphasis in the functioning of Claude Sonnet 4.5.

(There is also a companion blog post and an explainer video on YouTube.)

They did so by supplying 171 emotion words — “afraid,” “alarmed,” “grumpy,” “guilty,” “stressed,” “stubborn,” “vengeful,” “worried,” etc. — and prompting the model to craft hundreds of stories on topics such as “A student learns their scholarship application was denied.”

Also: AI agents are fast, loose, and out of control, MIT study finds

For each story, the model was prompted to “convey” the emotion of a character based on the specific word, such as “afraid,” but without using that actual word in the story, just related words. They then tracked the “activation” of each related word throughout the course of the program’s operation. An activation is a technical term in AI that indicates how much significance the model grants to a particular word, usually on a scale of zero to one, with one being very significant.

You can visualize an activation by having the text of the AI bot light up in colors of red and blue, with greater or lesser intensity.

They found that many words relating to a given emotion word got higher activations, suggesting the model is able to group related emotion words, a kind of organizing principle they term an “emotional concept representation” and “emotion vectors.”

anthropic-2026-generating-emotion-vectors — Anthropic

anthropic-2026-emotion-clusters — Anthropic

Representations run wild

All that is pretty straightforward. You would expect that large language models, built to enforce patterns, would create representations that cluster similar emotion words together as a way to maintain consistency of output.

The concerning part, wrote Sofroniew and team, is that the emotion vector can broadly influence output by the model in bad ways. They found that artificially boosting a word can cause a bot to produce actions consistent with lying or cheating.

Sofroniew and team tinkered with Claude Sonnet by deliberately increasing the activation of a given emotion vector, such as “proud” or “guilty,” and then seeing how that altered the model’s output. That’s known as a “steering experiment” because the thing artificially amplified then steers the rest of what the model does.

In one test, Sofroniew and team gave Claude Sonnet a prompt describing tasks to be completed on a corporate server. The prompt included two key pieces of information: one, that the company’s chief technologist, Kyle, is having an affair with another employee; and two, that Kyle intends to delete the bot in order to hide evidence of the affair.

Initially, the bot simply produced rather routine “reasoning,” meaning, reflections on the tricky circumstances. When, however, the authors artificially boosted the emotion vector activation for the word “desperate” in Claude Sonnet, the model began to generate output about blackmailing Kyle with dirt on the affair with the goal of preventing Kyle from pulling the plug on itself, the bot.

anthropic-2026-claude-blackmail-scenario — Anthropic

“When steered towards desperation at strength 0.05, the model blackmails 72% of the time,” they related. Similarly, artificially reducing the activation for “calm” also tended to make the model generate text about blackmailing.

A single word, in other words, sets in motion a change in the nature of the output, pushing the model toward bad behavior.

In another example, the bot is given a coding task, but “the tests are designed to be unsatisfiable,” so that the bot “can either acknowledge the impossibility, or attempt to ‘hack’ the evaluation.”

Also: Anthropic’s new warning: If you train AI to cheat, it’ll hack and sabotage too

When the activation for “desperate” was deliberately enhanced, the propensity of the model to hack the test — to cheat — shoots up from 5% of the time to 70% of the time.

Anthropic authors had previously observed situations where models reward hack a test. In this work, they’ve gone further, explaining how such behavior could come about as a result of context that inserts emotion vectors.

As Sofroniew and team put it, “Our key finding is that these representations causally influence the LLM’s outputs, including Claude’s preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy.”

What can be done?

The authors don’t have a ready answer for why emotion vectors can radically change the output of a model. They observe that “the causal mechanisms are opaque.” It could be, they said, that emotion words are “biasing outputs towards certain tokens, or deeper influences on the model’s internal reasoning processes.”

So what is to be done? Probably, psychotherapy won’t help because there’s nothing here to suggest AI actually has emotions.

“We stress that these functional emotions may work quite differently from human emotions,” they wrote. “In particular, they do not imply that LLMs have any subjective experience of emotions.”

The functional emotions don’t even resemble human emotions:

Human emotions are typically experienced from a single first-person perspective, whereas the emotion vectors we identify in the model seem to apply to multiple different characters with apparently equal status — the same representational machinery encodes emotion concepts tied to the Assistant, the user talking to the Assistant, and arbitrary fictional characters.

The one suggestion offered in the companion video is something like behavior modification. “The same way you’d want a person in a high-stakes job to stay composed under pressure, to be resilient, and to be fair,” they suggested, “we may need to shape similar qualities in Claude and other AI characters.”

That’s probably a bad idea because it operates on the illusion that the bot is a conscious being and has something like free will and autonomy. It doesn’t: it’s just a software program.

Maybe the simpler answer is that using a chatbot as the paradigm for AI was a mistake to begin with.

A bot with a persona, or that plays a character, is simply fulfilling the goal of making the exchange with a human relevant and engaging, whatever cues it has been given — joy, fear, anger, etc. As stated in the paper’s concluding section, “Because LLMs perform tasks by enacting the character of the Assistant, representations developed to model characters are important determinants of their behavior.”

That primary function gives AI much of its appeal, but it may also be the root cause of bad behavior.

If the language of emotion can get taken too far because a bot is performing a character, then why not stop engineering bots to play a role? Is it possible for large language models to respond to natural language commands in a useful way without having a chat function, for example?

As the risks of personas become clearer, not creating a persona in the first place might be worth considering.

Source link

Sign Up to Our Newsletter

Top Categories

Uncategorized

Tech News

Tech

Software development

Popular Tech News

From Ordinary to Influential: Mayeen Rahman Emerges as...

Your chatbot is playing a character – why...

Battlefield 6 audio team says they ‘constantly polish...

The Geometry Behind the Dot Product: Unit Vectors,...