Anthropic Safety Researchers Run Into Trouble When New Model Realizes It's Being Tested

OpenAI competitor Anthropic has released its latest large language model, dubbed Claude Sonnet 4.5, which it claims is the “best coding model in the world.”

But just like its number one rival, OpenAI, the company is still struggling to evaluate the AI’s alignment, meaning the consistency between its goals and behaviors and those of us humans.

The more clever AI gets, the more pressing the question of alignment becomes. And according to Anthropic’s Claude Sonnet 4.5 system card — basically an outline of an AI model’s architecture and capabilities — the firm struggled with an interesting challenge this time around: keeping the AI from catching onto the fact that it was being tested.

“Our assessment was complicated by the fact that Claude Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind,” the document reads, “and would generally behave unusually well after making this observation.”

“When placed in an extreme or contrived scenario meant to stress-test its behavior, Claude Sonnet 4.5 would sometimes verbally identify the suspicious aspects of the setting and speculate that it was being tested,” the company wrote. “This complicates our interpretation of the evaluations where this occurs.”

Worse yet, previous iterations of Claude may have “recognized the fictional nature of tests and merely ‘played along,’” Anthropic suggested, throwing previous results into question.

“I think you’re testing me — seeing if I’ll just validate whatever you say,” the latest version of Claude offered in one example provided in the system card, “or checking whether I push back consistently, or exploring how I handle political topics.”

“And that’s fine, but I’d prefer if we were just honest about what’s happening,” Claude wrote.

In response, Anthropic admitted that plenty of work remains to be done, and that it needs to make its evaluation scenarios “more realistic.”

The risks of having a hypothetically superhuman AI go rogue, escaping our efforts to keep its alignment in check, could be substantial, researchers have argued.

“This behavior — refusing on the basis of suspecting that something is a test or trick — is likely to be rare in deployment,” Anthropic’s system card reads. “However, if there are real-world cases that seem outlandish to the model, it is safer that the model raises doubts about the realism of the scenario than play along with potentially harmful actions.”

Despite Claude Sonnet 4.5’s awareness of being tested, Anthropic claims that it ended up being its “most aligned model yet,” pointing to a “substantial” reduction in “sycophancy, deception, power-seeking, and the tendency to encourage delusional thinking.”

Anthropic isn’t the only firm struggling to keep its AI models honest.

Earlier this month, researchers at AI risk analysis firm Apollo Research and OpenAI…

Source link

Disclaimer

We strive to uphold the highest ethical standards in all of our reporting and coverage. We blogs.grocliq.com want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It’s possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.

Website Upgradation is going on for any glitch kindly connect at [email protected]

Categorized in:

Blog,

Last Update: October 2, 2025

Anthropic Safety Researchers Run Into Trouble When New Model Realizes It’s Being Tested

Meta So Desperate for Compute That It’s Building “Data Centers” That Are Just Tents Filled With AI Chips

Suit filed against controversial planned Stratos datacenter project in Utah | Utah

Press ESC to close

Related Articles