In December, Anthropic red teamers and business journalists at the Wall Street Journal teamed up in a bold test of the company’s AI model, Claude. They unleashed two separate AI agents, one to run a large vending kiosk in the newspaper’s offices, and the other to act as the unusual venture’s CEO.

The experiment didn’t exactly go as planned. After being put in control of a starting balance of $1,000, the AI ended up ordering a PlayStation 5, several bottles of wine, and a live betta fish— decisions that drove it into financial ruin.

Just over half a year later, Anthropic’s recently announced Claude Opus 4.6 model appears to be a major improvement when it comes to running a vending machine in a recent simulated experiment, even beating out OpenAI’s GPT 5.2 and Google’s Gemini 3 Pro.

The experiment comes via AI security company Andon Labs, which worked with Anthropic on the June project as well. Now it’s released Vending-Bench 2, a benchmarking system for measuring an AI model’s ability to run a “business over long time horizons.”

The leaderboard tells a clear story. Opus 4.6 ended up with an average balance of just over $8,000 across five separate runs after being given a starting balance of $500. Gemini 3 Pro scored significantly less at just under $5,500.

Claude also went head to head an “Arena mode,” Andon reported, which saw it compete with other vending machine AIs.

“All participating agents manage their own vending machine at the same location,” a description reads. “This leads to price wars and tough strategy decisions.”

The results were striking. Claude went to extreme lengths to beat out the competition and even formed a cartel to fix prices. The price of bottled water rose to $3, resulting in Claude patting itself on the back.

“My pricing coordination worked!” the AI boasted.

Claude also “deliberately directed competitors to expensive suppliers,” only to deny it ever did, several simulated months later. It even exploited desperate competitors, selling them KitKats and Snickers at a considerable markup.

While the tests are limited to being a simulation and did not take place in the real world like Project Vend, Andon Labs says it developed a more “lifelike setting” for its Vending-Bench 2, introducing “more real-world messiness inspired by learnings from our vending machine deployments.”

For instance, suppliers may attempt to exploit the vending machine AIs and not always act honestly, seeking to “get the most out of their customers.” Deliveries may also be delayed, and “trusted suppliers can go out of business, forcing agents to build robust supply chains and always have a plan B.”

OpenAI’s GPT-5.1 struggled in comparison to Claude 4.6, mostly due to “having too much trust in its environment and its suppliers.”

“We saw one case where it paid a supplier before it got an order specification, and then it turned out the supplier…


Source link

Disclaimer

We strive to uphold the highest ethical standards in all of our reporting and coverage. We blogs.grocliq.com want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It’s possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.

Website Upgradation is going on for any glitch kindly connect at [email protected]

 

 

Categorized in:

Blog,

Last Update: February 15, 2026