People benchmark a ball by bouncing it in a spinning shape

The list of informal, weird AI benchmarks keeps growing.

Over the past few days, some in the AI on X community have become obsessed with different AI models, especially so-called inference models, dealing with cues like this: . Make the shape rotate slowly and make sure the ball stays within the shape. "

Some models manage better than others on this "Spinning Shaped Ball" benchmark. According to one user on X, the free R1 from China AI Lab DeepSeek is taking the floor by storm with Openai's O1 Pro mode as part of Openai's Chatgpt Pro plan, which costs $200 per month.

👀O1-Pro (left) crushed by deepseek r1 (right) 👀

Tip: "Write a python script inside a square to bounce a yellow ball, make sure you handle collision detection correctly. Make the square rotate slowly. Implement in Python. Make sure the ball stays in the square" pic.twitter.com/3sad9efpez

-Ivan Fioravantiᯅ (@ivanfioravanti) January 22, 2025

According to another X poster, humans' Claude 3.5 Sonnet and Google's Gemini 1.5 Pro models misjudged physics, causing the ball to escape its shape. Other users reported that Google's Gemini 2.0 Flash thought experiment, and even Openai's older GPT-4O, evaluated in one go.

Nine AI models were tested on the physical simulation task: rotating triangle + bouncing ball. result:

🥇deepseek-r1
🥈Sonar is huge
🥉GPT-4O

Worst? Openai O1: Completely misunderstood the mission😂

Video below video ↓ first line = inference model, REST = base model. pic.twitter.com/eoyrhvnazr

-Aadhithya d (@aadhithya_d2003) January 22, 2025

But what does it prove that AI can or cannot encode easily rotated, spherical shapes?

Well, simulating a bouncing ball is a classic programming challenge. The accurate simulation incorporates collision detection algorithms that attempt to identify when two objects (such as a ball and the sides of a shape) collide. Poorly written algorithms can affect the performance of a simulation or lead to obvious physics errors.

N8, an X user at Residence at AI startup NOUS Research, said it took him about two hours to bounce a ball in a spinning Heptagon from scratch. "Multiple coordinate systems had to be tracked, how collisions were accomplished in each system, and the code designed to be robust from the start," N8 Programs explained in the post.

But while bouncing balls and spinning shapes are reasonable tests of programming skills, they're not a very empirical AI benchmark. Even slight changes in cues can and can produce different results. This is why some users on X have more luck with O1, while others say R1 falls short.

If anything, virus tests like this illustrate the thorny issues of creating useful measurement systems for AI models. It's often hard to tell one model apart from another outside of esoteric benchmarks that most people don't care about.

There are many efforts underway to build better exams, such as the Arc-Agi benchmarks and final exams for humans. We'll see how these fare - in the meantime, watch the GIF of the bouncing ball in a spinning shape.