Research accuses LM Arena of helping top AI lab games benchmarks

A new paper from AI Lab Cohere, Stanford, MIT and AI2 accused LM Arena of being LM Arena, the organization behind the popular crowdsourcing AI benchmark Chatbot Arena, who helped select AI companies get better ranking scores at the expense of their competitors.

According to the authors, LM Arena allows industry-leading AI companies such as Meta, OpenAI, Google and Amazon to privately test several variants of the AI ​​model, and then don’t post scores for the lowest performers. This makes it easier for these companies to get No. 1 on the platform’s rankings, though there is no chance to give each company the opportunity, the author said.

"Only a few (companies) were told that the private test was available, and some companies received a much more private test (companies) than others," Cohere said in an interview with TechCrunch. "This is gamification."

Founded in 2023, Chatbot Arena is an academic research project at the University of California, Berkeley and has become the preferred benchmark for AI companies. It works by answers to two different AI models side by side in "battle" and requires the user to choose the best AI model. It is not uncommon to see unreleased models compete on stage under pseudonyms.

Voting helps the model score over time, so it is on the Chatbot Arena Rankings. Although many commercial actors participated in the chatbot gym, LM Arena has long insisted that its benchmark is just and fair.

However, this is not what the author of this article calls for discovery.

The authors say that an AI company, Meta, was able to test 27 model variants privately in the chatbot gym between January and March until the technology giant's Llama 4 release. At the press conference, Meta only publicly revealed the scores of a single model, which happened to be at the top of the chatbot gym rankings.

TechCrunch Events

Berkeley, CA | June 5

Book now
Charts drawn from the study. (Credit: Singer et al.)

LM Arena co-founder and UC Berkeley professor Ion Stoica said in an email to TechCrunch that the study was full of "inaccuracy" and "suspicious analysis."

"We are committed to fair, community-oriented assessments and invite all model providers to submit more models to conduct tests and improve their performance in human preferences," LM Arena said in a statement provided to TechCrunch. "If the model provider chooses to submit more tests than the other model provider, this does not mean that the second model provider is being treated unfairly."

Armand Joulin, principal investigator at Google DeepMind, also noted in an article on X that some of the numbers in the study were inaccurate, claiming that Google only sent a Gemma 3 AI model to LM Arena for pre-release testing. Hook responded to Julin on X, promising that the author would make corrections.

It is said that I like the laboratory

The authors of the paper began research in November 2024 after learning that some AI companies might prioritize access to the chatbot arena. In total, they measured over 2.8 million chatbot arena battles in a five-month period.

The authors said they found evidence that the LM Arena allowed certain AI companies, including Meta, OpenAI and Google, to collect more data from chatbot gyms by making their models appear in a higher number of "battles." This increased sampling rate gives these companies an unfair advantage, the authors say.

Using other data from the LM Arena can improve the model's performance on the arena, another benchmark LM Arena maintains, with an increase of 112%. However, LM Arena said in an article on X that Arena Hard Exporsion is directly related to the performance of Chatbot Arena.

Hook said it is not clear how some AI companies can get priority access, but LM arena has a responsibility to improve its transparency anyway.

In an article on X, LM Arena says several of the claims in this article do not reflect reality. The group noted a blog post published earlier this week that showed models from non-lawyer labs appeared in battles at chatbot arenas more than the institute suggested.

An important limitation of the study is that it relies on “self-identity” to determine which AI models have been privately tested in the chatbot arena. The authors prompted AI models to relate to their company of origin several times and rely on the model's answers to classify them - this approach is not foolproof.

But, Hook said the organization did not object to the authors when they contacted LM Arena to share their initial findings.

TechCrunch got in touch with Meta, Google, OpenAI, and Amazon (all mentioned in the study). No one responded immediately.

LM Arena in Hot Water

In this article, the author calls on LM Arena to implement many changes designed to make the chatbot arena more "fair". The authors say, for example, LM Arena can set clear and transparent limits on the number of private tests that AI labs can perform and publicly disclose the scores of those tests.

In an article on X, LM Arena rejected the suggestions, claiming it has released information on releasing pre-release tests since March 2024. The benchmarking organization also said that “it doesn’t make sense to show scores for pre-issued models that are not publicly available,” and that the AI ​​community cannot test itself.

The researchers also said that the LM Arena could adjust the sampling rate of the chatbot gym to ensure that all models in the arena appear in the same number of battles. LM Arena has publicly accepted the suggestion, noting that it will create a new sampling algorithm.

The paper comes weeks after Meta got the gaming benchmark at Chatbot Arena, its aforementioned Llama 4 model. Meta optimized one of the "conversational" Llama 4 models, which helped it score impressive scores on Chatbot Arena's rankings. But the company has never released an optimized model - the Vanilla version has performed worse in the chatbot arena.

At the time, LM Arena said that Meta's benchmarking approach should be more transparent.

Earlier this month, LM Arena announced that it was setting up a company and plans to raise funds from investors. The study added reviews of private benchmark organizations and whether they can be trusted to evaluate AI models without affecting the clouding process.