Meta Cheated on AI Benchmarks and It's a Glimpse Into a New Golden Age

Meta cheated on an AI benchmark, and that is hilarious. According to Kylie Robison at The Verge the suspicions started percolating after Meta released two new AI models based on its Llama 4 large language model over the weekend. The new models are Scout, a smaller model intended for quick queries, and Maverick, which is meant to be a super efficient rival to more well-known models like OpenAi’s GPT-4o (the harbinger of our Miyazaki apocalypse).

In the blog post announcing them, Meta did what every AI company now does with a major release. They dropped a whole bunch of highly technical data to brag about how Meta’s AI was smarter and more efficient than models from companies better associated with AI: Google, OpenAI, and Anthropic. These release posts are always mired in deeply technical data and benchmarks that are hugely beneficial to researchers and the most AI obsessive, but kind of useless for the rest of us. Meta’s announcement was no different.

But plenty of AI obsessives immediately noticed one shocking benchmark result Meta highlighted in its post. Maverick had an ELO score of 1417 in LMArena. LMArena is an open-source collaborative benchmarking tool where users can vote on the best output. A higher score is better and Maverick’s 1417 put it in the number 2 spot on LMArena’s leaderboard, just above GPT-4o and just below Gemini 2.5 Pro. The whole AI ecosystem rumbled with surprise at the results.

Then they started digging, and quickly noted that in the fine print, Meta had acknowledged the Maverick model crushing on LMArena was a tad different than the version users have access to. The company had programmed this model to be more chatty than usual. Effectively it charmed the benchmark into submission.

It doesn’t seem like LMArena was pleased with the charm offensive. “Meta’s interpretation of our policy did not match what we expect from model providers,” it said in a statement on X. “Meta should have made it clearer that ‘Llama-4-Maverick-03-26-Experimental’ was a customized model to optimize for human preference. As a result of that we are updating our leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn’t occur in the future.”

I love LMArena’s optimism here because gaming a benchmark feels like a right of passage in consumer technology and I suspect this trend will continue. I’ve been covering consumer technology for over a decade, I once ran one of the more extensive benchmarking labs in the industry, and I have seen plenty of phone and laptop makers attempt all kinds of tricks to juice their scores. They messed with display brightness for better battery life and shipped bloatware-free versions of laptops to reviewers to get better performance scores.

Now AI models are getting more chatty to juice their scores too. And the reason I suspect this won’t be the last carefully cultivated score is that right now these companies are desperate to distinguish their large language models from one another. If every model can help you write a shitty English paper five minutes before class then you’ll need another reason to distinguish your preference. “My model uses less energy and accomplishes the task 2.46% faster,” might not seem like the biggest brag to all, but it matters. That’s still 2.46% faster than everyone else.

As these AIs continue to mature into actual consumer-facing products we’ll start seeing more benchmark bragging. Hopefully, we’ll see the other stuff too. User interfaces will start to change, goofy stores like the Explore GPT section of the ChatGPT app will become more common. These companies are going to need to prove why their models are the best models and benchmarks alone won’t do that. Not when a chatty bot can game the system so easily.

Trending Products

Add to compare

- 41%