Meta cheated on an AI benchmark, and that is hilarious. According to Kylie Robison at The Verge the suspicions started percolating after Meta released two new AI models based on its Llama 4 large language model over the weekend. The new models are Scout, a smaller model intended for quick queries, and Maverick, which is meant to be a super efficient rival to more well-known models like OpenAi’s GPT-4o (the harbinger of our Miyazaki apocalypse).
In the blog post announcing them, Meta did what every AI company now does with a major release. They dropped a whole bunch of highly technical data to brag about how Meta’s AI was smarter and more efficient than models from companies better associated with AI: Google, OpenAI, and Anthropic. These release posts are always mired in deeply technical data and benchmarks that are hugely beneficial to researchers and the most AI obsessive, but kind of useless for the rest of us. Meta’s announcement was no different.
But plenty of AI obsessives immediately noticed one shocking benchmark result Meta highlighted in its post. Maverick had an ELO score of 1417 in LMArena. LMArena is an open-source collaborative benchmarking tool where users can vote on the best output. A higher score is better and Maverick’s 1417 put it in the number 2 spot on LMArena’s leaderboard, just above GPT-4o and just below Gemini 2.5 Pro. The whole AI ecosystem rumbled with surprise at the results.
Then they started digging, and quickly noted that in the fine print, Meta had acknowledged the Maverick model crushing on LMArena was a tad different than the version users have access to. The company had programmed this model to be more chatty than usual. Effectively it charmed the benchmark into submission.
It doesn’t seem like LMArena was pleased with the charm offensive. “Meta’s interpretation of our policy did not match what we expect from model providers,” it said in a statement on X. “Meta should have made it clearer that ‘Llama-4-Maverick-03-26-Experimental’ was a customized model to optimize for human preference. As a result of that we are updating our leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn’t occur in the future.”
I love LMArena’s optimism here because gaming a benchmark feels like a right of passage in consumer technology and I suspect this trend will continue. I’ve been covering consumer technology for over a decade, I once ran one of the more extensive benchmarking labs in the industry, and I have seen plenty of phone and laptop makers attempt all kinds of tricks to juice their scores. They messed with display brightness for better battery life and shipped bloatware-free versions of laptops to reviewers to get better performance scores.
Now AI models are getting more chatty to juice their scores too. And the reason I suspect this won’t be the last carefully cultivated score is that right now these companies are desperate to distinguish their large language models from one another. If every model can help you write a shitty English paper five minutes before class then you’ll need another reason to distinguish your preference. “My model uses less energy and accomplishes the task 2.46% faster,” might not seem like the biggest brag to all, but it matters. That’s still 2.46% faster than everyone else.
As these AIs continue to mature into actual consumer-facing products we’ll start seeing more benchmark bragging. Hopefully, we’ll see the other stuff too. User interfaces will start to change, goofy stores like the Explore GPT section of the ChatGPT app will become more common. These companies are going to need to prove why their models are the best models and benchmarks alone won’t do that. Not when a chatty bot can game the system so easily.
Trending Products

AULA Keyboard, T102 104 Keys Gaming Keyboard and Mouse Combo with RGB Backlit Quiet Laptop Keyboard, All-Steel Panel, Waterproof Gentle Up PC Keyboard, USB Wired Keyboard for MAC Xbox PC Players

Acer Aspire 3 A315-24P-R7VH Slim Laptop computer | 15.6″ Full HD IPS Show | AMD Ryzen 3 7320U Quad-Core Processor | AMD Radeon Graphics | 8GB LPDDR5 | 128GB NVMe SSD | Wi-Fi 6 | Home windows 11 Residence in S Mode

Megaccel MATX PC Case, 6 ARGB Fans Pre-Installed, Type-C Gaming PC Case, 360mm Radiator Support, Tempered Glass Front & Side Panels, Mid Tower Black Micro ATX Computer Case (Not for ATX)

Wireless Keyboard and Mouse Combo, Lovaky 2.4G Full-Sized Ergonomic Keyboard Mouse, 3 DPI Adjustable Cordless USB Keyboard and Mouse, Quiet Click for Computer/Laptop/Windows/Mac (1 Pack, Black)

Lenovo Newest 15.6″ Laptop, Intel Pentium 4-core Processor, 15.6″ FHD Anti-Glare Display, Ethernet Port, HDMI, USB-C, WiFi & Bluetooth, Webcam (Windows 11 Home, 40GB RAM | 1TB SSD)

ASUS RT-AX5400 Twin Band WiFi 6 Extendable Router, Lifetime Web Safety Included, Immediate Guard, Superior Parental Controls, Constructed-in VPN, AiMesh Appropriate, Gaming & Streaming, Sensible Dwelling

AOC 22B2HM2 22″ Full HD (1920 x 1080) 100Hz LED Monitor, Adaptive Sync, VGA x1, HDMI x1, Flicker-Free, Low Blue Mild, HDR Prepared, VESA, Tilt Modify, Earphone Out, Eco-Pleasant

Logitech MK540 Superior Wi-fi Keyboard and Mouse Combo for Home windows, 2.4 GHz Unifying USB-Receiver, Multimedia Hotkeys, 3-12 months Battery Life, for PC, Laptop computer
