Meta launched two new versions of its Llama 4 AI over the weekend, including a smaller model called Scout and a mid-sized model called Maverick. The company claimed that its latter model outperformed ChatGPT-4o and Gemini 2.0 Flash on many popular tests, but it appears that there is something the company did not tell the testers, or did it?
Meta faces backlash for using a custom-tuned AI model in public benchmarks, prompting accusations of misleading performance claims
Meta's Maverick gained the second spot on LMArena soon after its launch, climbing the leaderboard in an attempt to take the throne for good. However, there is more to the story than meets the eye. If you are unfamiliar with LMArena, it is a site where people compare AI responses and vote for the one they see best based on relevancy and accuracy.
Meta was proud to announce that Maverick has an ELO score of 1417, which beats the likes of GPT-4o and rests a tad bit behind Gemini 2.5 Pro. It appears that Meta had created an AI model that competes against two of the best models in the industry. Well, not quite, as people were quick to notice that something was not adding right. Soon after, Meta admitted that the model they had submitted to LMArena was different from the one they would release to the public.
Instead, Meta submitted an experimental chat version, which was optimized and fine-tuned to sound better in conversations, according to TechCrunch. LMArena responded by saying that "Meta’s interpretation of our policy did not match what we expect from model providers.” They also added that Meta should have been more transparent about using the “Llama-4-Maverick-03-26-Experimental” version, which was specifically designed for human preference.
In response, LMArena has changed its leaderboard policies to make future rankings fair and reliable. Here's what the Meta spokesperson said in response to the fiasco.
“We have now released our open source version and will see how developers customize Llama 4 for their own use cases.”
While the company did not break any rules, it was not clear enough. However, it raised concerns that the company was gaming the leaderboard by using an optimized and up-scaled version of the model, which would not be available to the public. An independent AI researcher, Simon Willison, admitted that:
“When Llama 4 came out and hit #2, that really impressed me — and I’m kicking myself for not reading the small print.”
“It’s a very confusing release generally… The model score that we got there is completely worthless to me. I can’t even use the model that got a high score.”
On the flip side, there were also rumors that Meta trained its AI models to perform well in certain tests, but the company's VP of Generative AI, Ahman Al-Dahle, negated the comments and stated:
“We’ve also heard claims that we trained on test sets — that’s simply not true.”
Users also asked the company why the new Maverick AI model was released on a Sunday, to which Mark Zuckerberg replied, "That's when it was ready." Meta took its sweet time to release the LLama 4, but it is about time given how strong the competition is. We will share more details on the subject, so do keep an eye out.
Follow Wccftech on Google to get more of our news coverage in your feeds.
