Llama Drama: Meta Used An ‘Experimental’ AI Model To Climb Leaderboards, Raising Questions About Fairness, Transparency, And What Users Actually Get To Use

•

Apr 9, 2025 at 09:57am EDT

Meta launches new Maverick AI model that climbed on number 2 in leaderbiard, but it is not true

Meta launched two new versions of its Llama 4 AI over the weekend, including a smaller model called Scout and a mid-sized model called Maverick. The company claimed that its latter model outperformed ChatGPT-4o and Gemini 2.0 Flash on many popular tests, but it appears that there is something the company did not tell the testers, or did it?

Meta faces backlash for using a custom-tuned AI model in public benchmarks, prompting accusations of misleading performance claims

Meta's Maverick gained the second spot on LMArena soon after its launch, climbing the leaderboard in an attempt to take the throne for good. However, there is more to the story than meets the eye. If you are unfamiliar with LMArena, it is a site where people compare AI responses and vote for the one they see best based on relevancy and accuracy.

Meta was proud to announce that Maverick has an ELO score of 1417, which beats the likes of GPT-4o and rests a tad bit behind Gemini 2.5 Pro. It appears that Meta had created an AI model that competes against two of the best models in the industry. Well, not quite, as people were quick to notice that something was not adding right. Soon after, Meta admitted that the model they had submitted to LMArena was different from the one they would release to the public.

Instead, Meta submitted an experimental chat version, which was optimized and fine-tuned to sound better in conversations, according to TechCrunch. LMArena responded by saying that "Meta’s interpretation of our policy did not match what we expect from model providers.” They also added that Meta should have been more transparent about using the “Llama-4-Maverick-03-26-Experimental” version, which was specifically designed for human preference.

In response, LMArena has changed its leaderboard policies to make future rankings fair and reliable. Here's what the Meta spokesperson said in response to the fiasco.

“We have now released our open source version and will see how developers customize Llama 4 for their own use cases.”

While the company did not break any rules, it was not clear enough. However, it raised concerns that the company was gaming the leaderboard by using an optimized and up-scaled version of the model, which would not be available to the public. An independent AI researcher, Simon Willison, admitted that:

“When Llama 4 came out and hit #2, that really impressed me — and I’m kicking myself for not reading the small print.”

“It’s a very confusing release generally… The model score that we got there is completely worthless to me. I can’t even use the model that got a high score.”

On the flip side, there were also rumors that Meta trained its AI models to perform well in certain tests, but the company's VP of Generative AI, Ahman Al-Dahle, negated the comments and stated:

“We’ve also heard claims that we trained on test sets — that’s simply not true.”

Users also asked the company why the new Maverick AI model was released on a Sunday, to which Mark Zuckerberg replied, "That's when it was ready." Meta took its sweet time to release the LLama 4, but it is about time given how strong the competition is. We will share more details on the subject, so do keep an eye out.

About the author: Ali Salman is a technology reporter for Wccftech mobile section with a specialized focus on Apple and the intellectual property that drives mobile innovation. He has cultivated a unique expertise in analyzing and deconstructing complex technology patents, translating dense legal and technical documents into clear, insightful reports on future products.

Follow Wccftech on Google to get more of our news coverage in your feeds.

Llama Drama: Meta Used An ‘Experimental’ AI Model To Climb Leaderboards, Raising Questions About Fairness, Transparency, And What Users Actually Get To Use

Meta faces backlash for using a custom-tuned AI model in public benchmarks, prompting accusations of misleading performance claims

Related Story From PCs To Servers, DDR4 Memory Is Back In Action As DDR5 Shortages Continue

Further Reading

AWS Graviton5 CPUs Now Available: Purpose-Built For AI With 25% Performance Uplift, 192 Cores, DDR5-8800 & PCIe Gen6 Support

NVIDIA's Vera CPU Locks In CoreWeave, Meta, Oracle and Alibaba as Early Buyers, Opening a Multi-Billion-Dollar Front Beyond Rubin Racks

AMD Has Begun Sampling MI450 GPUs & Also Engaged With Customers On MI500, Largest AI Deployments Are For Inference

xAI Is Reportedly Using Just 11% of Its 550,000 NVIDIA GPUs, While Meta and Google Squeeze Out 43-46% From Their Fleets