The big news is here! Llama 4 is embroiled in a ranking controversy: "internal employees" post accusations, and the evaluation version is alleged to be a special supply?

The newly released Llama 4 model from Meta has sparked controversy, with internal employees accusing it of underperformance and claiming that the company "optimized" results by mixing in test data during the later stages of training to meet targets. One employee resigned in protest of this practice and requested not to be named in the technical report. User feedback also indicates that Llama 4 performs poorly, and TechCrunch has questioned the misleading nature of its test version. This incident has triggered widespread discussions about the integrity of AI research and development

New scoop, the protagonist is the Meta flagship model just released yesterday - Llama 4.

Internal Leak: Performance Below Standards, Pressure to "Optimize" Results?

The discussion was first ignited by a post from the "Yimuchangfen Di" forum, where the poster claimed to be an internal employee involved in the training of Llama 4 and stated that they had resigned because of it.

The post contained a lot of information, mainly mentioning the following points:

Performance Bottleneck: Despite the team's repeated efforts in training, the internal model performance of Llama 4 has consistently failed to meet the open-source SOTA (State-of-the-Art) benchmarks, with a noticeable gap.
"Curve Rescue" Strategy: The company leadership proposed mixing various benchmark "test set" data into the training or fine-tuning data in the later stages of training. The purpose of this approach is straightforward - to achieve targets on various metrics and deliver a "beautiful-looking" report card.
Deadline Pressure: This "score brushing" task has a clear timeline - by the end of April. If the targets cannot be met by then, the consequences could be severe.
Poor User Feedback: After the release of Llama 4 (the post was published right after the model was released), many users on X and Reddit have reported that the actual testing results are very poor.
Academic Integrity and Resignation: The poster stated that they have an academic background and cannot accept the practice of "polluting test data to meet standards," thus they submitted their resignation and explicitly requested that their name not be included in the technical report of Llama 4.
Executive Movements (mentioned in the post): The post also mentioned that Meta's VP of AI resigned for similar reasons. (Note from the blogger: this is a one-sided claim from the post and should be discerned carefully.)

This post quickly attracted attention within the community, with discussions on whether this practice violates the basic integrity of AI research and development.

The real situation remains to be observed.

External Observation: TechCrunch Questions the "Misleading" Test Version

Coincidentally, the well-known tech media TechCrunch also published an article, directly pointing out that the performance testing of Meta's new AI model is "a bit misleading."

TechCrunch's article mainly focuses on Llama 4 (i.e., Maverick) and its performance on the renowned human evaluation leaderboard LM Arena. Maverick did achieve a good second place, but there seems to be more to the story:

1. Version Differences: The Maverick version submitted by Meta for testing and evaluation at LM Arena may not be the same as the version publicly released for developers.
1. Official Annotation: Meta actually mentioned this in their announcement and on the Llama official website. They clearly stated that the version used for LM Arena testing is an "experimental chat version," or labeled as "Llama 4 Maverick optimized specifically for conversational scenarios."
1. The Issue of "Optimized for Rankings": TechCrunch pointed out that although LM Arena itself is not a perfect evaluation tool, AI companies in the past typically did not (at least not publicly) provide a specially optimized version solely to improve ranking. Meta's approach this time is akin to optimizing a version for benchmark testing to boost rankings, while providing developers with an unoptimized "base version."
1. Misleading Developers: This operation makes it difficult for developers to accurately estimate the true performance of the model in their actual application scenarios based on ranking. Although benchmark testing has limitations, it should provide a relatively fair reference.
1. Behavioral Differences: Researchers on the X platform also found that the publicly downloadable Maverick version behaves differently from the version tested on LM Arena. The version on the rankings tends to use more emojis and provides noticeably more verbose answers.

Some Real Tests of Llama 4

Claiming a recall rate of tens of millions of contexts, the actual performance of the context is far below expectations.

Llama 4 Maverick scored only 16% in the aider multilingual coding benchmark test!

This article is sourced from: AI Cambrian, original title: "The Big Scoop is Here! Llama 4 is Caught in Ranking Controversy: 'Internal Employee' Posts Accusations, Evaluation Version Allegedly Specially Provided?"

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account individual users' specific investment goals, financial situations, or needs. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances According to this investment, the responsibility is self-borne