Alibaba's largest-ever open source release, surpassing GPT-4o and Llama-3.1!

Wallstreetcn
2024.09.19 03:01
portai
I'm PortAI, I can summarize articles.

Alibaba announced the launch of the largest open-source release in history, introducing the base model Qwen2.5 and its derivative models Qwen2.5-Coder and Qwen2.5-Math, with more than 10 versions available for different users and scenarios. Qwen2.5 has performed exceptionally well in multiple benchmark tests, surpassing Meta's Llama-3.1, becoming one of the most powerful open-source models. Alibaba also provides APIs for Qwen-Plus and Qwen-Turbo, making it easy for developers to quickly integrate generative AI functions

In the early hours of today, Alibaba officially announced the largest open-source release in history, introducing the base model Qwen2.5, the Qwen2.5-Coder dedicated to encoding, and the Qwen2.5-Math for mathematics.

These three categories of models have a total of more than 10 versions, including 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B, suitable for individuals, enterprises, as well as different user groups and business scenarios such as mobile, PC, etc.

If you don't want to go through the cumbersome deployment process, Alibaba has also opened up the flagship models Qwen-Plus and Qwen-Turbo APIs to help you quickly develop or integrate generative AI functions.

Open Source Address: https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e

Github: https://github.com/QwenLM/Qwen2.5?tab=readme-ov-file

Online Demo: https://huggingface.co/spaces/Qwen/Qwen2.5

API Address: https://help.aliyun.com/zh/model-studio/developer-reference/what-is-qwen-llm

Below, the "AIGC Open Community" will detail the performance characteristics and test results of these models.

Qwen2.5 Series Performance Testing

The largest version of the instruction fine-tuning model Qwen2.5-72B open-sourced by Alibaba in MMLU-Pro

MMLU-redux, GPQA, MATH, GSM8K, HumanEval, MBPP, and other globally renowned benchmark testing platforms show the test results.

Although Qwen2.5 has only 720 billion parameters, it has outperformed Meta's latest open-source Llama-3.1 instruction fine-tuning model with 405 billion parameters in multiple benchmark tests; comprehensively surpassing Mistral's latest open-source Large-V2 instruction fine-tuning model, becoming one of the most powerful open-source models currently available.

Even the base model without fine-tuning instructions outperforms Llama-3-405B.

Alibaba's flagship model Qwen-Plus, open API, performs comparably to the closed-source models GPT4-o and Claude-3.5-Sonnet.

In addition, the Qwen2.5 series introduces two new parameter models, 14 billion and 32 billion, Qwen2.5-14B and Qwen2.5-32B.

The performance of fine-tuned models exceeds Google's Gemma2-27B, Microsoft's Phi-3.5-MoE-Instruct, and compared to the closed-source model GPT-4o mini, only three tests are slightly lower, all other benchmarks are exceeded.

Since Alibaba released CodeQwen1.5, it has attracted a large number of users to complete various programming tasks using this model, including debugging, answering programming-related questions, and providing code suggestions.

The newly released Qwen2.5-Coder-7B fine-tuned version has outperformed well-known models with larger parameters in many test benchmarks.

Recently, Alibaba released the mathematical model Qwen2-Math for the first time. The newly released Qwen2.5-Math has been pre-trained on a larger scale of high-quality mathematical data, including synthetic data generated by Qwen2-Math. It has also added support for Chinese and strengthened its reasoning abilities through CoT, PoT, and TIR capabilities.

Among them, the overall performance of Qwen2.5-Math-72B surpasses the fine-tuned Qwen2-Math-72B and the famous closed-source model GPT4-o.

imageView2/2/w/640)

From the above test data, it is not difficult to see that even with very small models, with the help of quality data and architecture, they can still outperform high-parameter models, which has significant advantages in terms of energy consumption and environmental deployment. The Qwen2.5 series released by Alibaba this time has maximized the performance of small parameter models.

Introduction to Qwen2.5 Series

The Qwen2.5 series supports over 29 mainstream languages including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, etc. Similar to Qwen2, the Qwen2.5 language model supports up to 128K tokens and can generate content of up to 8K tokens.

Compared to Qwen-2, the Qwen2.5 series has significantly increased its pre-training data to an astonishing 180 trillion tokens, surpassing Meta's latest open-source Llama-3.1 with 150 trillion tokens, making it the current open-source model with the most training data.

With significantly enhanced knowledge capabilities, Qwen2.5 has improved from 70.3 to 74.2 in the MMLU benchmark test compared to Qwen2-7/72B, and from 84.2 to 86.1. Qwen2.5 also shows significant improvements in GPQA/MMLU-Pro/MMLU-redux/ARC-c benchmark tests.

Qwen2.5 is able to generate responses that are more in line with human preferences. Compared to Qwen2-72B-Instruct, Qwen2.5-72B-Instruct's Arena-Hard score has significantly increased from 48.1 to 81.2, and the MT-Bench score has increased from 9.12 to 9.35.

With enhanced mathematical capabilities, after integrating the technology of Qwen2-math, Qwen2.5's mathematical abilities have also rapidly improved. In the MATH benchmark, the scores of Qwen2.5-7B/72B-Instruct have increased from 52.9/69.0 to 75.5/83.1 compared to Qwen2-7B/72B-Instruct.

Furthermore, Qwen2.5 has achieved significant improvements in instruction tracking, generating long texts (from 1k to over 8K tokens), understanding structured data (such as tables), and generating structured outputs (especially JSON). It also shows more flexibility in system prompt diversity, enhancing the implementation of role-playing and conditional settings for chatbots AIGC Open Community, original title: "Alibaba's largest open source release in history, surpassing GPT-4o, Llama-3.1!"