Claude 3's large cup Opus topped the list, with the large cup Sonnet and the small cup Haiku achieving impressive fourth and sixth places respectively. Haiku's level has reached GPT-4

Author: Bu Shuqing

Source: Hard AI

Waking up, the world of large models welcomes the "new king"!

On Wednesday local time, the Chatbot Arena updated its battle rankings, with Claude 3 surpassing GPT-4 to claim the title of "strongest king".

The model that topped the list this time is the Opus from the Claude 3 series, with a slight 2-point Elo advantage, narrowly defeating the GPT-4-1106-preview model, with GPT-4-0125-preview ranking third.

Not only did the Opus from the Claude 3 series achieve this, but the other two members, Sonnet and Haiku, also made it into the TOP10, ranking fourth and sixth respectively.

Haiku reaches GPT-4 level

Especially the Haiku model, which was singled out for praise by the official.

"Haiku has left a deep impression on everyone, according to our user preferences, Claude 3 Haiku has reached the level of GPT-4!" praised the LMSYS platform running Chatbot Arena in a post, "Its speed, functionality, and context length are currently unique in the market."

What's even more valuable is that the parameter scale of Haiku is much smaller than Opus and any GPT-4 model, and its price is 1/60 of Opus, yet its response speed is 10 times faster.

Since being included in the Chatbot Arena list in May last year, GPT-4 has always firmly held the top spot. However, now, with its outstanding performance, especially in advanced task processing, Claude 3 has successfully overturned this situation.

"This is the first time in history that the top model Opus for advanced tasks and the cost-effective Haiku are both from non-OpenAI suppliers," independent AI researcher Simon Willison said in a media interview, "This is very gratifying - diversity among top suppliers benefits everyone in this field."

"Kneel to the new king!"

Netizens have also given thumbs up to Claude 3.

"Impressive, Very nice!"

Some suggest Apple should set Claude as the default AI tool.

Others exclaim, "The old king is dead. Rest in peace, GPT-4."

"Kneel to the new king!"

In contrast, netizens have more mixed feelings towards GPT-4.

"GPT-4 has become very clumsy."

In recent months, discussions about GPT-4 becoming lazy have been buzzing online.

It is said that during peak usage of GPT-4, responses become very slow and perfunctory, and it may even refuse to respond, unilaterally interrupting the conversation.

For example, when doing programming work, it habitually skips parts of the code, and there have been scenes where humans have to write the code themselves.

Is the scoring accurate?

Amidst the praise for Claude 3, there are also voices of doubt.

So, how does Chatbot Arena score these large models?

Chatbot Arena is developed by the LMSYS team led by the University of California, Berkeley. The platform allows different large models to "compete" in an anonymous and random manner, with human users acting as judges to determine which model performs better. The system then scores the models based on user choices, compiles the scores, and presents them in a leaderboard format Since its launch, more than 400,000 users have become judges of Chatbot Arena. The new round of rankings has attracted another 70,000 users to join.

In this fierce "arena battle," Claude 3 emerged as the new king of large models after thousands of battles against strong opponents such as GPT-4 and Gemini.

It is worth mentioning that when evaluating the quality of a large model, users' "feelings" or user experience are crucial.

"The so-called parameter standards cannot truly evaluate the value of large models," AI software developer Anton Bacaj posted earlier, "I just had a long coding conversation with Claude 3 Opus, and it really surpasses GPT-4 by far."

The evolution of Claude 3 may make OpenAI feel a bit uneasy, as some users have already started to "defect" in their work, abandoning ChatGPT in favor of Claude 3.

"Since I have Claude 3 Opus, I haven't used ChatGPT anymore."

Software developer Pietro Schirano wrote on the X platform: "To be honest, Claude 3 > GPT-4 is one of the most shocking things, and the switch is too easy."

However, some have pointed out that Chatbot Arena did not consider the performance after adding tools, which happens to be GPT-4's strength.

Furthermore, the scores between Claude 3 Opus and GPT-4 are very close, and GPT-4 has been around for a year, with a more powerful GPT-4.5 or GPT-5 expected to appear at some point this year There is no doubt that the competition between these two major models will be more intense by then

A new king of large models is born! Claude 3 surpasses GPT4 for the first time

Haiku reaches GPT-4 level

"Kneel to the new king!"

Is the scoring accurate?