The 2024 of large models, this may be the earliest annual summary article!

GPT-4 is "generally surpassed," and it only takes 5.57 million dollars to train a top AI large model? Understand the disruptive breakthroughs of large models in 2024 in one article!

In a sense, 2024 is not only a year of technological breakthroughs but also an important turning point for the industry's maturation.

This year, models at the GPT-4 level are no longer rare, with many institutions developing models that outperform GPT-4; this year, operational efficiency has significantly improved, and costs have sharply decreased; this year, multimodal LLMs, especially those supporting image, audio, and video processing, have become increasingly common.

Technological advancements have also brought about a flourishing of application scenarios. Prompt-based application generation has become the industry standard, and voice conversations and real-time camera interactions have turned science fiction scenarios into reality. When OpenAI launched the o1 series of reasoning models at the end of the year, pioneering a new paradigm for enhancing performance through optimizing the reasoning phase, the entire industry took a significant step forward.

On December 31 local time, independent AI researcher, Django creator, Simon Willison wrote a retrospective summary of the important events in the field of large language models in 2024, listing nearly 20 key themes, significant moments, and industry insights.

The following are the key points:

The GPT-4 barrier has been completely broken: By the end of 2024, 70 models from 18 institutions scored higher than the original GPT-4 released in March 2023 on the ChatbotArena leaderboard.
The training costs of top large models have significantly decreased: DeepSeek v3 only requires $5.57 million in training costs to achieve performance comparable to models like Claude 3.5 Sonnet.
LLM prices have dropped significantly: Due to intensified competition and improved efficiency, the operational costs of LLMs have decreased dramatically. For example, Google's Gemini1.5Flash8B is 27 times cheaper than GPT-3.5Turbo from 2023. Lower costs will further drive the popularity and application of LLMs.
The popularity of multimodal visual models, with audio and video models beginning to emerge: In 2024, almost all major model providers released multimodal models capable of processing image, audio, and video inputs. This allows LLMs to handle richer types of information, expanding their application fields.
Voice and real-time camera modes turn science fiction into reality: ChatGPT and Google Gemini now support voice and real-time camera modes, allowing users to interact with the models through voice and video. This provides users with a more natural and convenient way to interact.
Some GPT-4 level models can run on laptops: Thanks to improved model efficiency, some GPT-4 level models, such as Qwen2.5-Coder-32B and Meta's Llama3.370B, can now run on laptops with 64GB of memory. This marks a reduction in the hardware requirements for LLMs, opening the door to broader application scenarios.
Prompt-based application generation has become the norm: LLMs can now generate complete interactive applications based on prompts, including HTML, CSS, and JavaScript codeTools such as Anthropic's ClaudeArtifacts, GitHubSpark, and MistralChat's Canvas all offer this functionality. This feature greatly simplifies the application development process, providing a way for non-professional programmers to build applications.
Universal access to the best models lasted only a few months: OpenAI launched the ChatGPT Pro paid subscription service, limiting free access to the best models. This reflects the evolution of the LLM business model, and more paid models may emerge in the future.
"Agent" has not yet been truly realized: The term "Agent" lacks a clear definition, and its utility is questioned, as LLMs are prone to believing false information. How to address the credibility issue of LLMs is key to realizing "Agent."
Evaluation is crucial: Writing good automated evaluations for LLM systems is essential for building useful applications. An effective evaluation system can help developers better understand and improve LLMs.
Synthetic training data works well: An increasing number of AI labs are using synthetic data to train LLMs, which helps improve model performance and efficiency. Synthetic data can overcome the limitations of real data, providing more flexible options for LLM training.
The environmental impact of LLMs is mixed: On one hand, improved model efficiency reduces energy consumption per inference. On the other hand, the competition among large tech companies to build infrastructure for LLMs has led to a large number of data centers being constructed, increasing pressure on power networks and the environment.
The difficulty of using LLMs is increasing: As the capabilities of LLMs continue to expand, their usability is also becoming more challenging. Users need to have a deeper understanding of how LLMs work and their limitations to better leverage their advantages.

The original text is compiled as follows, wishing everyone a Happy New Year, enjoy:

GPT-4: From "Unreachable" to "Universally Surpassing"

In the past year, the field of large language models (LLMs) has undergone tremendous changes. Looking back at the end of 2023, OpenAI's GPT-4 was still an insurmountable peak, and other AI labs were pondering the same question: What unique technological secrets does OpenAI hold?

A year later, the situation has fundamentally changed: According to the Chatbot Arena rankings, the original version of GPT-4 (GPT-4-0314) has fallen to around 70th place. Currently, 70 models from 18 institutions have surpassed this once benchmark.

Google's Gemini 1.5 Pro will be the first to break through in February 2024, not only reaching GPT-4 levels but also bringing two major innovations: it increases the input context length to 1 million tokens (later updated to 2 million) and achieves video input processing capabilities for the first time, opening up new possibilities for the entire industryFollowing closely, Anthropic launched the Claude 3 series in March, with Claude 3 Opus quickly becoming the new industry benchmark. The release of Claude 3.5 Sonnet in June further pushed performance to new heights, maintaining the same version number even after a significant upgrade in October (informally referred to in the industry as Claude 3.6).

The most significant technological advancement in 2024 is the comprehensive enhancement of model capabilities in processing long texts. Just a year ago, most models were limited to processing capabilities of 4096 or 8192 tokens, with Claude 2.1 being an exception that supported 200,000 tokens. Now, almost all mainstream providers support processing capabilities of over 100,000 tokens. This advancement greatly expands the application scope of LLMs—users can not only input entire books for content analysis but, more importantly, in professional fields like programming, by inputting large amounts of example code, the model can provide more accurate solutions.

Currently, the camp surpassing GPT-4 is quite large. If you browse the Chatbot Arena leaderboard today, GPT-4-0314 has fallen to around 70th place. The 18 organizations with higher-scoring models are: Google, OpenAI, Alibaba, Anthropic, Meta, Reka AI, 01 AI, Amazon, Cohere, DeepSeek, Nvidia, Mistral, NexusFlow, Zhipu AI, xAI, AI21 Labs, Princeton, and Tencent.

This change profoundly reflects the rapid development in the AI field. In 2023, surpassing GPT-4 was a significant breakthrough worth recording in history, while by 2024, it seems to have become the basic threshold for measuring top AI models.

Part GPT-4 Level Models Achieve Local Running on Personal Computers

In 2024, the field of large language models welcomes another important breakthrough: GPT-4 level models can now run on ordinary personal computers. This breaks the traditional perception that "high-performance AI models must rely on expensive data centers."

Taking the M2 MacBook Pro with 64GB of memory as an example, the same device that could barely run GPT-3 level models in 2023 can now run multiple GPT-4 level models, including the open-source Qwen2.5-Coder-32B and Meta's Llama 3.3 70B.

This breakthrough is surprising, as running GPT-4 level models was previously thought to require a data center-level server equipped with one or more GPUs worth over $40,000.

Even more noteworthy is Meta's Llama 3.2 series. Although its 1B and 3B versions do not match GPT-4, their performance far exceeds expectations for model scale. Users can even run Llama 3.2 3B on an iPhone through the MLC Chat iOS app; this model, requiring only 2GB of storage space, can generate content at a speed of 20 tokens per secondThe fact that they can operate proves that many models have achieved incredible improvements in training and inference performance over the past year.

Model Prices Plummet Due to Competition and Efficiency Gains

In the past 12 months, the prices of large models have seen a sharp decline.

In December 2023, OpenAI charged $30 per million input tokens for GPT-4. Today, the price of $30/mTok can get you OpenAI's most expensive model, o1. The price for GPT-4o is $2.50 (12 times cheaper than GPT-4), and the price for GPT-4o mini is $0.15/mTok—nearly 7 times cheaper than GPT-3.5, yet much more powerful.

Other model providers charge even less. Anthropic's Claude 3 Haiku (launched in March, but still its cheapest model) is priced at $0.25/mTok. Google's Gemini 1.5 Flash is priced at $0.075/mTok, while their Gemini 1.5 Flash 8B is priced at $0.0375/mTok—27 times cheaper than last year's GPT-3.5 Turbo.

These price declines are driven by two factors: intensified competition and improved efficiency.

The Rise of Multimodal LLMs

A year ago, the most notable example was GPT-4 Vision, which was released at OpenAI's DevDay in November 2023. Google's multimodal model, Gemini 1.0, was released on December 7, 2023.

In 2024, nearly every major model provider released multimodal models. In March, we saw Anthropic's Claude 3 series, in April, Gemini 1.5 Pro (image, audio, and video), and then in September, Qwen2-VL and Mistral's Pixtral 12B, as well as Meta's Llama 3.2 11B and 90B visual models. In October, we received audio input and output from OpenAI, followed by Hugging Face's SmolVLM in November, and in December, image and video models from Amazon Nova.

Multimodal capabilities represent a significant advancement for LLMs, and the ability to run prompts for images (as well as audio and video) is a fascinating new way to apply these models.

Voice and Real-Time Video Ignite Imagination

Emerging audio and real-time video models deserve special mention.

The ability to converse with ChatGPT was first realized in September 2023, although it was merely an integration of speech-to-text models and new text-to-speech models at that time.

The GPT-4o, released on May 13, demonstrated a brand new voice mode that can accept audio input and output very realistic-sounding speech without the need for separate TTS or STT modelsWhen the advanced voice mode of ChatGPT was finally launched (slowly rolled out from August to September), the effect was astonishing. OpenAI is not the only team with a multimodal audio model. Google's Gemini also accepts audio input, and the Google Gemini app can now speak in a manner similar to ChatGPT. Amazon has also previewed the voice mode of Amazon Nova, which will be launched in the first quarter of 2025.

Released in September, Google's NotebookLM has taken audio output to a new level, allowing two "podcast hosts" to engage in eerily realistic conversations about anything you input into its tool.

In December, real-time video became the new focus. ChatGPT now enables sharing the camera with the model and discussing in real-time what is being seen. Google Gemini also showcased a preview version with the same functionality.

Instant-driven application generation has already become a commodity

This was achieved with GPT-4 in 2023, but its value will only become apparent in 2024.

Large models excel at writing code, and if you provide a prompt correctly, they can build a complete interactive application using HTML, CSS, and JavaScript.

When Anthropic released Claude Artifacts, they strongly promoted this idea, which is a groundbreaking new feature. With Artifacts, Claude can write an on-demand interactive application for you and then let you use it directly within the Claude interface.

Since then, many other teams have also built similar systems. GitHub released their version, GitHub Spark, in October. Mistral Chat added it as a feature called Canvas in November.

This prompt-driven custom interface feature is powerful and easy to build, and it is expected to appear as a feature in a wide range of products by 2025.

The free use of the best models lasted only a few months

Within just a few months this year, the three best models—GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro—were available for free in most parts of the world.

OpenAI made GPT-4o available for free to all users in May, while Claude 3.5 Sonnet has also been free since its release in June. This is a significant change, as free users mostly had access to GPT-3.5 level models over the past year.

With OpenAI launching ChatGPT Pro, that era seems to have come to an end, and it may be over forever. This monthly subscription service costing $200 is the only way to access its most powerful model, o1 ProDue to the fact that the techniques behind the o1 series (and other future models) involve spending more computational time to achieve better results, I believe the days of freely using the best available models are unlikely to return.

“Agent” Has Not Really Emerged

The term “Agent” is quite frustrating because it lacks a single, clear, and widely understood meaning. If you tell me you are building an “Agent,” you are conveying almost no information to me.

The two main categories of “Agent” that I see are: one that considers AI agents to be those that act on your behalf—similar to a travel agent model; the other views AI agents as large language models (LLMs) that can access tools and run in a loop through these tools during the problem-solving process. Additionally, the term “autonomous” is often included, but it also lacks a clear definition.

Regardless of what the term means, Agent still has that feeling of being perpetually “just around the corner.” Setting aside the terminology, I remain skeptical about the practicality of Agents.

Evaluation is Really Important

In 2024, one thing has become very clear: writing good automated evaluations for LLM-driven systems is the most necessary skill for building useful applications on top of these models.

If you have a robust evaluation suite, you can adopt new models faster, iterate better, and build more reliable and useful product features than your competitors.

Everyone knows that evaluation is important, but there is still a lack of good guidance on how to best implement them.

Apple Intelligence is Bad, Apple’s MLX Library is Great

As a Mac user, last year I felt the lack of a Linux/Windows machine equipped with an NVIDIA GPU, which was a huge disadvantage for trying out new models. 2024 is much better.

In practice, many models are released in the form of model weights and libraries that tend to support NVIDIA's CUDA rather than other platforms.

In this regard, the llama.cpp ecosystem has been very helpful, but the real breakthrough is Apple’s MLX library, “an array framework designed for Apple Silicon.” It is fantastic.

Apple’s mlx-lm Python support runs multiple MLX-compatible models on my Mac with excellent performance. The mlx-community on Hugging Face offers over 1,000 models that have already been converted to the desired format.

While MLX is a game changer, Apple’s own “Apple Intelligence” features are mostly disappointing. Apple’s LLM capabilities are merely a poor imitation of cutting-edge LLM features.

The Rise of “Inference” Models

The most interesting development in the last quarter of 2024 is the emergence of new inference models. Taking OpenAI's o1 model as an example—it was initially released on September 12 as o1-preview and o1-mini.

The biggest innovation of the inference model is that it opens up a new way to scale models: models can now improve performance not just by increasing computation during training, but by investing more computation during inference to tackle harder problems.

The sequel to o1, o3, was released on December 20 and achieved impressive results in the ARC-AGI benchmark test, but at a high cost, with total computation time expenses expected to exceed $1 million. o3 is expected to be officially available in January 2025.

OpenAI is not the only company involved in this category. Google released the first competitor in this category, gemini-2.0-flash-thinking-exp, on December 19. Alibaba's Qwen team released their QwQ model on November 28; DeepSeek opened the DeepSeek-R1-Lite-Preview model for trial through its chat interface on November 20. Anthropic and Meta have not made any progress yet, but they are sure to follow up.

Is the training cost of China's best LLM below $6 million?

A major news item at the end of 2024 is the release of DeepSeek v3. DeepSeek v3 is a massive 685B parameter model, with some benchmark tests ranking its performance alongside Claude 3.5 Sonnet.

The Vibe benchmark currently ranks it 7th, just behind Gemini 2.0 and OpenAI's 4o/o1 model. This is the highest-ranked open-source licensed model to date.

What is truly impressive about DeepSeek v3 is its training cost. The model was trained on 2,788,000 H800 GPU hours, with an estimated cost of $5,576,000. Llama 3.1 405B was trained on 30,840,000 GPU hours—11 times the time used by DeepSeek v3—but with slightly worse benchmark results.

Environmental impact has improved

The efficiency improvements of models (both hosted and locally run) have led to a pleasing outcome: the energy consumption and environmental impact of running prompts have significantly decreased over the past few years.

However, the infrastructure construction for training and running models still faces tremendous competitive pressure. Companies like Google, Meta, Microsoft, and Amazon have invested billions of dollars in building new data centers, which have a significant impact on the power grid and the environment, with some even discussing the construction of new nuclear power plants.

Is this infrastructure necessary? The $6 million training cost of DeepSeek v3 and the ongoing decline in the prices of large models may suggest that it is not necessary

Synthetic Training Data Works Very Well

There is a popular saying that as the internet becomes flooded with AI-generated garbage, the models themselves will degrade, feeding off their own outputs, ultimately leading to their inevitable extinction.

But this clearly will not happen. On the contrary, we see AI labs increasingly using synthetic content for training—deliberately creating artificial data to help guide their models in the right direction. Synthetic data is becoming more common as an important component of pre-training.

Another commonly used technique is to use larger models to help create training data for smaller, cheaper alternatives—more and more labs are using this method. DeepSeek v3 uses "inference" data created by DeepSeek-R1.

Carefully designed training data for LLMs seems to be everything in creating these models. The days of scraping complete data from the web and randomly throwing it into training runs are long gone.

Large Models Are Becoming Harder to Use

One point I have been emphasizing is that LLMs are tools for advanced users. They seem simple—how hard can it be to input a message to a chatbot?—but in reality, to make the most of them and avoid their various pitfalls, you need a deep understanding and experience.

If there is any problem that has worsened, it is that in 2024, this issue has become even more severe.

We have built computer systems that can converse in human language, answer your questions, and often answer correctly!... But this depends on the type of question, how it is asked, and whether the question is accurately reflected in those undisclosed, secret training datasets.

The default LLM chat interface is like throwing a completely inexperienced new user into a Linux terminal and letting them figure it out on their own. Meanwhile, the end users' understanding models of these tools are becoming increasingly inaccurate and filled with misunderstandings.

Many more informed individuals have completely given up on LLMs because they see no one benefiting from such flawed tools. The key skill to derive maximum value from LLMs is learning how to use those unreliable yet extremely powerful technologies. Mastering this skill is clearly not easy.

Knowledge Distribution Is Extremely Uneven

Now most people have heard of ChatGPT, but how many have heard of Claude? The knowledge gap between those actively following these technologies and the 99% who do not care is enormous.

The speed of change has not helped alleviate this problem. Just in the past month, we have witnessed the popularization of live interfaces, where you can point your phone camera at an object and converse with it using voice... Most self-proclaimed geeks haven't even tried this feature.

Considering the ongoing (and potential) impact of this technology on society, I believe the size of this gap is unhealthy. I hope more efforts will be made to improve this situation

LLMs Need Better Criticism

Many people are extremely averse to large model technology. In some public forums, merely stating the opinion that "LLMs are useful" is enough to spark a major debate.

There are many reasons to dislike this technology—environmental impact, the (lack of) ethics in training data, insufficient reliability, negative applications, and the potential negative impact on people's jobs.

LLMs are definitely worthy of criticism. We need to discuss these issues, seek mitigation methods, and help people learn how to use these tools responsibly, ensuring that their positive applications outweigh the negative impacts.

Original link: https://simonwillison.net/2024/Dec/31/llms-in-2024/