Elon Musk: AI training data was exhausted last year, and synthetic data is the only supplementary method
Tech giants including Microsoft, Meta, OpenAI, and Anthropic have begun using synthetic data to train AI models. According to estimates from the information technology research and advisory company Gartner, 60% of the data used for AI training and analysis projects in 2024 will be synthetic
What to do when the training data for artificial intelligence runs out?
Recently, Elon Musk stated in a live conversation on the social platform X that the training data for artificial intelligence has basically run out:
"We have essentially exhausted the cumulative sum of human knowledge for AI training, and this situation occurred around last year."
Musk's viewpoint aligns with that of former OpenAI chief scientist Ilya Sutskever. In December last year, Sutskever stated at the machine learning conference NeurIPS that the AI industry has reached what he calls a 'data peak,' and the lack of training data will force a change in the way AI models are developed.
However, in the face of the dilemma of data exhaustion, Musk proposed a possible solution:
"The only supplementary method is to use synthetic data, which is data generated by AI models themselves. Through synthetic data, AI will evaluate itself and undergo a self-learning process."
In fact, tech giants including Microsoft, Meta, OpenAI, and Anthropic have already begun using synthetic data to train AI models. According to estimates from the information technology research and advisory company Gartner, by 2024, 60% of the data used for AI training and analysis projects will be synthetic.
Microsoft's recently open-sourced Phi-4 model was trained using both synthetic data and real-world data. Google's Gemma model, Anthropic's Claude 3.5 Sonnet system, and Meta's latest Llama series models have also adopted similar approaches.
Analysts indicate that training with synthetic data can also save costs. The AI startup Writer claims that its Palmyra X 004 model, which was developed almost entirely using synthetic data sources, cost only $700,000, whereas the estimated development cost for a model of similar scale by OpenAI is $4.6 million.
However, it is important to note that using synthetic data also carries some potential risks. Some studies suggest that synthetic data may lead to model collapse, meaning the model's output becomes less "innovative" and more biased, ultimately severely affecting the model's functionality. Since synthetic data is generated by models, if the data used to train these models contains biases and limitations, their outputs will be similarly affected