The new challenge in the AI world: Not enough information on the Internet!

The demand for high-quality textual data in the AI industry may exceed supply within two years. Reports suggest that OpenAI is considering using publicly available YouTube video subtitles to train GPT-5

Source: Hard AI

Author: Fang Jiayao

The scarcity of high-quality data is becoming an important obstacle to the development of AI.

On April 1st, according to media reports, as companies like OpenAI and Google continue to advance AI technology, tech giants are facing a new problem: the existing amount of internet information may not be sufficient to support their training of more advanced AI systems.

Tech giants' AI systems, such as ChatGPT that can chat with humans, become smarter by learning information online. However, high-quality and useful information is increasingly scarce now, and some websites are starting to restrict AI companies' access to their data. According to some industry executives and researchers, the demand for high-quality text data in the AI industry may exceed the supply within two years, which may slow down the development speed of AI technology.

Facing the problem of insufficient information, AI companies are trying various methods to find new sources of information. For example, OpenAI is considering using dialogues from YouTube videos to train their next-generation intelligent model GPT-5. Some companies even create synthetic data for learning, although this method is considered by many researchers to potentially cause serious system failures, it is still a potential way to overcome data shortages.

It is reported that most of these efforts are carried out in secrecy, as finding effective solutions may become a key advantage for companies in fierce competition. With the continuous growth of data demand, finding new learning materials, cooperating with data owners, and making AI systems smarter have become important battlegrounds in this industry.

OpenAI's GPT-5 faces a shortage of 100 trillion to 200 trillion tokens

The construction of AI language models relies on a large amount of text data collected from the internet, including scientific research, news articles, Wikipedia entries, etc. These materials are broken down into "tokens", which can be complete words or parts of words. By analyzing and understanding the relationships and patterns between these tokens, AI models learn how to generate fluent, natural language, enabling them to answer questions, write articles, and even compose poetry.

The ability of the model largely depends on the amount of data it is trained on. Generally, the more data, the better the model's performance, as it has more examples to learn different language usages and complexities.

By providing massive training data for its GPT series models, OpenAI continuously improves performance, thereby becoming one of the world's top AI companies. This demonstrates the importance of big data training for the development of AI However, with the continuous expansion of the GPT-4 model, OpenAI's demand for data is also growing rapidly. AI researcher Pablo Villalobos from Epoch Institute estimates that the data involved in training GPT-4 could reach as high as 120 trillion tokens, and future models like GPT-5 may require 600 trillion to 1,000 trillion tokens. Therefore, even with all available high-quality language and image data, the development of GPT-5 may still face a data shortage of 10 trillion to 20 trillion tokens. As for how to fill this huge data gap, there is currently no clear solution.

According to media reports, in response to the challenge of data shortage, AI companies are trying various methods to find new sources of information. Meta founder Mark Zuckerberg recently emphasized that the vast amount of data owned by the company through platforms like Facebook and Instagram provides a significant advantage for its AI research and development. Zuckerberg stated that Meta is able to leverage billions of publicly shared images and videos on the internet, a scale that exceeds most commonly used datasets, although the proportion of high-quality data within them is not yet clear.

While OpenAI is considering using high-quality video and audio samples transcribed by its automatic speech recognition tool Whisper. In addition, OpenAI is also considering establishing a data marketplace to evaluate the contribution of each data point to model training and pay content providers based on this, an innovative idea that has also caught Google's attention.

Epoch Institute predicts that the AI data shortage crisis will be postponed until 2028

Two years ago, Villalobos and his colleagues wrote that by mid-2024, there was a 50% chance of demand for high-quality data exceeding supply, and by 2026, this likelihood increased to 90%. Since then, they have become more optimistic, and under the careful assessment of AI researcher Pablo Villalobos and his team, the new expectations show that this shortage risk will be delayed until 2028.

This optimistic update is based on a profound insight into the current quality and availability of data. Villalobos pointed out that the vast majority of data on the internet is not suitable for AI training materials. In the endless flow of information, only a small portion of data (much lower than previously estimated) can make a substantial contribution to the growth and development of AI models.

At the same time, major social media platforms and news publishers have begun to restrict the use of their data for AI training. They are concerned that if data is freely used for AI training, it may lead to content creators and the platforms themselves losing their deserved economic returns Furthermore, there has been a significant increase in public awareness of personal privacy protection. Many people have a lower willingness to provide private conversations, such as chat records in iMessage, for AI training, as they may be concerned about potential privacy violations.

Recently, when a female journalist questioned CTO Murati about the training data of OpenAI's latest model Sora, Murati failed to provide a clear answer. This has raised concerns in the industry about the transparency of OpenAI's management regarding the source of training data. This incident has sparked a broader discussion about the ownership of public domain data - whether the content we post online belongs to personal privacy or public shared assets.

As a result, these factors have collectively led to challenges in data acquisition. With users and regulatory bodies tightening monitoring of data usage, researchers must find a new balance between protecting privacy and collecting data