AI Sets off a New Wave? OpenAI Rushes to Release "Multimodal" Large Model before Google Gemini

According to reports, OpenAI is actively working to release a multimodal large-scale language model (MLLM), codenamed Gobi, before Google Gemini is launched, in order to surpass Google and maintain its leading position.

It was expected that Google would take the lead in the field of "multimodal" large models, with the company's masterpiece, Gemini, set to be released soon, expected to make its debut this autumn and reportedly undergoing testing with selected enterprise clients.

However, OpenAI is once again trying to steal the show.

According to the latest reports from the media, OpenAI is actively working to incorporate multimodal capabilities (similar to those expected to be provided by Gemini) into GPT-4, aiming to launch a multimodal large language model (MLLM) before Gemini is released. This next-generation large language model, codenamed Gobi, is intended to surpass Google and maintain OpenAI's leading position.

With ChatGPT demonstrating extraordinary capabilities in various fields, multimodal large language models have become a hot topic of research. They utilize powerful large language models (LLMs) as the "brain" and can perform various multimodal tasks.

MLLM exhibits capabilities that traditional methods lack, such as creating stories based on images, visual knowledge question answering, and mathematical reasoning without the need for optical character recognition (OCR). From natural language understanding to image interpretation, it provides a broader range of information processing capabilities.

The report states that OpenAI showcased these functionalities as early as the release of GPT-4 in March, but they were only made available to one company called "Be My Eyes." The latter primarily develops mobile applications for people with visual impairments or blindness. Six months later, OpenAI is preparing to launch the GPT-Vision feature on a larger scale.

Why did it take OpenAI so long to introduce this feature? The report suggests that the main concern was the potential misuse of the new visual capabilities, such as automated solving of captchas to impersonate humans or tracking individuals through facial recognition. However, OpenAI engineers seem to be close to addressing the legal concerns surrounding this new technology.

Google also faces this issue, and when asked about the measures being taken to prevent abuse of Gemini, a Google spokesperson pointed out that the company made a series of commitments in July to ensure responsible development of all its products.

However, considering that Google possesses proprietary data related to text, images, videos, and audio (including data from platforms like search and YouTube), the industry's shift towards multimodal models may play to Google's advantage. One user who has used early versions of Gemini said that it seems to produce fewer incorrect answers compared to existing models.

OpenAI CEO Sam Altman has hinted in recent interviews that GPT-5 has not yet appeared, but they plan to enhance GPT-4 in various ways, with the new enhanced model possibly being one of them. According to reports, it seems that OpenAI has not yet started training Gobi, so it is too early to say that it will eventually become GPT-5.

During an interview with Wired last week, Google CEO Sundar Pichai expressed his confidence in Google's current position in the field of AI and acknowledged the enduring nature of technological progress, as well as their thoughtful approach to balancing innovation and responsibility.

In any case, this competition is like the AI version of iPhone vs Android. People are eagerly awaiting the arrival of Gemini, which will reveal the extent of the gap between Google and OpenAI.