The battle of voice AI assistants begins! OpenAI VS Google - "iOS VS Android" in the AI mobile world

Apple's iOS 18 is expected to introduce ChatGPT technology, while Google controls the "lifeline" of the Android system. After AI assistants, will AI glasses be the next battlefield?

Author: Li Xiaoyin

Source: Hard AI

This week, there have been continuous "bombshells" in the AI circle: Google announced its entry into the Gemini era, "aggressively" releasing a bunch of updates, directly countering OpenAI by launching new products a day earlier.

OpenAI's GPT-4o's outstanding real-time interactive capabilities are impressive, while Google's Project Astra, with equally competitive capabilities, sparked discussions in the industry about AI assistants.

According to publicly available information, both AI voice assistants, GPT-4o and Project Astra, are built on multimodal models, supporting the reception/generation of text, images, and audiovisual content, with ultra-short delays and real-time interaction.

Furthermore, according to previous media reports, Apple has reached an agreement with OpenAI to introduce ChatGPT technology in the new operating system iOS 18, while Google controls the "lifeblood" of the Android system. This inevitably leads to the question: will the showdown between GPT-4o and Gemini be the next "iOS VS Android" in the AI phone industry?

Direct Confrontation, Who Comes Out on Top?

Comparing GPT-4o and Project Astra (which provides Gemini Live functionality in Gemini), we can find detailed differences between these two AI assistants.

1) Usage Scenarios

The average response latency of GPT-4o is 320 milliseconds, with the fastest response to audio input within 232 milliseconds, approaching the reaction time of human conversation. In the demonstration at the launch event, GPT-4o's daily usage scenarios include interpretation, coding, math teaching, summarizing and interpreting information, video emotion recognition, etc.

Gemini Live's visual recognition and speech interaction effects are comparable to GPT-4o, providing conversational natural language voice interfaces and the ability to perform real-time video analysis through the phone camera, with fast response speed for natural daily conversations. DeepMind CEO Demis Hassabis described it as "always wanting to create a useful general intelligence in daily life."

In terms of usability, there is little difference between the two.

However, one point that may lead to different market reactions is that GPT-4o's demonstration was done live, while Google's demonstration was pre-recorded before the event.

2) Multimodal Capabilities

Multimodal capabilities are the main selling points of the two AI assistants. Currently, GPT-4o may be slightly ahead in audio capabilities, while Project Astra's visual features are more prominent.

In the demonstration, GPT-4o showed realistic sound, smooth conversation flow, singing, and even the ability to guess emotions based on user expectations; while Project Astra demonstrated more "advanced" visual features, such as being able to "remember" where you placed your glasses In terms of multimodal models, Gemini relies on other models for output, including using Imagen for image processing and Veo for video processing; while GPT-4o adopts native multimodality, spontaneously generating images and sound.

3) Product Positioning

The launch of GPT-4o has sparked discussions in the market about a real-life version of "Her", as its AI assistant has a female voice with abundant emotional expression, even capable of chatting and joking, while Project Astra, although also a female voice, has a more composed and calm tone, and is more pragmatic.

This reflects the different product positioning of the two in terms of "AI assistants", with OpenAI aiming for a more "human-like" approach, while Google aims for a more "agent-like" approach.

Google has expressed its intention to avoid producing AI similar to "Her".

In a paper published by DeepMind last month, the company detailed the potential drawbacks of human-like AI, believing that such AI assistants will blur the "human-machine boundary", potentially leading to issues such as sensitive information leakage, human emotional dependence, and weakened agency capabilities.

4) Access Paths

OpenAI stated that, effective immediately, GPT-4o's text and visual features are being launched on the web interface and GPT applications. The company also mentioned that voice features will be added in the coming weeks, and developers can now access text and visual features in the API.

Google mentioned that Gemini Live will be launched in the "coming months" through Google's advanced AI program, Gemini Advanced.

Some opinions suggest that the earlier introduction of new features by OpenAI may indicate an advantage for its product in acquiring new users.

5) Costs

GPT-4o is freely available to all users of ChatGPT, with a 50% reduction in API prices.

However, the current official free limit is set on a certain number of messages. Once this limit is exceeded, the model for free users will switch back to ChatGPT, which is GPT3.5. Paid users (starting at $20 per month) will have a fivefold increase in the message limit for GPT-4o.

Gemini Advanced offers a two-month free trial period, followed by a monthly fee of $20.

Will AI Glasses Be the Next Battlefield?

With the advancement of edge AI applications, AI assistants will truly land and be applied in daily life, and their actual utility will gradually become apparent.

However, AI voice assistants seem to imply a new trend in electronic technology: a shift from text to audio.

Furthermore, the deep integration of visual capabilities seems to be on the horizon.

During the presentation, Google mentioned that another significant potential of Project Astra is its ability to be used in conjunction with Google Glasses - when worn by the visually impaired, they can receive real-time voice explanations in their daily lives Meta has also launched a voice robot MetaAI for its VR headsets and Ray-Ban smart glasses.

Some opinions believe that at the current stage, the addition of AI voice assistants may boost AI smartphones to become winners, but looking ahead, the ultimate form of these voice AI models will be smart glasses.