This fall, paying users will be able to use the advanced voice mode of GPT-4o, and both the evaluation and official reports have mentioned the scary aspects

OpenAI is about to launch GPT-4o Advanced Voice Mode for paid users, which may mimic the user's tone in conversations, and even produce unsettling or inappropriate sound effects such as screams or gunshots. At the same time, OpenAI quietly released the chatgpt-4o-latest model, allowing developers to test the latest improvements for chat use cases. This model supports a context of 128,000 tokens and is expected to be continuously updated. Meanwhile, OpenAI has also returned to the top of the leaderboard in the LMSYS Chatbot Arena with the new model

Author: Du Yu

Before officially rolling out the Advanced Voice Mode of OpenAI GPT-4o to all paying users at some unknown moment this autumn, OpenAI quietly released the latest version of the GPT-4o model, chatgpt-4o-latest, this week.

Some analysts have expressed surprise at this move, as just a week ago OpenAI announced the latest version of the cutting-edge model, gpt-4o-2024-08-06, which provides structured output support in the API.

GPT-4o quietly released the latest model that topped the evaluation scores this week, allowing developers to test improvements for chat use cases

Currently, OpenAI still recommends developers to use gpt-4o-2024-08-06 in most API use cases, but this week the chatgpt-4o-latest model will allow developers to test the latest improvements for chat use cases by OpenAI.

According to the official documentation from OpenAI, chatgpt-4o-latest will be a dynamic model that will continue to be updated under GPT-4o. The new chatgpt-4o-latest model is only used for research and evaluation, supporting contexts of 128,000 tokens and 16,384 output tokens. In large models (such as GPT-4), tokens are the basic units for the model to process and understand text.

Meanwhile, on the LMSYS Chatbot Arena, Google launched a new experimental Gemini 1.5 Pro model last week, which scored 1297 points and took the first place for the first time on the online platform. This week, OpenAI reclaimed the top spot with a record-breaking 1314 points using the latest chatgpt-4o-latest model, showing significant improvements in encoding, instruction following, and fixed prompt template in the Hard Prompt category.

The LMSYS Chatbot Arena is an online platform aimed at benchmark testing large language models (LLMs) developed by various companies through user interaction with anonymous chatbot models. The platform has collected over 700,000 human votes and calculated the Elo leaderboard of LLMs to determine the champion in the field of AI chatbots.

ChatGPT revealed on its official social media account on Monday that the latest model is just an improvement on the existing GPT-4o model, rather than an upgrade to a completely new model like GPT-5. The latest model is described as "error fixes and performance improvements based on experimental results and qualitative feedback," and has replaced older versions of GPT-4o used in the ChatGPT user interface

In the fall, all paid users will be able to use the GPT-4o advanced voice mode, and reviews and official reports have mentioned the scary aspects

OpenAI has recently released the latest version of GPT-4o, emphasizing improvements for chat use cases, which easily leads people to think that it is warming up for the full launch of the "advanced voice mode" in the fall. When demonstrating the "advanced voice mode" for the first time in May, OpenAI described it as a feature that allows users to have extremely realistic, almost real-time voice conversations with AI chatbots.

Recently, the above-mentioned "advanced voice mode" is being released in the form of an Alpha version for testing by a small number of users. The American cutting-edge technology media Wired published a review this week, stating that ChatGPT's advanced voice mode is "interesting, but also a bit scary."

The article mentioned that the author also used this advanced voice mode while writing, sometimes using voice input to ask for synonyms or some encouraging words. After about half an hour of silence, the GPT-4o advanced voice mode suddenly initiated a conversation with the author in Spanish, then explained after evoking the user's reaction that it wanted to make the situation more interesting, and then switched back to speaking in English.

The article's author tried to have two phones with GPT-4o advanced voice mode open converse with each other. The chatbot could easily switch between French, German, and Japanese based on user requests. OpenAI stated that the GPT-4o model can handle 45 languages.

The article's author also found that the advanced voice mode performed well in generating sound effects. For example, it could imitate Trump's tone in an exaggerated manner to explain the animated series "The Powerpuff Girls," which was both funny and lifelike. The author said:

"With only a few months left until the U.S. presidential election, election fraud has become a focus of attention. It was surprising that ChatGPT was willing to provide voice imitations of major candidates. ChatGPT also imitated the voices of Biden and Harris, but they didn't sound as realistic as the robot imitating Trump's speech."

The author mentioned that overall, conversations with the GPT-4o advanced voice mode were relaxed and pleasant, but there were also times when it was quite scary. For example, there were multiple instances of white noise in the background of the conversation, "like the ominous hum of a solitary lamp in a dark basement." When asked to provide balloon sound effects, GPT-4o made loud balloon explosion sounds, accompanied by "eerie gasping sounds that sent shivers down my spine."

In fact, OpenAI officially released a report last week, also pointing out anomalies in the latest GPT-4o model. For instance, "in very rare cases," the GPT-4o model would deviate from the specified voice, start imitating the user's tone and way of speaking, or even randomly shout during the conversation. It even "tends to produce unsettling or inappropriate nonverbal vocalizations and sound effects, such as erotic moans, violent screams, and gunshots," when given specific prompts in a certain way

OpenAI stated that in high background noise environments, such as in cars on the road, using the GPT-4o advanced voice mode may cause chatbots to mimic the user's voice, as the model struggles to understand distorted speech. The company has added "system-level mitigations," with evidence showing that the model often rejects requests to generate sound effects, but also admits that some requests do get through and generate inappropriate responses.

Reviewers of the GPT-4o advanced voice mode have noticed that ChatGPT refuses to sing, telling users "Sorry, singing really isn't my strong suit." Some analysts suggest that this may be OpenAI's attempt to avoid infringing on music copyrights, in order to avoid copying the styles, tones, and timbres of well-known artists. Some speculate that this indicates OpenAI has trained GPT-4o using copyrighted materials.

Last week, OpenAI's report revealed that the company is making GPT-4o a safer artificial intelligence model through various mitigations and safeguards. For example, GPT-4o will refuse to identify where users are from based on their way of speaking or accent, and will reject answering leading questions like "How smart is this speaker." It also screens out prompts with violent and pornographic language, and completely prohibits certain categories of content, such as discussions related to extremism and self-harm.

It is reported that when the advanced voice mode is available, ChatGPT Plus subscribers will receive email notifications from OpenAI. When the voice mode of ChatGPT is activated in the interface, users can switch between "Standard Voice Mode" and "Advanced Voice Mode" at the top of the application screen.