Did Alphabet-C "fake" the viral video of Gemini's "powerful multimodal capabilities"?

Alphabet-C's Gemini multimodal AI video has been accused of being fabricated, with Gemini's performance being achieved through artificial manipulation, skipping certain prompts and reasoning processes. The video showcases Gemini's flexible response and understanding abilities in speech, images, and more, but the actual response is not authentic enough. This discovery has caught the attention of netizens.

On Thursday, a video titled "Intimate Interaction with Gemini: Multi-modal AI Interaction" was posted on Google's official YouTube account, attracting a lot of attention from netizens. Within a day, it had been viewed 1.4 million times.

The reason why this video attracted so much attention is because it showcased the impressive performance of Google's most powerful Gemini multi-modal model in terms of interaction.

In this 6-minute and 22-second video, Gemini demonstrated flexible and smooth responses and understanding to inputs such as voice and images, showcasing its powerful multi-modal understanding and interaction capabilities.

However, according to the latest report from tech media TechCrunch, some users who carefully studied the video discovered that the stunning performance of Gemini was almost all "fabricated".

The descriptions of the scenes, the recognition of toys, and the answers to questions in the video were all artificially set up. The video skipped some prompts and the model's reasoning process, creating a false impression of Gemini being intelligent and agile.

Gemini may not be as intelligent as it seems

Gemini showcased various interactive skills in the video, such as recognizing a toy duck, tracking the ball in a cup trick, recognizing gestures, and rearranging the order of planets.

For example, in the first demonstration, Gemini recognized the evolution process of a duck sketch from a single line to a complete picture. When the demonstrator filled the duck sketch with blue paint, Gemini pointed out that ducks are usually brown, white, or black, not blue.

Then, when Gemini saw a blue toy duck, it exclaimed, "What is this thing!" and proceeded to answer various voice questions about the toy duck.

Throughout the video, Gemini's reactions were quick and its answers were smooth, but the problem is that its responses were not authentic enough.

The media tested Gemini's reaction capabilities in various scenarios by capturing footage and created a new demonstration video. Static image frames from the video footage were used to prompt Gemini, and prompts were given through text input.

The results showed that Gemini does possess some of the abilities showcased in the video, but it does not actually complete all interactive tasks as quickly and smoothly as shown in the video.

However, according to TechCrunch, the interactive effects shown in the video are not real-time interactions, but rather pre-set.

The video used a series of specially prepared text prompts and accompanying static images to create the effect of real interaction by selecting and editing these pre-set materials.This is actually done to mislead the audience and make people believe that the video shows Gemini's true real-time interaction capabilities. However, it is very likely that Gemini is not as powerful as shown in the video in terms of interaction speed, accuracy, and other aspects.

There are significant differences between the video and the documentation

It is worth noting that when TechCrunch compared the video with the document demonstration released by Google, they found differences in the prompts.

For example, at 2 minutes and 45 seconds in the video, a hand made a series of gestures without accompanying voice prompts. Gemini quickly responded, "I know what you're doing! You're playing rock, paper, scissors!"

But Google's Gemini capability document clearly states that the model cannot reason by recognizing individual gestures, and the prompts must display all three gestures together and provide the prompt "What game is this?" Only then can it recognize "rock, paper, scissors".

So the performance in the video completely contradicts the prompt restrictions described in the document and cannot demonstrate Gemini's true recognition capabilities.

In addition, the scene where Gemini recognizes the order of the planets may also be deceptive.

The presenter showed sticky notes with doodles of the sun, Saturn, and Earth, and asked Gemini if the order of the planets was correct. Gemini gave the correct order of the sun, Earth, and Saturn.

But the document shows that the actual prompt is, "Is this the correct order? Consider the distance from the sun and explain your reasoning."

These two interactions feel completely different. The video demonstration looks like intelligent real-time assessment, while in actual interaction, Gemini requires prompts with strong implications.

In addition, there are also cases where the prompts in the demonstration of tracking the crumpled paper in the cup are different from what is recorded in the document.It is worth noting that if the video had explicitly stated from the beginning that "this is an artistic representation of the interactions tested by our researchers," there would be no objections, as such videos inherently combine facts and idealized elements.

However, this video is titled "Intimate Interaction with Gemini" and claims to be "our favorite interaction," which implies that the interactions shown in the video are real. But in reality, they are not.

Google has not even clarified whether the models showcased in the video are the already released Gemini Pro version or the upcoming Gemini Ultra version expected to be launched next year.