Just now, DeepMind's strongest "foundation world model" was born! Generate a 1-minute game world from a single image, unlocking the next generation of intelligent agents

Wallstreetcn
2024.12.04 19:56
portai
I'm PortAI, I can summarize articles.

Google DeepMind has launched the second-generation foundational world model Genie 2, capable of generating playable game worlds lasting up to 1 minute. This model supports dynamic environments generated from images, allowing users to interact using a keyboard and mouse. Researchers tested the agents' ability to follow language instructions in the generated environments, and the results were surprising. Genie 2 provides an infinitely diverse 3D environment for training and evaluating embodied agents, marking a new advancement in AI research

Just now, Google's DeepMind's second-generation large-scale foundational world model Genie 2 has been born!

From now on, AI can generate various consistent worlds, playable for up to 1 minute.

Google researchers believe that Genie 2 can unlock the next wave of capabilities for embodied agents.

From first-person real-world scenes to third-person driving environments, Genie 2 generates a 720p world.

Given an image, it can simulate world dynamics, creating a consistent environment that can be interacted with via keyboard and mouse input.

How great is the potential of embodied agents?

In the world generated by Imagen 3, researchers tested whether the latest agents could follow language instructions to walk to the red door or blue door.

The results were surprising!

Thus, we now have a pathway to infinite environments for training and evaluating embodied agents.

After researchers created a world with 3 arches, Genie 2 simulated this world again, allowing the team to solve the task.

In response, netizens expressed admiration: "This work is truly amazing! From now on, we can finally combine open-ended agents with open-world models. We are moving towards an almost infinite training data system."

Some netizens also expressed: "The world of 'The Matrix' is coming!"

Generating Infinite Diverse Training Environments for Future General Intelligence Agents

As a foundational world generation model, Genie 2 can generate an infinite variety of controllable and playable 3D environments for training and evaluating embodied agents.

Based on a prompt image, it can be operated by humans or AI agents! The method is through keyboard and mouse input.

In AI research, games have always played a crucial role. Due to their engaging characteristics, unique combinations of challenges, and measurable progress, games have become an ideal environment for safely testing and enhancing AI capabilities.

Since the establishment of Google DeepMind, games have been at the core of research—from early Atari game studies to groundbreaking achievements like AlphaGo and AlphaStar, and collaborations with game developers to study general intelligence agents.

However, training more general embodied agents has been limited by the lack of sufficiently rich and diverse training environments.

But now, the birth of Genie 2 changes everything.

From now on, future agents can be trained and evaluated in an infinite number of new world scenarios.

A new creative workflow for interactive experience prototype design also has new possibilities.

Emergent Capabilities of Foundational World Models

So far, world models have largely been limited to modeling narrow domains In Genie 1, researchers introduced a method for generating diverse two-dimensional worlds.

By the time of Genie 2, a significant breakthrough in versatility was achieved—it can now generate rich and diverse 3D worlds.

Genie 2 is a world model, which means it can simulate virtual worlds, including the consequences of taking any actions (such as jumping, swimming, etc.).

After being trained on a large-scale video dataset, it exhibits various scales of emergent capabilities like object interaction, complex character animations, physical effects, modeling, and the ability to predict the behavior of other agents, similar to other generative AI models.

For every demo where a human interacts with Genie 2, the model takes a single image generated by Imagen 3 as the prompt input.

This means anyone can describe the world they want in words, choose their preferred rendering effects, and then enter this newly created world to interact with it (or they can also train or evaluate AI agents within it).

At each step, both the human or agent can provide actions through the keyboard and mouse, while Genie 2 simulates the next observational outcome.

For up to one minute, Genie 2 can generate a consistent world, with a duration lasting directly for 10-20 seconds!

Action Control

Genie 2 can intelligently respond to actions taken through keyboard keys, recognizing characters and moving them correctly.

For example, the model must calculate that the arrow keys should move the robot, not the trees or clouds.

A cute humanoid robot in the woods A humanoid robot in ancient Egypt Observing from a first-person perspective of a robot on a purple planet Observing from a first-person perspective of a robot in a modern urban apartment.

Generating Counterfactuals

Genie 2 can create multiple different developments based on the same starting scene.

This means we can provide various "what if" scenarios for AI training.

In the two demo lines below, each video starts from exactly the same scene, but the human player chooses different actions

Long-term Memory

Genie 2 can remember scenes that temporarily leave the frame and accurately restore them when they re-enter the field of view.

Continuous Generation of New Scenes

Genie 2 can create logically consistent new scene content in real-time during the process and maintain the consistency of the entire world for up to one minute.

Diverse Environments

Genie 2 can generate various observation perspectives, such as first-person perspective, isometric perspective (45-degree overhead view), or third-person driving perspective

3D Structure Genie 2 can create complex 3D visual scenes. Object Properties and Interactions Genie 2 can model various object interactions, such as balloon popping, opening doors, and shooting explosive barrels. ! Character Animation Genie 2 can create various action animations for different types of characters. NPC Genie 2 can model other intelligent agents and even interact with them in complex ways. ! Physical Effects Genie 2 can simulate the dynamic effects of water surfaces. Smoke Genie 2 can simulate the effects of smoke. Gravity Genie 2 can simulate gravity. ![](https://mmbiz-qpic.wscn.net/sz_mmbiz_gif/UicQ7HgWiaUb0r2HNNSLXsKeDgF6oxDyvFfeSVPhIup8LiaODQSKoXFXU2ia6rsVIgPgg0INpstiaGJSm8x4AaMdPxA/640? Lighting Genie 2 can simulate point light sources and directional light. Reflection Genie 2 can simulate reflection, diffuse light, and colored lighting. Simulation Based on Real Images Genie 2 can also take real-world images as input prompts and simulate scenes such as grass swaying in the wind or water flowing in a river. ![](https://mmbiz-qpic.wscn.net/sz_mmbiz_gif/UicQ7HgWiaUb0r2HNNSLXsKeDgF6oxDyvF16m6fWpjV9ZKibJ8jjw3GVpUvvaTySl6UOI36M9RJicXYg9qPZibh1mdg/640? wx_fmt=gif&from=appmsg)

Quickly Create Test Prototypes

With Genie 2, creating diverse interactive scenarios has become simple.

Researchers can quickly try out new environments to train and test embodied AI agents.

For example, below are different images generated by Imagen 3 that researchers input into Genie 2 to simulate controlling paper airplanes, dragons, falcons, or parachutes in various flying modes.

In this process, the performance of Genie in handling the actions of different control objects was also tested.

With its powerful discrete generalization capability, Genie 2 can transform conceptual design drawings and hand-drawn sketches into interactable scenes.

This allows artists and designers to quickly validate ideas, enhance the efficiency of scene design, and accelerate related research progress.

Below are some virtual scene examples created by conceptual designers.

AI Agents Acting in World Models

With Genie 2, researchers can quickly build rich and diverse virtual environments and create new evaluation tasks to test AI agents' performance in previously unencountered scenarios.

The following demo is the SIMA agent developed collaboratively by Google DeepMind and game developers, which can accurately understand and execute various commands in a brand new environment generated by Genie 2 from just one image.

prompt: A screenshot from a third-person open-world exploration game. The player in the image is an adventurer exploring a forest. There is a house with a red door on the left and a house with a blue door on the right The camera is facing behind the player. #RealisticStyle #Immersive

The goal of the SIMA intelligent entity is to complete various tasks in diverse 3D game environments through natural language commands.

Here, the team used Genie 2 to generate a 3D environment containing two doors (blue and red) and provided the SIMA intelligent entity with commands to open each door.

During the process, SIMA controls the game character using the keyboard and mouse, while Genie 2 is responsible for generating the game visuals in real-time.

Open the blue door

Open the red door

Moreover, we can also use SIMA to evaluate the capabilities of Genie 2.

For example, by having SIMA look around the scene and explore the area behind the house, we can test whether Genie 2 can generate a consistent environment.

Turn around

Go behind the house

Although this research is still in its early stages, both the performance of the AI entity and the generation of the environment still have room for improvement However, researchers believe that Genie 2 is a pathway to solving the structural problem of training embodied intelligent agents for safety, while also achieving the breadth and generality required for artificial general intelligence (AGI).

prompt: A computer game scene depicting the interior of a rugged cave or mine. The view is in third-person perspective, looking down from above the protagonist. The protagonist is a knight holding a longsword. In front of the knight stand three stone archways, and he can choose to enter any of the doors. Through the first door, strange green plants emitting a glow can be seen growing in the tunnel. Behind the second door is a long corridor, with the walls covered in riveted iron plates, and a disturbing light faintly visible in the distance. Inside the third door is a rough stone staircase winding up to an unknown height.

Go up the stairs

Go to the place with plants

Go to the middle door

Diffusion World Model

Genie 2 is an autoregressive latent variable diffusion model trained on a large-scale video dataset.

In this model, the latent variable frames of the video are first processed by an autoencoder and then passed to a large-scale Transformer dynamic model trained with causal masking similar to that in LLMs During the inference phase, Genie 2 can sample in an autoregressive manner, utilizing single actions and previous latent variable frames on a frame-by-frame basis. During this process, classifier-free guidance will be used to enhance the controllability of actions.

It is worth noting that the demonstrations above were generated by the un-diluted "full-blooded" base model, fully showcasing the potential capabilities of the technology.

Of course, a distilled version can also be run in real-time, but the output quality will correspondingly decrease.

Highlights

In addition to these cool demos, the team also discovered many interesting highlights during the generation process:

Standing in the garden in a daze, suddenly, a ghost appeared.

This friend prefers to parkour in the snowfield rather than skiing obediently on skis.

With great power comes great responsibility.

Acknowledgments

Finally, the Google DeepMind team released a long list of acknowledgments

Reference: https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/

This article is sourced from: New Intelligence, original title: "Just in, DeepMind's strongest 'foundation world model' is born! One image generates a 1-minute game world, unlocking the next generation of intelligent agents."

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Investment based on this is at one's own risk