游戏 AI 来了!英伟达新模型看直播学会所有游戏,GPT-5.2 秒杀塞尔达

Wallstreetcn
2025.12.25 06:15
portai
我是 PortAI,我可以总结文章信息。

NVIDIA released a new model called NitroGen, which learned the gameplay of human players by watching 40,000 hours of live game videos. NitroGen does not rely on game code but instead masters gameplay techniques across various games by observing videos with controller overlays. This technological breakthrough demonstrates the potential of AI in the gaming field

As we all know, the reason Tesla's FSD is hailed as a masterpiece lies in its "end-to-end" hardcore logic.

The car no longer relies on rigid high-precision maps or sensors, but behaves like an experienced driver:

Eyes on the road (visual input), foot directly on the accelerator, hands directly steering (action output).

So the question arises, what would happen if we applied this logic to gaming scenarios for AI learning?

The principle is exactly the same! Previously, AI playing games had to rely on reading backend data or even "cheating" to know where the enemies were.

But how do real human players operate?

We focus on the pixels on the screen (visual), our brains process the information, and our fingers directly tap the keyboard or press the controller (operation).

For example, Faker's screen-switching represents the pinnacle of human reaction speed.

Directly from the screen to mouse and keyboard operations, this is the "FSD" of the gaming world.

NVIDIA recently pulled off such a bold move!

They released a new model called NitroGen, which completely breaks the mold.

Project address:

https://nitrogen.minedojo.org/assets/documents/nitrogen.pdf

This model did not grow by reading game code, but by lurking on YouTube and Twitch:

It "watched" 40,000 hours of gameplay footage with controller visuals!

It acts like an extremely eager "cloud player," learning directly how humans operate by observing their actions, mastering positioning and basic attacks in various games.

Whether it's RPGs or side-scrolling games, it can handle them all.

You might ask: How can it learn to operate just by watching videos? I don't know which keys the streamer pressed One cannot help but admire the creativity of NVIDIA researchers.

They specifically dug into videos on YouTube and Twitch that feature "controller overlay graphics."

Yes, those are the videos where streamers place a small controller in the corner of the screen, and when they press a button, the controller on the screen lights up accordingly.

NitroGen focused on these 40,000 hours of video material, watching what happened in the game footage (for example, Link swinging a sword), while also observing which button on the controller lit up in the corner (for example, pressing the X button).

It's like someone wanting to learn guitar, not looking at sheet music, but instead watching close-ups of guitarists' fingerings from tens of thousands of concert videos, forcibly correlating "auditory" and "finger movements"!

Only AI could accomplish this task.

Refusing to be "specialized," becoming a versatile hexagonal warrior

Previous game AIs were often "specialists," who could play "Honor of Kings" but would never play "Super Mario."

But NitroGen is all about being a "generalist."

It has learned from over 1,000 different games.

This might mean it has developed a kind of "game intuition"!

Just like us humans playing games, if you've played a souls-like game, such as "Elden Ring," and then pick up a new action game like "Black Myth: Wukong," even if you've never seen it before, you generally know that the left joystick is for running and the right buttons are for attacking.

Testing data shows that when NitroGen was thrown into a new game it had never seen before, its performance was 52% better than models trained from scratch.

Whether it's action RPGs, platformers, or Roguelikes, it can quickly get the hang of it.

Next step: From Hyrule to the real world

Is NVIDIA's move just to create a stronger NPC to play with us?

That's a narrow view; NVIDIA's ambitions are much larger!

First, let's take a look at the recent performance of AI in games.

The latest research from The Decoder has found that today's AI has even begun to possess complex reasoning abilities.

Researchers conducted a unique "stress test" on the reasoning capabilities of current top models using a classic color-changing puzzle from "The Legend of Zelda."

The test required the model to plan six steps to solve the puzzle based solely on screenshots, without internet access.

The results showed a clear disparity among the models:

  • GPT-5.2-Thinking demonstrated astonishing dominance, quickly and accurately outperforming all others;
  • Google's Gemini 3 Pro could also solve the puzzle but sometimes fell into lengthy trial-and-error loops, with reasoning text extending up to 42 pages;
  • Claude Opus 4.5 struggled with visual understanding and needed mathematical formulas for assistance.

The author believes that this powerful reasoning ability, combined with autonomous agent technologies like NVIDIA's NitroGen, indicates:

The era of humans writing game guides and software documentation is coming to an end, and AI will fundamentally change the way we obtain guidance information.

For instance, AI models can now solve color-changing puzzles in "The Legend of Zelda" that require predicting more than six steps, just like solving math problems.

Moreover, NitroGen's potential goes even further; it can not only play but also record and review.

Imagine a future where AI plays a game and effortlessly writes a "platinum guide" for you, even automatically fixing bugs in the game—what more could you ask for?

(It seems likely that the game science project "Black Myth: Wukong" will incorporate AI technology.)

But Jensen Huang's true ambition is actually hidden in the code: NitroGen is built on NVIDIA's GR00T (General Robot Operating Model).

This ambition is quite grand!

  • In the game, it learns: sees a cliff -> knows it will fall -> controls the joystick to jump over.
  • In reality, it corresponds to: sees a puddle on the ground -> knows it will slip -> controls the robot's legs to step over.

The virtual world is, in fact, the most efficient "training ground" for the physical world.

NVIDIA is using millions of trial-and-error attempts in games to create a "general brain" for robots that will one day enter our homes, capable of handling all kinds of chaos.

Perhaps one day, when you marvel at your teammate's incredible moves, the one sitting on the other side of the screen might not actually be human.

But it's a real robot holding a controller and playing games with you!

Games are Reality

Video games have evolved from mere AI testing benchmarks to training grounds for physical intelligence.

This is not only a victory for game AI but also a key turning point for robotics to overcome the "Moravec's Paradox."

The Leap from "Brain" to "Body"

In the past decade, the field of artificial intelligence has experienced a leap from perceptual intelligence to cognitive intelligence.

However, despite large language models being able to write poetry, code, and even pass the bar exam, they often appear clumsy when faced with the physical world.

An AI that can pass the Turing test may struggle to control a robotic arm to complete the simplest task of "putting a cup in the dishwasher."

This is the famous " Moravec's Paradox": for computers, achieving high-level intelligence such as logical reasoning requires very little computational power, while achieving low-level intelligence such as perception and movement requires enormous computational resources.

Embodied intelligence aims to solve this problem, requiring agents not only to "think" but also to have a "body" that can physically interact with the environment.

For a long time, the development of embodied intelligence has been limited by two major bottlenecks:

  • Lack of Data

The internet is filled with trillions of text data but lacks an equivalent scale of robot data with precise action labels.

Generalization Difficulties

  • Traditional reinforcement learning (RL) algorithms typically perform well only in specific environments (such as a Go board or a particular factory assembly line), and once the environment changes slightly, the model fails.

Games as Simulators of Reality

In 2025, we saw a new path to solving the above bottlenecks: using video games as a bridge to the physical world.

Games provide rich visual environments, complex physical rules, and clear task objectives, and they inherently possess digital and scalable characteristics. More importantly, the "perception - decision - action" loop in the game world is isomorphic to that of physical robots.

For embodied agents to survive in the complex and unpredictable real world, mere reflexive responses are not enough.

They must possess deep reasoning and planning capabilities.

The Challenge of the Zelda Color Ball Puzzle

This puzzle originates from the "Legend of Zelda" series of games, and the rules seem simple but are extremely challenging for logic:

  • Scene

A grid composed of red and blue spheres.

  • Rules

Clicking on a sphere will change its color as well as the colors of the spheres directly above, below, left, and right (red to blue, blue to red).

  • Goal

Through a series of clicks, turn all spheres blue.

The essence of this puzzle is a constraint satisfaction problem or graph theory problem.

Its complexity lies in the combinatorial explosion of the state space and the irreversibility of operations.

Players cannot focus solely on the immediate benefits of the current step; they must anticipate the state changes in the next few steps This requires a strong forward-looking planning ability, which means constructing a "decision tree" in one's mind and deducing the outcomes of different branches. This is precisely what is defined as "System 2" thinking in human cognitive psychology—slow, deliberate, and logical thinking.

According to The Decoder's in-depth evaluation:

The current top AI models have shown significant generational differences when facing this challenge, directly reflecting their potential as "brains" of embodied intelligent agents.

The success of GPT-5.2-Thinking lies not only in solving the puzzle but also in demonstrating a trend of algorithmic internalization.

For example, when a robot faces a cluttered table, it can simulate in its "mind" like solving a Zelda puzzle: "If I take the book underneath first, the cup on top will fall; therefore, I must move the cup first."

This ability is key to the transition from "automated machines" to "autonomous agents."

If GPT-5.2 solved the "what to think," then NVIDIA's NitroGen model solved the "how to do it."

The release of NitroGen marks a "ImageNet moment" in robot learning, utilizing internet-scale data to train general motion control strategies.

The NitroGen team proposed an extremely clever "data mining" strategy: using the input overlay layers commonly found in game live streaming.

The brilliance of this strategy lies in its ability to instantly transform originally "unsupervised" video data into "supervised" "visual-action" pairs.

NVIDIA used this technology to build the NitroGen dataset, which includes 40,000 hours and covers over 1,000 games.

This scale is unprecedented in the field of robot learning.

Simulation Layer: World Models as the "Matrix" for Robots

In the movie "The Matrix," Neo learns kung fu in a virtual world.

For robots, world models are their "matrix."

If robots can experience thousands of trial-and-error attempts per second in an extremely realistic virtual world, their evolution speed will far exceed the limitations of physical time.

Based on the above analysis, the path to achieving general intelligence through games is not only feasible but has already begun to take shape.

This path can be summarized as: "Learn to control in games, learn physics in simulations, and learn to adapt in reality."

Future general intelligent agents will inevitably have a hierarchical architecture:

  • Top Level (Brain)

Similar to the reasoning model of GPT-5.2, responsible for handling long-term planning, logical puzzles, and understanding human instructions.

  • Middle Level (Cerebellum)

Similar to the general strategy model of NitroGen, responsible for translating high-level instructions into specific movement trajectories, utilizing the "movement intuition" gained from vast video data.

  • Bottom Level (Spinal Cord)

Based on GR00T's high-frequency full-body controller, responsible for specific motor torque output and balance maintenance.

Despite the bright prospects, several key issues remain to be addressed:

  • Lack of Tactile Feedback

Games and videos are primarily visual and auditory, lacking tactile feedback. NitroGen cannot learn "how heavy an object is" or "how slippery a surface is."

  • High-Precision Operations

The current visual-action model performs well in coarse actions (such as walking, grabbing large objects) but still falls short in operations requiring millimeter-level precision (such as threading a needle, precision assembly). This may require higher resolution visual encoders or specialized fine operation strategies.

  • Safety and Ethics

When robots possess autonomous planning capabilities, how can we ensure their objective functions align with human values? The "washing dishes" command should not lead to the robot "breaking plates to empty the sink as quickly as possible."

Games are no longer just entertainment; they are the cradle built by humans for AI.

In this cradle, AI learns to plan (Zelda), learns to control (NitroGen), and learns the physical laws of the world (Cosmos).

When they emerge from the cradle and enter the body of Project GR00T, we will witness the birth of true physical intelligence.

This is not only a victory of technology but also the ultimate manifestation of various possibilities for humanity to give back to the real world through the creation of virtual worlds.

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Investment based on this is at one's own risk