NVIDIA's "World Foundation Model" is born, igniting the physical AI revolution! A 75-page report is released, and GitHub skyrockets with 2k stars
NVIDIA announced the launch of the world's foundational model development platform, Cosmos, at the CES conference, aimed at promoting the development of physical AI. The platform is based on 2 million hours of video training and includes four major functional modules such as diffusion models and autoregressive models, capable of generating synthetic data to assist in autonomous driving and robotics research. Cosmos performs excellently in geometric accuracy and visual consistency, receiving 2k stars on GitHub less than a day after its project launch, along with the release of a 75-page technical report
At the CES conference, Jensen Huang stated, "The next frontier of AI is physical AI."
To this end, NVIDIA officially announced the world foundational model development platform—Cosmos, which is trained on 2 million hours of video.
It includes four major functional modules: Diffusion Model, Autoregressive Model, Video Tokenizer, and Video Processing and Editing Workflow.
In the words of NVIDIA senior scientist Jim Fan:
-
Two forms: Diffusion Model (generates continuous tokens); Autoregressive Model (generates discrete tokens)
-
Two generation modes: Text → Video; Text + Video → Video
Cosmos was born to solve the problem of insufficient data for physical AI! Nowadays, developers can directly generate synthetic data for use in autonomous driving and robotics research.
It includes three specifications of models: Nano, Super, Ultra.
Compared to the VideoLDM benchmark, the Cosmos world model performs better in geometric accuracy and consistently surpasses VLDM in visual consistency, with the pose estimation success rate soaring by up to 14 times.
The GitHub project has gained 2k stars in less than a day since its open-source release.
Meanwhile, the most detailed 75-page technical report on Cosmos has also been published.
Open-source project: https://github.com/NVIDIA/Cosmos
Paper link: https://research.nvidia.com/publication/2025-01_cosmos-world-foundation-model-platform-physical-ai
Cosmos, Customized World Model
This article introduces the Cosmos World Foundation Model Platform, aimed at helping developers build customized world models.
In pre-training, researchers utilize large-scale video datasets to expose the model to diverse visual data, training a general-purpose model. The pre-trained Cosmos World Foundation Model (WFM) can generate high-quality, consistent 3D videos.
In post-training, researchers collect datasets from specific environments to fine-tune the pre-trained model, resulting in a specialized WFM suitable for specific objectives.
The pre-trained World Foundation Model (WFM) is a general world model trained on large-scale, diverse video datasets. The post-training datasets are collected from the target environment as prompts - video pairs. Prompts can take the form of action instructions, trajectories, descriptions, etc.
The combination strategy of pre-training and post-training provides an efficient method for building physical AI systems. Since the pre-trained WFM provides a solid foundation, the post-training dataset can be relatively small.
World Foundation Model Platform
Video Editing
Researchers have developed a scalable video data editing process.
In this process, each video is segmented into independent shots without scene changes. A filtering step is used to locate high-quality, dynamic, and information-rich segments for training.
These high-quality shots are then annotated using VLM (Visual Language Model). Semantic deduplication is subsequently performed to construct a diverse yet compact dataset Video Tokenization
Researchers have developed a series of video tokenizers with different compression ratios. These tokenizers are causal (i.e., the token computation of the current frame does not depend on future frames).
This causal design brings multiple benefits. In terms of training, it enables joint training of images and videos, as the causal video tokenizer can also function as an image tokenizer when the input is a single image.
This is crucial for video models that utilize image datasets for training, as image datasets contain rich information about the appearance of the world and are often more diverse.
In application, causal video tokenizers are more suitable for physical AI systems that operate in a causal world.
WFM Pre-training
Researchers explored two scalable pre-training world foundation model approaches—diffusion models and autoregressive models. They utilized the Transformer architecture to achieve scalability.
For the diffusion-based WFM, pre-training consists of two steps:
1. Text to World Generation Pre-training
2. Video to World Generation Pre-training
Specifically, they trained the model to generate a video world based on input text prompts. It was then fine-tuned to generate future video worlds based on past videos and input text prompts, known as the Video to World generation task.
For the autoregressive-based WFM, pre-training includes two steps:
1. Basic Next Token Generation
2. Text-conditioned Video to World Generation
They first trained the model to generate future video worlds based on past video inputs (forward generation). It was then fine-tuned to generate future video worlds based on past videos and text prompts.
The Video to World generation model is a pre-trained world model that predicts the future based on current observations and prompts.
For both the diffusion and autoregressive models of WFM, researchers constructed a series of models with different capacities and studied their effectiveness in various downstream applications.
They further fine-tuned the pre-trained diffusion WFM to develop a diffusion decoder to enhance the generative results of the autoregressive model.
To better control the WFM, they also built a prompt upsampler based on LLM.
WFM Post-training
The team demonstrated the application of the pre-trained WFM in multiple downstream physical AI applications.
They fine-tuned the pre-trained WFM to take camera poses as input prompts, allowing them to navigate freely within the created world. Additionally, they showcased how to fine-tune the pre-trained WFM for humanoid robots and autonomous driving tasks Safety Mechanism
To safely use the developed world foundation model, researchers have developed a safety mechanism to prevent harmful inputs and outputs.
The Cosmos world foundation model platform consists of several main components: video editor, video tokenizer, pre-trained world foundation model, world foundation model post-training samples, and safety mechanism.
They believe that WFM has various uses for physical AI builders, including (but not limited to):
Policy Evaluation
Instead of evaluating trained policies by running physical AI systems in the real world, it is better to let digital copies of physical AI systems interact with the world foundation model. Evaluations based on WFM are more cost-effective and time-saving.
With WFM, builders can deploy policy models in unseen environments that may not be available in reality. WFM helps developers quickly eliminate unqualified policies and focus on those with greater potential.
Policy Initialization
Policy models generate actions that the physical AI system needs to perform based on current observations and given tasks. A high-quality WFM modeling the dynamics of the world can serve as a good initialization for policy models.
This helps address the data scarcity issue in physical AI.
Policy Training
In reinforcement learning settings, WFM paired with a reward model can act as a proxy for the physical world, providing feedback to the policy model. Agents gradually master the ability to solve tasks through interaction with WFM.
Planning or Model Predictive Control
WFM can be used to simulate the possible future states of the physical AI system after executing different sequences of actions, and then quantify the performance of these different action sequences through a cost/reward module.
Physical AI can execute the optimal sequence of actions based on overall simulation results (as in planning algorithms) or execute in a recursive horizon manner (as in model predictive control).
The accuracy of the world model determines the performance upper limit of these decision-making strategies.
Synthetic Data Generation
WFM can not only be used to generate synthetic data for training but can also be fine-tuned for conditional generation based on rendering metadata (such as depth maps or semantic maps). Conditional WFM can be used in Sim2Real scenarios.
Data Editing
Researchers have proposed a video processing workflow to generate high-quality training datasets for the tokenizer and WFM.
As shown in the figure below, the workflow includes five main steps: 1) segmentation, 2) filtering, 3) labeling, 4) deduplication, and 5) slicing.
These steps have been specifically optimized to improve data quality and meet the needs of model training.
Pre-training Dataset
Researchers have accumulated approximately 20 million hours of raw video, with resolutions ranging from 720p to 4k, and generated about 10^8 video clips for pre-training, along with around 10^7 video clips for fine-tuning.
These include various physical AI applications, and the training video dataset is divided into the following categories:
-
Driving (11%)
-
Hand movements and object manipulation (16%)
-
Human actions and activities (10%)
-
Spatial awareness and navigation (16%)
-
First-person perspective (8%)
-
Natural dynamics (20%)
-
Dynamic camera movements (8%)
-
Synthetic rendering (4%)
-
Others (7%)
Tokenizer
The tokenizer is a fundamental building block of large models, which converts raw data into more efficient representations in an unsupervised manner by learning a bottleneck latent space.
The diagram below illustrates the token training process, aiming to train the encoder and decoder so that the bottleneck token representation retains as much visual information from the input as possible.
Video tokenization process: the input video is encoded into tokens, and the decoder subsequently reconstructs the input video from these tokens. The training goal of the tokenizer is to learn the encoder and decoder while preserving visual information in the tokens as much as possible.
Continuous tokenizers encode visual data into continuous latent embeddings and are used in models that generate data by sampling from continuous distributions.
Discrete tokenizers encode visual data into discrete latent codes and map them to quantized indices. This discrete representation is necessary for models trained with cross-entropy loss, such as GPT.
The success of the tokenizer largely depends on its ability to provide high compression rates without compromising the quality of subsequent visual reconstruction.
Here, researchers propose a set of visual tokenizers—including continuous and discrete tokenizers for images and videos. They can provide excellent visual reconstruction quality and inference efficiency while supporting various compression rates to accommodate different computational constraints and application needs.
Visualization of continuous and discrete tokenizers: (left) continuous latent embeddings with embedding size C; (right) quantized indices, with each color representing a discrete latent code.
Specifically, the Cosmos tokenizer adopts a lightweight and computationally efficient architecture, combined with causal temporal mechanisms.
By using causal temporal convolution layers and causal temporal attention layers, the natural temporal order of video frames can be preserved, enabling seamless tokenization of images and videos through a single unified network architecture By directly training the tokenizer on high-resolution images and long-duration videos, it can operate without restrictions on categories or aspect ratios, including 1:1, 3:4, 4:3, 9:16, and 16:9.
During the inference phase, it is insensitive to time length and can handle video tokenization that exceeds the time length during training.
Comparison of different visual tokenizers and their functions
Evaluation results indicate that the Cosmos tokenizer significantly outperforms existing tokenizers in performance—not only is the quality higher, but the running speed can be up to 12 times faster.
Additionally, it can encode up to 8 seconds of 1080p video and 10 seconds of 720p video at once on a single NVIDIA A100 GPU (80GB memory) without exhausting memory.
Comparison of continuous tokenizers (left) and discrete tokenizers (right) in terms of spatiotemporal compression ratio (log scale) and reconstruction quality (PSNR). Each solid point represents a tokenizer configuration, illustrating the trade-off between compression ratio and quality.
World Foundation Model Pre-training
Researchers utilize two different deep learning paradigms—diffusion models and autoregressive models—to construct two types of WFM.
All WFM models in this paper were trained on a cluster containing 10,000 NVIDIA H100 GPUs over a training period of three months.
World Foundation Model (WFM) based on diffusion models and autoregressive models
Video generated by the autoregressive world foundation model
Researchers demonstrate how to fine-tune the Cosmos WFM to support various scenarios, including 3D visual navigation, enabling different robots to perform tasks, and autonomous driving.
World Foundation Model Post-training
Post-training of WFM for Robotics
World models have strong potential to support robotic operations, showcasing two tasks: (1) instruction-based video prediction and (2) action-based next-frame prediction.
For instruction-based video prediction, the input is the robot's current video frame along with text instructions, and the output is the predicted video. For action-based next-frame prediction, the input is the robot's current video frame along with the action vector between the current frame and the next frame, and the output is the predicted next frame, demonstrating the result of the robot performing the specified action For instruction-based video prediction, researchers created a dataset called Cosmos-1X. This dataset contains approximately 200 hours of first-person perspective videos captured by EVE (a humanoid robot from 1x.Tech), including navigation, folding clothes, cleaning tables, picking up objects, and more.
For action-based next-frame generation, the team used a public dataset called Bridge. The Bridge dataset includes about 20,000 third-person perspective videos showcasing the process of robotic arms performing different tasks in a kitchen environment.
Post-training for Autonomous Driving
Researchers demonstrated how to fine-tune the pre-trained WFM to create a multi-view world model suitable for autonomous driving tasks.
The researchers curated an internal dataset called the Real Driving Scenes (RDS) dataset. This dataset contains approximately 3.6 million 20-second panoramic video clips recorded through NVIDIA's internal driving platform.
The researchers fine-tuned Cosmos-1.0-Diffusion-7B-Text2World using the RDS dataset to build a multi-view world model.
Cosmos-1.0-Diffusion-7B-Text2World-Sample-MultiView-TrajectoryCond model results
Some Demonstrations
From the GitHub homepage, we can see all model series of the Cosmos family: 4 diffusion models and 4 autoregressive models.
The diffusion models 7B and 14B (Text2World) generated the following results based on the same prompt:
The diffusion models 7B and 14B (Video2World) generated the following results based on the same prompt:
The generation effects of the autoregressive models 4B and 12B are as follows:
The generation effects of the autoregressive models 5B and 13B based on the same prompt are as follows:
In addition, the post-training world model can also achieve "camera control." In the generated video of the robot in the factory, you can see the surrounding environment by moving around.
Note: This video showcases an advanced manufacturing facility where multiple robotic arms work collaboratively. These robots are equipped with special gripping devices and are processing and assembling components on a central platform. The environment is clean and orderly, with various machinery and equipment visible in the background. The entire robotic system is highly automated, reflecting a high-tech production process.
What’s even more surprising is that Cosmos can generate predictive scenarios of various robots in different environments based on prompts. For example, placing a book on a shelf, brewing coffee, sorting items...
This means that in the future, robotic simulation training can be directly practiced in the physical world!
There is also multi-view video generation for autonomous driving scenarios by Cosmos.
It is important to note that the following scenes do not exist at all.
Some netizens jokingly said that we must be living in a simulated world, and 99% of the time this system is supported by NVIDIA.
Physical AI cannot be without WFM
Why is the world model crucial for the realization of physical AI?
Jensen Huang vividly explained the importance of the world model from the working principles of large models at the conference— Large models typically generate one token at a time based on prompts, but this is limited to the output of content tokens. To achieve the transition from "content tokens" to "action tokens," language models are no longer sufficient.
What we need is a model that can understand the physical world, simply put, WFM.
Yesterday, Ming-Yu Liu, Vice President of Research at NVIDIA, also stated in the latest podcast that WFM is a powerful god-level network capable of simulating the physical world.
It can generate detailed videos from text/image input data and predict the evolution of scenes by combining its current state (image/video) with actions (prompts/control signals).
WFM can imagine many different environments and can simulate the future, helping physical AI developers make better decisions.
On the other hand, building world models typically requires large datasets.
Data collection is not only time-consuming but also costly, and WFM can generate synthetic data to enhance the training process.
Additionally, physical testing carries significant risks; any mistake with a robot prototype worth hundreds of thousands of dollars could lead to substantial losses.
With the 3D environments simulated by WFM, researchers can train and test physical AI systems in controlled environments.
NVIDIA Cosmos can help you generate everything in the physical simulation world.
Suppose you want to test a robot; you upload an original video and then input:
"Capture a scene of a humanoid robot working in an old factory from a first-person perspective. There are many industrial machines around the robot. The floor is old wooden flooring, worn and richly textured. The camera pans to the right at a height of 2 meters above the ground. The photo style should be realistic."
Then, a virtual image of a robot working in the factory appears.
Including the following autonomous driving field, all generated entirely by Cosmos Not only that, NVIDIA also uses Cosmos in conjunction with Omniverse, blending the virtual and real worlds, allowing the design of the virtual world to be transferred to real-world training.
All along, Jensen Huang has emphasized a new concept called "three computers": one is DGX for training AI, another is AGX for deploying AI, and the last one is Omniverse + Cosmos.
If we connect the first two, we need a digital twin.
Jensen Huang believes that "in the future, every factory will have a digital twin, and you can combine Omniverse and Cosmos to generate a large number of future scenarios."
New Intelligence, original title: "NVIDIA's 'World Foundation Model' is born, igniting the physical AI revolution! A 75-page report is released, GitHub skyrockets with 2k stars."
Risk Warning and Disclaimer
The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Investing based on this is at your own risk