NVIDIA has just announced its Cosmos 3 world model at the ongoing GTC Taipei, giving us a glimpse at what it calls the world's first "fully open omnimodel" that is capable of vision-based reasoning, while supporting multimodal output in the form of text, image, video, and ambient sound.
NVIDIA's Cosmos 3 "pairs a reasoning transformer with an expert generation transformer," allowing the model to grasp physical interactions before generating video and action content that leverages those interactions
At its heart, the Cosmos 3 tackles the challenge of making robots, autonomous vehicles (AVs), and vision agents understand their surroundings in an environment where training data is limited and simulation stacks remain fragmented.
NVIDIA's Cosmos 3 is an open omnimodel, which means it is able to "natively understand and generate text, images, video, ambient sound and actions with leading physics accuracy."
Its unique strength lies in it's architecture, which pairs reasoning transformers with those geared towards generation, "enabling Cosmos 3 to understand object interactions, motion and spatial-temporal relationships before generating video and action trajectories."
For the benefit of those who might not be aware, an AI transformer is basically a deep learning neural network that tracks relationships and context within sequential data, which might include words in a sentence. These networks can substantially speed up output generation by undertaking parallel processing, where a given data sequence is analyzed simultaneously instead of piece-by-piece.
Coming back, according to NVIDIA, you can use the Cosmos 3 as a:
- Vision language model
- World model that simulates physical environments and predicts future world states
- Foundation for other world models
Finally, do note that Cosmos 3 Super, which has the highest-fidelity responses, and Cosmos 3 Nano are available right now, with Cosmos 3 Edge coming soon for real-time inference, that too, geared towards edge devices.
Follow Wccftech on Google to get more of our news coverage in your feeds.




