
NVIDIA officially announces new collaboration achievements: Mistral open-source model accelerates, improving efficiency and accuracy at any scale

Through optimization techniques customized for large advanced mixture of experts models (MoE), Mistral Large 3 achieves best-in-class performance on the NVIDIA GB200 NVL72 system, with a 10-fold performance improvement compared to the previous generation H200 chip, processing over 5 million tokens per second per MW of energy consumption. The Mistral 3 series small models can achieve a maximum inference speed of 385 tokens per second on the NVIDIA RTX 5090 GPU
NVIDIA disclosed on Tuesday, Eastern Time, the significant breakthroughs achieved in collaboration with French artificial intelligence (AI) startup Mistral AI. By adopting NVIDIA's latest chip technology, the new member of the Mistral AI open-source model family has made leaps in performance, efficiency, and deployment flexibility.
At the core of this collaboration is the Mistral Large 3 model, which has achieved a tenfold performance improvement on the NVIDIA GB200 NVL72 system compared to the previous H200 chip. This performance leap translates into a better user experience, lower per-response costs, and higher energy efficiency. The model can process over 5 million tokens per second with each megawatt (MW) of energy consumption.
In addition to the large model, the small model series named Ministral 3 has also been optimized for NVIDIA's edge platforms, capable of running on RTX PCs, laptops, and Jetson devices. This enables enterprises to deploy AI applications in any scenario from cloud to edge without relying on continuous network connectivity.
The new model family released by Mistral AI on Tuesday includes one large cutting-edge model and nine small models, all available through open-source platforms like Hugging Face and major cloud service providers. Industry insiders believe that this series of releases marks a new phase of "distributed intelligence" in open-source AI, bridging the gap between research breakthroughs and practical applications.
GB200 System Drives Large Model Performance Breakthrough
Mistral Large 3 is a mixture of experts model (MoE) with a total of 67.5 billion parameters and 41 billion active parameters, as well as a context window of 256,000 tokens. The architecture is characterized by activating only the most influential parts of the model for each token, rather than starting all neurons, thus achieving efficient scaling while maintaining accuracy.
NVIDIA claims that by leveraging a series of optimizations tailored for large advanced MoEs, Mistral Large 3 has achieved best-in-class performance on the NVIDIA GB200 NVL72.

NVIDIA achieved the performance breakthrough through three key technological optimizations. The first is Wide Expert Parallelism technology, which fully utilizes NVLink's coherent memory domain through optimized MoE kernels, expert allocation, and load balancing. The second is NVFP4 low-precision inference technology, which reduces computational and memory costs while maintaining accuracy. The third is the Dynamo distributed inference framework, which enhances long text processing performance by separating the pre-filling and decoding stages.
The model is compatible with mainstream inference frameworks such as TensorRT-LLM, SGLang, and vLLM. Developers can flexibly deploy the model on NVIDIA GPUs of various scales using these open-source tools, choosing the precision format and hardware configuration that best suits their needs
Small Models Target Edge Device Deployment
The Ministral 3 series includes nine dense high-performance models, covering three parameter scales of 3 billion, 8 billion, and 14 billion, with each scale offering three variants: base, instruction, and inference. All variants support visual functions, handling context windows of 128,000 to 256,000 tokens, and support multiple languages.
These small models can achieve an inference speed of up to 385 tokens per second on NVIDIA RTX 5090 GPUs. On Jetson Thor devices, the vLLM container can reach 52 tokens per second under single concurrency and scale up to 273 tokens per second under eight concurrent sessions.
NVIDIA has collaborated with Ollama and llama.cpp to optimize the edge performance of these models. Developers can run these models on NVIDIA edge platforms such as GeForce RTX AI PCs, DGX Spark, and Jetson devices, achieving faster iteration speeds, lower latency, and stronger data privacy protection.
Since a single GPU can run them, Ministral 3 can be deployed on devices such as robots, autonomous drones, cars, mobile phones, and laptops. This deployment flexibility allows AI applications to operate in environments with limited or no network connectivity.
Mistral's New Model Family Accelerates Commercialization
The new model series released by Mistral AI on Tuesday is the company's latest effort to catch up with leading AI labs like OpenAI, Google, and DeepSeek. Founded in 2023, the company completed a €1.7 billion financing round last September, with Dutch chip equipment manufacturer ASML contributing €1.3 billion, and NVIDIA also participating, reaching a valuation of €11.7 billion.
Guillaume Lample, co-founder and chief scientist of Mistral AI, stated that although large closed-source models perform better in initial benchmark tests, small models often match or even surpass large models in enterprise-specific use cases after targeted fine-tuning. He emphasized that the vast majority of enterprise use cases can be addressed with fine-tuned small models, which are more cost-effective and faster.
Mistral AI has begun to accelerate its commercialization process. This Monday, the company announced an agreement with HSBC to provide model access for tasks ranging from financial analysis to translation for the multinational bank. Additionally, the company has signed contracts worth hundreds of millions of dollars with several enterprises and is expanding into physical AI, collaborating on robotics, drones, and in-vehicle assistant projects with the Singapore Ministry of Home Affairs, German defense tech startup Helsing, and automaker Stellantis.
Mistral Large 3 and Ministral-14B-Instruct are now available to developers through the NVIDIA API catalog and preview API. Enterprise developers will soon be able to easily deploy these models on any GPU-accelerated infrastructure using NVIDIA NIM microservices. All models in the Mistral 3 family can be downloaded from Hugging Face
