How was Gemini 3 trained without NVIDIA?

Wallstreetcn
2025.11.25 01:00
portai
I'm PortAI, I can summarize articles.

Google has launched the newly upgraded multimodal AI model Gemini 3, which uses a sparse Mixture-of-Experts Transformer architecture and supports multimodal inputs such as text, images, and audio. The model is trained from scratch using Google TPU Pod and JAX, and has the capability to handle ultra-long context. Although it still lags behind ChatGPT 5.1 in deep reasoning, it can meet the AI needs of most users

After more than a year of dormancy, Google is back with the newly upgraded multimodal Gemini 3, featuring a fully upgraded front-end UI and enhanced performance. Although there are still gaps in deep reasoning and contextual consistency compared to ChatGPT 5.1 thinking, it can generally meet the basic AI needs of the vast majority of users.

How was Gemini 3 trained? Is it completely based on Google TPU? Everyone is focused on these core issues!

Gemini 3 = Sparse Mixture-of-Experts (MoE) Transformer + Native Multimodal (Text/Image/Audio/Video) + Ultra-long Context (Input up to 1M tokens, Output 64k) + RL Reinforcement "Multi-step Reasoning/Theorem Proving" a complete stack, and it is a new model trained from scratch using Google’s own TPU Pod + JAX + Pathways.

Let’s break it down into several layers: architecture, training data and process, computing power/system design, and then discuss "the logic behind this design."

Architecture: Sparse MoE Transformer + Native Multimodal + Ultra-long Context

1. Core Framework: Sparse Mixture-of-Experts Transformer

The official model card states:

  • Architecture = Sparse Mixture-of-Experts (MoE) Transformer
  • Natively supports text, vision (images), and audio input (video is usually split into image frames + audio sequences).

Key points of MoE:

  • Each layer has many "expert sub-networks" (experts);
  • There is a routing/gating sub-network at the front that decides which experts to send each token to;
  • Each token only activates a few experts, not all parameters are run through;
  • This allows for a very large total parameter count (externally estimated total capacity > 1T level) while keeping the computational cost of single inference manageable.

It’s like not calling all employees in the company to a meeting for every issue, but routing to 2-3 of the most suitable small groups to handle it.

2. Native Multimodal (Text + Vision + Audio + Video)

The model is designed to be "multimodal-first," rather than "first doing text, then adding a visual encoder." Text tokens, image patches, and audio frames all enter the same Transformer backbone, with different encoders at the front unifying different modalities into the same vector space Google has also developed an image model called Nano Banana Pro based on this, directly using Gemini 3 Pro as the "brain" for image generation/editing.

The benefits of this native multimodal approach include:

  • Cross-modal reasoning: For example, understanding "why this experiment failed" by watching a video + reading explanatory text together;
  • Very friendly to product scenarios (screenshots of search interfaces, code + error screenshots, lecture videos + PDFs).

3. Ultra-long context: 1M Token input, 64k output

  • Official model card: Input context limit of 1,000,000 tokens, output limit of 64,000 tokens.
  • MarkTechPost's article also confirmed this and emphasized that it is "the key to allowing the agent to consume entire codebases/long documents/multi-hour videos."

In terms of implementation, Google has not disclosed all the details, but based on their open-source Gemma 3 report, we can see the recent approach: more local attention layers + shorter local spans to reduce KV-cache explosion; using "a small number of global attention layers" for key information aggregation.

So you can understand it as: using cheap local attention in local windows, occasionally inserting a layer of "global perspective" for information integration, and then using MoE to distribute computation across different experts to collectively support 1M context.

4. Differences from Gemini 2.5

The official statement is very clear:

  • It is not a fine-tuned version of 2.5, but a new generation architecture trained from scratch.
  • It significantly surpasses 2.5 Pro in various reasoning, multimodal, and long context benchmarks.

Training Data: Multimodal + Multi-source + Large-scale Cleaning

1. Composition of Pre-training Data

The model card discloses this quite detailed:

Large-scale corpus of multimodal and multi-domain:

  • Public web documents & texts
  • Code (in various languages)
  • Images
  • Audio (including speech and other audio types)
  • Video

Types of data sources:

  • Publicly downloadable datasets
  • Web-crawled data (complying with robots.txt)
  • Commercially licensed data
  • User data from Google products & interaction data with the model (under corresponding TOS/privacy policies and user control)
  • Data generated from Google’s internal operations
  • AI synthetic data

So overall, it can be understood as a "mishmash of public internet + licensed copyright libraries + internal product behavior logs + internal & synthetic data," and it is fed in a multimodal manner 2. Data Cleaning and Security Filtering

The same model card also outlines the data processing workflow:

  • Deduplication
  • Compliance with robots.txt
  • Various security filters (blocking content such as pornography, violence, CSAM, etc.)
  • Quality filtering to remove junk/unrelated content

These are both safety requirements and necessary for stable training (too much dirty data can directly hinder convergence).

Training Process: Pre-training + Instruction Fine-tuning + RL (Human & Critic Feedback)

The official documentation does not provide ultra-detailed loss functions and schedules, but the framework is a typical "three-stage" process:

1. Stage One: Self-supervised Pre-training (Base Model)

On the aforementioned multimodal data, self-supervised training similar to "next token prediction" is performed; text/code uses a standard autoregressive objective; images/audio/video are adapted through encoding methods, treating patches/frames as tokens to predict.

Goal: To learn general language + world knowledge + multimodal representation, regardless of the task or instruction.

2. Stage Two: Supervised Instruction Fine-tuning (SFT)

  • Fine-tuning with "high-quality multimodal instruction data written by humans":
    • Question answering, dialogue, code generation, reasoning problems
    • Image-text question answering, video understanding, audio understanding
  • This step is akin to transforming a "talking brain" into an "assistant that listens to instructions and acts."

The model card collectively refers to this part as instruction tuning data.

3. Stage Three: Reinforcement Learning + Secure Deployment

Gemini 3 describes RL more straightforwardly than previous versions: using reinforcement learning from human and critic feedback:

Humans annotate which responses are better; then a "critic model" automatically provides scoring; the content used in reinforcement learning is particularly emphasized:

  • Multi-step reasoning data

  • Problem-solving data

  • Theorem proving data

In other words, they specifically use RL to guide the model towards "slow reasoning, problem decomposition, and performing mathematics/proofs." This also explains why Gemini 3 performs better than 2.5 and many competitors on high-difficulty reasoning benchmarks like Humanity’s Last Exam and ARC AGI 2.

Regarding safety: they treat data filtering + conditional pre-training + SFT + RLHF + product-level security filtering as a safety "layered defense." They conduct red teaming and capability assessments according to their own Frontier Safety Framework

Computing Power and Systems: TPU Full Stack + JAX + Pathways

An important "meta-narrative" of Gemini 3 is: "Being at the forefront without NVIDIA."

1. Hardware: Fully trained on Google’s own TPU

The model card states clearly:

  • Training is entirely completed on Google Tensor Processing Units (TPUs);
  • Using TPU Pods (large-scale TPU clusters) to support multi-device distributed training;
  • Achieving better model quality + energy efficiency by leveraging TPU's high-bandwidth memory and large batch sizes.

External articles emphasize: Gemini 3 proves a complete path of "self-developed chips + own cloud," achieving frontier-level capabilities without relying on the GPU supply chain.

2. Software Stack: JAX + ML Pathways

Model card: Training uses JAX + ML Pathways. Pathways is Google’s own multi-machine multi-task training framework, suitable for this kind of MoE + ultra-long context large model parallelism. Combined with the MoE architecture, you can imagine the system-level challenges it needs to address:

  • How to slice/place expert parameters on TPU Pods;
  • How to balance the load of token routing across devices;
  • How to shard and recycle the KV cache for ultra-long contexts;
  • Ensuring training throughput and stability under these constraints.

These implementation details are not disclosed, but from their emphasis on "sparse MoE + 1M context practicalization," it can be seen that system engineering plays a significant role.

Insights on Gemini 3 from "Design Choices":

From a methodological perspective, we can summarize the orientation of Google’s current generation of models:

  1. Capacity vs Cost: Using MoE to exchange computational efficiency aims for trillion-level parameter expressiveness, but cannot fully utilize every token; Sparse MoE = "only calling the few experts most useful for this task," allowing more knowledge and capability to be packed into the same computational power.
  2. Scenario First: Native multi-modal + ultra-long context + agent capabilities. Multi-modal + 1M context is designed to directly consume: code repositories, product documentation, UI screenshots, video courses, system logs; combined with agent IDEs like Antigravity and "Generative UI," transforming the model into a true "operating system-level assistant," rather than just a chat tool.
  3. Inference First: Deliberately reinforcing multi-step reasoning and theorem proving in RL. Many frontier benchmarks (ARC AGI, GPQA, math competitions) emphasize "thinking step by step." So they explicitly use this type of data for RL, designing the reward as "think slowly but answer correctly."
  4. Safety and Compliance: Multi-layer protection from data to product. Filtering is done on the data side; during the model training phase, safety-related objectives and RL penalties are used; during deployment, add policy + safety filtering + Frontier Safety assessment.
  5. Full-stack Integration: Collaborative optimization of TPU + framework + model + product. Training is done entirely on their own TPU, deeply binding hardware features using JAX + Pathways; then vertically integrating into products like Search, Workspace, Antigravity IDE, AI Studio, etc.

Gemini 3 is more like "a MoE multimodal brain driven by TPUs," pre-trained with complex yet clean multimodal data, and then refined with RL to make "multi-step reasoning + agent behavior" practical for real-world use.

Why did Google choose Sparse MoE instead of Dense LLM?

Sparse MoE vs Dense LLM: What did we gain and what did we pay?

Sparse MoE = trading "more parameter capacity" for "more complex system engineering";

Dense LLM = trading "simplicity and stability" for "higher reasoning costs / more limited capacity."

1. Parameter Capacity vs Computational Cost

Consider a simplified example:

Dense model: 400B parameters, every layer uses all parameters for all tokens.

Sparse MoE: Assume there are 32 experts, each expert has 50B parameters. The model's "total capacity" ≈ 32 × 50B = 1.6T parameters; but the routing strategy: each token only activates 2 experts. So the parameters used in a single forward computation ≈ 2 × 50B = 100B parameters.

So, for " single inference":

  • Dense 400B: fixed use of 400B;
  • Sparse MoE: logical capacity of 1.6T, but each token actually runs around 100B.

This is the core appeal of MoE:

Under the premise of "computational power being bearable," the total capacity can be made far greater than Dense, enhancing "memory & specialization capabilities."

2. Routing & Load Balancing: The First Major Pitfall of MoE

But this comes with a very challenging set of engineering problems:

  1. The choice of routing/gating. Each token needs to select the "most suitable" 1-2 experts. The router itself is also a small network that needs to learn "which token should find which type of expert." In the early stages of training, it can easily become: a few experts are crazily selected while the rest are idle → training does not converge
  2. Load balancing is usually added with a regularization/loss term to prevent "popular experts from being overwhelmed," forcing the use of each expert to be more uniform. If it's too strong → routing becomes "flattened," losing "expertise"; if it's too weak → there is an over-preference for a few experts, leading to low parameter utilization.
  3. Cross-device communication costs Experts are usually distributed across different TPUs/GPUs; each layer needs to "scatter + aggregate + reassemble" tokens according to routing results, requiring a large amount of All-to-All communication; if communication is not well designed, MoE directly turns into a huge network storm generator, causing throughput to plummet.

Dense LLM is much simpler:

  • All layers & parameters are sliced in order, just requiring data parallelism / tensor parallelism;
  • There is no additional routing logic, nor All-to-All expert distribution.

3. Expressive Ability: Generalist vs Specialist

The "theoretical selling point" of MoE is that different experts can learn different "styles/domains/tasks":

  • Some are better at coding;

  • Some are better at mathematics;

  • Some are better at dialogue/chitchat;

  • For specific tokens/tasks, only the "most suitable" experts are called upon.

This brings about several interesting phenomena:

  1. "Expert personalities," in visualized routing patterns, can show that certain experts are activated only near "code blocks + error messages"; others are used more in "multi-step mathematical derivations."
  2. Local overfitting vs global generalization Benefits: Performance on segmented tasks can be very strong (because there are many expert parameters, focusing on narrow ranges); Risks: If the router does not learn well, some experts may overfit to "certain expressions/data distributions," leading to performance drops with different expressions.

Dense LLM, on the other hand, is a complete "generalist mode": all tokens use the same set of parameters; it is easier to maintain robustness during distribution shifts, but requires higher capacity and computational power.

4. Stability of Training & Inference

Advantages of Dense LLM:

  • Simple implementation, stable optimization;
  • Issues like "expert idling" and "routing breakdown" do not occur;
  • Tuning & debugging is much less difficult.

Typical Troubles of Sparse MoE:

  1. Training stability is worse Once the router biases towards a few experts, training will be skewed; careful warmup, loss design, and even curriculum are needed to stabilize it.
  2. More tuning dimensions The number of experts, the number of experts activated per token, capacity factor (how many tokens each expert can handle), load balancing loss weights, etc., are all additional hyperparameters.
  3. Higher complexity in deployment & inference Multi-device expert deployment layout; latency and memory fragmentation issues caused by routing; real-time services need to coordinate with KV cache/batching, all of which are significantly more complicated than Dense But when it comes to the scale of Gemini 3:
  • Stacking Dense further will lead to extremely exaggerated inference costs;
  • Doing full-stack MoE optimization on TPU is controllable for Google;
  • Therefore, they chose the path of "higher system complexity for greater capacity and lower inference costs."

So, Google's use of MoE transforms the "scaling law of model capacity" from "relying entirely on computing power" to "spending more on system engineering + some computing power."

What about the hallucination situation?

Gemini 3 is SOTA in "answering things it knows very well," but it does not perform well in "honestly saying it doesn't know when it doesn't."

Several key benchmarks:

  1. SimpleQA Verified (fact question answering accuracy) means: on simple factual questions, it clearly "knows much more" than competitors.
    • Gemini 3 Pro: 72.1% accuracy
    • Gemini 2.5 Pro: 52.9%
    • GPT-5.1: about 35%, Claude Sonnet 4.5 is even lower.
  2. AA-Omniscience (knowledge + hallucination joint evaluation) what does this 88% mean? Essentially, when it doesn't answer correctly, in ~88% of cases, it confidently gives a wrong answer instead of saying "I don't know / can't confirm."

Gemini 3 Pro ranks first in both the Omniscience Index total score and Accuracy. However, in the same evaluation, its Hallucination Rate is approximately 88%, which is similar to Gemini 2.5 Pro.

So:

  • "Gemini 3 indeed gives correct answers more often than its predecessor and many competitors";
  • But it also "still loves to fabricate when it doesn't know, and it appears very confident."

Many media and analysts have directly pointed this out—"ranking first in reliability benchmarks, but the hallucination rate remains high." Therefore, the hallucination issue of Gemini 3 now seems "quite serious," and compared to 2.5, there has been almost no progress in "saying I don't know." At the same time, it is clearly leading in many reasoning, multimodal, and factual accuracy benchmarks.

So a more reasonable positioning might be:

This is a huge brain that is "knowledge-rich, reasoning-strong, but has poor self-awareness (knowing what it doesn't know)."

Regarding how to use Gemini, I would suggest: treating it as a "generative research structure + exploring blind spots + co-pilot for scenario/ontology" is more appropriate Risk Warning and Disclaimer

The market has risks, and investment should be cautious. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Investing based on this is at one's own risk