Apple AI boosts Google TPU, behind-the-scenes killer OCS is also an important part of computing power

Apple disclosed in its official paper that its training models use Google's TPUv4 and TPUv5 chips, which can provide higher performance and energy efficiency than NVIDIA's A100. TPU is a dedicated processor introduced by Google, specializing in large matrix operations and suitable for tensor operations in deep learning tasks. Compared to GPUs, TPUs are faster in specific AI computing tasks, but may not be as flexible or efficient in other tasks. Apple relies on Google's chips and software in the field of AI, but the specific extent of reliance has not been disclosed

On July 29th local time, Apple disclosed in a paper on its official website that its training model uses Google's fourth-generation AI ASIC chip TPUv4 and the latest generation chip TPUv5.

As early as the Worldwide Developers Conference (WWDC) in June this year, media outlets had already discovered in the technical details disclosed by Apple that Google had become another winner in Apple's efforts in the field of AI. Apple's engineers used the company's self-developed framework software and various hardware, including the Tensor Processing Unit (TPU) available only on Google Cloud, when building basic models. However, Apple did not disclose how much it relies on Google's chips and software compared to other AI hardware suppliers such as NVIDIA.

TPU - Dedicated Chip for AI Training

TPU (Tensor Processing Unit) is a dedicated processor for machine learning first introduced by Google in 2016.

This processor excels in large matrix operations, allowing for more efficient model training. The integrated HBM in the chip also helps with training larger-scale models. Additionally, multiple TPUs can be combined into Pod clusters, greatly enhancing the efficiency of neural network workloads.

Compared to the mainstream NVIDIA GPU products on the market, TPUs have the following characteristics:

In terms of core count, GPUs have a large number of processing cores that can handle multiple tasks simultaneously, while TPUs have relatively fewer cores, but each core is optimized for deep learning workloads.
In terms of applicability, GPUs provide a certain level of versatility and can handle tasks including graphics rendering, scientific computing, and deep learning. TPUs focus on tensor operations in deep learning, making them potentially faster than GPUs for specific AI computing tasks, but less flexible or efficient for other types of tasks.
In terms of applications, GPUs are widely used for various compute-intensive tasks due to their versatility and flexibility, including gaming, movie production, scientific research, financial modeling, and deep learning training. TPUs, optimized for deep learning, are typically used for deep learning inference tasks requiring high throughput and low latency, such as search engines, recommendation systems, and autonomous driving vehicles.
In terms of performance, Google has stated in a paper that for systems of comparable scale, TPU v4 can provide 1.7 times the performance of NVIDIA A100 and also improve energy efficiency by 1.9 times.

Additionally, according to Huachuang Securities, Google will launch two chips, TPUv5e and TPUv5p, in 2023. TPUv5e, under the same cost conditions, provides up to 2 times higher training performance and 2.5 times higher inference performance compared to TPUv4 for large language models and generative AI models. TPUv5p is the most powerful, scalable, and flexible AI chip ever released by Google, with a training speed for large LLM models 2.8 times that of TPUv4, nearly 50% higher than TPUv5e.

Currently, TPU has become the main force for Google's large-scale model training. From the usage of TPU, over 90% of Google's model training is conducted on TPU.

Google's Secret Weapon - OCS

In addition, according to previous Google papers, when building the TPUv4 cluster, the OCS (Optical Circuit Switch) solution has lower costs, lower power consumption, and faster deployment compared to the traditional Infiniband switch solution.

OCS is a self-developed data center optical switch by Google, which uses an array of MEMS systems to reflect and switch optical signals, replacing the original optoelectronic hybrid switch system.

Caitong Securities pointed out that Google Gemini mainly uses TPU v4 and TPU v5e for large-scale training. Starting from TPU v4, Google began using the OCS optical switch, which switches between 64 TPU slices based on MEMS-based micro-mirror arrays. This allows for flexible selection of data links and network expansion based on the actual data volume in the network. This means that when higher-speed optical modules and switches are deployed in the network, existing low-speed devices can continue to be used, reducing costs and power consumption. It is expected that the demand for high-speed optical modules in large-scale AI chip networking will further expand, and the OCS all-optical solution may bring new growth to optical devices.

In terms of the industry, Huachuang Securities stated that MEMS-based optical switching solutions have advantages in being insensitive to data rates and wavelengths, low power consumption, low latency, etc. Google chooses to self-develop OCS, optical modules, and optical ring devices as the three main components to form a low-cost and high-efficiency large-scale optical switching system. Specifically:

MEMS reflective mirrors are the core components of OCS, and the innovative application of OCS helps expand the MEMS foundry business.
Optical modules are customized to adapt to the needs of OCS, designed as the latest generation Bidi OSFP package using a ring-shaped device + CWDM4/8. Domestic enterprises in the optical module industry have strong competitiveness, and future applications are expected to have higher technological difficulties, leading to continuous improvement in customer stickiness.
The ring device is innovatively introduced into the optical module, further improving transmission efficiency. The supply chain of the ring device is relatively mature, with the core component Faraday rotator having a low degree of domestic production. Domestic manufacturers have the mass production capability of the polarization beam splitter in recent years.
Optical chips and electrical chips are upgraded to meet higher link budget requirements, with EML and DSP chips mainly supplied by overseas vendors and having a low degree of domestic production.
Copper cables and optical fibers benefit from connections inside and outside the rack, bringing significant demand