The Future of AI Infrastructure: Competition between Google and Microsoft in Multi-Data Center Training

Wallstreetcn
2024.09.22 03:53
portai
I'm PortAI, I can summarize articles.

In the competition of AI infrastructure between Google and Microsoft, Google takes the lead with its advanced computing systems and liquid cooling technology, planning to conduct gigawatt-level AI training in multiple data center campuses. Microsoft, on the other hand, has initiated an expansion plan to catch up with Google in infrastructure, constructing multiple gigawatt-level data centers and planning to adopt a fully liquid-cooled design to enhance energy efficiency

Comparison of Core Capabilities between Google and Microsoft

Infrastructure and Expansion Capabilities

  • Google: The IT capacity of the Council Bluffs campus in Iowa is close to 300 megawatts, with a projected total capacity exceeding 500 megawatts by 2023. By the end of 2025, the total capacity in Ohio, Iowa/Nebraska is expected to reach 1 gigawatt, with a gigawatt-level cluster under construction in the Columbus area. Gigawatt-level training is expected to be conducted in multiple campuses in 2025, forming a gigawatt-level AI training cluster by 2026.
  • Microsoft: The largest AI training base in Phoenix plans to expand to 10 buildings and build 24 data centers. Wisconsin is building a super campus to become the largest single data center campus. In collaboration with Oracle+Crusoe and CoreWeave+Core Scientific in Texas, Microsoft is constructing multiple gigawatt-level data centers, with plans to expand nationwide and surpass Google in scale.

Cooling Technology

  • Google: Adopts a direct-to-chip water cooling method, transferring rack heat to the central facility water system through liquid-liquid heat exchangers, deploying over a million liquid-cooled TPUs with a total capacity exceeding 1 gigawatt.
  • Microsoft: The current largest data training cluster has not yet adopted liquid cooling technology. Plans are in place to build fully liquid-cooled single buildings in Milwaukee and Atlanta dedicated to next-generation AI hardware.

Energy Efficiency

  • Google: Achieves a PUE of 1.1, with no need for chiller plants most of the time, using giant cooling towers and centralized water systems, capable of dissipating nearly 200 megawatts of heat.
  • Microsoft: PUE is 1.223, with fan power exceeding server power by 15%. Plans to use air-cooled chiller plants in future campuses, with a water usage efficiency (WUE) of 2.24 liters/kilowatt-hour, significantly higher than the industry average of 0.49.

AI Technology and Products

  • Google: The Gemini 1 Ultra model is the first to achieve multi-data center training, with the upcoming Gemini 2 expected to surpass OpenAI and Anthropic in synthetic data, reinforcement learning, and model architecture, deploying over a million liquid-cooled TPUs.
  • Microsoft: Collaborating with OpenAI, planning to interconnect multiple super-scale campuses to implement large-scale distributed training across the U.S., aiming to establish multi-gigawatt computing systems.

Communication Networks

  • Google: Can further interconnect in concentrated areas in Ohio and Iowa/Nebraska to support multi-gigawatt training of a single model, using high-bandwidth fiber optic networks to ensure low-latency communication between multiple data centers.
  • Microsoft: Collaborating with Lumen Technologies and Zayo, utilizing advanced fiber optic technology and infrastructure to support large-scale AI training clusters, planning to achieve low-latency communication and data transfer between multiple campuses through high-speed fiber optic networks Microsoft's Jiva-level AI Training Cluster Plan

I. Overview

  • Goal: Interconnect multiple campuses to build a large-scale AI training cluster.

  • Partners: Lumen Technologies and Zayo provide Microsoft with fiber optic technology support to construct a high-performance computing network.

  • Requirements: Need to process massive data, achieve low-latency communication to meet the needs of AI model training.

II. Role and Challenges of Lumen Technologies

  • Agreement Signing: Lumen reached a $5 billion interconnection agreement with Microsoft and signed an agreement with Corning to maintain 10% of capacity.
  • Market Demand: Fiber demand driven by AI is sharply rising, and Lumen plans to compete for another $7 billion in sales opportunities.
  • Idle Resources: Lumen has a large amount of "dark fiber" underutilized, facing upgrade opportunities.

III. Trends and Challenges in the Telecommunications Industry

  • Capital Expenditure: It is expected that future telecommunications capital expenditure will exceed $10 billion, specifically for AI training across multiple data centers.
  • Price Pressure: Due to declining internet prices and enterprise traffic migrating to the internet, there is a reduced demand for MPLS.

IV. Beneficiary Companies and Their Development Prospects

  • Fabrinet: Benefiting from the 400ZR product line, with telecommunications business accounting for nearly 40% of revenue and maintaining good cooperation with multiple telecommunications customers.
  • Lumentum: Driven by the growth in demand for ZR/ZR+ optical devices, significant revenue growth is expected.
  • Ciena: Holds a leading position in the telecommunications hardware market, especially driven by AI traffic demand, with continuous growth in orders.
  • Cisco: Achieved double-digit growth in orders from hyperscale customers, with expectations of continued AI-related orders in the future.
  • Marvell: Competitive advantage in the ZR optics and coherent DSP fields, rapid growth in related businesses, and broad market prospects.

Basic Knowledge

Distributed Training Across Multiple Data Centers

1. Concept and Objective: Distributed training across multiple data centers disperses training tasks to achieve higher computational efficiency and resource utilization.

2. Key Steps:

  • Data Partitioning: Divide training data into multiple mini-batches allocated to different data centers.
  • Gradient Computation: Each center independently computes gradients.
  • Gradient Synchronization: Use efficient communication mechanisms (such as all-reduce) to synchronize gradients to ensure consistent model parameters.

3. Challenges and Issues:

  • Communication Overhead: With an increasing number of chips, communication latency and bandwidth requirements significantly rise, affecting overall training efficiency.
  • Latency Issues: Latency between data centers in different geographical locations significantly reduces training speed
  • Lagging Nodes: Nodes with unbalanced performance may lead to a decrease in overall training task speed, affecting model convergence.

4. Solutions:

  • Asynchronous Training: Adopting an asynchronous update strategy to reduce dependence on global synchronization and improve efficiency.
  • Optimized Communication Protocol: Developing more efficient communication protocols to reduce latency and data exchange costs.
  • Dynamic Resource Adjustment: Real-time monitoring of network status, dynamically adjusting resource allocation to cope with latency and bandwidth fluctuations.

5. Other Considerations:

  • Scalability: According to Amdahl's Law, adding nodes does not always linearly improve training performance.
  • Monitoring and Optimization: Monitoring performance metrics such as MFU to identify and eliminate lagging nodes, maintaining training efficiency.

Fault-Tolerant Training

1. Concept and Objective: Fault-tolerant training refers to designing distributed systems to cope with hardware failures, allowing the overall training process to continue even if some computing units (such as GPUs) fail, avoiding restarts from checkpoints and reducing resource idle time.

2. Key Steps:

  • Fault Detection: Real-time monitoring of computing unit status to promptly identify faulty nodes.
  • Resource Allocation: When a fault occurs, dynamically redistribute computing tasks to available GPUs.
  • State Recovery: Maintain model training status under fault-tolerant conditions without affecting overall training progress.

3. Challenges and Issues:

  • Insufficient Edge Case Coverage: Existing open-source libraries (such as TorchX) fail to handle all possible failure scenarios, limiting application scenarios.
  • Impact of Network Failures: In large-scale GPU clusters, network failures can lead to data packet retransmissions, affecting training efficiency.
  • Performance Discrepancies: Performance differences in different hardware (chip lottery effect) can affect the effectiveness of fault tolerance mechanisms.

4. Solutions:

  • Develop Comprehensive Fault-Tolerant Systems: Drawing inspiration from Google's Borg and Pathways, build fault-tolerant infrastructure covering more fault scenarios.
  • Improve Network Communication: Optimize data transmission mechanisms, reduce strict requirements for sequential transmission, and enhance fault tolerance.
  • Utilize Checkpoint Technology: Implement checkpoint saving of GPU process states and memory contents to support more flexible fault recovery.

5. Other Considerations:

  • High-Temperature Burn-In Testing: Conduct thorough burn-in testing before system deployment to eliminate early failures and improve system stability.
  • Silent Data Corruption (SDC) Detection: Use tools (such as DCGMI) to monitor and identify SDC issues to ensure the accuracy of computation results.
  • Technology Confidentiality and Openness: Despite the increasing importance of fault-tolerant training methods, the level of openness of related technologies is relatively low, which may affect industry development and collaboration.

Training Strategies

1. Concept and Objective: Training strategies aim to optimize the distributed training process by reducing the number of global synchronization times and allowing partial independent operation to overcome the diminishing returns problem in Amdahl's Law, suitable for training scenarios across campuses, multiple regions, or continents 2. Key Steps:

  • Hierarchical Synchronization: Set different synchronization frequencies based on delays and bandwidth differences to adapt to GPU configurations in different geographical locations.
  • Load Balancing: Dynamically adjust between different parks based on the number of GPUs and batch sizes to ensure load balance during the training process.
  • Parameter Server Mechanism: Use a multi-layer parameter server architecture that allows model replicas to exchange data with multiple servers frequently to ensure timely updates and convergence of global weights.

3. Challenges and Issues:

  • Convergence Issue: Asynchronous updates may lead to convergence difficulties, requiring algorithm optimization to avoid instability.
  • Merging and Updating: In large model training, merging updates from different branches may introduce additional engineering complexity, affecting efficiency.
  • Network Bottleneck: Cross-regional training faces dual constraints of bandwidth and latency, which may impact overall training speed.

4. Solutions:

  • Optimize Asynchronous Parameter Servers: Improve existing asynchronous parameter server models to address update and convergence issues through algorithm innovation.
  • Powerful Sharders: Utilize tools like Google's MegaScaler to achieve efficient cross-park training and weight exchange.
  • Network Infrastructure Optimization: Promote high-bandwidth, low-latency network technologies to reduce bandwidth requirements for cross-regional training.

5. Other Considerations:

  • Future Scalability: It is anticipated that the model scale could reach 1 trillion parameters in the future, requiring advance layout of relevant infrastructure.
  • Fiber Deployment Cost: In cross-regional training, the cost and licensing issues of fiber deployment need to be considered to ensure the feasibility of network layout.
  • Industry Dynamics: The regression of asynchronous training may require a reassessment of existing training strategies and infrastructure design to meet new technological requirements.

Modulation and Multiplexing Technologies

1. Concepts and Goals: Modulation and multiplexing technologies aim to improve signal transmission efficiency and bandwidth utilization, optimizing optical fiber communication, especially in data centers and telecommunication networks, to meet the demand for high-speed data transmission.

2. Key Steps:

  • Modulation Scheme Selection: Use advanced modulation schemes such as PAM4, 16-QAM, and 64-QAM to increase the number of bits transmitted per symbol and enhance transmission rates.
  • Dense Wavelength Division Multiplexing (DWDM): Combine multiple wavelengths of optical signals into a single fiber, significantly increasing bandwidth.
  • Application of Coherent Optical Technologies: Employ coherent light sources and digital signal processors (DSP) to implement complex modulation schemes, ensuring signal accuracy and stability.

3. Challenges and Issues:

  • Cost Issue: High-order modulation schemes and coherent optical devices (such as tunable lasers) are costly, limiting widespread adoption.

  • Fiber Quality Limitation: Long-distance transmission is limited by the fiber itself, requiring high-quality fiber and equipment to reduce signal attenuation.

  • Signal Interference: Non-coherent light sources may cause phase interference, affecting the recovery of modulation signals 4. Solutions:

  • Optimize coherent optical modules: Reduce costs by using silicon photonics technology and gradually transition to using O-band lasers to simplify designs.

  • DWDM technology expansion: Increase bandwidth by adding wavelengths such as C-band and L-band to meet the growing customer demands.

  • Modular design: Use ZR/ZR+ optical modules for direct insertion into network ports to simplify telecom equipment chains.

5. Other Considerations:

  • Industry trends: With the increasing demand for AI and big data, the continuous development of modulation and multiplexing technologies will be crucial for data center interconnection and telecom networks.
  • Future expectations: The combination of advanced modulation and DWDM is expected to achieve transmission capacities exceeding 100Tbps on a single pair of optical fibers, driving larger-scale network upgrades.

Telecom Network Deployment

1. Concepts and Objectives: Telecom networks are infrastructure for data transmission, aiming to meet the communication needs of high bandwidth and high reliability, especially to support data center interconnection and cross-regional training. The goal is to achieve large-scale and efficient data exchange through optimizing optical fiber resource allocation and transmission technologies.

2. Key Steps:

  • Optical fiber deployment: Lay a large number of optical fibers next to cities and major infrastructure pairs, usually reserving idle fibers to meet future demands.
  • Application of DWDM technology: Merge multiple optical signals into a single fiber through dense wavelength division multiplexing, significantly increasing bandwidth.
  • Ultra-large-scale operator-built networks: Ultra-large-scale operators typically choose to cooperate directly with equipment suppliers to meet their specific needs.

3. Challenges and Issues:

  • Cost control: The deployment cost of submarine cables is high, mainly focused on the number of fiber pairs, while the main costs of land cables are labor and equipment.
  • Scarce resources: In some urban areas, optical fiber resources may be limited, leading ultra-large-scale operators to use fewer fiber pairs.
  • Technological complexity: Long-haul networks require various telecom equipment, increasing system complexity and space occupation.

4. Solutions:

  • Deployment of expanded fiber pairs: Ultra-large-scale operators typically choose to pre-deploy more fiber pairs than actually needed to reduce the complexity of subsequent telecom deployments.
  • Modular telecom equipment: Use modular chassis to combine various devices such as transceivers, DWDM multiplexers, and ROADM to enhance system flexibility and scalability.
  • Dynamic network management: Achieve dynamic adjustment and optimization of optical signals through ROADM to improve network performance and resource utilization.

5. Other Considerations:

  • Market trends: The demand for telecom equipment from non-cloud customers may gradually recover, improving the market prospects for equipment suppliers.
  • Technological advancements: With the expanding application of ZR/ZR+ optical modules in data center interconnection, increased spending on telecom equipment and systems is expected, driving the industry towards higher-performance devices The Future of AI Infrastructure: Competition between Google and Microsoft in Multi-Data Center Training

GW-level clusters, communication networks, long-haul fiber optics, hierarchical and asynchronous stochastic gradient descent (SGD), distributed infrastructure

As the Scaling Laws continue to drive the development of AI, the demand for infrastructure construction is soaring. This year, top AI model training clusters have expanded to 100,000 GPU units, with an expected increase to 300,000 by 2025. However, constrained by factors such as construction cycles, approval processes, regulatory restrictions, and power supply, the traditional single data center large-scale synchronous training mode is approaching its limits.

Google, OpenAI, and Anthropic have begun to expand large-scale model training to multiple data center campuses. Google has the most advanced computing systems globally, pioneering key technologies such as rack-level liquid cooling architecture and multi-data center training, which are now gradually being adopted by other companies.

The Gemini 1 Ultra model is the first to achieve multi-data center training. While Google leads in floating-point operations per second (FLOPS), it still lags behind OpenAI and Anthropic in synthetic data, reinforcement learning (RL), and model architecture. The upcoming Gemini 2 is expected to change this situation. Of particular note, Google is expected to have the ability to conduct GW-level training in multiple campuses by 2025, but its long-term planning is unexpectedly more conservative than OpenAI and Microsoft.

While most companies are just starting to use the high-density liquid-cooled AI chips of the NVIDIA GB200 architecture, Google has deployed millions of liquid-cooled TPUs with a total capacity exceeding 1 GW. This highlights Google's significant advantage in infrastructure.

Google's AI training campus currently has a power capacity close to 300 megawatts (MW), expected to increase to 500 MW next year. In addition to its massive scale, these facilities also have extremely high energy efficiency. The facilities use giant cooling towers and centralized water systems, connecting three buildings through pipelines, capable of dissipating nearly 200 MW of heat. With this system, Google mostly avoids using chiller plants, achieving a PUE of 1.1, according to the latest environmental report from 2023.

Google adopts a direct-to-chip water cooling method, transferring rack heat to the central facility water system through liquid-liquid heat exchangers This highly efficient energy system is very similar to Nvidia's GB200 liquid-to-liquid deployment.

In comparison, Microsoft's largest data training cluster currently does not use liquid cooling technology. The IT capacity per building is about 35% lower than Google's, despite similar gross floor area (GFA). Public data shows that Microsoft's PUE is 1.223, but this calculation favors air-cooled systems as it does not accurately measure the power consumption of server internal fans. For example, in the case of the H100 air-cooled server, fan power consumption exceeds 15% of server power, while liquid-cooled DLC servers are less than 5%.

Therefore, Microsoft needs to consume an additional 45% of power per watt supplied to the chip for server fans, power cooling, and other non-IT loads, while Google only needs an additional 15% or so. Considering the higher efficiency of TPU, Microsoft's overall situation is not optimistic.

Furthermore, to achieve higher energy efficiency in desert areas like Arizona, Microsoft requires a large amount of water resources. Its water usage efficiency (WUE) reaches 2.24 liters/kilowatt-hour, far higher than the industry average of 0.49 and slightly higher than Google's level of around 1. This high water consumption has raised negative attention, leading Microsoft to be required to use air-cooled chillers in future campuses to reduce water consumption, although this will further increase PUE, widening the energy efficiency gap with Google.

In summary, based on existing data center designs, Google has more efficient infrastructure and can expand megawatt-level capacity more quickly. With individual building capacity exceeding 50% more, Google requires relatively less public electricity per watt of IT load.

Google's AI Training Infrastructure

Google stands out in infrastructure construction. Its individual data center design has surpassed Microsoft, Amazon, and Meta, but this is just the tip of the iceberg. Over the past decade, Google has been building large-scale campuses. The Council Bluffs campus in Iowa is a typical case, with a western region IT capacity close to 300 megawatts. Although most of the capacity is used for traditional workloads, we speculate that a large number of TPUs may be deployed in the buildings below. The latest data center design is used in the eastern expansion area to further enhance AI training capabilities.

Google's largest AI data centers are adjacent to each other. The company has two major multi-data center regions in Ohio and Iowa/Nebraska. Currently, the surrounding area of Council Bluffs is undergoing large-scale expansion, with capacity set to exceed the current scale by twice. In addition to the campus shown in the image above, Google has three other sites under construction in the area, all upgrading high-bandwidth fiber networks.

Within a 15-mile radius, Google has three sites in Council Bluffs, Omaha, and Papillion, Iowa, with another site located in Lincoln, Nebraska, 50 miles away. The Papillion campus in the image above adds over 250 megawatts of capacity to the Omaha and Council Bluffs areas Combining the above parks, Google's total capacity in the region will exceed 500 megawatts in 2023, with most of it allocated to TPU.

The scale of the other two sites has not yet reached this level, but they are expanding rapidly. In total, the four parks are expected to form a gigawatt-level AI training cluster by 2026. The Lincoln data center 50 miles away will become Google's largest single site.

Google's extensive TPU deployment goes beyond this. Another gigawatt-level cluster under construction is located in the Columbus area of Ohio, following a similar development pattern. By the end of 2025, the total capacity of the three parks is expected to reach 1 gigawatt.

The New Albany cluster shown in the figure will become one of Google's largest data centers, already deploying TPU v4, v5, and v6.

Google's concentration in Ohio and Iowa/Nebraska regions can further interconnect to support multi-gigawatt training for a single model. Our data center model details historical and projected power data for over 5,000 data centers, covering the construction status of AI labs, hyperscale cloud providers, next-generation clouds, and enterprise clusters. Subsequent reports will delve into the software stack and related methods for multi-data center training.

Microsoft and OpenAI's Infrastructure Counterattack Strategy

Microsoft and OpenAI are well aware of their short-term infrastructure disadvantages and have launched an ambitious plan to catch up with Google in infrastructure construction. They are striving to compete with Google in its strong area - water-cooled multi-data center training clusters.

Microsoft and OpenAI are building near-gigawatt-level ultra-high-density liquid-cooled data center parks. At the same time, they are collaborating with companies such as Oracle, Crusoe, CoreWeave, QTS, Compass, etc., to surpass Google in total AI training and inference capacity.

Once some parks are completed, the scale will exceed any of Google's current single parks. In fact, Microsoft's park in Wisconsin will exceed the total of all Google sites in Ohio, but with a longer construction period.

However, the ambitions of OpenAI and Microsoft go far beyond this. They plan to interconnect multiple hyperscale parks to implement large-scale distributed training across the United States. They will be the first to establish a multi-gigawatt computing system. Working with supply chain partners, they are undertaking the largest infrastructure construction in history.

Subsequent reports will detail the infrastructure construction of Microsoft and OpenAI. Before that, we will first explore synchronous and asynchronous training methods for multiple parks, lag nodes, fault tolerance mechanisms, implicit data corruption, and various challenges faced by multi-data center training.

Next, we will analyze how data centers interconnect through fiber optic communication networks, including relevant technologies and equipment Finally, we will analyze the telecommunications supply chain and discuss the key beneficiaries in this round of AI infrastructure construction, as well as which companies we believe will have an advantage.

Multi-Data Center Distributed Training

Large language models (LLMs) typically use synchronous training. Training data is divided into several small mini-batches, processed by model replicas on different groups of GPUs. After each mini-batch is processed, each replica calculates gradients, which are then synchronized at the end of each batch.

This synchronization is usually achieved through collective communication operations such as all-reduce, which aggregates the gradients of all replicas. After aggregation, the gradients are averaged and used to simultaneously update the model parameters. This ensures that all data replicas maintain a consistent parameter set, ensuring stable model convergence. Since the synchronous process requires all devices to wait for each other to complete before the next step, it ensures that no device is ahead or behind in the model state.

While synchronous gradient descent provides stable convergence, it also brings significant challenges, especially when the number of chips in a single training task exceeds 100,000, leading to a significant increase in communication overhead. The synchronous nature requires strict latency requirements and sufficient bandwidth to connect all chips, as data exchanges often occur in the form of massive data streams.

When attempting to use multiple regional GPUs to handle the same training task, inter-regional latency increases. Even with fiber optic propagation at a speed of 208,188 kilometers per second, the round-trip time (RTT) between the east and west coasts of the United States requires 43.2 milliseconds. Various telecommunications devices also introduce additional delays. This poses a significant challenge to standard synchronous training.

According to Amdahl's Law, when there are a large number of synchronous operations in the workload, the acceleration effect of adding chips rapidly diminishes. As the number of chips increases, the proportion of the workload that needs to be synchronized remains constant, reaching a theoretical limit where even doubling the number of GPUs does not increase the total throughput by more than 1%.

In addition to the theoretical scalability limit, practical applications of synchronous gradient descent also face challenges such as lagging nodes. When one chip is 10% slower than others, it causes the entire training task to be 10% slower. As shown in the diagram, from steps 7500 to 19000, ByteDance's MFU (Machine Floating Point Utilization) gradually decreases due to more chips experiencing slight speed reductions, gradually constraining the entire task due to lagging nodes.

After identifying and removing the lagging nodes, ByteDance restarts the training task from a checkpoint, restoring the normal MFU level. It can be seen that the MFU decreases from 40% to 30%, a reduction of 25%. With 1 million GPUs, a 25% decrease in MFU is equivalent to 250,000 GPUs idle, representing over $10 billion in IT capital expenditure.

Fault Tolerant Training Fault tolerance training is a key link in distributed systems. When millions of computing, storage, and memory components are running simultaneously, failures are inevitable, and even "Silicon Lottery" performance differences may occur. System design is aimed at addressing these issues. However, machine learning training, as the world's largest scale computing problem, adopts the opposite strategy.

All chips must run perfectly, as any failure in one of the 100,000 GPUs will cause all GPUs to restart from the checkpoint, resulting in a large amount of idle time. Fault tolerance training allows only a few GPUs to be affected in the event of a single GPU failure, with most continuing to run without the need to restart from the model weight checkpoint. Open-source models like LLAMA 3.1 have consumed a significant amount of cost and time due to this.

NVIDIA's InfiniBand network has similar potential flaws, requiring each data packet to be transmitted in exactly the same order. Any deviation or failure requires data to be retransmitted. Reports from a 100,000 GPU cluster indicate that network failures alone can be measured in minutes.

The main open-source library TorchX (formerly TorchElastic) implements fault tolerance training, but has significant drawbacks, such as the inability to cover all edge fault scenarios and lack of support for three-dimensional parallelism. This has led to large AI labs having to develop their own fault tolerance training systems.

As a leader in fault-tolerant infrastructure, Google achieves optimal fault tolerance training through Borg and Pathways. These libraries cover the most edge cases, reflecting Google's vertical integration advantage: designing training chips, building servers, writing infrastructure code, and conducting model training. This high level of integration helps to quickly address and solve fundamental issues.

Overall, fault tolerance is key to scaling a 100,000+ GPU cluster to a single workload. NVIDIA lags far behind Google in AI system reliability, which also explains the frequent appearance of fault tolerance in NVIDIA job descriptions.

Design redundant and fault-tolerant mechanisms, including redundant components, interfaces, and error correction codes (ECC), to maximize system availability. Evaluate and select appropriate technologies and components to optimize reliability, availability, and maintainability, while considering factors such as mean time between failures (MTBF), mean time to repair (MTTR), and total cost of ownership (TCO).

In the CPU field, fault-tolerant infrastructure is generally considered a solved problem. For example, Google's internal database Spanner supports all of Google's production services, including YouTube, Gmail, and Stadia (may it rest in peace), and is able to scale globally in a distributed manner, while being fault-tolerant in terms of storage servers and NVMe disk failures. In Google's data centers, hundreds of NVMe disks fail every hour, but for end users and internally, Spanner's performance and availability remain constant Another example of traditional CPU workload fault tolerance is MapReduce. MapReduce is a modeling approach where users can "map" data by processing data samples and "reduce" multiple data samples to an aggregated value. For example, counting how many letter "W" are in a paper is a theoretical workload that is very suitable for MapReduce: mapping each word to output the count of "W" in each data sample, and reducing to aggregate the count of "W" in all samples. MapReduce achieves fault tolerance by detecting which CPU working nodes fail and re-executing failed mapping and reducing tasks on other CPU working nodes.

Jeff Dean, Sanjay Ghemawat, and other world-class experts at Google have developed a significant amount of CPU fault tolerance research and systems in the field. As machine learning training scales up and fault tolerance requirements increase, Google's expertise in building reliable and robust systems will become a competitive advantage.

GPU failure distribution follows a bathtub curve, with failures more common in the early and late stages of the cluster's lifecycle. This explains the necessity of extensive burn-in testing before deployment. Some emerging AI cloud vendors, in order to maximize lifespan, do not conduct sufficient burn-in testing, resulting in poor user experience.

In contrast, large-scale cloud computing companies and major AI labs conduct long-term burn-in testing under high temperatures and rapid temperature fluctuations to ensure early failures are eliminated, and the system enters a random failure phase. However, there is a balance between sufficient burn-in time and avoiding excessive consumption of GPU and transceiver lifespan.

Wear-related failures often occur towards the end of a device's lifecycle, commonly due to components experiencing rapid high-temperature fluctuations during intense usage. Transceivers are particularly susceptible to thermal cycle damage.

In the CPU domain, when the error rate of physical hosts is high, virtual machines (VMs) are typically migrated to another host. Large-scale vendors even achieve seamless migration without user perception. This is usually accomplished by replicating memory pages in the background, allowing the VM to quickly switch to normal operation on a second physical host when the application experiences a brief slowdown.

Linux's mainstream software package CRIU is used for major container engines, supporting container and application migration between physical hosts, and can even freeze the entire process state and store it as a checkpoint Long-term only applicable to CPU and AMD GPU until this year when NVIDIA began to support.

Starting from 2024, NVIDIA GPUs will support CRIU checkpoints, enabling smoother migration of CPU process states, memory contents, and GPU processes between physical hosts.

Microsoft's Singularity cluster manager paper describes the transparent migration of GPU virtual machines using CRIU. Singularity was originally designed to consider global GPU workload scheduling management and has been used for Phi-3 training (1024 H100 GPUs) and other models. Microsoft is catching up with Google's Borg cluster manager, which has a highly vertically integrated advantage.

The importance of fault-tolerant training has led to the basic cessation of related methods being publicly disclosed. Companies like OpenAI provide vague feedback to the hardware industry, avoiding revealing specific techniques of distributed systems. These technologies are more important than model architecture and can both be considered part of computational efficiency.

Silent Data Corruption (SDC) is another common issue that causes computers to produce silent errors in processing results without alerting users or administrators. It is difficult to resolve because "silent" implies undetectable. Some are minor but may lead to outputs becoming NaN or gradients abnormally increasing. Jeff Dean's gradient norm graph shows that some SDC can be identified through gradient norm mutations, but some cannot be detected.

Some gradient norm mutations are not caused by hardware SDC but by improper adjustments of large amounts of data or hyperparameters. All companies operating GPU clusters regularly encounter SDC issues, but small and medium-sized emerging cloud providers often struggle to quickly identify and fix them due to limited resources.

The DCGMI diagnostic tool can diagnose NVIDIA GPU SDC errors and capture a considerable portion of common SDCs, but it cannot handle many edge cases that lead to numerical errors and performance issues.

When testing different emerging cloud providers with H100, although DCGMI diagnostic level 4 passes, the NVSwitch ALU does not work properly, leading to a decrease in the performance of the NVLS NCCL algorithm and generating incorrect all-reduce results. Subsequent NCCL/RCCL collective communication articles will delve deeper into these benchmark test results.

In contrast, Google's Pathways excels in identifying and resolving SDCs. Its highly vertically integrated infrastructure and training stack enable it to easily perform SDC checks before and after large-scale training tasks Asynchronous training was widely used. In 2012, Jeff Dean's "Distbelief" paper described asynchronous and synchronous gradient descent techniques for training deep learning models on a large number of CPU cores. The introduction of a global "parameter server" was widely used in production environment training for Google's autocomplete, search, and advertising models.

At that time, parameter server-based training had good results. However, convergence issues with new model architectures gradually led the industry back to fully synchronous gradient descent. Currently, all cutting-edge models such as GPT-4, Claude, Gemini, and Grok are trained using synchronous gradient descent. With the continuous increase in the number of GPUs, there may be a shift back to asynchronous gradient descent in the future.

Training Strategy

To overcome the diminishing returns of adding chips in Amdahl's Law, the number of global synchronous iterations can be reduced, allowing more workloads to run (semi-) independently. This method is suitable for training across campuses, multiple regions, or even continents, due to hierarchical differences in GPU latency and bandwidth.

Within a campus (less than 1 km), the latency is extremely low and the bandwidth is very high, allowing for frequent synchronization. Within a region (less than 100 km), the bandwidth is still high but the latency is higher, requiring a reduction in synchronization frequency. The number of GPUs in each campus may vary, making load balancing easier. For example, if campus A has 100,000 GPUs and campus B has 75,000 GPUs, the batch size for B can be around 75% of A's, and synchronization is done by weighted average based on the number of GPUs.

This principle can be applied to situations involving multiple regions and continents. Due to higher latency, synchronization frequency should be reduced. Essentially, this is a form of hierarchical synchronization.

For example, we often meet with neighbors frequently, less often with friends in other cities on the same coast, and even less frequently with friends in cities on other continents.

Hierarchical synchronized stochastic gradient descent (SGD) also has an advantage in mitigating the impact of "stragglers." Most "stragglers" only exhibit abnormal behavior in a few steps, but quickly return to normal. Therefore, the lower the synchronization frequency, the less likely "stragglers" will disrupt the synchronization process. With no need for global synchronization at each iteration, the impact of "stragglers" becomes less significant. Hierarchical synchronized SGD will become a common innovation in future multi-data center training.

Another prospective method is to reuse the asynchronous parameter server from Jeff Dean's 2012 DistBelief paper. Model copies process their own data batches and periodically exchange data with the parameter server to update global weights, similar to git version control. Simple implementations may lead to convergence issues, but OpenAI has the capability to innovate solutions to update problems through optimized algorithms The Branch-Train-Merge paper by MetaAI describes a similar approach: branching from existing large language models, training on data subsets, and then merging back to the main branch. This experience may be integrated into OpenAI's multi-site training technology. However, the merging issues for models like GPT-3 175B or GPT-4 1.8T have not been fully resolved, requiring more engineering resources to manage merging and updates to ensure training convergence.

Expanding to a hierarchical architecture requires setting up multiple layers of parameter servers, where model replicas not only exchange data with the nearest server but also between servers. At the lowest level, individual model replicas frequently update with the nearest parameter server to ensure local rapid convergence synchronization.

Local parameter servers are grouped into higher levels, where each layer optimizes the updates from the lower level before passing them upwards. In cases with a large number of GPUs, parameter servers may need to store main weights in FP32 format, similar to NVIDIA's FP8 training servers. Storing main weights in FP32 avoids accumulation overflow from multiple GPUs, and actual calculations may use FP8 or even lower precision like MX6.

Google currently uses the powerful sharding tool MegaScaler to achieve multi-site training, enabling synchronization training across multiple nodes within a site and across multiple sites within a region, providing stability and reliability advantages for scaling the number of chips in a single training task.

However, industry regression asynchronous training may pose a bottleneck for MegaScaler at Google. Adding asynchronous functionality to MegaScaler based on synchronous training principles may require significant refactoring or redevelopment. Pathways design considers asynchronous data flow, but current production cases are all based on fully synchronous SGD training. Google has the capability to redesign this software stack.

Interconnecting data centers across regions is mainly constrained by bandwidth and latency limitations. In the long term, latency will become a greater bottleneck due to the speed of light limiting signal propagation. The main costs of laying fiber optics across regions lie in permits and excavation, not the fiber optics themselves. Strategies to reduce bandwidth demand remain crucial.

In the future, model scales on multi-site, multi-regional training clusters could reach the order of 100 trillion parameters. Intra-regional available bandwidth is expected to expand to 5Pbps, while inter-regional estimates are around 1Pbps. With such high bandwidth, weight exchange between sites is no longer the main bottleneck, and transferring 400TB of weights (4 bytes per parameter) would only take 0.64 seconds.

NVIDIA's MetroX Infiniband switch is used for network connections within 40 kilometers, but AI labs do not use it, with only a few non-AI HPC clusters spanning multiple sites within 10 kilometers. Each chassis has only 2 100Gbps ports, and Ethernet solutions within 40 kilometers are more mature. Even Microsoft, which widely uses Infiniband, uses Ethernet between data centers.

From Gb to Tb: Evolution of Modulation and Multiplexing Technologies

Current internal networks in data centers typically provide fiber optic connections of up to 400Gbps per end device (such as GPUs). It is expected that next year, NVIDIA will upgrade its Connect-X8 network interface card to increase this speed to 800Gbps to meet the demands of AI applications Compared to that, telecommunications networks typically need to aggregate the communication needs of multiple devices and servers within a facility on a small number of optical fibers and transmit at higher speeds. Although 800Gbps optical modules in data communication can typically only transmit 100Gbps per pair of optical fibers (such as DR8 format) and require multiple independent pairs of fibers, in telecommunications applications, a single pair of single-mode optical fibers can achieve transmission speeds of 20Tbps to 40Tbps, suitable for undersea cables as well as numerous land and metropolitan network deployments.

Increased bandwidth is mainly achieved through the following ways:

  1. Adopting higher-order modulation schemes to transmit more bits on a given wavelength.
  2. Using Dense Wavelength Division Multiplexing (DWDM) technology to multiplex optical signals of multiple wavelengths onto a single fiber.

In terms of modulation, data communication typically uses optical modules based on VCSEL and EML, which can achieve PAM4 modulation. PAM4 is an intensity modulation scheme (i.e., intensity modulation direct detection—IMDD optical devices) that transmits signals using four different levels, encoding two bits per symbol.

Increasing speed can be achieved in two ways: increasing the symbol transmission rate (in gigabaud Gbd units) or increasing the number of bits per symbol. For example, a 400G SR8 optical module transmits symbols at a rate of 26.6 Gbd, achieving 2 bits per symbol through PAM4, transmitting 50Gbps per pair of fibers. By combining 8 pairs of fibers into one connector, the total transmission rate reaches 400Gbps. To achieve 800Gbps, the symbol rate can be increased to 53.1 Gbd while still using PAM4 on 8 channels. However, increasing the symbol rate is usually more challenging than adopting higher-order modulation schemes.

16 Quadrature Amplitude Modulation (16-QAM) is a widely used higher-order modulation scheme in ZR/ZR+ optical modules and telecommunications. It encodes not only four different amplitudes of the signal wave but also uses two sets of carrier waves with a phase difference of 90 degrees, each set of carrier waves having four different amplitudes, totaling 16 possible symbols, with each symbol transmitting 4 bits. By introducing dual polarization, i.e., using two sets of carrier waves with horizontal and vertical polarizations, it further expands to 256 possible symbols, transmitting 8 bits. Most 400ZR/ZR+ and 800ZR/ZR+ optical modules support Dual-Polarization 16-QAM (DP-16QAM), while dedicated telecommunications systems running on high-quality optical fibers (with larger specifications) can support Dual-Polarization 64-QAM (DP-64QAM), achieving transmission of 12 bits per symbol.

Implementing modulation schemes based on different phases requires the use of coherent optical technology. Coherent light is crucial in achieving phase modulation schemes because incoherent light sources can lead to inconsistent interference in signals, making it impossible to recover phase-modulated signals.

Coherent optical systems require the use of coherent Digital Signal Processors (DSP) to process higher-order modulation schemes and are equipped with tunable lasers and modulators. 400ZR optical modules typically use silicon photonics technology to reduce costs. Due to the expensive price of tunable lasers, the industry is currently exploring the use of lower-cost O-band lasers to achieve simplified coherent optical modules

ZR/ZR+ Optical Modules in Data Center Interconnection

ZR/ZR+ optical modules are becoming increasingly popular types of optical transceivers, using coherent optical technology designed specifically for data center interconnection. They can significantly increase the bandwidth per fiber pair and achieve longer transmission distances ranging from 120 kilometers to 500 kilometers. They typically adhere to the OSFP or QSFP-DD specifications, which are common in data communication applications, allowing for direct insertion into the same network switch.

Traditional telecommunication systems can also be used for data center interconnection, but they require a more complex chain of telecommunication equipment, occupying more physical space in data centers. In contrast, ZR/ZR+ pluggable modules can be directly inserted into network ports, connecting both ends directly and bypassing multiple telecommunication devices.

Higher-order modulation schemes have significantly increased the bandwidth per fiber pair. For example, compared to IMDD optical modules using PAM4, double-polarization 16-QAM can increase the bandwidth by 8 times. However, long-distance transmission is still limited by the fiber itself, so additional bandwidth per fiber pair can be achieved through Dense Wavelength Division Multiplexing (DWDM). DWDM combines multiple wavelengths into a single fiber pair for transmission. For instance, in the C-band (1530nm to 1565nm) and L-band (1565nm to 1625nm), 76 wavelengths can be combined into the same fiber.

If each wavelength transmits at 800Gbps, the system could achieve transmission speeds of up to 121.6Tbps on a single fiber pair. Submarine cables typically maximize the number of wavelengths used, with some deployments using fewer than 16 wavelengths, while others may deploy up to 96 wavelengths. The current typical deployment goal is to achieve transmission capacities of 20 to 60Tbps on a single fiber pair.

Many initial deployments only activate a few wavelengths in the C-band, gradually activating more C-band wavelengths as customer demand grows, eventually expanding to the L-band, significantly increasing the transmission speed of existing fibers.

Telecom Network Deployments for Hyperscale Operators

Most cities in the United States have sufficient fiber resources to meet the massive bandwidth requirements for AI data center interconnection. Submarine cable deployments typically include only 8 to 12 fiber pairs, as costs are proportional to the number of fiber pairs. In contrast, the cost of land-based cables mainly focuses on labor, equipment, and rights of way, so companies often deploy hundreds or even thousands of fiber pairs when laying cables in urban areas.

Training across seas is significantly more challenging than land-based training Typical fiber optic business models usually reserve a considerable amount of idle fiber pairs to meet future demand. Not only in cities, but also along major highways, power lines, railways, or infrastructure, fiber optic cables are usually laid. In infrastructure construction projects, due to existing excavation equipment on site, adding fiber optic installation requires almost no additional cost.

Hyper-scale operators tend to build their own networks rather than cooperate with telecommunications service providers. They directly collaborate with equipment suppliers and construction companies to meet long-haul, urban, and data center interconnection needs.

Data center interconnection typically involves laying a large number of fiber pairs to connect two data centers that are no more than 50 kilometers apart. Hyper-scale operators can insert ZR optical modules into network switches at two remote data centers, tune them to different wavelengths, and then combine up to 64 optical modules onto a pair of fibers through a passive multiplexer (DWDM link). Using 400ZR, each pair of fibers can achieve a transmission speed of 25.5 Tbps. Another method is to insert each ZR optical module into independent fiber pairs.

More complex telecommunications systems can also achieve DWDM, multiplexing more ZR optical signals onto fewer fiber pairs and supporting network connections beyond point-to-point. However, this requires additional rack space for telecommunications equipment, including routers, reconfigurable optical add-drop multiplexers (ROADM), and DWDM multiplexers/demultiplexers.

Since the main cost lies in excavating fiber ducts, most hyper-scale operators find it simpler to deploy more fiber pairs than needed, saving space inside data centers and avoiding complex telecommunications deployments. Only when fiber capacity is limited do they consider deploying extensive telecommunications systems over short distances, a situation more common outside the United States where hyper-scale operators may be forced to use only 2 to 4 pairs of fibers in cities where fiber resources are scarce.

However, in long-haul networks, hyper-scale operators need a comprehensive range of telecommunications products that are fundamentally different from data communication products. A typical long-haul network requires at least several basic systems, including transponders, DWDM multiplexers/demultiplexers, routers, amplifiers, gain equalizers, and regenerator sites, often including ROADM (reconfigurable optical add-drop multiplexer) and WSS (wavelength selective switch).

Transponders in the telecommunications field function similarly to optical modules but are higher in price and power levels. One side transmits/receives with the actual telecommunications network (line side), while the other side provides various combinations of port connections to customer equipment (client side). For example, a transponder may provide 800Gbps on the line side and 4 200Gbps optoelectronic ports on the client side, but customers can choose from various port capacities and electrical/optical combinations. The client side can connect to routers or switches inside data centers, while the line side will connect to multiplexers, combine signals from multiple transponders through DWDM, and possibly achieve more complex network topologies through ROADM DWDM works through multiplexers and demultiplexers (mux/demux), combining optical signals with slightly different wavelengths from each transceiver into a pair of optical fibers. Each transceiver can be adjusted to select specific wavelengths of light for multiplexing. When using ROADM, transceivers are typically connected to colorless multiplexers/demultiplexers, then connected to wavelength-selective switches (WSS) to dynamically adjust transceivers to specific wavelengths, optimizing network objectives.

Optical amplifiers are used to counteract signal attenuation during transmission, typically placed every 60 to 100 kilometers, directly amplifying optical signals without converting them to electrical signals. Every three amplifiers require a gain equalizer to ensure that optical signals of different wavelengths can be balanced when transmitted at different speeds to avoid errors. In ultra-long distance deployments spanning thousands of kilometers, regeneration is needed to convert optical signals to electrical signals, reshape and retime them, and retransmit them through another set of transceivers.

For networks connecting multiple nodes with multiple intermediate points for adding or receiving traffic, ROADM is required. It can optically add or drop specific wavelengths of light at specific locations in the network without the need to convert signals to electrical for processing or routing. ROADM also has a control plane to actively discover and monitor network status, understand idle channels, signal-to-noise ratio, reserved wavelengths on the fiber network, and control transceivers to adjust the line side to the appropriate wavelength.

These different components are typically combined in a modular chassis.

Ciena, Nokia, Infinera, and Cisco are major global telecommunications systems and equipment suppliers, while Lumentum, Coherent, Fabrinet, and Marvell provide various subsystems and active components to these major suppliers. Currently, the advantage of component manufacturers mainly lies in ZR/ZR+ optics used for data center interconnects, but as large-scale operators and other operators need to train outside neighboring data centers, they may significantly increase their spending on high average selling price telecommunications equipment and systems.

Non-cloud customers' demand for telecommunications equipment seems to have bottomed out and may soon enter a cycle of recovery, thereby boosting the prospects of various telecommunications suppliers.

OpenAI and Microsoft's Google Surpass Strategy

As mentioned earlier, Microsoft's standard design is inferior to Google in terms of density. Although the data center footprint of both companies is similar, Microsoft's facilities have lower megawatt capacity.

Google's data centers have a lower PUE (power usage effectiveness), meaning more power is available for IT equipment while reducing the energy consumption of network devices, CPUs, and cooling systems. Therefore, although Microsoft also has experience in building large campuses, its construction cycle is usually longer, and the overall scale is smaller than Google's.

Microsoft's largest AI training base is located in Phoenix, one of its largest deployments, which will expand to 10 buildings in the future. Leveraging multiple approved sites, Microsoft plans to build 24 data centers Microsoft is actively leasing around Phoenix to further expand its footprint in the area. However, not all of these data centers will be used for AI training, some may serve other purposes.

To surpass Google in scale, Microsoft and OpenAI cannot rely on Microsoft's existing data center designs. They are significantly increasing the density of new facilities through self-construction, especially in Milwaukee, and expanding across the United States in partnership with Compass, QTS, Crusoe, Oracle, CoreWeave, and other partners. In Milwaukee and Atlanta (through QTS), Microsoft is building the world's most powerful monolithic building, using full liquid cooling design, dedicated to next-generation AI hardware.

Once completed, the self-built super campus in Wisconsin will be the largest single data center campus owned by Microsoft or Google. Meta is also actively advancing ambitious single-site plans.

This is just an overview of some of the sites, but the expansion speed is astonishing. Another part of Microsoft and OpenAI's massive infrastructure is located in Texas, further expanding in collaboration with Oracle+Crusoe and CoreWeave+Core Scientific in Abilene and Denton.

It is worth noting that in the process of building AI clusters, Microsoft has also ventured into the field of cryptocurrency mining. CoreWeave has leased existing Core Scientific cryptocurrency mining facilities, while Oracle is partnering with Crusoe to use its campus, with Crusoe previously deeply involved in the cryptocurrency field. Bitcoin miners are accustomed to high-density, high-power data centers, and many mining farms have signed large-scale power supply contracts.

Core Scientific's 10-K filing shows that it has 1.2GW of contracted capacity at multiple sites. Compared to building new data centers, the timetable for repurposing cryptocurrency mining farms is much shorter, making these facilities transition more rapidly and efficiently in AI cluster construction.

The company is massively shifting towards AI data center hosting and has reached a large-scale agreement with CoreWeave involving 382MW of IT power supply, with a short delivery time. CoreWeave will purchase GB200 GPUs and lease them to Microsoft for OpenAI's use. We believe the most critical location will be the mining farm in Denton, Texas.

Similar to X.AI's on-site generators, this data center also has ample power infrastructure. The site has a 225MW natural gas power plant located at the center of all cryptocurrency mining farms. The cryptocurrency mining farms will be dismantled and undergo large-scale transformation, replacing them with data center-level power and cooling systems However, compared to self-built data centers with a PUE higher than 1.3, the efficiency of this site is still relatively low.

Another important park developed by Crusoe is located in Abilene, Texas. Crusoe is known for its innovative associated gas mining sites in North Dakota and Wyoming, and is currently constructing a gigawatt-level data center. The initial portion of this center is being leased to Oracle, who will equip it with GPU and network devices before leasing it to OpenAI. Through real-time low-resolution satellite images, we can observe the rapid expansion of this park. We have precise and detailed quarterly historical and forecasted power data covering over 5,000 data centers. These data include the construction status of data center clusters for AI labs, hyperscale cloud providers, emerging cloud platforms, and enterprises.

In other regions of the United States, there are several large-scale data centers. For the sake of brevity, we will not introduce them one by one in this briefing, but the key points are very clear:

Through ambitious self-built plans, active leasing strategies, large partnerships, and innovative ultra-high-density designs, Microsoft will lead the AI training market with multi-gigawatt-scale clusters.

Gigawatt-Scale Giant Clusters

Microsoft is working on interconnecting multiple parks to create a massive training cluster at a multi-gigawatt scale. Fiber companies Lumen Technologies and Zayo have signed cooperation agreements, providing us with some clues.

The participation of Lumen and Zayo indicates that Microsoft may be leveraging advanced fiber optic technology and infrastructure to support its large-scale AI training clusters. This large-scale interconnection of data centers implies that Microsoft is building a high-performance computing network capable of handling massive data, achieving low-latency communication and data transfer between multiple parks through high-speed fiber optic networks to meet AI model training needs.

On July 24th, Lumen announced an agreement with Microsoft to interconnect multiple data centers. A few days later, Lumen signed an agreement with Corning to reserve 10% of its capacity over the next two years. We expect to see more similar agreements in the future, which could greatly expand Corning's business.

Lumen Technologies (NYSE: LUMN) announced on September 4, 2024, that it has secured $5 billion in new business due to the huge connectivity demand driven by AI. Companies across various industries are eager to acquire fiber capacity, and as AI demand surges, this resource becomes increasingly valuable and potentially in short supply.

Furthermore, Lumen is actively negotiating with customers to secure an additional $7 billion in sales opportunities to meet the growing customer demand.

Lumen Technologies is a large telecommunications company with operations spanning multiple areas, with the enterprise sector being the most important. Lumen collaborates directly with enterprises, leveraging its extensive fiber network to address their connectivity needs As mentioned earlier, the business is plagued by capacity utilization issues, with a large amount of leased or owned fiber already deployed but sitting idle, known as dark fiber. Lumen is one of the largest dark fiber suppliers in the United States, alongside Zayo, AT&T, and Crown Castle.

Enterprise telecom services also face challenges as many businesses have shifted their traffic to run over the internet due to declining internet prices, damaging the demand for MPLS (Multiprotocol Label Switching, a major enterprise product that provides data connectivity between remote offices), leading to price pressures and underutilization of resources. Additionally, buyers of telecom capacity have become more concentrated due to the rise of hyperscale cloud service providers, who often prefer to build their own telecom networks.

This means that much fiber capacity is sitting idle, with many fibers lit but using only a small number of wavelengths and still using outdated modulation schemes and slower data rates. If a driving force such as the surge in AI training demand occurs, upgrading this idle fiber capacity would present a huge opportunity as it can significantly enhance network transmission capabilities.

Maintaining such a massive infrastructure requires significant capital expenditure, and like many peers, Lumen faces cash flow issues and a huge debt burden. The company has nearly $20 billion in financial debt, with almost no free cash flow generation, and its revenue growth trend remains stable or even declining.

The rise of multi-campus AI training is changing the above situation as it requires massive bandwidth demand. Microsoft is a key customer in facilitating a $5 billion deal and there may be another $7 billion deal in the works.

Companies like Lumen are able to make such deals (and mention the possibility of a $7 billion deal) because they have extensive unused fiber networks. All this idle capacity and existing routes allow hyperscale cloud computing companies to build large-scale, high-bandwidth networks in a cost-effective manner. More importantly, the time to market is shortened, leveraging existing infrastructure accelerates processes that would have taken years, especially in cases requiring the excavation of special tunnels.

For Lumen, the economic benefits of this $5 billion deal are as follows:

  1. The business structure uses IRU (Indefeasible Right of Use), which is a standard agreement in the fiber industry, essentially similar to capital lease. The typical term for such agreements is 20 years.
  2. 85-90% of the transaction value is related to infrastructure, with the remaining portion involving operations and maintenance, as well as power and hosting services.
  3. Lumen estimates the cash profit margin of this transaction to be 30-35%, meaning pre-tax profit is around $1.5 billion Most of the infrastructure costs will be prepaid in cash in the first 3-4 years of the contract, with the remaining portion increasing as milestones are achieved and collected in the form of annual fees during the contract period.

Despite the increase in capital expenditures (CAPEX) and operating expenses (OPEX) associated with the transaction leading to a decrease in EBITDA, the transaction significantly enhances Lumen's annual free cash flow expectations.

This may just be the beginning. The telecom industry is expected to experience significant growth next year, and this long-dormant telecom company is clearly leading a substantial increase in revenue. Fiber optic companies are beginning to take notice of this opportunity, but we believe the actual impact will surprise investors and companies in the sector. For example, the entire switch, router, and wide area network (WAN) market is only $75 billion, so an incremental investment of $5 billion to $10 billion by a company will have a huge impact.

We expect over $10 billion in telecom capital expenditures to be specifically allocated to multi-data center training in the future. These are additional incremental expenses. At the same time, the telecom market is currently at a cyclical low, which is a new incremental driver, accompanied by the cyclical recovery of the market.

Beneficiary Companies

In addition to Corning and Lumentum, Fabrinet has a strong advantage in data center interconnect products, especially the 400ZR product line. In the fourth quarter of the fiscal year ending June 2024, the 400ZR product line contributed 10% of Fabrinet's optical revenue.

Fabrinet's presence in the telecom sector goes far beyond 400ZR. Its telecom business accounted for nearly 40% of total revenue in the fourth quarter of the 2024 fiscal year. Additionally, Fabrinet also has a strong data communication transceiver product line, responsible for producing 800G transceivers for connecting GPUs for NVIDIA.

The continued growth of ZR optics is one of the main drivers expected to drive quarter-over-quarter growth in Fabrinet's telecom business next quarter. With ZR optics revenue increasing from approximately 10% to over 20%, Fabrinet is poised to benefit. As a contract manufacturer focused on optical systems and components, Fabrinet is hailed as the "TSMC" of the industry and is expected to continue to benefit from its scale advantage and strong market position.

In the 2024 fiscal year, Cisco became Fabrinet's second-largest customer after NVIDIA, contributing 13% of sales. Coupled with strong demand from other telecom customers, Fabrinet's telecom business is showing strong growth momentum.

Furthermore, Fabrinet recently won orders from Ciena, indicating that Ciena's orders are expected to increase next year. In the past, Lumentum and Infinera were important customers of Fabrinet, and the recovery of these two companies will also help drive Fabrinet's telecom revenue growth Lumentum is another telecommunications company that is expected to achieve significant revenue growth for consecutive quarters, mainly due to the increasing demand for ZR/ZR+ optical devices, ROADMs, and C+L band products.

In contrast to Lumentum's optimistic outlook, Coherent is more cautious about the future. Despite the strong performance of its 400ZR+ transceiver business, Coherent expects the overall telecommunications market to remain weak in the short term. Coherent continues to be plagued by telecommunications equipment inventory issues, leading to a 6% sequential decrease and a 38% year-on-year decrease in telecommunications revenue. However, Coherent's forward guidance suggests that the bottom of the telecommunications market may be approaching.

Although Coherent inherited Finisar's legacy, due to business diversification, the proportion of telecommunications business in its total revenue has significantly decreased. We estimate that by the second half of the fiscal year ending in June 2024, telecommunications business will only account for 13% of its total revenue.

In comparison, we believe that Lumentum has a more balanced business portfolio and a more robust capital structure. If Coherent can continue to focus on data communication business and make breakthroughs in the telecommunications field, its debt may drive an increase in stock buybacks.

Ciena and Cisco are both telecommunications industry giants, with product lines covering line cards/transceivers, routers, multiplexers/demultiplexers, and ROADMs among other traditional telecommunications equipment. Cisco has a more diversified product line, including software business, while Ciena focuses on core telecommunications equipment. Infinera also focuses on the telecommunications field but is about to be merged into Nokia.

Among many telecommunications equipment manufacturers, Ciena has the highest focus on the telecommunications industry, with its business focus on telecommunications network hardware. Ciena has repeatedly mentioned the strong demand from cloud customers and explicitly stated during the second quarter fiscal year 2024 earnings conference call that it has received a large number of orders for next year's network construction, which are closely related to AI traffic demand.

Although Ciena's main business is still traditional telecommunications network equipment rather than data center interconnection, it has emphasized the acquisition of 18 orders for 400ZR+ and 800ZR+. ZR optical products have brought value-added opportunities to Ciena as its business is mainly focused on metropolitan and long-haul networks.

We believe that Ciena holds a leading position in these niche markets. With the increasing demand for higher link densities in telecommunications networks supporting AI training, Ciena has growth potential in terms of content and quantity. Among all telecommunications equipment manufacturers, Ciena has the highest exposure in AI telecommunications network construction.

Cisco emphasized that in the second half of the fiscal year ending in July 2024, orders from hyperscale customers achieved double-digit growth, successfully offsetting the weakness in service provider business. In addition, the company also secured $1 billion in AI orders, mainly concentrated in the Ethernet and optical products field, and is expected to secure another $1 billion in AI-related orders in the fiscal year 2025 Despite Cisco's acquisition of Acacia in 2021 to gain a favorable position in related areas such as DSP and ZR optics, there has been little mention of ZR optics opportunities in recent earnings conference calls. Considering Cisco's large revenue base, even with a significant increase in AI data center interconnection demand, the impact on Cisco's revenue percentage will be relatively limited.

Now, looking at Marvell. Through the acquisition of Inphi, Marvell has taken a leading position in the PAM4 DSP field and has acquired a series of related DSP products including Deneb, Orion, and Canopus.

Historical data shows that the proportion of related DSP in the Inphi/Marvell business has been relatively small. However, this situation is changing. Marvell's ZR optics business not only benefits from its related DSP product portfolio but also includes data center interconnect transceivers such as COLORZ, COLORZ 400ZR/ZR+, and COLORZ 800ZR/ZR+.

This ZR business is growing rapidly and is expected to become an important part of its business, even comparable to the substantial PAM4 business. The average selling price of ZR transceivers is much higher than IMDD transceivers, and shipment volumes are expected to continue to grow significantly.

Marvell has a stronger competitive advantage in this product area, with broad market prospects for its COLORZ series products. The company has made significant breakthroughs at a major hyperscale customer and shipment volumes continue to grow significantly. In addition, Marvell has expanded this product to multiple new customers. This impact will far exceed any potential short-term LRO issues.

Reference: Patel, D., Nishball, D., & Ontiveros, J. E. (2024, September 4). Multi-Datacenter Training: OpenAI's Ambitious Plan To Beat Google's Infrastructure. SemiAnalysis. Retrieved from https://www.semianalysis.com/p/multi-datacenter-training-openais