Code name "TorchTPU"! Google and Meta join forces to replicate CUDA, further threatening NVIDIA

Wallstreetcn
2025.12.18 01:09
portai
I'm PortAI, I can summarize articles.

Google is closely collaborating with Meta to advance this plan. As the creator and manager of PyTorch, Meta aims to enhance its negotiating leverage with Nvidia by reducing inference costs and diversifying AI infrastructure. Google is also considering open-sourcing some software to accelerate customer adoption. If the TorchTPU plan succeeds, it will significantly lower the switching costs for companies looking for alternatives to Nvidia GPUs

Google is advancing an internal initiative called "TorchTPU," aimed at enhancing the compatibility of its artificial intelligence chips with PyTorch, the most widely used AI software framework globally, directly targeting NVIDIA's long-standing software ecosystem moat.

According to a report by Bloomberg on Thursday, insiders revealed that Google is closely collaborating with Meta to advance this initiative. As the creator and manager of PyTorch, Meta aims to strengthen its negotiating position against NVIDIA by reducing inference costs and diversifying AI infrastructure. Google is also considering open-sourcing some software to accelerate customer adoption.

Compared to past efforts to support PyTorch, Google is investing more organizational resources and strategic emphasis this time. As more companies seek to adopt Tensor Processing Unit (TPU) chips but view the software stack as a bottleneck, this initiative has become a key growth engine for Google Cloud.

If successful, TorchTPU will significantly lower the switching costs for enterprises transitioning from NVIDIA GPUs to alternatives. NVIDIA's dominance relies not only on hardware but also on its deeply embedded CUDA software ecosystem within PyTorch, which has become the default method for enterprises to train and run large AI models.

Software Compatibility as the Biggest Barrier to TPU Promotion

Google's TorchTPU initiative aims to eliminate the key barriers to TPU chip adoption. Insiders indicate that enterprise customers have consistently provided feedback to Google that TPUs are harder to adopt for AI workloads because historically, developers needed to switch to Google's internally favored machine learning framework Jax, rather than PyTorch, which most AI developers are already using.

This mismatch stems from Google's own technological path. Google's internal software development teams have long used a code framework called Jax, and its TPU chips rely on the XLA tool to run code efficiently. Google's own AI software stack and performance optimizations are primarily built around Jax, widening the gap between how Google uses chips and customer needs.

In contrast, NVIDIA's engineers have ensured for years that software developed using PyTorch runs as quickly and efficiently as possible on its chips. PyTorch is an open-source project, and its development history is closely tied to the development of NVIDIA's CUDA software. Some analysts on Wall Street view CUDA as NVIDIA's strongest shield against competitors.

Google Accelerates External Sales of TPUs

Alphabet has long reserved the vast majority of its TPU chip share for internal use. This situation changed in 2022 when Google's cloud computing division successfully lobbied to gain management control of the TPU sales team. This move significantly increased Google Cloud's TPU quota.

As customer interest in AI grows, Google has been seeking to profit by increasing TPU production and external sales. TPU sales have become a key growth engine for Google Cloud revenue, as the company strives to demonstrate to investors that its AI investments are yielding returns. This year, Google began selling TPUs directly to customer data centers, no longer limiting access to its own cloud services. This month, Google veteran Amin Vahdat was appointed as the head of AI infrastructure, reporting directly to CEO Sundar Pichai. Google needs this infrastructure to run its own AI products, including the Gemini chatbot and AI-driven search, while also supplying it to Google Cloud customers like Anthropic.

Meta Becomes a Strategic Partner

To accelerate development, Google is working closely with Meta. According to a report by The Information, the two tech giants have been discussing a deal for Meta to acquire more TPUs.

Insiders revealed that the early services provided to Meta used Google's managed model, where clients like Meta install Google-designed chips to run Google software and models, with operational support provided by Google. Meta has a strategic interest in developing software that makes TPUs easier to operate, hoping to reduce inference costs and diversify its AI infrastructure away from NVIDIA GPUs, thereby gaining a negotiating advantage.

A Google Cloud spokesperson did not comment on the specific details of the project, stating, "We are seeing a massive acceleration in demand for TPU and GPU infrastructure. Our focus is on providing the flexibility and scale that developers need, regardless of which hardware they choose to build on." Meta declined to comment.

Reducing Switching Costs Challenges NVIDIA's Ecosystem

PyTorch, initially released in 2016, is one of the most widely used tools by developers for building AI models. In Silicon Valley, few developers write every line of code that runs on NVIDIA, Advanced Micro Devices, or Google chips. Instead, these developers rely on tools like PyTorch, which is a collection of pre-written code libraries and frameworks that automate many common tasks in AI software development.

Insiders indicated that as demand from companies looking to adopt TPU chips but viewing the software stack as a bottleneck grows, Google has invested more organizational focus, resources, and strategic importance into the TorchTPU project. Most developers cannot easily adopt Google chips and achieve performance levels comparable to NVIDIA without significant additional engineering work. In the fast-paced AI race, such work requires time and funding.

If the TorchTPU initiative is successful, it will significantly lower the switching costs for companies looking for alternatives to NVIDIA GPUs. NVIDIA's dominance is not only reinforced by its hardware but also relies heavily on its deeply embedded CUDA software ecosystem within PyTorch, which has become the default method for enterprises to train and run large AI models