Dojo - Musk's High-Stakes Bet on "Autonomous Driving"

TechCrunch, a technology media outlet, reported that the core of Dojo's plan is Tesla's proprietary D1 chip, which means Tesla may not need to rely on NVIDIA's chips in the future and can obtain a large amount of computing power at a low cost. It is expected that by the end of this year, Dojo1 will achieve online training equivalent to about 8,000 H100s

Author: Li Xiaoyin

Source: Hard AI

The importance of Dojo supercomputer to Tesla is increasing day by day.

For Musk, Dojo is not just a supercomputer used by Tesla to train autonomous driving models in the cloud. In fact, it has become the cornerstone of Musk's AI business empire.

Morgan Stanley even likened Dojo to "Tesla's AWS", believing it will become Tesla's biggest value driver in the future.

In Musk's grand AI blueprint, what role does Dojo play? On Saturday morning local time, technology media TechCrunch journalist Rebecca Bellan published a in-depth report titled "Tesla Dojo: Elon Musk’s big plan to build an AI supercomputer, explained", starting from Dojo, detailing Musk's AI plan.

Here are the highlights of the article:

Tesla's pure visual approach (relying solely on cameras rather than sensors to capture data) is the main reason it needs a supercomputer.
Tesla's goal is to achieve "half Tesla AI hardware, half Nvidia/other" in the next approximately 18 months, with "other" possibly being AMD chips.
The core of the Dojo plan is Tesla's proprietary D1 chip, which means Tesla may not need to rely on Nvidia chips in the future and can obtain a large amount of computing power at a low cost.
The Dojo chip is Tesla's insurance policy and may bring dividends.
It is expected that by October this year, Dojo's total computing power will reach 100 exaflops, equivalent to the computing power level of about 320,500 Nvidia A100 GPUs; it is expected that by the end of this year, Dojo1 will achieve online training equivalent to about 8,000 H100s.

The full article is as follows:

For years, Elon Musk has been talking about Dojo - the artificial intelligence supercomputer that will become the cornerstone of Tesla's AI ambitions. This project is very important to Musk, who recently stated that as Tesla prepares to announce its robotaxi in October, the company's AI team will "double down" on advancing the Dojo project.

But what exactly is Dojo? Why is it so crucial to Tesla's long-term strategy?

In short: Dojo is a custom-built supercomputer by Tesla, designed to train its "fully autonomous driving" neural network. Enhancing Dojo is closely related to Tesla achieving fully autonomous driving and pushing robotaxis to the market. FSD is currently available on about 2 million Tesla vehicles, capable of performing some automated driving tasks, but still requires human attention in the driver's seat.

Tesla's original plan to announce its robotaxi in August has been postponed to October, but whether from Musk's public statements or internal sources at Tesla, it is clear that the goal of autonomous driving has not disappeared

Tesla seems to be preparing to invest heavily in artificial intelligence and Dojo to achieve this feat.

The Story Behind Tesla's Dojo

Musk does not want Tesla to be just a car manufacturer, or just a provider of solar panels and energy storage systems. Instead, he wants Tesla to be an artificial intelligence company, a company that decodes the code of self-driving cars by mimicking human perception.

Most other companies developing self-driving car technology rely on a combination of sensors to perceive the world (such as LiDAR, radar, and cameras) and high-definition maps to locate vehicles. Tesla believes that it can rely solely on cameras to capture visual data, then use advanced neural networks to process this data and quickly determine how the car should behave.

As Tesla's former head of artificial intelligence, Andrej Karpathy, said at the company's first AI Day in 2021, the company is essentially trying to "build a synthetic organism from scratch." (Musk has been teasing Dojo since 2019, but Tesla officially announced it on AI Day.)

Companies like Alphabet's Waymo have already commercialized Level 4 self-driving cars through more traditional sensor and machine learning methods - SAE defines it as a system that can drive itself under specific conditions without human intervention. Tesla has not yet produced a self-driving system that does not require human involvement.

About 1.8 million people have paid a high subscription fee for Tesla's FSD, currently priced at $8,000, with a peak price of $15,000. The selling point is that AI software trained by Dojo will eventually be pushed to Tesla customers through over-the-air updates. The scale of FSD also means that Tesla has been able to collect millions of miles of video clips for FSD training. This means that the more data Tesla can collect, the closer this car manufacturer is to achieving true full self-driving.

However, some industry experts say that simply pouring more data into the model and expecting it to become smarter may have limitations.

"First, there are economic constraints, and doing so quickly becomes prohibitively expensive," said Anand Raghunathan, professor of electrical and computer engineering at Purdue University Silicon Valley, to TechCrunch. He further stated, "There is a school of thought that we may actually run out of meaningful data to train the model. More data does not necessarily mean more information, so it depends on whether that data contains useful information to create a better model, and whether the training process can truly distill that information into a better model."

Raghunathan said that despite these concerns, at least in the short term, it seems that there will be more data. More data means more computational power is needed to store and process it to train Tesla's AI models. This is where the supercomputer Dojo comes into play.

What is a supercomputer?

Dojo is a supercomputer system designed by Tesla for artificial intelligence, especially for the training of FSD. The name pays homage to martial arts training dojos.

A supercomputer is made up of thousands of small computers called nodes. Each node has its own CPU (central processing unit) and GPU (graphics processing unit). The former is responsible for overall node management, while the latter handles complex tasks such as dividing tasks into multiple parts and processing them simultaneously. GPUs are crucial for machine learning operations, just as they support FSD training simulations. They also support large language models, which is why the rise of generative AI has made NVIDIA the most valuable company on Earth.

Even Tesla purchases NVIDIA GPUs to train its artificial intelligence (that's another story).

Why does Tesla need a supercomputer?

Tesla's pure vision approach is the main reason it needs a supercomputer. The neural networks behind FSD are trained on a large amount of driving data to identify and classify objects around the vehicle, and then make driving decisions. This means that when FSD is activated, the neural network must continuously collect and process visual data at a speed that matches human depth and speed recognition capabilities.

In other words, Tesla wants to create a digital version of the human visual cortex and brain functions.

To achieve this goal, Tesla needs to store and process all video data collected from cars around the world, and run millions of simulations to train its models on the data.

Tesla appears to rely on NVIDIA to power its current Dojo training computer, but it doesn't want to put all its eggs in one basket—especially since NVIDIA chips are expensive. Tesla also wants to make something better, increase bandwidth, and reduce latency. That's why the automaker's AI department has decided to propose its own custom hardware plan, which aims to train AI models more efficiently than traditional systems.

At the core of this plan are Tesla's proprietary D1 chips, which the company says have been optimized for AI workloads.

More information about these chips

Tesla shares a similar view with Apple that hardware and software should be designed to work together. That's why Tesla is working to move away from standard GPU hardware and design its own chips to power Dojo.

Tesla showcased its D1 chip at the AI Day in 2021, which is a palm-sized silicon block. As of May this year, the D1 chip has been put into production. Taiwan Semiconductor Manufacturing Company (TSMC) in China is using a 7-nanometer process to manufacture these chips. According to Tesla, the D1 chip has 500 billion transistors and a large size of 645 square millimeters, all of which indicate that the D1 is promised to be very powerful and efficient, capable of quickly handling complex tasks.

"We can do computation and data transfer at the same time, our custom ISA (instruction set architecture) is fully optimized for machine learning workloads," said Ganesh Venkataramanan, former Senior Director of Autopilot Hardware at Tesla, at the 2021 Tesla AI Day " This is a pure machine learning machine."

Nevertheless, the D1 chip is still not as powerful as NVIDIA's A100 chip, which is also manufactured by TSMC using a 7-nanometer process. The A100 has 540 billion transistors and a size of 826 square millimeters, so it slightly outperforms Tesla's D1 in terms of performance.

In order to achieve higher bandwidth and computing power, Tesla's AI team integrates 25 D1 chips together to form a block, serving as a unified computer system. Each block has a computing power of 9 petaflops and a bandwidth of 36 TB per second, and includes all the hardware needed for power, cooling, and data transfer. You can think of this block as a self-sufficient computer composed of 25 small computers. Six of these blocks make up a rack, and two racks make up a cabinet. Ten cabinets make up an ExaPOD. At the 2022 AI Day, Tesla stated that Dojo will be expanded by deploying multiple ExaPODs. All of these together constitute a supercomputer.

Tesla is also developing the next-generation D2 chip, aimed at addressing the information flow bottleneck issue. The D2 does not connect individual chips, but rather places the entire Dojo block on a single silicon wafer.

Tesla has not confirmed how many D1 chips it has ordered or expects to receive, nor has it provided a timetable for running the Dojo supercomputer on D1 chips.

In a post on X in June, it was mentioned: "Elon is building a huge GPU cooler in Texas." In response, Musk said, Tesla's goal is to achieve "half Tesla AI hardware, half NVIDIA/other" in the next approximately 18 months. According to Musk's comments in January, "other" may be AMD chips.

What does Dojo mean for Tesla?

Controlling its own chip production means that Tesla may one day be able to quickly add significant computing power to AI training projects at low cost, especially as Tesla and TSMC expand chip production.

It also means that Tesla may not have to rely on NVIDIA chips in the future, which are becoming increasingly expensive and difficult to secure.

During Tesla's second-quarter earnings call, Musk stated that the demand for NVIDIA hardware is "so high that it is often difficult to get GPUs." He said he is "fairly concerned about being able to reliably get GPUs when needed," "so I think it requires us to put more effort into Dojo to ensure we have the training capacity we need."

That being said, Tesla is still purchasing NVIDIA chips today to train its AI. In June, Musk posted on X:

"Of the approximately $10 billion in AI-related spending that I mentioned Tesla will do this year, about half is internal, mainly Tesla-designed AI inference computers and the sensors present in all our cars, plus Dojo. For building the AI training supercluster, NVIDIA hardware accounts for about 2/3 of the cost. My best guess for how much Tesla will buy NVIDIA this year is $3 billion to $4 billion." "

Inferential calculation refers to the real-time AI calculation performed by Tesla vehicles, which is separate from the training calculation handled by Dojo.

Dojo is a risky bet, with Musk repeatedly suggesting that Tesla may not succeed to hedge this bet.

In the long run, Tesla theoretically can create a new business model based on its AI department. Musk has stated that the first version of Dojo will be specifically tailored for Tesla's computer vision labeling and training, which is very beneficial for FSD and training Optimus (Tesla's humanoid robot), but not useful for other things.

Musk has mentioned that future versions of Dojo will lean more towards general AI training. One related potential issue is that almost all existing AI software is written for GPUs. Using Dojo to train general AI models will require rewriting software.

Unless Tesla rents out its computing power, similar to how AWS and Azure rent out cloud computing capabilities. Musk also mentioned during the second-quarter earnings call that he sees "a path to compete with NVIDIA through Dojo."

Morgan Stanley predicted in a report in September 2023 that Dojo could increase Tesla's market value by $500 billion by unlocking new revenue streams from robotaxis and software services.

In short, Dojo's chip is an insurance policy for this automaker and could pay dividends.

How is Dojo progressing?

Reuters reported last year that Tesla began producing Dojo in July 2023, but Musk hinted in an article in June 2023 that Dojo had been "online and running useful tasks for months."

Around the same time, Tesla stated that by February 2024, Dojo is expected to become one of the top five most powerful supercomputers - a feat that has not been publicly disclosed, casting doubt on whether it has already occurred.

The company also projects that by October 2024, Dojo's total computing power will reach 100 exaflops. (1 exaflop equals 10^18 operations per second. To reach 100 exaflops, assuming a D1 can achieve 362 teraflops, Tesla would need over 276,000 D1s, or approximately 320,500 NVIDIA A100 GPUs.)

Tesla also committed to investing $500 million in January 2024 to build a Dojo supercomputer at a mega factory in Buffalo, New York.

In May 2024, Musk pointed out that the rear of Tesla's Austin mega factory will be reserved for an "ultra-dense water-cooled supercomputer cluster."

Just after Tesla's second-quarter earnings call, Musk posted on X that the automaker's AI team is using the Tesla HW4 AI computer (renamed AI4), which is hardware in Tesla vehicles present in the training loop of NVIDIA GPUs. He noted that **the breakdown is roughly 90,000 NVIDIA H100s plus 40,000 AI4 computers." **

He continued, "Dojo1 will achieve online training equivalent to about 8,000 H100 by the end of this year. Not a lot, but not a few either."