Is it really sustainable for AI startups to invest 80% of their funds in computing power?

Training an LLM is no longer a $100,000 affair, in fact, it may cost millions of dollars.

Recently, A16Z investors Stephanie Smith and Guido Appenzeller took stock of the current AI venture capital ecosystem. They discussed the cost of AI computing power and the sustainability of the market. The topics covered include LLM training costs, data limitations, model scale, training data volume, and hardware development. The complete conversation is as follows:

Stephanie Smith:

Guido mentioned in his latest article the high cost of AI computing, pointing out that access to computing resources has become a decisive factor for the success of AI companies. This applies not only to the largest companies building the largest models, but in fact, many companies allocate more than 80% of their total capital to computing resources. Naturally, this raises the question of whether this is truly sustainable.

Guido Appenzeller:

The core technology you build in the early stages is evolving towards providing more complete products, with more checks on various functionalities and implementations. If your application is toB, you also have to deal with all aspects of management, so there may be more non-AI, traditional software development going on, and you may need to pay more employee salaries. In the end, I expect that as a percentage, it will decrease over time, but as an absolute amount, it will increase for a period of time because this AI boom is still in its early stages.

Stephanie Smith:

The AI boom has just begun. In the second part, we discussed how the computing demand is unlikely to decrease in the short term, and when it comes to costs, the decision to own or lease infrastructure has a significant impact on a company's bottom line. But there are other considerations in terms of cost, such as batch size, learning rate, and the duration of the training process, all of which will affect the final price tag.

Guido Appenzeller:

How much does it cost to train a model? That depends on many factors. The good news is that we can simplify this question to some extent because most models used today are Transformer models. The Transformer architecture is a major breakthrough in AI, and they have proven to be very flexible. They are easier to train because they can be parallelized better than previous models.

So, in a Transformer, you can approximate the inference time to be roughly twice the number of parameters in floating-point operations, while the training time is about 6 times the number of parameters. If we take GPT-3 as an example, which is a massive model with 175 billion parameters, you would need 350 billion floating-point operations for one inference. Based on this, you can roughly calculate how much computing power you need, how it will scale, how you should price it, and ultimately how much it will cost.

This can also provide an idea of the training time for the model and the floating-point operation capability of your AI accelerator. You can theoretically calculate the number of operations required for training the model.In reality, the math problem is even more complex because there are some acceleration methods. Perhaps you can use reduced precision for training, but achieving 100% utilization on these cards is also very difficult. If you naively implement it, you may only achieve 10% utilization, but with some work, you may be able to increase it by several tens of percentage points. This provides you with a rough estimate of how much computational power is needed for training and inference, but ultimately you may need to test it before making a final decision to ensure that your assumptions hold true.

Stephanie Smith:

If all these numbers confuse you, don't worry. We will explain through a very specific example. GPT-3 has about 175 billion parameters, and below are Guido's requirements for training the model and performing final inference.

Guido Appenzeller:

If we do the math very naively, let's start with training. We know how many tokens it was trained on, and we know how many parameters the model has. So we can do a rough calculation, and in the end, you get about 3x10^23 floating-point operations. This is an absolutely crazy number, with 23 digits, very difficult to describe.

In reality, humans rarely deal with such complex computational problems. It's a huge engineering effort. Then, you can consider, let's take the A100 as an example, which is one of the most commonly used GPUs. We know how many floating-point operations it can perform per second.

We can divide these numbers, which will give us an order of magnitude estimate of how long it will take. Then we know the cost of these GPUs, renting an A100 costs about $1-4, depending on the source of the rental. So the rough cost you end up with is about $500,000, based on this very naive analysis.

Now there are some factors to consider. We haven't taken into account optimizations, and we haven't taken into account that you may not be able to run at full capacity due to memory bandwidth limitations and network limitations. Last but not least, you may need to run multiple times to get the correct results, and there may be some test runs that may not be full runs, and so on.

This makes you understand that training these LLMs today is not a $100,000 thing. In fact, based on what we have seen in the industry, it may actually cost millions of dollars.

This is because you need to reserve computational power. If I can get all my GPUs in the next two months, it may only cost me $1 million, but the problem is they require a two-year reservation, so the actual cost is 12 times higher, which basically adds a zero to my training cost.

Stephanie Smith:

Yes, it is much cheaper compared to inference.

Guido Appenzeller:

Basically, for the training set of modern text models, there are about 1 trillion tokens. If you run inference, outputting each word as a token, it will be about 1 trillion times faster than the training part. If you calculate the cost of running an LLM, it will actually only increase by a few cents.For example, 100% or one in several, approximately within this range.

Once again, it is emphasized that if we naively approach this issue, the problem with reasoning is that you must provide peak capacity. If everyone uses your model at 9 am on Monday, you still have to pay for the cost at midnight on Saturday when no one is using it, which greatly increases costs, especially for certain specific image models. For inference, you can use lower-cost graphics cards because the models are small enough to run on the consumer version of the graphics card server. This can save a lot of costs.

Stephanie Smith:

As we discussed in the first part, you can't just compensate for these inefficiencies by assembling a bunch of low-performance chips, at least not for model training.

Guido Appenzeller:

You need some very complex software because the overhead of distributing data between these cards may exceed the savings provided by the cards.

Stephanie Smith:

On the other hand, for inference.

Guido Appenzeller:

For inference, it can usually be done on one card. So if you take something like Stable Diffusion, which is a very popular image generation model, it can run on a MacBook because it has enough memory and computing power, so you can generate images locally, so you can run it on relatively inexpensive consumer graphics cards without using A100 for inference.

Stephanie Smith:

When we talk about model training, obviously the computational workload far exceeds inference. Another point we have discussed is that in general, more computation does not always, but usually leads to better models. Does this ultimately mean that these factors have led to capital-rich old companies winning this competition? Or how do you see the relationship between computational capital and today's technology?

Guido Appenzeller:

This is a million-dollar or even trillion-dollar question. First of all, training these models is expensive. For example, we haven't seen really good open-source LLM yet, and I believe part of the reason is that training these models is really expensive. There are many enthusiastic people who want to do this, but you need to find millions or tens of millions of dollars in computing power to complete it, which makes things more difficult. This means that you need to invest considerable effort in such things.

Overall, the cost of training these models seems to be decreasing, partly because we seem to be limited by data. It turns out that there is a correspondence between the scale of the model and the optimal training data volume of the model. Having an LLM but very little data is of no use to you, or having a large amount of data but using a small model is also of no use. You need to decide that the size of your brain roughly corresponds to the length of your education in college. I don't think this works, which means that because some large models today have already utilized a considerable proportion of human knowledge in a certain field.I mean, if you look at GPT, it may have been trained on about 10% of the entire internet, including all of Wikipedia, many books, a large amount of literature, so it might be possible to increase it by multiples of 10, but it's unclear if it's possible to increase it by multiples of 100.

I mean, humans haven't generated enough knowledge yet to absorb into these large models, so I think the current expectation is that the cost of training these models may actually peak and even slightly decrease, as chips become faster, but we won't discover new training data as quickly as before unless someone comes up with a new way to generate training data.

If this assumption holds true, I think the patterns created by these large-scale investments are not particularly profound, it's more like a speed bump rather than something that stops new participants. I mean, today, it is absolutely possible for well-funded startups to train an LLM, so for this reason, we expect to see more innovation in this field in the future.

Author: Youxin, Source: Youxin Newin, Original Title: "A16Z on the Real Cost of Computing Power | Can AI Startups Sustainably Invest 80% of Their Money into Computing Power?"