【In-depth Interview】Chen Tianqi: A Card Runs Large Models, iPhone Runs 70B, Overcoming NVIDIA's GPU Computing Power Dilemma

Is computing power no longer a problem?

Recently, many people have been worrying about computing power.

Big tech and startups are hoarding NVIDIA GPUs like crazy, while VCs and media outlets are carefully assessing the supply and demand of GPUs, just like taking inventory of a warehouse. Articles analyzing the shortage of GPUs on the internet are popping up like mushrooms after rain.

However, what if we could replace N cards with A cards or even train large models without GPUs? How would everything change then?

Speaking of this, we have to mention a genius - Tianqi Chen, the author of TVM, MXNET, and XGBoost, an assistant professor at Carnegie Mellon University, and the CTO of OctoML.

Recently, Tianqi Chen's Machine Learning Compilation Group (MLC) at CMU released a new solution for large-scale model inference using AMD graphics cards, which immediately gained widespread attention in the machine learning community.

Through this optimization method, in the latest Llama2 7B and 13B models, using an AMD Radeon RX 7900 XTX, the speed can reach 80% of NVIDIA RTX 4090 or 94% of 3090Ti.

On August 11th, Chen Tianqi had an in-depth discussion with Alessio, a partner and CTO of Decibel Partners, and Swyx, the founder of Latent Space. They discussed topics such as MLC, XGBoost, WebLLM, OctoAI, and self-optimizing computing.

Here is the full content for you to enjoy~ ✌️

(Translated by ChatGPT, with slight modifications)

Table of Contents:

● The Creation of XGBoost

● A Comparison | Tree-based Models vs. Deep Learning Models

● Overview of TVM and ONNX

● In-depth Research on MLC

● Model Inference Using int4 Quantization

● Comparison | MLC and Other Model Optimization Projects

● Running Large Language Models Online

Chen Tianqi's Research Manuscript

1. The Creation of XGBoost

Alessio:

When it comes to XGBoost, many listeners may know that it is a gradient boosting library and probably the most popular one.

The reason for its popularity is that many people started using it in machine learning competitions. I guess there might even be a Wikipedia page listing all the state-of-the-art models that use XGBoost, and the list could be very long.

When you were building XGBoost, could you predict that it would become so popular, or what were your expectations when creating it?

Chen Tianqi:

Actually, the initial motivation for building this library was when deep learning had just emerged, and that was when AlexNet was first introduced.

My mentor, Carlos Guestrin, and I had an ambitious mission to think about whether we could find alternative methods to deep learning models.

At that time, there were other alternatives such as support vector machines, linear models, and of course, tree-based models.

Our question was, if we build these models and train them with a large enough dataset, could we achieve the same performance?

Of course, in hindsight, that was a wrong assumption, but as a byproduct, we found that most gradient boosting libraries were not efficient enough to meet our needs for testing this hypothesis.

Coincidentally, I had a lot of experience in building gradient boosting trees and their variants. So, Effective Action Boost was actually a byproduct of testing this hypothesis. At that time, I also competed in some data science challenges, such as participating in the KDD Cup, and later Kaggle became bigger, so I thought maybe this could be useful for others. My friend convinced me to try making Python bindings for him, and it turned out to be a great decision. Through the Python bindings, we also got R bindings.

After that, it started to get interesting. People started contributing different ideas, such as visualization, and so we started pushing for more distributed support to ensure it works on any platform, and so on. Even at that time, when I talked to my mentor Carlos, he said he never expected it to be so successful.

In fact, the interesting thing is that I pushed for gradient boosting trees, even though he had a different opinion at the time. He thought maybe we should choose kernel machines. It turns out that we were both somewhat wrong, and deep neural networks are the strongest. But at least the direction of gradient boosting trees brought some results.

Alessio:

When it comes to these improvements, I'm always curious about the design process and how much is the result of collaboration with other people you work with, or an attempt to drive research in the academic field, which is usually paper-oriented.

Tianqi Chen:

I would say that the Extra Boost improvements at that time were more of a problem I tried to solve on my own.

Before that, I had worked on some other libraries in the field of matrix factorization. It was my first experience with open source. No one knew about it, because if you search for "SVD feature pack," you might find some SVN repository.

But in reality, it was used in some recommendation system packages. I tried to apply some of my previous experiences here and combine them. Later projects like MXNet and TVM were more collaborative in nature.

Of course, Extra Boost has grown bigger. When we started the project, I worked with some people and was surprised to see others join. Michael used to be a lawyer and now works in the AI field, and he contributed in the visualization aspect. Now people in our community contribute different things.

Even today, Extra Boost is still a community project driven by co-maintainers. So, it's definitely a collaborative work, constantly improving the content to better serve our community.

2. "Tree-based Models" vs "Deep Learning Models"

Swyx:

I would like to discuss the comparison between tree-based AI or machine learning and deep learning at some point, because I think many people are very interested in integrating these two fields.

Tianqi Chen:

Actually, the hypothesis we tested was partially wrong, because the hypothesis we wanted to test was whether you can run tree-based models in image classification tasks, and today, deep learning is obviously unparalleled in this field.

But if you try to run it on tabular data, you will still find that most people choose tree-based models. This is for a reason, because when you look at tree-based models, the decision boundaries are naturally the rules you are looking at, and they have good properties such as the ability to handle unknown input scales and automatically combine features.

I know that some people are trying to build neural network models for tabular data, and I sometimes pay attention to them. I think it's good to maintain some diversity in the modeling field.

In fact, when we built TVM, we built a cost model for programs, and we actually used XGBoost in it. I still think tree-based models are still quite relevant because, first of all, it's easy to make them work out of the box. Additionally, you will be able to get some interoperability and control over monotonicity.

Sometimes, I also keep thinking about whether we can build some potential improvements on top of these models. Undoubtedly, I think this is a field that may have potential in the future.

Swyx:

What are some promising projects currently, such as those trying to integrate these two directions?

Tianqi Chen:

I think there are some projects that try to apply transformer-type models to tabular data.

I don't remember the specific projects, but even today, if you look at what people are using, tree-based models are still one of their tools.

So I think the ultimate goal may not be replacement, but rather a collection of models that can be called upon, perfectly.

3. "Overview of TVM and ONNX"

Alessio:

Next, after building XGBoost for about three years, you built something called TVM, which is now a very popular model compilation framework.

Let's talk about it, it's about the time after ONNX appeared. Maybe you can give us an overview of how these two work together. Because it's like a model, then to ONNX, and then to TVM. But I think many people don't quite understand the subtle differences. Can you provide some background story?

Tianqi Chen:

Actually, this is a rather ancient history.

Before XGBoost, I worked in the field of deep learning for two or three years. During my master's degree, my thesis focused on applying convolutional restricted Boltzmann machines to ImageNet classification, which was before the era of AlexNet.

At that time, I had to manually create NVIDIA CUDA kernels, I remember it was on a GTX 2070 card. I had a 2070 card. It took me about six months to make a model work. In the end, the model didn't perform well, and we should have chosen a better model. That was a very early period in history, and it really got me into the field of deep learning.

Of course, in the end, we found that it didn't work. So during my master's degree, I ended up working on recommendation systems, which led me to publish a paper, and then I applied for and obtained a Ph.D. degree. But I always wanted to go back to work in the field of deep learning. Therefore, after XGBoost, I started working with some people on a project called MXNet. At that time, frameworks like Caffe, Caffe2, and PyTorch had not yet emerged. We worked hard to optimize performance on GPUs. At that time, I found it challenging even on NVIDIA GPUs, and it took me six months.

Then, seeing how difficult it was to optimize code on different hardware, it was truly surprising. So, it made me think, can we build something more general and automated? That way, I wouldn't need a whole team to build these frameworks.

So that's why I started working on TVM. On the platforms we support, very little machine learning engineering is needed for deep learning models. I think it's more about wanting to have fun.

I enjoy writing code, and I really enjoy writing CUDA code. Of course, it's cool to be able to generate CUDA code, right? But now, after being able to generate CUDA code, well, by the way, you can also do it on other platforms, isn't that amazing? So it was more that attitude that got me started.

I also see that now we see different researchers, and I'm more of a problem solver type myself. So I like to look at problems and then say, okay, what kind of tools do we need to solve this problem? So whether they solve the problem or not, I will definitely try.

I also see that this is now a common trend. If you want to solve machine learning problems, it's no longer just about algorithms. You need to approach it from both the data and system perspectives. I think the whole field of machine learning systems is emerging. There is already a conference about it. Seeing more and more people starting to research in this field ensures continued innovation here.

4. "MLC Deep Research"

Alessio:

Next, after building XGBoost for about three years, you built something called TVM, which is now a very popular model compilation framework.

Let's talk about this, it was around the time when ONNX appeared. Maybe you can give us an overview of how these two work together. Because it's like a model, then it goes to ONNX, and then to TVM. But I think many people don't quite understand the subtle differences. Can you provide some background?

Swyx:

Your latest startup project in the field of Machine Learning Systems (MLSys) introduced MLC LLM into the mobile domain in April this year. I have been using it on my phone, and it works well. I am running Lama 2, Vicuña version.

I'm not sure what other models you offer, but maybe you can describe your research process for MLC. I don't know how it relates to your work at Carnegie Mellon University. Is it an extension of your work? Chen Tianqi:

I think this is more like a concentrated effort we want in the field of machine learning compilation.

It is related to what we built in TVM. We built TVM five years ago, right? A lot has happened. We built an end-to-end machine learning compiler.

We have also learned a lot from it. Therefore, we are building a second-generation product called TVM Unity. It allows machine learning engineers to quickly apply new models and optimize them.

MLCLLM is similar to MLC. It is more like a vertically-driven organization where we create tutorials and build projects like the LLM scheme. You can apply machine learning compilation techniques and bring some interesting things.

It can run on mobile phones, which is really cool. But our goal is not limited to this, but to make it universally deployable. We have already run it on Apple M2 to Mac, running a model with 17 billion parameters.

In fact, in single-batch inference, we have achieved the best performance in 4-bit inference recently on CUDA.

On AMD cards, we have just made an achievement. In single-batch inference, we can actually run it on the latest AMD GPU. It is a consumer-grade graphics card. It can achieve about 80% of the performance of the 4090, which is currently the best consumer-grade graphics card from NVIDIA.

Although it has not reached the same level yet, considering the diversity and the previous performance you can get on this card, the things you can do with this technology in the future are really amazing.

Swyx:

There is one thing that confuses me. Most of these models are based on PyTorch, but you run these models internally in TVM. I don't know if any fundamental changes are needed, or if this is basically the fundamental design of TVM?

Chen Tianqi:

In fact, TVM has a program representation called TVM script, which includes computation graphs and operation representations.

Initially, we did need to put some effort into bringing these models into the program representation supported by TVM. There are various ways to do this, depending on the type of model you are looking at.

For example, for vision models and stable diffusion models, PyTorch models can usually be brought into TVM through tracing. This part is still being solidified so that we can bring in more models.

In language modeling tasks, what we do is directly build some model constructors and try to map them directly from Hugging Face models. The goal is that if you have a Hugging Face configuration, we will be able to bring it in and optimize it.

So, the interesting thing about model compilation is that your optimization doesn't just happen at the software language level, right? For example, if you are writing PyTorch code, you just need to try to use better fusion operators at the source code level. Torch compilation may help you do some things in that regard. In most model compilations, optimization occurs not only in the initial stage but also through the application of general transformations in the middle, all done through the Python API. Therefore, you have the flexibility to fine-tune some of these optimizations. This optimization section is particularly helpful in improving performance and ensuring portability across different environments.

Additionally, we have a general deployment feature. So, if you convert your ML program into a TVM script format, which includes functions that accept input and output tensors, we will be able to compile it. Consequently, these functions can be loaded into any language runtime supported by TVM. You can load it in JavaScript, which means you will have a JavaScript function that accepts tensors as input and outputs tensors. And of course, you can load it in Python, as well as in C++ and Java.

Overall, our goal is to bring ML models to the languages that people care about and run them on their preferred platforms.

Swyx:

Impressively, I've spoken to many people in the field of compilers before, but you don't have a traditional background in compilers. You are creating a whole new discipline called Machine Learning Compilation (MLC).

Do you think this will become a larger field in the future?

Tianqi Chen:

Firstly, I have indeed collaborated with people working in the field of compilers. So, we have drawn inspiration from many early innovations in that field.

For example, TVM initially drew a lot of inspiration from Halide, an image processing compiler. Of course, since then, we have made a lot of progress and now focus on compilation related to machine learning.

If you look at some of our conference publications, you will find that machine learning compilation is already a subfield. Every year, there are papers on machine learning compilation in machine learning conferences, MLC conferences, and system conferences. In the compiler conference CGO, there is also a workshop called C4ML that strives to focus on this field.

So, it has definitely gained attention and is becoming a field. I won't claim that I invented this field, but I have certainly collaborated with many people in this field. And of course, I'm trying to put forward a viewpoint, trying to learn from compiler optimization and combine it with knowledge from machine learning and systems.

Alessio:

In the previous episodes of the podcast, we had George Hotz, who had a lot to say about AMD and their software. When you consider TVM, are you still somewhat limited by the underlying kernel performance?

If your goal is the CUDA runtime, you can still achieve better performance regardless of whether TVM can help you reach that goal, but you wouldn't be concerned at that level, right?

Tianqi Chen:

Firstly, there are underlying runtimes like the CUDA runtime, which for NVIDIA, in many cases, come from their libraries such as Cutlass, CUDNN, and so on. And for specific workloads, you can actually optimize them. Because in many cases, if you do benchmarking, it's very interesting. For example, two years ago, if you tried to benchmark ResNet, in most cases, NVIDIA libraries provided the best performance and it was difficult to surpass them.

However, once you start modifying the model to some variant, not limited to traditional ImageNet detection but also applicable to potential detection and so on, there is some room for optimization because sometimes people tend to overfit on benchmark tests. These people go to optimize things, right? People have done too much optimization on benchmark tests. So this is the biggest obstacle to getting low-level kernel libraries to do the best they can. From this perspective, the goal of TVM is actually to try to leverage libraries at a general level. In this regard, we are not limited by the libraries they provide.

That's why we will be able to run on Apple M2 or WebGPU because we are generating libraries to some extent. This makes it easier to support hardware that is not well supported, for example, WebGPU is an example. From a runtime perspective, AMD was not supported before its Vulkan driver.

Recently, their situation has improved. But even before that, we could support AMD through a GPU graphics backend called Vulkan. Although the performance is not as good as CUDA, it can provide you with good portability.

5. Using int4 quantization for language model inference

Alessio:

I know we still have to talk about other things related to MLC, like WebLLM, but I want to talk about the optimizations you are working on first. There are four key points, right? Can you briefly explain them?

Tianqi Chen:

Kernel fusion means that, you know, if you have an operator like convolution, or in the case of transformers, it's MOP, followed by other operators, right? You don't want to launch two GPU kernels. You want to be able to put them together in an intelligent way.

Memory planning is more about, you know, hey, if you run Python code, every time you generate a new array, you're actually allocating a new memory segment, right? Of course, PyTorch and other frameworks optimize this for you. So there's a smart memory allocator behind the scenes. But actually, in many cases, it's better to pre-allocate and plan everything statically.

This is where the compiler can come in. First of all, it's actually more challenging for language models because the shapes are dynamic. So you need to be able to do what we call symbolic shape tracking. We have a symbolic variable that tells you the shape of the first tensor is n times 12. The shape of the third tensor is also n times 2 times 12. Although you don't know what n is, you will be able to know this relationship and use it to infer fusion and make other decisions.

In addition to that, I think loop transformation is very important. Actually, this is non-trivial. Initially, if you simply write code and expect performance, it's very difficult. For example, if you write a matrix multiplier, the simplest thing you can do is use for i, j, k, c[i][j] += a[i][k] * b[k][j]. But this piece of code is 100 times slower than the best code you can get.

Therefore, we have made many transformations, such as being able to take the original code and put things into shared memory, and using tensor calls, memory copies, and so on. In fact, we also realize that not all of these things can be done. Therefore, we also package the ML compilation framework as a Python package so that people can continuously improve this part of the engineering process in a more transparent way.

We found that this was very useful for us to quickly achieve good performance on some new models. For example, when Lamato was released, we were able to look at the whole picture, identify bottlenecks, and optimize them.

Alessio:

So, the fourth one is weight quantization, and everyone wants to know about this. Just to give people an idea of memory savings, if you use FB32, each parameter takes up four bytes. Int8 only takes up one byte. So you can really reduce memory usage.

What are the trade-offs in this regard? How do you determine the right target? And what about the trade-offs in terms of precision?

Tianqi Chen:

Currently, most people actually use int4 on language models, which significantly reduces memory usage.

In fact, recently we started to consider that, at least in MOC, we don't want to have strong opinions on the quantization types we want to introduce because there are so many researchers in this field. So we can allow developers to customize the quantization they want, but we still bring them the best code. So we are working on something called "bring your own quantization". In fact, we hope that MOC will be able to support more quantization formats. And there is definitely an open field being explored. Can you bring more sparsity? Can you quantize activations as much as possible? This will be a relevant field for quite some time.

Swyx:

You mentioned some things that I want to confirm again, that most people actually use int4 on language models. For me, this is not obvious. Are you talking about people who work on GGML types, or are researchers working on models also using int4?

Tianqi Chen:

Sorry, I was mainly talking about inference, not training, right? So, during training, of course, int4 is more difficult, right?

Perhaps to some extent, some mixed-precision methods can be used in inference. I think in many cases, int4 is very applicable. In fact, it does significantly reduce memory overhead, and so on.

6. Comparison: MLC vs. Other Model Optimization Projects

Alessio:

Okay, great. Let's briefly talk about what might be GGML and then Mojo. How should people consider MLC? How do all these factors interact?

I think GGML focuses on model-level reimplementations and improvements. Mojo is a language, very powerful. You are more at the compiler level. Do you work together? Do people have to choose between them? ** 陈天奇：**

In my opinion, in this situation, I think it can be said that the ecosystem has become very rich in many different ways. In our case, GGML is more like starting from scratch with C to implement something, right?

This allows you to customize each specific hardware backend. But then you will need to write CUDA kernels and optimize them for AMD, etc. In this sense, the engineering effort will be more extensive.

As for Mojo, I haven't looked into it in detail yet, but I believe there will also be machine learning compilation techniques behind it. So, it can be said that it has an interesting position in it.

As for MLC, our situation is that we don't want to comment on how people want to develop, deploy, etc., in terms of methods, locations, languages, etc. We also realize that there are actually two stages. We want to be able to develop and optimize your models.

By optimization, I mean truly introducing the best CUDA kernels and doing some machine learning engineering in them. Then there is another stage where you want to deploy it as part of an application. So, if you look at this field, you will find that GGML is more like developing and optimizing in the C language. While Mojo is about developing and optimizing in Mojo, right? And then deploying in Mojo.

In fact, this is the philosophy they want to promote. In terms of machine learning, we find that if you want to develop models, the machine learning community likes Python.

Python is the language you should focus on. Therefore, in the case of MLC, we really hope to achieve not only defining models in Python, which is very common, but also performing ML optimization in Python, such as engineering optimization, CUDA kernel optimization, memory planning, and so on.

But when you deploy, we realize that people want some general styles. If you are a web developer, you may need JavaScript, right? If you may be an embedded systems person, you may prefer C++, C, or Rust. People sometimes really like Python in many cases. So, in the case of MLC, we really hope to have such a vision, that is, optimizing and building generically in Python, and then deploying it widely in environments that people like.

Swyx：

That's a great point and comparison.

I want to make sure we cover another thing that I think you are one of the representatives of this emerging academic field, and you are also very focused on delivering results.

Obviously, you treat XGBoost as a product, right? And now you have released an iPhone application. What are your thoughts on this?

** 陈天奇：**

I think there are different ways to make an impact, right?

Certainly, there are scholars writing papers, building insights for people to build products on top of. In my case, I think in the specific field I'm working on, machine learning systems, I feel that we need to make it accessible to people so that we can really see the problems, right? And demonstrate that we can solve problems. This is a different way of making an impact. There are also some scholars doing similar things.

For example, if you look at some people from Berkeley, right? Over the years, they have launched some major open-source projects. Of course, I think it's a healthy ecosystem to have different ways of making an impact, and I think it's very interesting to be able to make an impact in different ways. And I think it makes sense to collaborate with the open-source community and work in an open-source manner because when we build our research, we have a real problem to solve.

In fact, these research results will come together and people will be able to leverage them. We are also starting to see some interesting research challenges that we might not have seen otherwise, right? If you're just trying to do a prototype, for example. So I think it's an interesting way to make an impact and contribute.

7. Running Large Language Models Online

Swyx:

Yes, you have definitely made a big impact in this regard. And having experience with releasing Mac applications before, the Apple App Store is a big challenge. So one thing we definitely want to cover is running in the browser.

You have run a model with 70 billion parameters in the browser. That's right. Can you talk about how that was achieved?

Tianqi Chen:

First, you need a MacBook, the latest one, like the M2 Max, because you need enough memory to cover it.

For a 70 billion model, it takes about 50g of RAM. So the M2 Max, the previous version, can run it, right? It also leverages machine learning compilation.

Actually, what we do is the same whether it's running on an iPhone, or on a server cloud GPU, or on AMD, or on a MacBook. We go through the same MOC pipeline. Of course, in some cases, we may do some custom iterations for each one. And then when it runs in the browser, it goes through the WebLM package.

What we do is actually compile the original model into what we call WebGPU. And then WebLM will be able to pick it up. And WebGPU is the latest GPU technology being rolled out by major browsers. So you can already find it in Chrome. It allows you to access the native GPU from the browser. And then actually, the language model just calls the WebGPU kernel from there.

When LATMAR2 was released, initially, we had the question of whether it could run on a MacBook with 70 billion. First, we actually... Jin Lu, the engineer who drove all this, ran the 70 billion version on a MacBook.

In MLC, you will be able to... It runs through the metal accelerator. So actually, you use the Metal programming language for GPU acceleration. So we found, okay, it can run on a MacBook. And then, we asked ourselves honestly, do we have a WebGPU backend? Why not give it a try? So we did. It was truly amazing to see everything running smoothly. In fact, in this case, it ran very smoothly. Therefore, I believe there are already some interesting use cases in this situation, as everyone has a browser.

You don't need to install anything. I think it's not quite reasonable to run a 70 billion model on a browser because you need to be able to download weights and so on. But I think we're getting close to that point.

In fact, the most powerful models will be able to run on consumer devices. It's truly amazing. And in many cases, there may be some use cases. For example, if I want to build a chatbot that I can talk to and ask questions, maybe certain components, like speech-to-text, can run on the client side. So there may be a lot of possibilities in a hybrid mode running on both the client and the server.

Alessio:

Do these browser models have a way to connect to applications? So, if I use, for example, OpenAI, can I also use local models?

Tianqi Chen:

Certainly, we are currently building an NPM package called WebILM.

If you want to embed it into your web application, you will be able to directly depend on WebILM and then use it.

We also have a REST API that is compatible with OpenAI. So the REST API, I think, is actually running on the native backend now. It runs faster if the CUDA server is on the native backend. But we also have a WebGPU version that you can run.

So yes, we do hope to make it easier to integrate with existing applications. Of course, the OpenAI API is definitely one way to do that, which is great.

Swyx:

Actually, I didn't know there was an NPM package that makes trying and using it very easy. I actually... I'm not sure how much we'll dive into it, but in terms of the concept of getting model runtime optimizations, there are already a lot in the API.

Alessio:

I think one possible question is about the timeline. Because as far as I know, Chrome released WebGPU at the same time you released WebILM. Okay, yes. So do you have secret chats with Chrome?

Tianqi Chen:

The good news is that Chrome is working hard on an early release. So even though the official Chrome WebGPU release coincides with the release of WebILM, you can actually try out WebGPU technology in Chrome. There's an unstable version called Canary.

I think there have been versions of WebGPU as early as two years ago. Of course, it's getting better and better. So two years ago, we did start to see it mature and performance improve. Therefore, we had a TVM-based WebGPU backend two years ago. Of course, there were no language models at that time. It was running on less interesting but still interesting models.

This year, we really started to see it becoming more mature and the performance improving. So, we brought it into a larger scale with a language model-compatible runtime.

Swyx:

I think you agree that the most difficult part is model downloading. Is there any discussion about sharing between one-time model downloads and applications that may use this API? That's a good question.

Chen Tianqi:

That's a good question, and I think it has been supported to some extent. When we download the model, WebILM caches it in a special Chrome cache.

So, if different web applications use the same WebILM JavaScript package, you don't need to download the model again. So there is already some content. But of course, you need to download the model at least once to use it.

Swyx:

Okay. Another question is, in terms of performance both generally and in terms of what can run in the browser, what do you gain from optimizing your models?

Chen Tianqi:

It depends on how you define "running," right? On the one hand, downloading MOC, just like you download it, you can run it on your laptop, but there are different decisions, right?

If you try to provide larger user requests, if the requests change, if the availability of hardware changes, now it's hard to get the latest hardware because everyone is trying to use existing hardware, which is a pity.

So I think there will be more questions when the definition of running changes. And in many cases, it's not just about running the model, but also about solving problems around the model. For example, how to manage the location of your model, how to ensure that the model is closer to the execution environment more efficiently, and so on.

So there are indeed many engineering challenges in this regard that we hope to solve, yes. And if you consider the future, I definitely think that with the technology we have now and the availability of hardware, we need to leverage all possible hardware. This will include mechanisms to reduce costs, bring something to the edge and the cloud in a more natural way. So I think we are still at a very early stage, but we can already see many interesting progress.

Alessio:

Yes, that's great. I like it. I don't know how much we will delve into it, but how do you abstract all this from the end users? You know they don't need to know which GPUs you are running, which clouds you are running on. You put all of that aside. As an engineering challenge, what is it like?

Chen Tianqi:

I think to some extent, you will need to support all possible hardware backends. On the one hand, if you take a look at the media library, you will find it quite surprising, but not entirely unexpected, that most of the latest libraries perform well on the latest GPUs. However, there are also other GPUs in the cloud. So, of course, being able to have expertise and perform model optimization is one thing.

There are also other aspects on the infrastructure that support various hardware, such as scalability. How to ensure that users can find what they want most. If you look at GPU instances in the cloud, like NVIDIA or even Google, it's likely that these are the ones most people are using. So to a large extent, I think we are still in the early stages, but there are indeed some interesting developments. And over time, I believe there will be more tools that make it easier, so I think we are in a transformative period.

Swyx:

Alright. I think we've covered a lot of ground, but is there anything else that you think is really important that we haven't touched on?

Tianqi Chen:

I think we've covered a lot, haven't we?

What I want to say is that this is indeed a very exciting field. Personally, I consider myself fortunate to be involved in something like this. And I think it's a very open field, which I think is very important. Many things are open source, and everything is publicly available.

And if you want to learn more, you can check out GitHub, watch some tutorials on YouTube, right? Or check out some news on Twitter. So I think it's a very open field with many opportunities for everyone to participate.

Just give it a try and see if you have any ideas, if there's something you can contribute. You'll find many surprises. So I think it's a very wonderful era.

Source: Hard AI

Attached:

MLC Project Page GitHub