ByteDance AI drives the acceleration of Agent implementation
Building AI infrastructure
Author | Liu Baodan
Editor | Huang Yu
In the battle for the implementation of AI models, ByteDance's All In AI has unveiled its latest weapon.
On April 17, Volcano Engine launched the Doubao 1.5 Deep Thinking Model for the enterprise market, which possesses visual reasoning capabilities, allowing it to associate and think about what it sees like a human. At the same time, Volcano Engine also upgraded its Text-to-Image Model 3.0 and Visual Understanding Model.
Tan Dai, President of Volcano Engine, stated that the Doubao 1.5 Deep Thinking Model excels in reasoning tasks in professional fields, achieving scores in the mathematical reasoning AIME2024 test that match OpenAI's o3-mini-high. It also performs well in programming competitions, scientific reasoning, and creative writing.
The Deep Thinking Model serves as the foundation for building Agents, and thanks to the improvement in model performance, ByteDance has begun to focus on the implementation of Agents.
Tan Dai hopes that AI can solve more complex and complete problems, moving beyond mere perception, processing, and generation of information to end-to-end task handling. For example, helping users with itinerary planning and ticket purchasing.
At the conference, Volcano Engine announced the launch of the OS Agent solution and AI Cloud Native Inference Suite, aimed at helping enterprises build and deploy Agent applications faster and more cost-effectively.
Wall Street Journal learned that the OS Agent solution includes the Doubao UI-TARS model, as well as veFaaS function services, cloud servers, cloud phones, and other products, enabling operations on code, browsers, computers, phones, and other Agents.
Taking the Doubao UI-TARS model as an example, it integrates screen visual understanding, logical reasoning, interface element positioning, and operation, breaking through the limitations of traditional automation tools that rely on preset rules, providing a model foundation for intelligent interaction of Agents that is closer to human operation.
To successfully implement Agents, in addition to model capabilities and good architecture and tools, cost is also an important factor for the deployment of Agents.
To this end, Volcano Engine has specifically developed the AI Cloud Native Serving Kit inference suite, allowing for faster model deployment and lower inference costs. Wall Street Journal learned from within Volcano Engine that GPU consumption has been reduced by 80% compared to traditional solutions.
As the implementation of Agents accelerates, it will bring a significant increase in model inference consumption for Volcano Engine.
According to Volcano Engine, as of the end of March 2025, the daily average tokens call volume for the Doubao large model has reached 12.7 trillion, growing over a hundred times since its release in May last year. According to IDC's report "Analysis of the Market Landscape of China's Public Cloud Large Model Services, 1Q25," Volcano Engine ranks first with a market share of 46.4%.
However, Agents are still in the exploratory stage, and for Volcano Engine to better promote the implementation of Agents, it must undergo more tests in the market.
Below is the transcript of the exchange between Wall Street Journal and Tan Dai (edited):
Question: DeepSeek R2 is being planned intensively. In the face of this powerful open-source competitor, what is the overall closed-source strategy and subsequent commercialization of Doubao? Tan Dai: I mainly want to talk about Volcano Engine. Whether it is open source or closed source is not the key; the quality of the model is what matters.
Yesterday, OpenAI released o4mini and o3, and competition is a good thing. If we view the current AI development as a marathon, we may have only run 500 meters so far. Everyone is pushing each other in competition, and both technology and industry applications can develop rapidly.
In terms of business model, as a cloud vendor, Volcano Engine focuses on two aspects: first, to build a solid infrastructure for AI cloud-native; second, to provide the best models and various applications based on those models. After the launch of DeepSeek, the cloud vendor that has adapted to it the best is Volcano Engine.
Question: In the next two years, will the token calls for Volcano Engine's large models continue to grow nearly a hundredfold?
Tan Dai: In the long term, a hundredfold or even higher growth is possible, but whether it will be in two years or three years depends critically on whether there are significant breakthroughs in the models.
The rapid growth from last year to this year is due to several major breakthroughs in the models: first, the improvement of basic chat and information processing capabilities and the reduction of costs; second, the launch of deep thinking functions this year is also a breakthrough. There are many key nodes for future development, such as whether Agent-related technologies can make greater progress.
Every major breakthrough in large models will definitely trigger significant changes, but whether it can increase a hundredfold within two years should be approached with cautious optimism.
Question: Regarding the comprehensive upgrade of the Doubao model, how would you rate Doubao's performance? This upgrade emphasizes stronger text reasoning, lower costs, and easier implementation. Which of these three advantages is the hardest to achieve?
Tan Dai: I won't give a score because the progress of the model is too rapid. If I give it a score of 100 today, it might only be 60 a month later, so static scoring is not very meaningful. Compared to subjective scoring, more valuable are objective data, such as how many people are using the Doubao app and how many large enterprises are calling the Doubao model.
Overall, achieving good results is the most challenging because it requires using various methods to achieve the best effect, and then on that basis, using various ways to reduce costs. This is an optimization process.
Question: Will Volcano support the MCP protocol in the future, or similar protocols? What do you think about competing for discourse power in the development ecosystem through a unified protocol?
Tan Dai: Volcano Engine already supports the MCP protocol. I believe that protocol unification is very important. In the past, different vendors like Google or other related products had different plugin protocols, which made it costly for developers to adapt. If we can achieve a unified protocol, application development will be faster, and model calls will be more intelligent.
We hope to embrace and build an open protocol together, just like the early internet's HTTP and HTML protocols, which can accelerate the development of the entire industry.
Question: Are you considering launching a new protocol similar to A2A?
Tan Dai: I think, first of all, we need to do a good job with the MCP, which is the most fundamental thing. A2A can be seen as an extension of MCP.
Question: What is the reason for Volcano Engine to develop the OS Agent solution?
Tan Dai: Volcano Engine provides the OS Agent solution around the model, aiming to build relevant infrastructure. Many times, achieving related functions requires multimodal support. If you want to place an order and call an API, MCP support is needed For some long-tail demands, it may also be necessary to directly manipulate computers, browsers, mobile phones, etc., which is why we are developing the OS Agent solution.
Q: Regarding AI smart glasses, could you share some progress?
Tan Dai: I'm not particularly clear about that either.
For Volcano Engine, whether it's internal or external demand, we adopt a unified solution to address it. From an external perspective, with the improvement of model capabilities, many things that couldn't be done before can now be realized, such as AI glasses, AI toys, smart cameras, door locks, etc.
Q: What is ByteDance's current view on the development of the Agent market?
Tan Dai: It can't simply be said as "betting." With the development of AI, chat and information processing are just a small part of it. To truly bring about transformation in various industries, Agents are a necessary path. If we don't do a good job with Agent-related work, it will be difficult to realize the social and economic value of AI.
There are roughly two types of Agents: vertical Agents and general Agents. For vertical Agents, Volcano Engine will explore based on its own advantageous fields, such as the data Agent we launched earlier.
For general Agents, it is more important to build a solid foundational framework and provide useful tools. Therefore, we launched the OS Agent solution, leveraging new AI cloud-native components, SandBox, and new models, allowing developers and enterprises to more easily create their own general Agents. This is an important development direction for Volcano in the future.
Q: What is the balance between internal support and external expansion for Volcano? Are there plans for team expansion in the near future?
Tan Dai: Since our establishment, we have adhered to a unified approach for internal and external technology, which ensures both internal service and external support. Through the reuse of technology and resources, we can provide more cost-effective services to both internal and external clients. For example, our MaaS and cloud services derive their cost-effectiveness from this.
Regarding business expansion, for scale-related businesses, such as serving more clients, we will need to expand the sales team, and personnel will increase accordingly; product development focuses more on quality, and we are also considering how to leverage AI to enhance the efficiency of our product development and maintenance.
Q: Do you have plans to maintain a leading advantage in the next year or two? How do you define an Agent?
Tan Dai: We are committed to becoming the best cloud vendor in the AI era, always adhering to three principles: continuously optimizing models to maintain competitiveness; constantly reducing costs, including expenses, latency, and improving throughput; making products easier to implement. In the future, we will continue to focus on these three areas.
Currently, many intelligent computing centers are mainly used for model training, but models can only generate economic value during the application phase. From this year onward, the consumption of model applications will far exceed the training itself.
Regarding the definition of an Agent, an Agent should be able to complete complete tasks that require a high level of professionalism and take a long time, which is a qualitative perspective. From a technical implementation perspective, if there is no application of thinking models and a lack of reflection and planning capabilities, it is also difficult to be recognized as an Agent.
This year, everyone's definition of Agents will become clearer. Perhaps, similar to the grading of autonomous driving (Agent L1, L2, L3, L4), Agents will also be graded. Those so-called three or four thousand Agents may only be at the L1 level, while true implementation may require reaching L2++ and above levels Q: 2025 is the year of AI Agents. Between large companies like ByteDance and startups like Manus, who has a greater opportunity?
Tan Dai: The size of the opportunity depends on the company's innovation capability, not its scale. One cannot judge by company size; maintaining continuous innovation capability is key.
Q: Large models all have hallucination issues. If used for data analysis, how can we reduce or avoid the occurrence of hallucinations?
Tan Dai: The acceptance level of hallucinations in large models varies by field. The key is to reduce the probability of hallucinations occurring.
First, the stronger the model's capability, the lower the likelihood of hallucinations; second, a knowledge base can be introduced to assist the model, providing more reliable references when generating content; furthermore, adding stages like validation can continuously enhance the model's capability in this regard