The next "AI shovel seller": Computing power scheduling is the key to inference profitability, and vector databases have become a necessity

Wallstreetcn
2025.12.24 04:14
portai
I'm PortAI, I can summarize articles.

Shenwan Hongyuan stated that as generative AI applications accelerate their penetration, AI infrastructure software (AI Infra) is becoming the key "shovel seller" for application implementation, and the computing power scheduling capability directly determines the profitability of model inference services. According to estimates, with a daily query volume of 1 billion, if using H800 chips, a 10% increase in single-card throughput can improve the gross profit margin by 2-7 percentage points. On the data level, vector databases have become a necessity, and Gartner predicts that by 2025, the adoption rate of enterprise RAG technology will reach 68%

With the accelerated penetration of generative AI applications, AI infrastructure software (AI Infra) is becoming the key "shovel seller" for application implementation, and computing power scheduling capability has become the core variable determining the profitability of model inference.

Recently, the Huang Zhonghuang team from Shenwan Hongyuan Research released an in-depth report titled "AI Infra: Another Shovel Seller Under Application Penetration." The report points out that infrastructure software is entering a golden development period. Unlike the model training phase dominated by giants, the inference and application deployment stages have opened up new commercial space for independent software vendors. Currently, two types of products are most critical: computing power scheduling software and data-related software.

The capability of computing power scheduling directly determines the profitability of model inference services. According to estimates, under a daily query volume of 1 billion, if using H800 chips, a 10% improvement in single-card throughput can increase the gross profit margin by 2-7 percentage points.

On the data front, vector databases have become a necessity, with Gartner predicting that the adoption rate of enterprise RAG technology will reach 68% by 2025. Overseas data vendors like MongoDB have seen a significant turning point in revenue growth in the second quarter of 2024, validating this trend.

Computing Power Scheduling: The Core Variable of Inference Profitability

AI Infra refers to the underlying hardware and software systems designed, built, managed, and optimized specifically for AI workloads. Its core goal is to efficiently and at scale complete AI model training and inference tasks. If developing large models is likened to "building a house," then AI Infra is the "toolbox," comprising the hardware, software, and services needed to build, deploy, and maintain artificial intelligence (AI) systems.

In the context of the domestic model price war, cost control has become a matter of life and death. The official pricing of Deepseek V3 is only 2 yuan for every million tokens input and 3 yuan for output, while similar overseas products generally range from 1.25 to 5 US dollars. This significant price difference makes domestic vendors far more sensitive to costs than their overseas counterparts.

Comparison of major manufacturers' computing power scheduling capabilities:

Huawei Flex:ai has achieved unified scheduling of heterogeneous computing power, supporting NVIDIA, Ascend, and third-party computing power. Through chip-level slicing technology (precise to 10% granularity), in scenarios where the full card computing power cannot be fully utilized, it can increase the average utilization rate by 30%.

Alibaba Aegaeon has gone a step further, achieving token-level dynamic scheduling. Through fine-grained scheduling at the token level, phased computing, cache reuse, and elastic scaling, Aegaeon has reduced the number of GPUs required for 10 models from 1,192 to 213, achieving a resource savings rate of up to 82%. This real-time scheduling method of "sorting by token" is similar to upgrading parcel sorting from "by batch" to "by individual package."

The report data indicates that computing power scheduling software has become an invisible lever for improving gross margins:

Sensitivity analysis of gross margins shows that assuming model inference service providers use H800 chips, in a scenario with 1 billion queries per day (daily revenue of approximately 4.4 million yuan, annual revenue of 1.606 billion yuan), when the single card throughput increases from the baseline value of 0.6 times to 1.4 times, the gross margin can increase from 52% to 80%. This means that for every 10% improvement in single card throughput through computing power scheduling optimization, there can be an improvement of about 2-7 percentage points in gross margin.

This also explains why there are significant differences in the gross margins of the cloud businesses of the three major overseas cloud providers: In the third quarter of 2025, Google Cloud's gross margin was 43.3%, Microsoft's Intelligent Cloud was 34.6%, while Amazon AWS was only 23.7%. As the proportion of cloud revenue from large AI models continues to rise, the impact of hardware scheduling capabilities on gross margins will become increasingly critical.

Vector Databases: The Essential Foundation for RAG Applications

The hallucination problem of large models has led to the rapid adoption of RAG (Retrieval-Augmented Generation) technology. Since large models cannot directly remember vast amounts of proprietary knowledge and are prone to hallucinations when lacking external knowledge, RAG has become a standard for enterprises deploying AI applications. According to Gartner, by 2024, 45% of enterprises globally will have deployed RAG systems in scenarios such as intelligent customer service and data analysis, with this proportion expected to exceed 68% by 2025.

The report points out that the core value of vector databases lies in supporting millisecond-level retrieval of massive data. In the inference process of RAG applications, the system first needs to convert user queries into vectors, then retrieve the most similar knowledge fragments from the vector database, and finally input the retrieval results along with the user questions into the large model to generate answers. This requires the vector database to maintain high QPS (queries per second) real-time retrieval capabilities at a scale of hundreds of millions of data.

Statistics from OpenRouter show that starting from the fourth quarter of 2024, the token consumption from various large models accessed via API interfaces has rapidly increased, nearly tenfold within a year, directly driving the demand for vector databases.

Reshaping the Database Landscape: OLTP Counterattack, Real-Time is King

In the era of generative AI, data architecture is shifting from "analysis-first" to "real-time operation + analysis collaboration." The traditional design goals of data warehouses/lakes are batch processing and post-event insights, but AI applications require millisecond-level responses, and agents need to continuously acquire real-time data and make quick decisions. This high-frequency, small-batch, low-latency real-time transaction processing demand is the core advantage of OLTP (Online Transaction Processing) databases

The report points out that the data architecture in the AI era is shifting from 'analysis-first' to 'real-time operations + analysis collaboration'... MongoDB, with its 'low threshold + high flexibility', meets the low-cost AI implementation needs of small and medium-sized customers, showing outstanding growth flexibility. Snowflake and Databricks... need to cope with cross-industry competition from CSPs and shortcomings in real-time capabilities.

Specifically:

MongoDB: Low threshold entry into the small and medium-sized customer market

As a document-oriented NoSQL database, MongoDB is naturally suited for unstructured data storage and high-frequency real-time CRUD operations. Its revenue growth rate showed a turning point in the second quarter of 2024, with core product Atlas revenue growth rates of 26%, 29%, and 30% in the first to third quarters of the 2026 fiscal year, significantly higher than the overall revenue growth rate.

MongoDB's competitive advantages are reflected in three aspects: first, its document-oriented design eliminates predefined table structures, storing data in a JSON-like format that meets the needs of AI-native applications; second, by acquiring Voyage AI for $220 million in February 2025, it has enhanced its vector retrieval capabilities, with Voyage's embedded models ranking first, fourth, and fifth in the HuggingFace RTEB evaluation; third, the newly launched AMP (Application Modernization Platform) helps customers migrate from traditional relational databases to modern document databases.

In the third quarter of the 2026 fiscal year, MongoDB's gross margin reached 76%, and it is expected that the operating profit margin will reach 18% by the end of the year, with an annual revenue growth rate of approximately 21%-22%, nearly approaching the Rule of 40 standard (revenue growth rate + profit margin ≥ 40%).

Snowflake and Databricks: Extending to full-stack tools

Snowflake and Databricks, centered around OLAP, have chosen different strategies to respond—vertically expanding upstream and downstream. Snowflake achieved data lake-warehouse compatibility through Iceberg Tables in 2025, launched Snowpark supporting Python and other languages, and provided AI toolchains such as Cortex AI and Snowflake ML. Its revenue for the 2025 fiscal year reached $3.626 billion, a year-on-year increase of 29.21%, and is expected to reach $4.446 billion in the 2026 fiscal year.

Databricks, on the other hand, acquired the serverless Postgres solution provider Neon for $1 billion in May 2025, enhancing its OLTP capabilities, and subsequently launched the AI-native database Lakebase and Agent Bricks. Its annualized revenue for 2025 exceeded $4.8 billion, a year-on-year increase of 55%, with annualized revenue from data lake-warehouse products exceeding $1 billion and a net retention rate exceeding 140%

The two companies occupy core scenarios in data-intensive industries such as finance and healthcare with their full-process toolchain and customer stickiness. As of the third quarter of the 2026 fiscal year, Snowflake has 688 high-value customers with annual consumption exceeding $1 million, and 766 of the Forbes Global 2000 companies have become its clients.

GPU-Dominated Storage Architecture: Technology Upgrade in Progress

AI inference has entered a new stage of real-time processing and PB-level data access, with storage IO shifting from "behind-the-scenes support" to "performance lifeline." The KV cache access granularity for LLM inference is only 8KB-4MB, and vector database retrieval can be as low as 64B-8KB, requiring support for thousands of concurrent requests with parallel threads.

NVIDIA's SCADA (Accelerated Data Access Extension) solution achieves direct GPU connection to SSDs, reducing IO latency from milliseconds to microseconds. This solution employs a "GPU-switch-SSD" direct connection architecture, with test data showing that the IO scheduling efficiency of one H100 GPU is more than twice that of a Gen5 Intel Xeon Platinum CPU.

This necessitates a technological upgrade for vector databases: adopting GPU-adapted columnar storage, changing retrieval algorithms to GPU parallel versions, and autonomously managing GPU memory allocation. These technological evolutions are reshaping the competitive landscape of data infrastructure