DingTalk's Wukong Gets Its Golden Cudgel

Wallstreetcn
2026.03.31 13:47

On March 17, DingTalk launched an AI platform named "Wukong," which features powerful automation capabilities such as controlling browsers for price comparisons, automatically creating tasks, and sending summaries to mobile devices. Likened to Sun Wukong's Golden Cudgel, the platform can compare prices across multiple e-commerce sites and generate Excel files. Additionally, Wukong can automate scheduled tasks and generate websites and data animations, demonstrating its "from 0 to 1" creation capabilities

Sun Wukong truly became a headache for the Heavenly Court only after he obtained the Golden Cudgel—a "natal treasure" he could wield at will, which made him invincible like a tiger with wings.

On March 17, DingTalk released an AI platform named "Wukong." It can take over your browser, search for items for you, and operate your computer while you are away—it has hands and feet, and it can execute.

Meanwhile, Alibaba's newly released Qwen3.5-Omni—an all-modal model capable of watching videos, listening to audio, and deconstructing them into structured data ready for work—is very much like Sun Wukong's Golden Cudgel.

Currently, the monkey and the staff have not yet fully merged.

But once they do, it will be incredibly powerful.


I. What Can Wukong Do?

DingTalk's Wukong is a powerful but rule-abiding enterprise-grade "lobster."

(1) One-sentence web-wide price comparison

I asked it to search for the "DJI Osmo Pocket 3" on Taobao, JD.com, and Pinduoduo, compare prices and sales volumes, take screenshots, and organize them into an Excel file.

It took over my browser—opening Taobao, entering keywords, scrolling, and saving screenshots; it then jumped to JD.com to perform the same actions, and finally to Pinduoduo.

After running through all three platforms, an Excel file appeared on my desktop: the top 5 cheapest and highest-selling items, arranged by platform, store, price, and link, with the lowest price highlighted in red.

It wasn't just "telling" me which one was cheaper. It was "comparing prices, taking screenshots, and creating tables" on my behalf. I only typed a single paragraph throughout the process.

Of course, there are some rough edges—you need to be logged into your accounts on each platform beforehand, otherwise, CAPTCHAs will stop it.

(2) Content Radar

The second highly practical scenario doesn't happen in front of the computer.

I used DingTalk on my phone to send Wukong a message: "Set a daily scheduled task at 9 AM to automatically open the computer browser and search for 'latest AI trends and create an AI-related topic.' Extract 3 summaries with source links and send them to my phone."

Wukong called up the relevant Skills and automatically created the task. A few minutes after 9 AM the next day, a morning briefing popped up on my phone—neatly formatted with clickable links.

(3) Attracting customers and building websites

I also used Wukong for a website-building task, selecting skills from the official skill market, which resulted in a runnable website and complete source code. While the aesthetics still need refinement, the "from 0 to 1" capability is certainly there. Marketing departments can use it to generate scheduled competitor monitoring, and animation masters can generate complete data animation videos with a single sentence.

There were even more aggressive demonstrations at the launch event. A car repair shop owner told Wukong, "Help me attract 100 customers," and the AI autonomously completed the entire chain from competitor analysis, learning from trending content, and social media posting to comment guidance.

If these scenarios can be executed stably on a daily basis, it indicates that AI is moving from "executing instructions" to "helping you finish the job."

Having discussed the highlights, let's also address the inevitable instability factors in the product's early stages. The official provided a data point from a case study: a user reported that creating a PPT consumed approximately 270 million tokens. As AI moves from conversation to execution, operations like file handling, repeated modifications, and cross-system calls lead to an order-of-magnitude change in token consumption.

Wukong's RealDoc file system is officially claimed to improve token efficiency by fivefold. The direction is correct, but for cost-conscious SMEs, a more stable system and superior skills may be needed to make the ROI clear and justifiable.


II. What Does the Golden Cudgel Look Like?

Wukong has hands and feet but currently lacks something: eyes and ears. It can operate browsers, read documents, and execute across devices, but it cannot yet understand what is happening in a video or discern who said what and with what tone in a recording.

You must have experienced this: a two-hour meeting recording sits quietly in your cloud drive, and no one ever watches it—because the cost of re-watching is almost equivalent to holding the meeting again. You come across a viral marketing video and vaguely feel its conversion logic is worth learning, but you don't have the time to deconstruct it frame by frame. English podcasts, customer service recordings in dialects—they are heard and then forgotten. A vast amount of valuable audio-visual content leads to nothing after being "watched."

Alibaba's newly released Qwen3.5-Omni aims to turn "watched and gone" into "deconstructed for use."

Let's talk about our actual tests.

We used it to deconstruct viral TikTok marketing videos.

By inputting a merchant recruitment video from Yiwu, the model performed a structured breakdown across seven dimensions: Hook, selling point sequence, visual proof points, subtitle strategy, emotional rhythm, CTA timing, and target audience. The core insight that impressed me was: "This video isn't selling a product, it's selling certainty." It identified a three-level physical evidence chain for building trust, a digital anchor of "20,000 SKUs + 20-cent average price," and "nanny-style" promises for risk reversal.

More critical is its migration capability: when asked to write a script for a "T-shirt customization factory" using the same logic, it successfully outputted an executable 5-step template. The Hook was changed to "pulling the T-shirt to show elasticity," and the proof of strength was replaced with an "inkjet close-up of the printing machine + rubbing without fading," even including the operation guide for the comments section.

There was also a "dictated coding" test. I hand-drew a deliberately rough app wireframe, opened the camera, and dictated my requirements. It directly generated runnable React code. As I continued to dictate modifications—sidebar, rounded corners, dark theme, press animations—the context was never lost through multiple iterations. Watching, speaking, and modifying simultaneously is the most natural way for humans to interact, and the model handled it.

Supporting these performances is the underlying architecture: a Mixture-of-Experts (MoE) structure, native multimodal pre-training on over 100 million hours of audio data, SOTA results achieved in 215 third-party tests, and multiple metrics surpassing Gemini-3.1 Pro. It features a 256K context window supporting over 10 hours of audio, speech recognition for 113 languages and dialects, and TTS synthesis for 36 languages and dialects. Pricing: input costs less than 0.8 yuan per million tokens—less than one-tenth the cost of Gemini-3.1 Pro.

In short: Qwen3.5-Omni makes audio and video "deconstructible"—not just "understood," but broken down into data assets that are searchable, reusable, and directly actionable.


III. When Wukong Picks Up the Golden Cudgel

Wukong can operate browsers, read and write files, execute across devices, and call upon thousands of DingTalk capabilities, but if it cannot process audio and video, it cannot be widely used by users in the most natural business scenarios. Qwen3.5-Omni, which can break down videos into structured data with timestamps, understand multilingual recordings, and comprehend mixed inputs of visuals and voice, fills exactly this gap.

If the two are successfully combined: you throw a two-hour meeting recording at it. It doesn't just generate minutes—it hears who said what and when, whether the tone was firm or hesitant, and which words are to-do items, then directly creates tasks in DingTalk, assigns them to the right people, and sets deadlines. From "understanding the meeting" to "executing meeting conclusions," no manual intervention is needed in between.

Operations teams will no longer need to manually monitor competitors' short video accounts every day. The AI can watch competitor videos itself, deconstruct conversion logic—just as Qwen3.5-Omni deconstructed that TikTok video—output transferable script templates, then automatically publish adapted content via Wukong on social media, or even go further to attract and acquire customers. From "competitor analysis" to "content production" to "customer acquisition and conversion," the entire process is handled.

Or more routinely: quality inspection of customer service recordings. In the past, this required humans to listen, record, and score, limiting the daily volume of inspections. With all-modal capabilities integrated, the AI listens to all recordings itself, outputs the emotional trajectory and script score for every call, flags problematic calls, generates improvement suggestions, and writes the results into DingTalk's management system.

The common logic across these scenarios is identical: Perception → Understanding → Execution, a complete closed loop. Wukong handles execution, while Qwen3.5-Omni handles perception. Furthermore, Qwen3.5-Omni's pricing of less than 0.8 yuan per million tokens makes the entire flywheel financially viable; the puzzle is just one step away from being completed.


Conclusion

In Journey to the West, Wukong was already a formidable fighter when he sprang from the rock. But after he obtained the Golden Cudgel, took a master, and set out on the journey, he became increasingly powerful.

DingTalk's Wukong has already emerged. The Golden Cudgel has just been forged and has not yet been handed over. The journey to obtain the scriptures is long—Token costs must drop, products must be refined, and the awareness of 27 million enterprises must be won over one by one.

But the monkey, the staff, and the road are all there.


This article is from the WeChat public account "Hard AI". For more cutting-edge AI news, please visit here.

Risk Warning and Disclaimer

The market carries risks, and investment requires caution. This article does not constitute personal investment advice, nor does it take into account the specific investment objectives, financial situations, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are appropriate for their specific circumstances. Investing based on this information is at your own risk.