Tencent's version of "Sora" joins the generative video battlefield
Still in the early stages of exploration
Author | Huang Yu
Editor | Zhou Zhiyu
At the beginning of the year, the emergence of the "text-to-video" model Sora sparked a global competition in AI video generation; nearly 10 months later, Sora has yet to open to the public, while the latecomer Tencent Hongyuan has jumped into this battlefield ahead of time.
On December 3rd, Tencent Hongyuan's large model officially launched its video generation capability, allowing C-end users to apply for a trial through the Tencent Yuanbao APP, while enterprise clients can access services through Tencent Cloud, with API open for internal testing applications.
Bringing text-to-video to the forefront, this marks another milestone for Tencent Hongyuan's large model following text-to-text, text-to-image, and 3D generation. Meanwhile, Tencent has open-sourced this video generation large model, which has 13 billion parameters, making it the largest open-source video model currently available.
According to Wall Street News, Tencent Hongyuan's video generation has almost no barriers; users only need to input a text description, and the Tencent Hongyuan video generation large model can produce a five-second video.
Compared to Sora's minute-level and some "Sora-like" products' 10-second video generation time, Tencent Hongyuan's video generation duration is not particularly exciting.
At the media communication meeting that day, the head of Tencent Hongyuan's multimodal generation technology stated that video duration is not a technical issue but purely a computational power and data issue, as doubling the time results in a square-level increase in computational power, making it not very cost-effective.
Additionally, he pointed out that most people use videos in a way that involves one shot after another, so the first version of the Hongyuan video generation model is released with a 5-second duration to prioritize meeting the majority's needs. "In the future, if there are many strong demands for longer continuous shots, we will consider upgrades."
Tencent Hongyuan's video generation currently presents four main characteristics: realistic quality, semantic compliance, dynamic smoothness, and native transitions.
In terms of technology, Tencent Hongyuan's video generation model has chosen a DiT architecture similar to Sora's, with multiple upgrades in architectural design, including the introduction of a multimodal large language model as a text encoder, a fully attention-based DiT based on self-developed Scaling Law, and self-developed 3D VAE.
The head of Tencent Hongyuan's multimodal generation technology pointed out that Hongyuan is one of the first or very few video generation models in the industry to use a multimodal large language model as a text encoder. The industry still predominantly uses T5 and CLIP models as text encoders.
The reason for this choice is that Tencent Hongyuan has identified three major advantages of this technological route, including enhanced understanding of complex texts, native text-image alignment capabilities, and support for system prompts.
Furthermore, the head of Tencent Hongyuan's multimodal generation technology mentioned that before developing GPT, OpenAI invested significant effort in validating the effectiveness of Scaling Law (training larger models with more data) in language models, but in the field of video generation, the effectiveness of Scaling Law has not been publicly disclosed in academia or industry Against this backdrop, the Tencent Hunyuan team validated the Scaling Law for image and video generation, ultimately concluding that image DiT has the properties of Scaling Law, and video based on image DiT also exhibits the same properties through a two-stage training process.
"So our first version of the Tencent Hunyuan video generation model is based on this relatively strict inference of Scaling Law, and we developed a model with 13 billion parameters," said the head of Tencent Hunyuan's multimodal generation technology.
At the same time, Tencent Hunyuan is also rapidly exploring video generation ecological models, including image-to-video models, video dubbing models, and driving 2D photo digital humans.
The head of Tencent Hunyuan's multimodal generation technology pointed out that compared to text-to-video, the image-to-video model is progressing faster in terms of usability, and it is possible that Hunyuan will release the latest developments in less than a month.
Since the AI large model boom sparked by ChatGPT two years ago, the technical path of large language models has converged, while video generation models are still in the exploratory phase.
Analysts from Dongfang Securities noted that under the guidance of OpenAI's technological direction, the current technical path for language models is essentially along the GPT route. In terms of multimodal technology, no single company is in an absolutely leading position, and there is still potential for exploration in the technical path.
The head of Tencent Hunyuan's multimodal generation technology also stated that text-to-video is overall still in a relatively immature stage, with a low overall qualification rate.
As the most challenging area in multimodal generation, video generation requires high computational power and data resources. Currently, it is less mature compared to text and images, and faces challenges of slow commercialization and productization progress.
OpenAI has also announced a delay in the update of Sora due to a shortage of computational power, resulting in it not being open to the public yet.
Despite this, to seize the market more quickly, there have been intensive achievements in the video generation field since last November.
As of now, many large model manufacturers both domestically and internationally have achieved the launch of products similar to Sora, including domestic companies like MiniMax, Zhipu, ByteDance, Kuaishou, and Aishi Technology, as well as overseas companies like Runway, Pika, and Luma. However, due to factors such as computational power and technology, the video generation duration is generally within 10 seconds.
To promote commercialization, large model manufacturers must find more application scenarios for video generation. This time, Tencent's proposed idea is that the Hunyuan video generation model produces high-quality visuals, which can be used in industrial-grade commercial scenarios such as advertising, animation production, and creative video generation.
Video AI is the final link in the multimodal field and is also a domain that is more likely to give rise to blockbuster applications. However, how to balance the investment in computational power with commercialization remains a major challenge that current "Sora-like" video generation models must address