ByteDance's Douyin video creation large model release: First breakthrough in multi-party interaction challenges
ByteDance unveiled the Douyin Video Generation Large Model at the 2024 Volcano Engine AI Innovation Tour, breaking through the technical difficulties of multi-party interaction. This model supports complex instructions and multi-camera generation, with efficient computing units and optimized structure, significantly enhancing video generation capabilities. Tan Dai, President of Volcano Engine, stated that the usage of the Douyin model has surged, with daily token usage reaching 13 trillion, applied in various fields such as e-commerce and education
On September 24th, 2024, the 2024 Volcano Engine AI Innovation Tour was held in Shenzhen, and the Dou Bao Big Model Family welcomed a new member: the newly released Dou Bao Video Generation Model, as well as the Dou Bao Music Model, Dou Bao Simultaneous Interpretation Model, Dou Bao Universal Model Pro, Wenshengtu Model, Speech Synthesis Model, and other vertical models have been significantly upgraded.
The Dou Bao Video Generation Model can follow complex prompts, unlocking the ability to sequence multiple shooting actions and interactions between multiple subjects.
This model supports efficient DiT fusion computing units, a newly designed diffusion model training method, and a deeply optimized Transformer structure, which can more fully compress encoded videos and text, support consistent multi-camera generation, and significantly improve the generalization ability of video generation.
Officially, whether it is semantic understanding ability, complex interaction scenes of multiple subjects in motion, or the content consistency of multi-camera switching, the Dou Bao Video Generation Big Model has reached an advanced level in the industry.
Tan Dai, President of the Volcano Engine, stated that the Dou Bao Video Generation Big Model supports consistent multi-camera generation in multiple styles and proportions, and can be applied in e-commerce marketing, animation education, urban culture and tourism, micro-scripts, and other fields.
In addition, Tan Dai stated that the usage of the Dou Bao Big Model has exploded since its release. As of September, the daily average tokens usage of the Dou Bao language model exceeded 13 trillion, a tenfold increase compared to its initial release in May, and the processing volume of multimodal data reached 50 million images and 850,000 hours of speech per day.
Previously, most video generation models could only complete simple instructions, while the Dou Bao Video Generation Model can achieve natural and coherent multiple shooting actions and complex interactions between multiple subjects.
Some creators who experienced the Dou Bao Video Generation Model in advance found that the generated videos not only follow complex instructions, allowing different characters to interact with multiple action instructions, but also maintain consistency in character appearance, clothing details, and even headgear under different camera movements, approaching the effect of real shooting.
According to the Volcano Engine, the Dou Bao Video Generation Model is based on the DiT architecture, allowing videos to freely switch between large dynamics and camera movements through efficient DiT fusion computing units, with multi-camera language capabilities such as zooming, circling, panning, scaling, and target following. The Dou Bao Video Generation Model has professional-level lighting layout and color blending, with visually appealing and realistic visuals.
The deeply optimized Transformer structure significantly enhances the generalization ability of the Dou Bao Video Generation, supporting various styles such as 3D animation, 2D animation, traditional Chinese painting, black and white, thick painting, adapting to various devices such as movies, TV, computers, and mobile phones. It is not only suitable for enterprise scenarios such as e-commerce marketing, animation education, urban culture and tourism, micro-scripts, but also provides creative assistance for professional creators and artists