Sora's rival! Meta's most powerful immersive AI media model is here, a 300 billion parameter model supporting Movie Gen videos

Meta claims that Movie Gen is "the most advanced and immersive storytelling model kit with the best effects", trained based on authorized and publicly available data, capable of generating videos at a speed of 16 frames per second for up to 16 seconds; the 13 billion parameter model supports audio generation; in human evaluations, Movie Gen's video generation capability outperforms Sora with a net win rate of 8.2. Meta has not specified the release date, but Zuckerberg mentioned that it will be launched on Instagram next year

Author: Li Dan

Source: Hard AI

OpenAI's Sora faces a tough competitor as Meta launches Movie Gen, touted as the most advanced media-based model.

Meta claims that Movie Gen is a breakthrough generative AI research for media, encompassing modalities such as images, videos, and audio. Users can create custom videos and sounds, edit existing videos, and transform personal images into unique videos simply by inputting text. The performance of Movie Gen in these tasks is comparable to similar models in the industry as evaluated by humans.

Meta introduces Movie Gen as the "most advanced and immersive storytelling model suite," combining the Make-A-Scene series of generative AI media research models from the company's first wave. This includes models for creating images, audio, videos, and 3D animations, as well as the second wave of research models targeting the Llama Image base model for higher-quality image and video generation and editing with the emergence of diffusion models.

文生视频最长 16 秒 130 亿参数音频生成模型人工评测视频生成对 Sora 净胜率 8.2

In summary, Movie Gen offers four main functions: video generation, personalized video generation, precise video editing, and audio generation.

For video generation, Meta explains that users only need to provide a text prompt, and Movie Gen can utilize joint models optimized for text-to-image and text-to-video conversion to create high-definition images and videos. The video model of Movie Gen has 30 billion parameters, and this conversion model can generate videos up to 16 seconds long at a speed of 16 frames per second.

Meta states that these models can infer object movements, interactions between shooting subjects and objects, camera movements, and learn various concepts to understand reasonable movements, making them the most advanced models in their class. When showcasing this feature, Meta presented several 10-second video clips, including one featuring a bouncing hippo "Moo Deng" that has become popular on the internet.

Wall Street News noted that based on the maximum length of generated videos, Movie Gen still falls short of Sora released by OpenAI in February this year. One impressive aspect of Sora is its ability to create text-based videos up to 60 seconds long. However, compared to Meta's Emu Video model announced in November last year, Movie Gen has indeed made significant progress. Emu Video could only generate videos up to 4 seconds long at a speed of 16 frames per second Movie Gen not only directly generates videos, but also has outstanding personalized video production capabilities. Meta introduced that it has expanded the basic model to support the generation of personalized videos. Users can provide an image of a person, along with text prompts, to let Movie Gen generate a video that includes the character in the reference image and visual details that match the text prompts. Meta claims that its model has achieved state-of-the-art results in creating personalized videos that retain human identity and actions.

In a video demonstrated by Meta, users can provide a photo of a girl, input the text "a female DJ wearing a pink vest playing records, with a cheetah by her side," and then generate a DJ spinning records resembling the girl in the photo, accompanied by a cheetah.

In terms of precise video editing, Meta stated that Movie Gen uses edited variant models of the same basic model. After users input videos and text prompts, it accurately performs tasks to generate the desired output. It combines video generation with advanced image editing, executing local edits such as adding, deleting, or replacing elements, as well as global changes like background or style modifications. Unlike traditional tools that require professional skills or lack precision in generation, Movie Gen retains the original content and only edits relevant pixels.

One of the examples provided by Meta is when a user inputs a request for a penguin to wear clothing in the style of Queen Victoria's reign in the UK, Movie Gen generates a penguin wearing a red dress with lace.

Regarding audio generation, Meta mentioned that they trained a 13 billion parameter audio generation model. This model can accept videos and optional text prompts to generate high-quality, high-fidelity audio up to 45 seconds long, including ambient sounds, Foley sound effects, and instrumental background music, all synchronized with the video content. Additionally, Meta introduced an audio extension technology that can generate coherent audio for videos of any length, achieving state-of-the-art performance in audio quality, video-to-audio alignment, and text-to-audio alignment overall.

An example provided by Meta is generating the sound of an ATV engine roaring and accelerating under the accompaniment of guitar music, as well as orchestral music with rustling leaves and snapping branches sounds

Meta also displayed the results of A/B comparative tests for the four abilities mentioned above, showing that the net positive win rate represents a preference for the results generated by the Movie Gen model compared to competitors like Sora. In terms of the direct video generation function, Movie Gen achieved a net win rate of 8.2 compared to Sora.

Trained on Authorized and Publicly Available Data, No Clear Release Date, Zuckerberg Says Instagram Will Go Live Next Year

What information was Movie Gen trained on? Meta's statement did not specify the details, only stating: "We trained these models on authorized and publicly available datasets."

Some comments point out that for generative AI tools, the sources of training data and what data is scraped from the internet remain controversial issues, and the public rarely knows which texts, videos, or audio clips were used to create any large models.

Some comments also suggest that Meta mentioned that the training dataset is "proprietary/commercially sensitive" without providing details, so it can only be speculated that the data includes many videos from Instagram and Facebook platforms, along with some content from Meta's partners, as well as many other inadequately protected contents, known as "publicly available" content.

Regarding the release date, Meta did not specify when Movie Gen will be launched to the public this Friday, only vaguely stating "possibly in the future." Since OpenAI announced Sora in February this year, it has not yet been truly opened to the public for use, nor has it disclosed any planned release dates.

However, Meta CEO Zuckerberg stated that Movie Gen will go live on Meta's social media platform Instagram next year. He posted a video generated by Movie Gen on his personal Instagram account, showing him using a leg press machine, with the background changing as he starts exercising. It first shows him exercising in a neon-lit futuristic-style gym, then changes to him wearing gladiator armor while exercising, followed by him pushing a burning pure gold machine, and finally, him using his legs to press a box of chicken nuggets, surrounded by fries.

Zuckerberg captioned the post saying that Meta's new MovieGen AI model can create and edit videos, and every day is a leg day. The model will land on Instagram next year

Under the post where Meta officially announced and demonstrated Movie Gen on social media platform X, some highly liked comments show that netizens are already urging Meta to officially release the model. One netizen asked if anyone has the opportunity to try it out.

Sora's rival! Meta's most powerful immersive AI media model is here, a 300 billion parameter model supporting Movie Gen videos

文生视频最长 16 秒 130 亿参数音频生成模型 人工评测视频生成对 Sora 净胜率 8.2

Trained on Authorized and Publicly Available Data, No Clear Release Date, Zuckerberg Says Instagram Will Go Live Next Year

文生视频最长 16 秒 130 亿参数音频生成模型人工评测视频生成对 Sora 净胜率 8.2