Alter聊科技
2024.11.08 08:11

What new features does the fully upgraded 'New Qingying' bring to AI-generated videos?

portai
I'm PortAI, I can summarize articles.

Just now, the Zhipu Qingyan App launched "New Qingying" and open-sourced Zhipu's latest image-to-video model CogVideoX v1.5.

Over three months ago at the Zhipu Open Day, the video creation assistant Qingying was officially launched on Zhipu Qingyan, capable of generating a 6-second, 1440x960 high-definition video in just 30 seconds. This quickly led to innovative applications like short videos, meme images, and ad production.

In just over a month, Zhipu open-sourced the 2B and 5B versions of the image-to-video model CogVideoX behind Qingying, which can run smoothly on consumer-grade GPUs, spawning numerous secondary development projects like CogVideoX-factory.

After three months of refinement and evolution, what improvements does "New Qingying" bring, and what new experiences will it offer?

We were fortunate to get early access for testing. Below, we’ll reveal the answers together.

01 Higher Definition, Faster, and More Realistic Image-to-Video

Through discussions with multiple content creators, we learned that compared to the fun of text-to-video, there’s a higher demand for image-to-video because it offers better control and consistency, enabling quick generation of usable video 素材。

The first highlight of "New Qingying" is the comprehensive upgrade in image-to-video capabilities, which can be summarized in four aspects:

1. 4K Ultra-HD Resolution: While Qingying produced 6-second videos at 1440x960, "New Qingying" supports 10-second, 4K, 60fps ultra-HD videos.

2. Variable Aspect Ratios: Users can upload images of any 比例 to generate videos, even ultra-wide formats, with corresponding aspect ratios preserved.

3. Multi-Channel Generation: Previous image-to-video products could only generate one video at a time, but "New Qingying" can produce four videos simultaneously.

4. Enhanced Model Capabilities: CogVideoX introduces innovations in content coherence, controllability, and training efficiency, significantly improving "New Qingying" in image quality, aesthetic 表现, motion 合理性, and complex prompt understanding. It also excels in facial details,表演连贯性, and 物理特性模拟—simply put, it’s more natural and realistic.

The first three improvements are easily verifiable, but the real test lies in video quality and realism, the core value of video generation products. So, we used several sets of images with corresponding prompts to see if "New Qingying" lives up to the hype.

The first set featured a barn owl perched on a stump with a blurred background. The prompt was simple: "Make the animal in the image move." This tested 运动合理性,动作连贯性, and image quality.

The video performed impressively: the owl’s head movement was natural and smooth, with every feather,纹路, and detail 清晰可见. Even the blurred background showed leaves rustling in the wind, and the owl’s leg straps swayed realistically—almost like a real 拍摄的视频。

The second set was a car driving in snow, with a more complex composition—a black car in the foreground and a distant forest. The prompt was more detailed: "A car 弹射起步 in the snow, kicking up 滚滚烟尘."

 

The result exceeded expectations: despite slight car 变形, the 轮胎转动溅起的残雪, the car’s acceleration, and the 烟尘 fading as it drove away all followed 物理规律. Even the distant trees obscured by the car were 清晰可见, fitting the winter scene perfectly.

In summary, "New Qingying"高度还原了输入图像, with 光影和色调 naturally 融入场景. The 视频的自然度和逼真度 are vastly improved. More importantly, users no longer need to endlessly "抽卡" or edit—the generated 素材 is almost ready to use.

02 From "Silent Videos" to the "Era of Sound"

Another highlight of "New Qingying" is its upcoming sound effects feature.

Currently, AI-generated videos are in the "silent film" era or rely on manually added background music. "New Qingying" aims to fill this gap by 自动生成与画面匹配的音效, bringing AI videos into the "有声时代."

To test this, we downloaded three silent video clips from Pixabay and matched them with 音效 using Zhipu’s CogSound model.

 

The first clip showed a harvester working in a field—a niche 场景. Yet, CogSound accurately generated the rumbling engine sound of a tractor, with seamless 音效过渡, making it feel like autumn harvest.

 

The second clip featured someone pouring water by a campfire. CogSound nailed it again: crackling fire sounds followed by the pouring water sound, perfectly synced with the visuals.

 

The third clip was a bird on a stump in heavy snow—a semantic "trap" that could easily mismatch 音效 (e.g., forest bird sounds). Surprisingly, we heard wintery "white noise" and chaotic bird calls, fitting the scene.

If "New Qingying"’s image-to-video solves the need for high-quality 素材, the 音效 feature opens even broader applications.

For example, large-scale battle or disaster scenes in movies can now use AI-generated 音效, slashing production time and costs while accelerating the shift from assembly-line to 智能化的电影制作。

Similarly,音效 for games or ads, once requiring 专业团队 and equipment, can now be handled by CogSound alone. Lowering 创作门槛 will undoubtedly catalyze industry growth.

But how does CogSound achieve this complex task? The answer lies in the Diffusion architecture common in large models.

The core idea is shifting the diffusion process from high-dimensional raw audio space to a low-dimensional latent space, enabling efficient audio synthesis without sacrificing quality.

Zhipu’s team used Unet-based Latent Diffusion with chunked temporal alignment cross-attention and rotary position encoding, ensuring semantic consistency between 音效 and video while achieving smooth transitions.

In simpler terms, CogSound works like this:

First, GLM-4V’s video understanding analyzes the video’s semantics and emotions. Then, the audio model generates 音效, rhythms, or even complex mixes (explosions, water flow, instruments, animal sounds, etc.) matching the content.

03 "All in One" Content Creation Is Near

When video generation models first emerged earlier this year, many envisioned a future where more people could create video content.

So far, most products remain "creative toys" for short clips on social media, far from 真正的生产力: users still spend hours editing to make a decent short video.

The bottlenecks fall into two categories:

1. Model Limitations: Semantic understanding (interpreting user prompts), video quality (fluency, stability, motion 连贯性,光影一致性, style accuracy), and video length/resolution.

2. Usability: While AI tools like "New Qingying" are far simpler than PR or AE, there’s still a gap before amateurs can easily produce high-quality videos.

Thankfully, each tech iteration brings us closer. In under a year, Zhipu has made strides in video length, speed, resolution, and consistency, proving scaling law’s effectiveness in video generation. Further breakthroughs may come soon.

Three months ago, Qingying was China’s first C 端-ready video generation tool, achieving 从 0 到 1. Now, "New Qingying" marks a full upgrade. In this "tech explosion" era, all challenges are just a matter of time.

Rumors also hint at usability improvements. With GLM-4-Voice (emotional 语音模型), CogSound, and CogMusic, Zhipu has built a multimodal 矩阵 covering text, images, video, and sound—all 基于原创可控的 GLM 技术。

We tested further: "New Qingying" generated videos from images while CogSound added 音效。

 

Beyond the stunning results, the 效率 stood out—the entire process took minutes. Batch-generating 音效 videos from photos may soon be a key 应用方向。

Could 工具流同步调用多个模型 turn an idea into a video with 同步音效 in one step? Zhipu’s 官方表态 hints: "Our 理想状态 is: give AI a good 创意, and it handles the rest—turning an idea or image into a 影片 with BGM." In other words, tasks once requiring teams (scripting, visuals, sound) could soon be fully automated.

An All in One video 创作平台 is no longer distant.

04 Closing Thoughts

Soon,短视频创作 may be 重构。

Creators won’t need to appear on-camera or travel—just describe the desired scene, and AI 批量生成符合需求的短视频。

Content creation won’t be limited to professionals; amateurs can express 创意 through simple, intuitive tools.

This is an opportunity for 大模型—and all creators.

The copyright of this article belongs to the original author/organization.

The views expressed herein are solely those of the author and do not reflect the stance of the platform. The content is intended for investment reference purposes only and shall not be considered as investment advice. Please contact us if you have any questions or suggestions regarding the content services provided by the platform.