Meituan’s Open‑Source Digital Human Model Delivers Real‑World Performance Across MV, E‑Commerce, and More

Meituan’s LongCat‑Video‑Avatar 1.5 replaces its audio encoder with Whisper‑Large, cuts inference to eight steps, and, after a 770‑person, 13,240‑rating evaluation, outperforms competing models in lip‑sync, style generalization, multi‑person scenes, and overall visual fidelity.

SuanNi
SuanNi
SuanNi
Meituan’s Open‑Source Digital Human Model Delivers Real‑World Performance Across MV, E‑Commerce, and More

LongCat‑Video‑Avatar 1.5, the latest release of Meituan’s open‑source digital‑human video generation framework, swaps the original Wav2Vec2 audio encoder for Whisper‑Large and reduces inference steps to eight Neural Function Evaluations (NFE) using Distribution Matching Distillation 2 (DMD2).

Audio encoder upgrade and lip‑sync improvement

The Whisper‑Large model, a benchmark in speech recognition, captures finer‑grained temporal audio information, resulting in noticeably smoother and more natural lip movements. This addresses the common failure mode where lip‑sync mismatches break viewer immersion.

Inference acceleration

By distilling the denoising process from dozens of steps down to eight, the framework dramatically lowers computational cost while preserving visual fidelity, making server‑side deployment more flexible and cost‑effective.

Supported tasks and scenarios

Version 1.5 natively supports Audio‑Text‑to‑Video (AT2V), Audio‑Text‑Image‑to‑Video (ATI2V), and video continuation, handling both single‑stream and multi‑stream audio inputs. It covers use cases such as news broadcasting, performances, singing, e‑commerce marketing, multi‑person dialogue, animated characters, and animal avatars.

Style generalization

The model maintains performance not only on real‑person footage but also on anime, animal, multi‑person interaction, and handheld‑object scenarios, eliminating the need to train separate models for different styles and further reducing deployment overhead.

Comprehensive evaluation

Using the EvalTalker benchmark, the authors evaluated the model across news, education, entertainment, and commercial scenes, varying audio speed, emotion, number of participants, pose, and occlusion. A total of 770 evaluators provided 13,240 subjective scores, complemented by structured quality analysis from ten domain experts.

Radar‑chart dominance

The radar‑chart area, covering physical realism, temporal stability, identity consistency, and audio‑video coordination, places LongCat‑Video‑Avatar 1.5 at the top of all compared models, showing balanced superiority without obvious weaknesses.

Head‑to‑head win rates

Against Kling Avatar 2.0, LongCat‑Video‑Avatar 1.5 achieves a 65.9% win rate; against OmniHuman‑1.5, 61.1%; and against HeyGen, 54.3%. All three competitors are commercial systems currently available on the market.

Single‑person vs. multi‑person performance

In single‑person scenarios, the model scores 3.336, markedly higher than HeyGen and OmniHuman‑1.5, indicating strong naturalness and realism. In multi‑person scenes, it scores 2.730, substantially ahead of InfiniteTalk’s 2.339, primarily due to better speaker‑listener distinction.

Hard‑metric error rates

Subject deformation rate: 23.1% (lowest among peers); background deformation: 9.4%; frame‑drop: 0.8% (best). Audio‑video coordination errors: facial‑body sync 5.1%, lip‑sync 29.8%—both lower than competing models, with the lip‑sync figure representing the current industry low.

Conclusion

LongCat‑Video‑Avatar 1.5 delivers state‑of‑the‑art naturalness in single‑person videos while also excelling in multi‑person interaction, long‑sequence stability, physical plausibility, and audio‑video coordination, all with significantly reduced inference cost.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIVideo Generationbenchmarkdigital humanWhisperLongCat-Video-Avatar
SuanNi
Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.