Meituan’s Open‑Source Digital Human Model Delivers Real‑World Performance Across MV, E‑Commerce, and More
Meituan’s LongCat‑Video‑Avatar 1.5 replaces its audio encoder with Whisper‑Large, cuts inference to eight steps, and, after a 770‑person, 13,240‑rating evaluation, outperforms competing models in lip‑sync, style generalization, multi‑person scenes, and overall visual fidelity.
LongCat‑Video‑Avatar 1.5, the latest release of Meituan’s open‑source digital‑human video generation framework, swaps the original Wav2Vec2 audio encoder for Whisper‑Large and reduces inference steps to eight Neural Function Evaluations (NFE) using Distribution Matching Distillation 2 (DMD2).
Audio encoder upgrade and lip‑sync improvement
The Whisper‑Large model, a benchmark in speech recognition, captures finer‑grained temporal audio information, resulting in noticeably smoother and more natural lip movements. This addresses the common failure mode where lip‑sync mismatches break viewer immersion.
Inference acceleration
By distilling the denoising process from dozens of steps down to eight, the framework dramatically lowers computational cost while preserving visual fidelity, making server‑side deployment more flexible and cost‑effective.
Supported tasks and scenarios
Version 1.5 natively supports Audio‑Text‑to‑Video (AT2V), Audio‑Text‑Image‑to‑Video (ATI2V), and video continuation, handling both single‑stream and multi‑stream audio inputs. It covers use cases such as news broadcasting, performances, singing, e‑commerce marketing, multi‑person dialogue, animated characters, and animal avatars.
Style generalization
The model maintains performance not only on real‑person footage but also on anime, animal, multi‑person interaction, and handheld‑object scenarios, eliminating the need to train separate models for different styles and further reducing deployment overhead.
Comprehensive evaluation
Using the EvalTalker benchmark, the authors evaluated the model across news, education, entertainment, and commercial scenes, varying audio speed, emotion, number of participants, pose, and occlusion. A total of 770 evaluators provided 13,240 subjective scores, complemented by structured quality analysis from ten domain experts.
Radar‑chart dominance
The radar‑chart area, covering physical realism, temporal stability, identity consistency, and audio‑video coordination, places LongCat‑Video‑Avatar 1.5 at the top of all compared models, showing balanced superiority without obvious weaknesses.
Head‑to‑head win rates
Against Kling Avatar 2.0, LongCat‑Video‑Avatar 1.5 achieves a 65.9% win rate; against OmniHuman‑1.5, 61.1%; and against HeyGen, 54.3%. All three competitors are commercial systems currently available on the market.
Single‑person vs. multi‑person performance
In single‑person scenarios, the model scores 3.336, markedly higher than HeyGen and OmniHuman‑1.5, indicating strong naturalness and realism. In multi‑person scenes, it scores 2.730, substantially ahead of InfiniteTalk’s 2.339, primarily due to better speaker‑listener distinction.
Hard‑metric error rates
Subject deformation rate: 23.1% (lowest among peers); background deformation: 9.4%; frame‑drop: 0.8% (best). Audio‑video coordination errors: facial‑body sync 5.1%, lip‑sync 29.8%—both lower than competing models, with the lip‑sync figure representing the current industry low.
Conclusion
LongCat‑Video‑Avatar 1.5 delivers state‑of‑the‑art naturalness in single‑person videos while also excelling in multi‑person interaction, long‑sequence stability, physical plausibility, and audio‑video coordination, all with significantly reduced inference cost.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
SuanNi
A community for AI developers that aggregates large-model development services, models, and compute power.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
