Artificial Intelligence 22 min read

How Meituan’s Open‑Source Avatar Redefines Digital Human Voice‑Over Costs (Beyond HeyGen)

LongCat‑Video‑Avatar‑1.5, Meituan’s open‑source audio‑driven video generation model, upgrades its encoder, stability, multi‑character support and 8‑step distillation, provides a detailed workflow, benchmark evaluation, and examines its impact on designers, operators, e‑commerce and marketing while highlighting deployment and compliance challenges.

Design Hub

May 25, 2026

Clarifying the Project

LongCat‑Video‑Avatar‑1.5 is Meituan LongCat’s open‑source audio‑driven digital‑human video generation model, released under the MIT license with English and Chinese support. The model is tagged on Hugging Face as audio-text-to-video, audio-image-text-to-video, audio-driven-video-continuation, avatar, and video-generation.

The official description defines it as an audio‑driven human video generation framework built on the LongCat‑Video foundation model, supporting three native tasks:

AT 2 V : Audio + Text to Video

ATI 2 V : Audio + Text + Image to Video

Video Continuation : Long‑form video continuation

Thus, it is more than a simple lip‑sync tool; it integrates reference images, audio, textual prompts, and a video generation model into a single pipeline.

Model Upgrades in Version 1.5

The model card highlights four major upgrades:

The audio encoder switches from Wav2Vec2 to Whisper‑Large , delivering smoother and more natural lip movements.

Emphasis on production‑ready stability , ensuring accurate lip‑sync, full‑body temporal consistency, and identity preservation over longer videos.

Generalization to more complex scenes such as anime, animals, multi‑character interaction, and object interaction .

Adoption of DMD‑2‑based step distillation , compressing inference to 8 NFE to make the model more service‑ready and batch‑friendly.

Demo GIFs illustrate these directions.

The single‑voice‑over demo mimics the style of knowledge‑sharing accounts, focusing on natural facial, body, and subtle hand gestures over time.

Singing and emotional performance are harder than plain speech because of larger mouth opening and richer facial expressions; stable results here imply easier stability for ordinary voice‑over.

The stage‑singing demo tests “content atmosphere”, checking lighting, body posture, microphone, and background to see if they jointly serve the performance feel.

Office teaching / screen instruction demo

The office‑teaching demo is especially useful for operations and course teams, as it automates the production chain for typical knowledge‑type content such as PPT walkthroughs, product demos, and tool tutorials.

Comparison with HeyGen, Kling, OmniHuman

The project page lists HeyGen, Kling Avatar 2.0, and OmniHuman‑1.5 as commercial counterparts, focusing on stability, consistency, and natural lip motion. The author cautions against framing the open‑source release as a direct “knock‑out” of these closed‑source tools because commercial products bundle end‑to‑end services (upload, templates, review, collaboration, licensing, cloud compute, batch export, API, support, SLA).

LongCat‑Video‑Avatar‑1.5 instead lowers the “base price” of model capability, allowing engineering‑savvy teams to integrate the core generation ability into their own content pipelines, e.g., automatic script generation, TTS, video synthesis, batch editing, channel distribution, and data‑driven A/B testing.

HF Space Demo – Product Form

The Hugging Face Space runs on Gradio (status: Running on Zero) and shows a demo titled “LongCat‑Video‑Avatar 1.5: Audio‑Image‑to‑Video”. Input fields include Reference image, Driving audio, Prompt, Resolution (480 p / 720 p), Seed, Audio preprocessing, and Acceleration.

The demo provides three reference images: a cartoon character, an orc warrior, and a realistic person, indicating the model can ingest real, cartoon, game, or virtual IP assets.

These examples demonstrate that the model can handle real humans, cartoon avatars, and game characters within a single generation pipeline.

Official Human Evaluation

The model card describes an audio‑driven digital‑human benchmark covering six application scenarios (News, Education, Daily Life, Entertainment, Singing, Commercial Promotion), two languages (Chinese, English), two visual styles (Realistic, Animated), 508 image‑audio pairs, 770 crowd‑sourced raters, 13,240 human‑likeness scores, and ten domain experts evaluating physical rationality, harmony, temporal stability, and identity consistency.

The evaluation shows that the model aims to meet multiple quality dimensions simultaneously: accurate lip‑sync, stable identity, coherent body motion, non‑rigid expressions, multilingual friendliness, and consistent quality over longer video durations.

Impact on Designers

Designers should not view the tool as a replacement or a toy. While the model lowers the barrier for “making a character speak”, it does not solve why a character is worth watching. The workflow shifts designers from repetitive motion‑design tasks to higher‑level responsibilities such as defining brand avatar personality, ensuring cross‑channel visual consistency, creating reusable visual templates, deciding when a virtual human is appropriate, and managing the credibility of AI‑generated content.

No longer need to create motion from scratch for each video.

No longer need to manually adjust lip‑sync and basic expressions.

No longer need separate visual assets for each language.

No longer need on‑set filming for every voice‑over.

New responsibilities include brand avatar definition, multi‑channel persona consistency, camera‑language control, visual‑style direction, reusable template creation, content‑type suitability assessment, and navigating the boundary between realism and brand trust.

Impact on Operations and E‑Commerce

E‑commerce benefits most because it deals with massive SKU catalogs, short product cycles, multiple platforms, and multilingual needs. Traditional video production involves models, locations, lighting, editing, subtitles, and review, which remain bottlenecks even with commercial digital‑human tools.

Open‑source avatars enable a modular pipeline:

Extract product selling points from the detail page.

Use LLMs to generate three script styles.

Generate Chinese, English, and Japanese TTS audio.

LongCat creates voice‑over videos with a fixed brand avatar.

Automated editing adds subtitles, price tags, and selling‑point overlays.

Deploy and collect click‑through and conversion data.

Retain high‑performing scripts and discard low‑performing ones.

This transforms SKU video production from a per‑item effort to a scalable “production line” capable of generating dozens of variants per product, enabling dense A/B testing across audiences and platforms.

Impact on Marketing and Multilingual Localization

The model’s explicit support for English and Chinese (with potential for other languages) suggests that digital‑human marketing will naturally become multilingual. Previously, multi‑language video required re‑shooting with local actors, re‑recording audio, and re‑editing, making it expensive.

With Avatar + TTS + automated editing, the workflow becomes:

Same persona speaking different languages.

Same product with varied scripts.

Same visual template adapted to different platforms.

Consistent brand IP across markets.

This is valuable for cross‑border e‑commerce, online education, SaaS products, and game publishing. However, as generation costs drop, the real cost shifts to establishing credibility: trustworthy personas, natural scripts, brand‑aligned visuals, concrete information, and knowing when not to use digital humans.

Practical Deployment Barriers

Open‑source does not mean zero friction. The Quick‑Start guide requires cloning the GitHub repo, creating a Python 3.10 conda environment, installing matching CUDA‑compatible Torch 2.6.0, FlashAttention, librosa, ffmpeg, and avatar requirements, then downloading the LongCat‑Video and LongCat‑Video‑Avatar‑1.5 weights.

git clone --single-branch --branch main https://github.com/meituan-longcat/LongCat-Video
cd LongCat-Video
conda create -n longcat-video python=3.10
conda activate longcat-video
huggingface-cli download meituan-longcat/LongCat-Video --local-dir ./weights/LongCat-Video
huggingface-cli download meituan-longcat/LongCat-Video-Avatar-1.5 --local-dir ./weights/LongCat-Video-Avatar-1.5

Single‑person inference example:

torchrun --nproc_per_node=2 run_demo_avatar_single_audio_to_video.py \
  --context_parallel_size=2 \
  --checkpoint_dir=./weights/LongCat-Video-Avatar-1.5 \
  --stage_1=ai2v \
  --input_json=assets/avatar/single_example_1.json \
  --use_distill --model_type avatar-v1.5 --use_int8

Multi‑person dialogue example:

torchrun --nproc_per_node=2 run_demo_avatar_multi_audio_to_video.py \
  --context_parallel_size=2 \
  --checkpoint_dir=./weights/LongCat-Video-Avatar-1.5 \
  --input_json=assets/avatar/multi_example_1.json \
  --use_distill --model_type avatar-v1.5 --use_int8

Key command‑line flags: --model_type avatar-v1.5 selects Whisper‑large‑v3. --use_distill is mandatory for the distilled model. --use_int8 reduces VRAM usage but only works with avatar‑v1.5.

Both 480 p and 720 p resolutions are supported; batch deployment introduces engineering concerns around speed, memory, and queue management.

The project’s Ethical Considerations note that some source images and audio come from real videos and are for research demonstration only; commercial use is not permitted without proper clearance. Although the model weights are MIT‑licensed, users must respect image and voice copyrights, consider AI‑generated content labeling, platform deep‑fake policies, brand‑personality liability, and potential consumer deception.

Strategic Outlook

The author predicts that LongCat‑Video‑Avatar‑1.5 will first replace low‑trust, high‑repetition, batch‑oriented video content such as product selling‑point explanations, store promotions, cross‑border multilingual introductions, educational snippets, tool tutorials, SaaS update videos, brand‑IP daily content, and post‑live‑stream recaps.

These use cases share the need for stability, large volume, rapid updates, and low reliance on unique human presence. The model does not eliminate creative work; instead, it automates repetitive on‑camera labor, shifting designers toward sustainable avatar system design and operators toward end‑to‑end script testing, asset management, channel distribution, and data‑driven iteration.

Ultimately, as generation becomes cheaper, the scarcity moves to aesthetic quality, compelling scripts, credible personas, distribution strategy, and the judgment of when AI should or should not be used.

In summary, LongCat‑Video‑Avatar‑1.5 signals a transition from “buying a tool” to “building a production line” for digital humans. The tools will keep getting cheaper, but the production pipeline and human judgment will become increasingly valuable.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Open Source Content Automation Digital Avatar AI Video Meituan LongCat Audio-driven Video Generation

Written by

Design Hub

Periodically delivers AI‑assisted design tips and the latest design news, covering industrial, architectural, graphic, and UX design. A concise, all‑round source of updates to boost your creative work.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.