Artificial Intelligence 10 min read

Text-Video Alignment Algorithm for Automated Short Video Production at Youku

Youku’s new text‑video alignment system automatically generates short video summaries by extracting multimodal video and linguistic features, matching sentences to clips through embedding and tag‑level models, and enabling AI‑driven auto‑editing that cuts production time from days to minutes.

Youku Technology
Youku Technology
Youku Technology
Text-Video Alignment Algorithm for Automated Short Video Production at Youku

This article presents Youku's research on automated short video production through text-video alignment algorithms. As video consumption trends toward shorter formats due to fragmented user attention, Youku leverages its extensive video library to automatically generate short video summaries.

Related Research: The academic community addresses this as "text video alignment" - aligning video scripts with video shots based on similarity between text sentences and video segments. This involves two tasks: computing text-video segment similarity and aligning text sequences with video sequences. Unlike video text grounding, text video alignment is insensitive to segment boundaries. Unlike video text retrieval, it operates within a single video with sequential temporal information.

Previous approaches considered only single-modal features. Article [1] proposed a similarity calculation framework incorporating all modal features (optical flow, face, audio) with flexibility to extend to more modalities and handle missing modalities. Article [2] abstracted cross-modal matching as operations on video and text sequence stacks, using LSTM to model sequences and predicting stack top operations for matching. Article [3] added information filtering modules and inter-modal fusion channels for video-text retrieval. Article [4] applied graph neural networks to extract multi-level features from text and video modalities for intra-modal fusion.

Algorithm Framework: The system consists of video feature extraction, text feature extraction, cross-modal matching, and text matching components.

Feature Design:

Video Features: Video structured processing extracts key information through intelligent image analysis and generates semantic text descriptions.

Text Features: Includes text classification, Named Entity Recognition (NER), coreference resolution, and dependency analysis. Text classification provides weights for matching strategies - descriptive text uses person/scene/behavior embedding matching while dialogue uses OCR text matching. NER extracts entities like persons, actions, and scenes using BERT models pre-trained on large Chinese corpora and fine-tuned on annotated data. Coreference resolution handles pronoun references (e.g., "he" in "Chen Yongren heard that Han Chen had new drugs, so he quickly passed this information to Huang Zhicheng"). Dependency analysis extracts subject, predicate (action), and object as the main sentence components, discarding modifiers that interfere with matching.

Cross-Modal Matching: Addresses aligning text sentences with video segments through multi-level matching at embedding level and tag level. Embedding level trains semantic embedding models for text and video, computing embeddings for each sentence and video segment, then learning matching relationships with neural networks. Tag level uses entity labels (e.g., person names) to filter non-matching segments.

Text Matching: Handles both short-phrase and sentence-level matching using word vectors trained on 8 million Chinese words. For phrase matching, direct word vector similarity is used. For sentence matching, weighted average of word vectors represents the sentence. Cosine similarity between average word embeddings measures semantic distance, with Word Mover's Distance used for more challenging cases.

Applications: AI auto-editing enables fully or semi-automated video editing for batch production, improving content production efficiency and short video distribution. Youku has applied AI capabilities to bullet comment extraction, video understanding tags, episode summaries, intelligent cover images, and video speed commentary. The system builds a "machine production + human review + advertisement generation" pipeline, compressing production time from days to minutes.

NLPvideo understandingBERTvideo retrievalmulti-modal learningcross-modal matchingtext-video alignment
Youku Technology
Written by

Youku Technology

Discover top-tier entertainment technology here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.