Exploring Intelligent Production at Youku: AI‑Driven Video Analysis and Automation
The talk describes Youku’s intelligent production platform, which uses AI and cloud computing to automatically analyze video frames, extract fine‑grained metadata such as scenes, persons, actions and scores, and then generate highlights, vertical clips, annotations and feedback for editors and upstream producers, while addressing challenges like pose‑tracking, graph‑based action classification and future plans for deeper video understanding and open competitions.
Preface: This article is a written version of the talk "Exploring Intelligent Production at Youku" delivered by a senior algorithm engineer from Alibaba at the Youku Technology Salon.
We define intelligent production as the use of artificial intelligence and cloud computing to analyze video content, generate rich metadata, and then use that metadata to assist editors in video creation as well as to model video quality for feedback to upstream production.
The video metadata includes fine‑grained information such as timestamps of scenes, identified persons, their actions, and other detailed attributes. The goal is two‑fold: to provide services for downstream editors and to give feedback to upstream producers.
Traditional video editing is labor‑intensive. For example, sports highlights require manual clipping of the best moments, adjusting start/end frames, selecting titles and covers, etc. This process is time‑consuming and inefficient.
Intelligent production aims to automate these tasks using AI. The video is first structured along two dimensions:
Temporal dimension: frame‑level detection → shot detection → scene/segment identification.
Spatial dimension: locating persons within frames to enable tasks such as converting landscape videos to portrait format for mobile consumption.
After structuring, the metadata is stored in a database for editors and operations staff to visualize and fine‑tune, because automated results may not be perfectly accurate.
In the sports scenario, we first locate the start and end of a match using scoreboard recognition. Image search techniques are employed to find the exact opening and closing segments of the broadcast (similar to product image search on e‑commerce platforms).
Image search consists of two steps: (1) converting an image to a vector, either by local feature aggregation or by deep‑learning‑based embedding; (2) performing vector similarity search using tools such as Facebook’s open‑source Faiss or Alibaba’s internal high‑performance vector engine. The index is small enough to run on CPU.
For sports video, we also perform scoreboard digit detection and recognition, which is challenging because digits are tiny and may shift position between frames.
Beyond metadata extraction, we generate real‑time highlights (images or GIFs) for user engagement, and we create vertical videos for mobile viewing. We also detect and track the ball, player poses, goalposts, and field lines to support fine‑grained analysis.
Speech recognition is used to avoid cutting off commentary during editing. The overall pipeline consists of three dimensions: (1) frame‑level detection of people, balls, etc.; (2) shot‑level understanding; (3) event‑level understanding (e.g., a red‑card incident) which requires temporal context.
Action recognition is particularly difficult in sports because camera motion differs from that in movies or variety shows. Precise start and end timestamps of actions (e.g., a shot) are required for accurate clipping.
Initially we tried a 3‑D convolutional network for video classification, but many classes were visually similar. We later incorporated pose tracking and spatio‑temporal graph convolutional networks to improve accuracy.
State‑of‑the‑art pose estimation solutions such as AlphaPose (top‑down) and OpenPose (bottom‑up) are evaluated. We favor a bottom‑up approach with optimizations (field‑line masking, optical‑flow acceleration) to meet real‑time requirements.
After obtaining individual trajectories, we apply spatio‑temporal graph convolutional networks to classify actions and infer higher‑level events.
Future directions include deeper video understanding (especially actions), unified model deployment (including moving inference to the frontend via JavaScript), performance improvements, intelligent labeling and training pipelines, and script optimization based on user viewing behavior.
Finally, the article promotes the ongoing Youku Video Enhancement and Super‑Resolution Competition, inviting participants to tackle large‑scale high‑definition video challenges.
Youku Technology
Discover top-tier entertainment technology here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.