Artificial Intelligence 11 min read

Fast Video Editing: Architecture and AI‑Powered Subtitle & Redundant Segment Detection

Baidu’s Fast Editing tool automates video trimming by using NLP to recognize subtitles, tone markers and duplicate sentences, then aligns them with the timeline for one‑click removal, employing character, Levenshtein and cosine similarity algorithms within a three‑module architecture (Plugin, Window, Caption) and planning on‑device PaddlePaddle analysis to cut latency and cost.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Fast Video Editing: Architecture and AI‑Powered Subtitle & Redundant Segment Detection

The article introduces "Fast Editing" (快速剪辑), a Baidu‑produced video‑editing tool aimed at improving creator efficiency by automatically removing ineffective fragments such as subtitles, repeated sentences, and blank intervals through intelligent recognition.

Design background : Traditional trimming requires manual navigation to locate clip boundaries. Fast Editing leverages NLP to recognize subtitles, align them with the video timeline, and map subtitle duration to video segments, enabling one‑click removal of unwanted parts. It also detects tone markers, pauses, and duplicate sentences, which are common in generic knowledge videos.

Overall architecture : The system is divided into three main modules – Plugin (the UI controller), Window (video rendering and control), and Caption (subtitle processing). The architecture diagram (omitted) shows how these components interact.

Window module includes:

Timeline – stores clip data; TimelineModel – stores stickers, subtitles, etc.

Streaming – renders timeline data, pre‑loads video, handles playback and progress callbacks.

LiveWindow – displays the rendered view, manages size, background, and coordinate conversion.

Restorer – records user actions for undo/redo.

Caption module implements the core AI features:

Subtitle recognition and matching.

Detection of tone words and pauses.

Duplicate‑sentence detection based on character repetition, Levenshtein distance, and cosine similarity.

The workflow is as follows: when entering the fast‑editing page, the system checks whether subtitles have already been recognized. If not, it extracts audio, uploads it to the backend, and runs NLP analysis to obtain subtitle and tone‑segment data. The resulting subtitle clips are aligned with the video timeline, and blank‑sentence rules are applied to insert empty subtitle clips where needed.

Below are the core similarity‑calculation algorithms used for duplicate detection.

/** * Character similarity */ private func similarity(s1: String, s2: String) -> Float { var simiCount: Float = 0 var string2Array = [String]() for i in 0..

/** * Levenshtein distance – the minimum number of edit operations to transform one string into another. */ - (CGFloat)levenshteinDistance:(NSString *)s1 compare:(NSString *)s2 { NSInteger n = s1.length; NSInteger m = s2.length; if (n * m == 0) { return n + m; } int D[n + 1][m + 1]; for (int i = 0; i < n + 1; i++) { D[i][0] = i; } for (int j = 0; j < m + 1; j++) { D[0][j] = j; } for (int i = 1; i < n + 1; i++) { for (int j = 1; j < m + 1; j++) { int left = D[i - 1][j] + 1; int down = D[i][j - 1] + 1; int left_down = D[i - 1][j - 1]; NSString *i1 = [s1 substringWithRange:NSMakeRange(i - 1, 1)]; NSString *j1 = [s2 substringWithRange:NSMakeRange(j - 1, 1)]; if (![i1 isEqualToString:j1]) { left_down += 1; } D[i][j] = MIN(left, MIN(down, left_down)); } } NSInteger maxLength = MAX(s1.length, s2.length); CGFloat rate = 1.0 - ((CGFloat)D[n][m] / (CGFloat)maxLength); return rate; }

/** * Cosine similarity: vectorize strings and compute the cosine of the angle between them. */ - (CGFloat)cos:(NSString *)s1 compare:(NSString *)s2 { NSMutableSet *setA = [NSMutableSet new]; for (int i = 0; i < [s1 length]; i++) { NSString *string = [s1 substringWithRange:NSMakeRange(i, 1)]; [setA addObject:string]; } NSMutableSet *setB = [NSMutableSet new]; for (int i = 0; i < [s2 length]; i++) { NSString *string = [s2 substringWithRange:NSMakeRange(i, 1)]; [setB addObject:string]; } NSMutableDictionary *dicA = [NSMutableDictionary new]; NSMutableDictionary *dicB = [NSMutableDictionary new]; for (NSString *key in setA) { NSNumber *value = dicA[key] ?: @0; dicA[key] = @([value integerValue] + 1); } for (NSString *key in setB) { NSNumber *value = dicB[key] ?: @0; dicB[key] = @([value integerValue] + 1); } NSMutableSet *unionSet = [setA mutableCopy]; [unionSet unionSet:setB]; NSArray *unionArray = [unionSet allObjects]; NSMutableArray *aVec = [NSMutableArray arrayWithCapacity:unionSet.count]; NSMutableArray *bVec = [NSMutableArray arrayWithCapacity:unionSet.count]; for (NSInteger i = 0; i < unionArray.count; i++) { [aVec addObject:@0]; [bVec addObject:@0]; } for (NSInteger i = 0; i < unionArray.count; i++) { NSString *obj = unionArray[i]; NSNumber *numA = dicA[obj] ?: @0; NSNumber *numB = dicB[obj] ?: @0; aVec[i] = numA; bVec[i] = numB; } NSInteger p1 = 0; for (NSInteger i = 0; i < aVec.count; i++) { p1 += ([aVec[i] integerValue] * [bVec[i] integerValue]); } CGFloat p2 = 0.0f; for (NSNumber *i in aVec) { p2 += ([i integerValue] * [i integerValue]); } p2 = sqrt(p2); CGFloat p3 = 0.0f; for (NSNumber *i in bVec) { p3 += ([i integerValue] * [i integerValue]); } p3 = sqrt(p3); CGFloat rate = ((CGFloat)p1) / (p2 * p3); return rate; }

Performance considerations: because timeline refresh is costly, subtitle deletions only modify the subtitle clip data source and refresh the UI list; the full timeline is refreshed when the user exits, reducing runtime overhead.

Future work : To reduce network dependency, Baidu plans to integrate PaddlePaddle for on‑device audio analysis, subtitle extraction, and tone/duplicate detection, improving latency, data security, and storage cost.

AISwiftNLPVideo EditingDuplicate Detectionsubtitle detection
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.