Optimization Techniques for Image Cropping in Kuaishou YKit AI SDK
This article details the engineering optimizations applied to the image cropping stage of Kuaishou's YKit AI SDK, covering instruction-level fixes, SIMD acceleration, I/O cache improvements, algorithmic refinements, parallel processing, and device‑tier strategies to achieve up to 4.6× speedup on mobile devices.
Kuaishou's YKit AI SDK powers a variety of visual effects, and its image‑cropping (裁图) step is critical for preparing input frames for AI inference. The article explains why this seemingly small stage warrants extensive optimization, as it appears in almost every AI algorithm such as segmentation, GAN‑based face extraction, and key‑point detection.
The cropping pipeline consists of two main phases: (1) affine transformations—including flip, rotation, scaling, and crop—and (2) format conversion, typically from YUV‑based capture formats to RGBA required by models. After affine mapping, non‑integer coordinates are resolved via bilinear interpolation, followed by color‑space conversion.
Optimization approaches are grouped into four categories:
1. Instruction optimization
1.1 Fixed‑point computation replaces floating‑point arithmetic in affine and interpolation calculations, reducing latency and power consumption. 1.2 SIMD acceleration vectorizes both the affine mapping and bilinear interpolation, allowing multiple pixels to be processed in parallel. Example NEON code:
uint16x4_t vec_quantizer_mod = vdup_n_u16((1 << 10) - 1);
uint16x4_t fx = vand_u16(vreinterpret_u16_s16(src_x_quantizer), vec_quantizer_mod);
uint16x4_t fy = vand_u16(vreinterpret_u16_s16(src_y_quantizer), vec_quantizer_mod);
uint16x4_t alpha0 = vsub_u16(vdup_n_u16(1 << 10), fx);
uint16x4_t alpha1 = fx;
uint16x4_t beta0 = vsub_u16(vdup_n_u16(1 << 10), fy);
uint16x4_t beta1 = fy;1.3 I/O cache optimization uses CPU pre‑fetching and row‑major memory layout to minimize cache misses, especially when rotation changes the access pattern from row‑wise to column‑wise.
2. Algorithm optimization
2.1 Redundant computation removal pre‑computes per‑row and per‑column terms of the affine matrix, cutting the per‑pixel arithmetic from six multiplications and two additions to just two additions. 2.2 Format customization implements direct conversion pipelines for common camera formats (NV12, NV21, I420) to the model‑required RGBA/BGRA, avoiding extra intermediate buffers. 2.3 Merged computation combines color‑space conversion with bilinear interpolation, reducing memory traffic.
3. Parallel optimization
3.1 CPU multithreading splits the target image into row or column blocks, leveraging a thread pool to run independent cropping tasks concurrently. 3.2 GPU shaders (OpenGL, Metal) provide a parallel path for the same operations on Android and iOS, with CPU and GPU tasks scheduled on separate threads to increase overall throughput.
4. Strategy optimization
Device‑tiered configuration selects lighter‑weight interpolation (e.g., nearest‑neighbor) and smaller output sizes for low‑end phones, while high‑end devices keep bilinear interpolation and larger resolutions. This dynamic selection is driven by YKit's model‑tiering platform, which supports over ten device classes.
Performance testing on an iPhone 7 shows the optimized bilinear cropping dropping from 5.09 ms to 1.09 ms (≈4.6× speedup) when instruction, algorithm, and CPU‑thread optimizations are enabled.
The article concludes that these optimizations form a reusable image‑processing library within YKit, and future work will target new hardware features and algorithmic advances to further accelerate AI‑driven visual effects.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.