Artificial Intelligence 18 min read

OPPO Multimodal Pretrained Model Deployment in Cloud-Edge Scenarios: Practices and Optimizations

OPPO details how it deploys multimodal pretrained models on resource‑constrained edge devices by compressing CLIP‑based image‑text retrieval, adapting Chinese text‑to‑image generation with LoRA and adapters, and lightweighting diffusion models through layer pruning and progressive distillation, achieving sub‑3‑second generation while preserving cloud‑level quality.

Sohu Tech Products

May 21, 2024

OPPO Multimodal Pretrained Model Deployment in Cloud-Edge Scenarios: Practices and Optimizations

This article shares OPPO's practical experience in deploying multimodal pretrained models in cloud-edge scenarios, focusing on implementing large model deployment on mobile devices with limited resources while achieving lower training and inference costs.

The content is divided into three main themes:

1. Edge Image-Text Retrieval Technology Research

Previously, photo search on phone albums relied on tags. After CLIP model emergence, natural language search became possible. The challenge lies in edge performance, search accuracy, and security while maintaining cloud-level precision for offline use. Algorithm optimization focuses on compression: using CLIP for image-text matching with large-scale parallel training, but with fine-grained understanding limitations. The solution adopts ALBEF single-stream fusion, combining both models' capabilities through distillation to a smaller student model using contrastive loss. For 100,000-level image benchmarks, direct vector multiplication proves sufficient with fp16, achieving 14ms search time comparable to cloud V100 GPU.

Fine-grained optimization involves attribute word replacement in queries, constructing negative samples for training to help the model distinguish incorrect attributes. Learning Without Forgetting (LWF) is essential to preserve general capabilities.

2. Text-to-Image Generation & Understanding Model Application Optimization

Chinese Text-to-Image Model Continued Pre-training: Using open-source English models with adapter training to achieve multilingual input while maintaining output quality. The approach adds an adapter to map features without requiring full alignment of the text encoder. For Chinese cultural adaptation, LoRA with orthogonal matrix (R) rotation helps prevent forgetting while adapting to new data with minimal replay of previous data.

Domain-Specific Optimization - Portrait Domain: Addressing issues in advertising scenarios including face/hand rendering quality, over-refined appearance, and fine-grained attribute mismatches. Building a dataset of tens of thousands of facial attribute images with precise labeling. Using LoRA layer analysis to determine which UNet layers affect specific facial features.

Text Rendering: Constructing text-image paired data by extracting text from images and corresponding positioning. Using million-level data to train the model to strengthen text detail generation in the final UNet.

Personalized Generation: Moving beyond DreamBooth's test-time fine-tuning to achieve test-time free personalized generation. Using SAM for automated segmentation and Grounding DINO for open-domain recognition, constructing a 76 million dataset for pre-training.

3. Text-to-Image Model Edge Lightweighting

Model Structure Optimization: Analyzing UNet layers to identify those with minimal effect on quality but high latency/memory requirements, then pruning and retraining. The resulting model has ~1/3 fewer parameters than original SD.

Sampling Acceleration: Using Progressive Distillation to reduce diffusion steps (32→16→8→4) and Classifier-free Guidance (CFG) distillation to combine CFG and original distillation into one step, saving 50% time.

Results: Achieving 2.5-second generation time (down from 10+ seconds), with plans to combine operator and algorithm optimization for sub-second generation. The approach enables user-perception-free generation on edge devices while inheriting full Stable Diffusion capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Model Compression LoRA Stable Diffusion text-to-image edge deployment CLIP distillation OPPO multimodal model personalized generation

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.