Multimodal Large Model for Voucher Verification: Prompt Engineering and Fine‑Tuning
By leveraging multimodal large models such as GPT‑4o and fine‑tuned Qwen‑VL, the study builds a prompt‑engineered and SFT‑enhanced voucher verification system that classifies product categories, detects diverse defects, and estimates problem counts, achieving up to 90 % accuracy and meeting real‑time business throughput requirements.
With the rapid progress of multimodal large‑model technology, its application scope has expanded to include verification scenarios. In many business domains, voucher verification still relies on manual review, which incurs high labor costs and limited efficiency.
The core tasks of voucher verification are: (1) determining product category, (2) identifying quality issues (defect detection), and (3) estimating the proportion of problematic items. Traditional algorithms struggle with the diversity of product categories and defect types, leading to prohibitive ROI.
Recent advances in multimodal models such as GPT‑4o and Qwen‑VL provide strong visual understanding, description, and reasoning capabilities. We therefore explore using these models for voucher verification, focusing on Visual Question Answering (VQA) tasks.
Prompt Engineering : We evaluated zero‑shot, few‑shot, and chain‑of‑thought strategies. Few‑shot prompting proved most effective for defect detection. Prompts are constructed by providing a small set of high‑quality examples (e.g., product images and corresponding defect labels) and asking the model to determine whether a submitted voucher image exhibits specific issues.
Category‑specific prompts are further tuned to handle distinct defect patterns (e.g., rot in peaches, missing grapes, water loss in pomelos). This engineering pipeline covers over 20 defect scenarios across hundreds of leaf categories.
Fine‑Tuning : To overcome the accuracy ceiling of prompt‑only solutions, we fine‑tuned the open‑source Qwen‑VL model. A high‑quality dataset was assembled from historical records, including single‑image and multi‑image voucher cases, product‑category classification data, and auxiliary white‑box images to mitigate hallucination.
The training used a 9:1 train‑eval split on A100 GPUs for ~14 hours, achieving the lowest eval loss at epoch 2.2. The fine‑tuned model was deployed via Alibaba’s Whale platform, exposing an OpenAI‑compatible API.
Evaluation : End‑to‑end tests in a Java service compared model outputs with human labels. The fine‑tuned Qwen‑VL‑SFT improved problem‑count accuracy from 79 % (GPT‑4) to ~90 % and raised classification accuracy from 90 % to 92 %.
Stress Test : Under a dual‑card L20 setup, processing a single image per request, the system sustained ~10 QPS, meeting business requirements.
Outlook : The success of multimodal LLMs in this domain demonstrates their potential to transform risk‑control workflows. Ongoing efforts will focus on expanding defect coverage, optimizing prompt pipelines, and scaling deployment.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.