Artificial Intelligence 10 min read

Multimodal Large Model for Voucher Verification: Prompt Engineering and Fine‑Tuning

By leveraging multimodal large models such as GPT‑4o and fine‑tuned Qwen‑VL, the study builds a prompt‑engineered and SFT‑enhanced voucher verification system that classifies product categories, detects diverse defects, and estimates problem counts, achieving up to 90 % accuracy and meeting real‑time business throughput requirements.

DaTaobao Tech

Nov 1, 2024

Multimodal Large Model for Voucher Verification: Prompt Engineering and Fine‑Tuning

With the rapid progress of multimodal large‑model technology, its application scope has expanded to include verification scenarios. In many business domains, voucher verification still relies on manual review, which incurs high labor costs and limited efficiency.

The core tasks of voucher verification are: (1) determining product category, (2) identifying quality issues (defect detection), and (3) estimating the proportion of problematic items. Traditional algorithms struggle with the diversity of product categories and defect types, leading to prohibitive ROI.

Recent advances in multimodal models such as GPT‑4o and Qwen‑VL provide strong visual understanding, description, and reasoning capabilities. We therefore explore using these models for voucher verification, focusing on Visual Question Answering (VQA) tasks.

Prompt Engineering : We evaluated zero‑shot, few‑shot, and chain‑of‑thought strategies. Few‑shot prompting proved most effective for defect detection. Prompts are constructed by providing a small set of high‑quality examples (e.g., product images and corresponding defect labels) and asking the model to determine whether a submitted voucher image exhibits specific issues.

Category‑specific prompts are further tuned to handle distinct defect patterns (e.g., rot in peaches, missing grapes, water loss in pomelos). This engineering pipeline covers over 20 defect scenarios across hundreds of leaf categories.

Fine‑Tuning : To overcome the accuracy ceiling of prompt‑only solutions, we fine‑tuned the open‑source Qwen‑VL model. A high‑quality dataset was assembled from historical records, including single‑image and multi‑image voucher cases, product‑category classification data, and auxiliary white‑box images to mitigate hallucination.

The training used a 9:1 train‑eval split on A100 GPUs for ~14 hours, achieving the lowest eval loss at epoch 2.2. The fine‑tuned model was deployed via Alibaba’s Whale platform, exposing an OpenAI‑compatible API.

Evaluation : End‑to‑end tests in a Java service compared model outputs with human labels. The fine‑tuned Qwen‑VL‑SFT improved problem‑count accuracy from 79 % (GPT‑4) to ~90 % and raised classification accuracy from 90 % to 92 %.

Stress Test : Under a dual‑card L20 setup, processing a single image per request, the system sustained ~10 QPS, meeting business requirements.

Outlook : The success of multimodal LLMs in this domain demonstrates their potential to transform risk‑control workflows. Ongoing efforts will focus on expanding defect coverage, optimizing prompt pipelines, and scaling deployment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

e-commerce multimodal AI Prompt Engineering model fine-tuning visual QA voucher verification

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.