Seal (Stamp) Recognition in Intelligent Document Processing: Challenges, Methods, and Experiments
This article explains how intelligent document processing uses deep‑learning‑based seal detection and OCR techniques—enhanced YOLOv5, multi‑label loss, combined NMS, and end‑to‑end models such as Mask‑TextSpotter, ABCNet, PGNet, and TrOCR—to overcome diverse stamp styles, background interference, and image quality issues, presenting experimental results that surpass commercial OCR vendors.
Intelligent Document Processing (IDP) can automate routine document tasks, and seal (stamp) recognition is a key capability for contract comparison, inventory audit, and invoice reimbursement. Traditional manual verification is labor‑intensive, while OCR‑based seal recognition can significantly reduce costs.
Challenges of seal recognition include strong diversity of stamp styles (different types, shapes, and text layouts), background interference (text overlapping the seal), and degraded image quality caused by ink inconsistency or uneven pressure.
Seal position detection is treated as a typical object‑detection problem. The authors improve the YOLOv5 framework to output not only bounding boxes but also stamp shape and color attributes. Two main modifications are introduced:
Loss calculation uses multi‑label classification with a sigmoid‑based cross‑entropy loss. Example label vector: [1, 0, 0, 0, 0, 1, 0] and loss computed by torch.nn.BCEWithLogitsLoss() .
Post‑processing replaces the original class‑agnostic NMS ( torchvision.ops.nms() ) with a combined multi‑class NMS ( tf.image.combined_non_max_suppression() ) that respects each box’s category.
Seal text recognition follows detection. Two strategies are discussed:
Two‑stage: separate text detection and recognition, requiring text‑line detection, geometric correction, and then recognition.
One‑stage (end‑to‑end): directly predict text from the cropped seal image, avoiding error accumulation.
Several end‑to‑end models are reviewed:
Mask‑TextSpotter series (v1‑v3) combine instance‑segmentation‑based detection with character‑level segmentation for recognition, but need character‑level annotations.
ABCNet series use Bezier curves for arbitrary‑shaped text detection and a CRNN‑based recognizer; v2 adds attention and shared feature maps.
PGNet (by Baidu) is a single‑stage multi‑task detector that predicts text center line, border offset, direction offset, and character classification map, using a PG‑CTC loss to avoid character‑level labels.
TrOCR (Microsoft) employs a Vision Transformer encoder and a Transformer decoder, pretrained on massive synthetic data and fine‑tuned for seal text.
The authors’ experiments adopt the improved YOLOv5 for seal detection and TrOCR for text recognition. Training data consist of over 20,000 real annotated seal images and more than 300,000 synthetic images generated via Photoshop scripts. Data augmentation includes random resizing, rotation, texture addition, HSV perturbation, and random transparent‑channel filling to simulate real‑world variations.
Two evaluation metrics are defined: seal‑presence F1 (based on detection precision/recall) and seal‑text‑item F1 (based on correct text item recognition). The test set covers 13 real‑world scenarios, and the proposed solution exceeds leading OCR vendors on both metrics.
Seal recognition has been deployed on the Laiye IDP platform, with a demo link provided for users to try the service.
Laiye Technology Team
Official account of Laiye Technology, featuring its best tech innovations, practical implementations, and cutting‑edge industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.