Checkbox Detection and State Classification Using YOLOv5
This article describes a comprehensive solution for detecting checkboxes in document images and determining their selected or unselected status by combining YOLOv5 object detection, synthetic and semi‑synthetic data generation, specialized post‑processing, and association logic to handle varied shapes, positions, and markings.
Checkboxes are common symbols in documents used to capture user feedback, and accurate detection requires locating the box, recognizing its state, and linking it to descriptive text.
Initial attempts using OCR alone proved insufficient due to the variability of box shapes, sizes, and markings. The problem is abstracted into two tasks: checkbox position detection and state determination.
The technical approach treats checkboxes and their markings as detection targets, using YOLOv5 to identify them. Target sizes range from 20–40 px after resizing images to 1280 × 1280, and the model benefits from a small, fast architecture.
Data preparation includes a large real dataset (≈30 scenes, 1,200 images, 45 k annotations) and synthetic data generated by compositing checkbox and marking assets onto varied backgrounds. Two synthetic strategies are employed: fully synthetic generation by randomly placing and transforming assets, and semi‑synthetic augmentation that inserts synthetic markings into real, empty checkbox regions to balance class distribution.
Image augmentation (small rotations, gamma correction, blur, noise, and color shift) further diversifies training data.
Model training proceeds in two stages: pre‑training on abundant synthetic data (with optional pretrained weights) followed by fine‑tuning on real data.
Post‑processing includes class‑aware non‑maximum suppression to prevent merging of boxes and markings, a modified IoU calculation (intersection over the smaller box area), and an association algorithm that expands each checkbox region, groups nearby markings, and resolves ambiguous matches using IoU and distance metrics.
The final system achieves approximately 94 % accuracy and over 98 % recall on a test set of ~100 images, with identified areas for improvement such as expanding real‑world data, handling high overlap cases, and moving toward end‑to‑end solutions.
Laiye Technology Team
Official account of Laiye Technology, featuring its best tech innovations, practical implementations, and cutting‑edge industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.