Artificial Intelligence 11 min read

Applying YOLOv5 Object Detection for Black, Color, and Blank Screen Classification in Video Frames

This article presents a method that replaces manual visual inspection with an automated YOLOv5‑based object detection pipeline to classify video frames as normal, colorful, or black screens, detailing data annotation, training, loss calculation, inference code, and showing a 97% accuracy improvement over ResNet.

360 Quality & Efficiency
360 Quality & Efficiency
360 Quality & Efficiency
Applying YOLOv5 Object Detection for Black, Color, and Blank Screen Classification in Video Frames

Video frame quality inspection often relies on manual visual checks for black or garbled screens, which is labor‑intensive and inefficient. To automate this task, the article proposes using an object detection model (YOLOv5) to perform classification, especially when the dataset is small and class differences are subtle.

Core Technology and Architecture The workflow contrasts traditional image classification (e.g., VGG, ResNet, DenseNet) with object detection pipelines (Fast R‑CNN, SSD, YOLO). By treating the whole image as a single detection target, YOLOv5 can be repurposed for pure classification.

Data Annotation Instead of labor‑intensive bounding‑box labeling, the whole image is taken as the target; the image center serves as the object center. Labels are defined as 0 = Normal screen, 1 = Colorful screen, 2 = Black screen. The annotation function is:

OBJECT_DICT = {"Normalscreen": 0, "Colorfulscreen": 1, "Blackscreen": 2}

def parse_json_file(image_path):
    imageName = os.path.basename(image_path).split('.')[0]
    img = cv2.imread(image_path)
    size = img.shape
    label = image_path.split('/')[4].split('\\')[0]
    label = OBJECT_DICT.get(label)
    imageWidth, imageHeight = size[0], size[1]
    xmin, ymin = (0, 0)
    xmax, ymax = (imageWidth, imageHeight)
    xcenter = (xmin + xmax) / 2 / float(imageWidth)
    ycenter = (ymin + ymax) / 2 / float(imageHeight)
    width = (xmax - xmin) / float(imageWidth)
    height = (ymax - ymin) / float(imageHeight)
    label_dict = {label: [str(xcenter), str(ycenter), str(width), str(height)]}
    return imageName, sorted(label_dict.items(), key=lambda x: x[0])

Training Process The training follows the standard YOLOv5 pipeline, loading data paths from a YAML file, creating the model, setting a cosine learning‑rate schedule, and iterating over epochs.

# Load data paths
with open(opt.data) as f:
    data_dict = yaml.load(f, Loader=yaml.FullLoader)
    train_path = data_dict['train']
    test_path = data_dict['val']
Number_class, names = (1, ['item']) if opt.single_cls else (int(data_dict['nc']), data_dict['names'])

# Create model
model = Model(opt.cfg, ch=3, nc=Number_class).to(device)

# Learning‑rate lambda
lf = lambda x: ((1 + math.cos(x * math.pi / epochs)) / 2) * (1 - hyp['lrf']) + hyp['lrf']
scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lf)

# Training loop
for epoch in range(start_epoch, epochs):
    model.train()
    ...

Loss Computation The loss combines bounding‑box (GIoU), objectness, and classification components, implemented as:

def compute_loss(p, targets, model):
    device = targets.device
    loss_cls, loss_box, loss_obj = torch.zeros(1, device=device), torch.zeros(1, device=device), torch.zeros(1, device=device)
    tcls, tbox, indices, anchors = build_targets(p, targets, model)
    BCEcls = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor([model.hyp['cls_pw']])).to(device)
    BCEobj = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor([model.hyp['obj_pw']])).to(device)
    ...
    loss = loss_box + loss_obj + loss_cls
    return loss * bs, torch.cat((loss_box, loss_obj, loss_cls, loss)).detach()

Inference (Detection) During prediction, the model outputs bounding boxes, objectness scores, and class probabilities. The class with the highest confidence is taken as the final classification:

def detect(opt, img):
    model = experimental.attempt_load(opt.weights, map_location=device)
    img = letterbox(img)[0]
    img = np.ascontiguousarray(img[..., ::-1].transpose(2, 0, 1))
    img = torch.from_numpy(img).to(device).half() if half else torch.from_numpy(img).float().to(device)
    img /= 255.0
    if img.ndimension() == 3:
        img = img.unsqueeze(0)
    pred = model(img, augment=opt.augment)[0]
    pred = non_max_suppression(pred, opt.conf_thres, opt.iou_thres, classes=opt.classes, agnostic=opt.agnostic_nms)
    for i, det in enumerate(pred):
        if det is not None and len(det):
            all_conf = det[:, 4]
            if len(det[:, -1]) > 1:
                ind = torch.max(all_conf, 0)[1]
                detect_class = int(torch.take(det[:, -1], ind))
            else:
                detect_class = int(det[0, -1])
            return detect_class

Results A test set of 600 frames (200 per class) was evaluated. ResNet achieved 88% accuracy, misclassifying many colorful screens as normal. YOLOv5 achieved 97% accuracy, correctly identifying all three classes.

Conclusion For small‑scale datasets where pure classification struggles, repurposing an object‑detection framework like YOLOv5 can significantly improve performance. The approach is applicable to other image‑classification problems where class boundaries are ambiguous, and similar detection architectures can be adapted accordingly.

image classificationcomputer visionPythondeep learningobject detectionvideo qualityYOLOv5
360 Quality & Efficiency
Written by

360 Quality & Efficiency

360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.