Artificial Intelligence 10 min read

Solution Overview of the AMAP-TECH Algorithm Competition: Dynamic Road Condition Analysis from In‑Vehicle Video Images

To tackle the AMAP‑TECH competition’s dynamic road‑condition classification from scarce, imbalanced vehicle‑video frames, the team combined YOLOv5 object detection, ResNeXt101‑based semantic embeddings, and engineered temporal detection statistics, feeding the fused features into a five‑fold LightGBM model that achieved top weighted‑F1 performance.

Amap Tech

Jan 15, 2021

The AMAP-TECH algorithm competition, co‑hosted by Amap (Gaode Map) and Alibaba Cloud Tianchi, focused on the task "Dynamic Road Condition Analysis Based on In‑Vehicle Video Images". The problem originates from real‑world traffic scenarios where road‑condition information influences route planning, travel mode selection, ETA estimation, and also provides valuable insights for traffic management and urban planning.

Core Task

The competition required classifying the current road condition from vehicle video frames. In the preliminary round there were three classes – free flow, slow, and congestion. The final round added a fourth class, "closed". The evaluation metric was a weighted F1 score with class weights 0.1 (free flow), 0.2 (slow), 0.3 (congestion), and 0.4 (closed).

Challenges and Difficulties

Very limited training data (only ~7k samples) and severe class imbalance – closed and free‑flow together account for 86.9% of the data, while slow and congestion only 3.7% and 9% respectively.

High visual similarity between free‑flow, slow, and congestion makes it hard for a pure CNN to find clear decision boundaries. Exploiting temporal information from image sequences is essential.

The competition used both an A‑list and a B‑list leaderboard; achieving stable performance on a small dataset and ensuring model extensibility for future road‑information additions were critical.

Data exploration revealed several problematic cases: images of open roads that were mistakenly classified as "closed" due to overpasses or storefronts, and blurry or occluded frames that were difficult even for humans to judge.

These observations motivated a multimodal approach: combine image‑level semantic embeddings with detection‑derived tabular features (vehicle count, coordinates, pedestrian presence, etc.).

Algorithm Overview

The final pipeline consists of three main components:

Object detection using YOLOv5 to extract vehicle, pedestrian, truck, bicycle, and road‑obstacle information.

Image‑level feature extraction: a ResNeXt101_32x8d_wsl model provides intermediate‑layer embeddings for each frame.

Feature engineering on the detection results (statistics over the sequence such as mean, variance of vehicle counts, confidence sums for obstacles).

The concatenated detection and embedding features are fed into a LightGBM classifier (five‑fold cross‑validation). This architecture allows easy extension when new data sources become available.

Image Feature Vector Extraction

Key frames are passed through a pretrained ResNeXt101_32x8d_wsl network; the intermediate layer output serves as a robust semantic representation that is less sensitive to domain variations (different vehicle models, pedestrians, etc.).

Data Augmentation

Because of the small and noisy dataset, a variety of image augmentation techniques were applied, including a custom "object‑paste" augmentation to help the model learn the semantics of the "closed" class.

Classification Model Selection

ResNeXt101_32x8d_wsl was chosen for its high accuracy and robustness.

While transfer learning with deep CNNs captured frame‑level information well, end‑to‑end sequence models struggled with noise. Therefore, handcrafted temporal features from detection results were combined with the image embeddings, yielding a several‑percentage‑point performance boost.

Detection Part

Two detectors were used:

General traffic detector (YOLOv5 pretrained on COCO) to detect vehicles, pedestrians, trucks, bicycles, etc. No fine‑tuning on the competition data was performed.

Road‑obstacle detector (YOLOv5 trained on the provided obstacle annotations) to detect closed‑road objects.

The detectors output bounding boxes, class labels, and confidence scores for each frame.

Detection Feature Engineering

Temporal features were derived from the detection outputs: mean and variance of vehicle counts across the sequence, summed confidence scores for obstacles, etc. Using confidence sums instead of raw counts mitigated cumulative detection errors.

Final Submission

LightGBM with five‑fold cross‑validation was used for the final predictions. For each test image, five different embedding vectors (from the five CV folds) were generated, and their predictions were averaged (a TTA‑like strategy). Weighted averaging of the A‑list and B‑list predictions based on the weighted F1 scores yielded the best online result.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision feature engineering Multimodal Learning LightGBM ResNeXt road condition classification YOLOv5

Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.