Solution Overview of the AMAP-TECH Algorithm Competition: Dynamic Road Condition Analysis from In‑Vehicle Video Images
To tackle the AMAP‑TECH competition’s dynamic road‑condition classification from scarce, imbalanced vehicle‑video frames, the team combined YOLOv5 object detection, ResNeXt101‑based semantic embeddings, and engineered temporal detection statistics, feeding the fused features into a five‑fold LightGBM model that achieved top weighted‑F1 performance.
The AMAP-TECH algorithm competition, co‑hosted by Amap (Gaode Map) and Alibaba Cloud Tianchi, focused on the task "Dynamic Road Condition Analysis Based on In‑Vehicle Video Images". The problem originates from real‑world traffic scenarios where road‑condition information influences route planning, travel mode selection, ETA estimation, and also provides valuable insights for traffic management and urban planning.
Core Task
The competition required classifying the current road condition from vehicle video frames. In the preliminary round there were three classes – free flow, slow, and congestion. The final round added a fourth class, "closed". The evaluation metric was a weighted F1 score with class weights 0.1 (free flow), 0.2 (slow), 0.3 (congestion), and 0.4 (closed).
Challenges and Difficulties
Very limited training data (only ~7k samples) and severe class imbalance – closed and free‑flow together account for 86.9% of the data, while slow and congestion only 3.7% and 9% respectively.
High visual similarity between free‑flow, slow, and congestion makes it hard for a pure CNN to find clear decision boundaries. Exploiting temporal information from image sequences is essential.
The competition used both an A‑list and a B‑list leaderboard; achieving stable performance on a small dataset and ensuring model extensibility for future road‑information additions were critical.
Data exploration revealed several problematic cases: images of open roads that were mistakenly classified as "closed" due to overpasses or storefronts, and blurry or occluded frames that were difficult even for humans to judge.
These observations motivated a multimodal approach: combine image‑level semantic embeddings with detection‑derived tabular features (vehicle count, coordinates, pedestrian presence, etc.).
Algorithm Overview
The final pipeline consists of three main components:
Object detection using YOLOv5 to extract vehicle, pedestrian, truck, bicycle, and road‑obstacle information.
Image‑level feature extraction: a ResNeXt101_32x8d_wsl model provides intermediate‑layer embeddings for each frame.
Feature engineering on the detection results (statistics over the sequence such as mean, variance of vehicle counts, confidence sums for obstacles).
The concatenated detection and embedding features are fed into a LightGBM classifier (five‑fold cross‑validation). This architecture allows easy extension when new data sources become available.
Image Feature Vector Extraction
Key frames are passed through a pretrained ResNeXt101_32x8d_wsl network; the intermediate layer output serves as a robust semantic representation that is less sensitive to domain variations (different vehicle models, pedestrians, etc.).
Data Augmentation
Because of the small and noisy dataset, a variety of image augmentation techniques were applied, including a custom "object‑paste" augmentation to help the model learn the semantics of the "closed" class.
Classification Model Selection
ResNeXt101_32x8d_wsl was chosen for its high accuracy and robustness.
While transfer learning with deep CNNs captured frame‑level information well, end‑to‑end sequence models struggled with noise. Therefore, handcrafted temporal features from detection results were combined with the image embeddings, yielding a several‑percentage‑point performance boost.
Detection Part
Two detectors were used:
General traffic detector (YOLOv5 pretrained on COCO) to detect vehicles, pedestrians, trucks, bicycles, etc. No fine‑tuning on the competition data was performed.
Road‑obstacle detector (YOLOv5 trained on the provided obstacle annotations) to detect closed‑road objects.
The detectors output bounding boxes, class labels, and confidence scores for each frame.
Detection Feature Engineering
Temporal features were derived from the detection outputs: mean and variance of vehicle counts across the sequence, summed confidence scores for obstacles, etc. Using confidence sums instead of raw counts mitigated cumulative detection errors.
Final Submission
LightGBM with five‑fold cross‑validation was used for the final predictions. For each test image, five different embedding vectors (from the five CV folds) were generated, and their predictions were averaged (a TTA‑like strategy). Weighted averaging of the A‑list and B‑list predictions based on the weighted F1 scores yielded the best online result.
Amap Tech
Official Amap technology account showcasing all of Amap's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.