GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection
GEN‑VLKT introduces a Guided‑Embedding Network with position‑ and instance‑guided embeddings to remove costly post‑processing and leverages CLIP‑based visual‑linguistic knowledge transfer for interaction understanding, achieving state‑of‑the‑art HOI detection performance and zero‑shot capability, now deployed in Alibaba’s Taobao services.
The paper "GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection" was accepted at CVPR 2022, a premier computer‑vision conference with a 25.33% acceptance rate.
Motivation: Human‑Object Interaction (HOI) detection faces two core challenges—human‑object association and interaction understanding.
Association solution: The Guided‑Embedding Network (GEN) introduces position‑guided embedding (p‑GE) and instance‑guided embedding (i‑GE) to create a decoupled two‑branch structure that eliminates costly post‑processing.
Interaction‑understanding solution: Visual‑Linguistic Knowledge Transfer (VLKT) leverages the large‑scale pre‑trained CLIP model. The CLIP text encoder initializes interaction classifiers via prompt templates, while the CLIP visual encoder provides knowledge‑distillation supervision for the decoder.
Training employs a set‑matching loss that jointly aligns entity and relation decoders, combined with CLIP‑based distillation loss and standard detection losses (bbox regression, IoU, classification).
Experiments: On HICO‑DET, GEN‑VLKT achieves 34.95 mAP, surpassing previous SOTA for both regular and zero‑shot tasks. On V‑COCO, it reaches 63.91 % and 65.89 % role‑mAP for two scenarios. Ablation studies confirm the contributions of p‑GE, i‑GE, and VLKT components.
Conclusion: GEN‑VLKT provides an end‑to‑end HOI detector with superior performance and strong zero‑shot capability, and has been deployed in Alibaba’s Taobao content‑understanding services.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.