Artificial Intelligence 9 min read

HiCo: Hierarchical Controllable Diffusion Model for Layout-to-Image Generation

The paper introduces HiCo, a hierarchical controllable diffusion model that enables precise layout‑to‑image generation by decoupling object and background features through weight‑shared branches and a fusion module, achieving high‑quality results and efficient inference as demonstrated on the HiCo‑7K benchmark.

360 Tech Engineering
360 Tech Engineering
360 Tech Engineering
HiCo: Hierarchical Controllable Diffusion Model for Layout-to-Image Generation

Current text‑to‑image models are limited to coarse control, which hampers their use as professional production tools. To address this, 360 AI Research presented HiCo (Hierarchical Controllable Diffusion Model) at NeurIPS 2024, a layout‑controllable generation model that will be open‑sourced.

Abstract: Layout‑to‑image generation is a key AIGC task that synthesizes images from object descriptions and spatial positions. Existing methods struggle with complex layouts, causing object loss, lighting inconsistency, and overlapping issues. HiCo introduces a hierarchical diffusion architecture with object‑separating conditional branches to model spatial separation.

Motivation: Prior layout‑controllable approaches rely on new network designs or cross‑attention mechanisms, which suffer from target loss, degraded instruction following, image distortion, high inference cost, and limited ecosystem compatibility. HiCo seeks to retain the capabilities of base diffusion models while adding precise layout control via external conditions.

Method – Overall Architecture: HiCo consists of a backbone Stable Diffusion (SD) model, weight‑shared side‑networks (HiCo Net) for each object, and a fusion module (FuseNet) that integrates background and foreground features. The architecture is illustrated in Figure 1.

Hierarchical Modeling & Fusion: The model decouples each object's spatial layout and merges them with background information using a mask‑based fusion strategy. The training objective extends the standard diffusion loss with additional terms for each conditional branch, enabling simultaneous optimization of multiple layout constraints.

Feature Visualization: Weight sharing allows HiCo to generate independent features for foreground instances and background, which are strategically integrated during up‑sampling. Figure 2 visualizes the hierarchical feature flow for four example layouts.

Training Data & Strategy: HiCo is trained on both fine‑grained grounding data (GRIT‑20M, filtered to 1.2 M samples) and coarse‑grained COCO categories, with a dedicated evaluation set HiCo‑7K. The framework supports various diffusion backbones (SD 1.5, SDXL, SD 3, Flux) and plugins such as LoRA, LCM, and SDXL‑Lighting.

Experiments – Quantitative & Qualitative Evaluation: On HiCo‑7K, the model achieves superior image quality and layout fidelity across varying object counts. Human studies confirm HiCo’s advantage in spatial accuracy and semantic alignment, matching or surpassing state‑of‑the‑art baselines like RealisticVisionV5‑1.

Ablation Studies: Systematic ablations of the hierarchical structure and fusion strategies demonstrate their contributions to performance, as shown in the accompanying tables.

Inference Efficiency: HiCo offers two inference modes—parallel and serial. Benchmarks on a 24 GB RTX 3090 (512×512 resolution) reveal lower latency and memory consumption compared to competing methods, with scalable performance as object count grows.

Conclusion & Outlook: HiCo effectively decouples object position and appearance, handling complex interactions and occlusions via a global background branch and fusion network. While current occlusion ordering remains imperfect due to limited training data, the model sets a strong foundation for future work on editable content and multi‑style integration, enhancing the usability of AI‑generated artwork.

diffusion modelimage generationAI PaintingHiColayout controlNeurIPS2024
360 Tech Engineering
Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.