Artificial Intelligence 8 min read

How Transformers Revolutionize Image Style Transfer: Introducing StyTr²

This article reviews the limitations of traditional CNN‑based image stylization, explains how Transformer architectures overcome these issues with global context and self‑attention, and presents the novel StyTr² method with content‑aware positional encoding that achieves superior, detail‑preserving style transfer results.

Kuaishou Large Model

Apr 6, 2022

How Transformers Revolutionize Image Style Transfer: Introducing StyTr²

Introduction

Image stylization, which renders a content image in the visual style of a reference image, has long been a popular research topic and is now widely deployed in short‑video applications such as Kuaishou’s main app, Quick App, Yitiao Camera, and Kuaiying. Traditional texture‑synthesis methods produce vivid results but are computationally heavy. CNN‑based neural style transfer simplifies the process by adjusting second‑order statistics of feature maps, yet it often fails to model content‑style relationships adequately, leading to unsatisfactory results. Recent works incorporate self‑attention mechanisms to improve stylization quality.

Background

Transformers, successful in natural language processing, have been adopted for vision tasks because self‑attention enables each layer to capture global information and relational structures, preserving fine details that CNNs may lose during deep feature extraction.

Method

The paper proposes StyTr², a Transformer‑based image stylization algorithm that addresses the content‑style bias of CNN methods. The architecture (Figure 2) consists of three components: a content Transformer encoder, a style Transformer encoder, and a Transformer decoder that fuses content features with style features to generate the stylized output.

To better suit visual tasks, the authors introduce Content‑Aware Positional Encoding (CAPE), which (1) incorporates semantic information into positional encoding for image generation, and (2) remains scale‑invariant for multi‑scale inputs, overcoming the limitations of traditional sinusoidal encodings.

Results

Qualitative comparisons (Figures 4 and 5) show that StyTr² outperforms state‑of‑the‑art CNN‑based methods, preserving content structure while delivering richer style patterns even after multiple stylization rounds. The Transformer‑based model captures long‑range dependencies, avoids detail loss, and produces high‑quality stylizations.

Conclusion

StyTr² demonstrates that Transformer architectures can advance the frontier of image stylization and broaden the application of Transformers in visual generation tasks.

Paper: https://arxiv.org/abs/2105.14576

Code: https://github.com/diyiiyiii/StyTR-2

Figure 1: (a) CNN intermediate layer visualization; (b) StyTr² intermediate layer visualization

Figure 2: Network architecture of StyTr²

Figure 4: Stylization results comparison with state‑of‑the‑art methods

Figure 5: Multi‑round stylization results

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision deep learning Transformer neural style transfer image stylization

Written by

Kuaishou Large Model

Official Kuaishou Account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.