Artificial Intelligence 7 min read

StyTr²: A Transformer‑Based Approach for Image Style Transfer

The paper proposes StyTr², a Transformer‑based image style transfer method that uses content‑aware positional encoding to preserve details and improve feature representation, achieving high‑quality stylization with better content structure and style patterns.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
StyTr²: A Transformer‑Based Approach for Image Style Transfer

Image style transfer is a practical and interesting task that uses a reference style image to render content images, having been extensively studied in academia and widely deployed in industry, such as in short‑video applications.

Traditional texture‑synthesis methods can produce vivid results but are computationally complex due to modeling brush strokes and the painting process. Subsequent CNN‑based neural style transfer optimizes an encoder‑decoder pipeline, yet its limited receptive field forces deep networks that lose fine‑grained details, and the content representation tends to leak, causing the original structure to disappear after repeated stylization.

Transformer architectures overcome these limitations: self‑attention enables each layer to capture global information, and the relational modeling preserves structural details, yielding stronger feature representation without the detail loss seen in CNNs.

Building on this, the authors propose StyTr², a novel image style transfer algorithm that mitigates the content‑expression bias of CNN‑based methods. The model consists of a content Transformer encoder, a style Transformer encoder, and a Transformer decoder.

The content and style encoders respectively encode long‑range dependencies of the content and style images, effectively avoiding detail loss, while the decoder transforms the content features into a stylized output infused with style characteristics.

To further suit visual tasks, the paper introduces Content‑Aware Positional Encoding (CAPE), which is scale‑invariant and semantically aware, addressing the mismatch between traditional sinusoidal positional encoding (designed for sequential sentences) and the spatial semantics of image patches.

Experimental comparisons show that StyTr² achieves higher‑quality stylization than state‑of‑the‑art approaches, preserving both content structure and rich style patterns, and remains stable across multiple stylization iterations.

The work advances the frontier of image stylization and demonstrates the effectiveness of Transformers in vision tasks; the paper is available at https://arxiv.org/abs/2105.14576 and the implementation at https://github.com/diyiiyiii/StyTR-2.

computer visionDeep Learningtransformerimage style transfercontent-aware positional encoding
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.