Artificial Intelligence 15 min read

Overview of Document Intelligence Models: StrucText, LayoutLMv3, and GraphDoc

This article reviews three representative document intelligence models—StrucText, LayoutLMv3, and GraphDoc—detailing their input features, feature fusion strategies, self‑supervised tasks, and underlying architectures, and explains how they learn embeddings for segments, words, or regions to enable classification and key‑value extraction.

Laiye Technology Team
Laiye Technology Team
Laiye Technology Team
Overview of Document Intelligence Models: StrucText, LayoutLMv3, and GraphDoc

1. Introduction

Document images contain multiple textual entries (Segments), words, or regions. The core challenges for document intelligence are (1) predicting the category of each Segment/Word/Region and (2) predicting the key‑value pairing relationship between them.

2. Problem Decomposition

Learn high‑quality embeddings for Segments (Words, Regions).

Use the learned embeddings for classification to predict categories.

Compute similarity between embeddings to predict pairing relationships, assuming paired Segments have high similarity.

3. Model Overview

The article selects three representative papers—StrucText, LayoutLMv3, and GraphDoc—and provides a brief introduction to their core technologies. All three models employ self‑supervised learning to pre‑train embeddings for Segments, Words, or Regions, followed by fine‑tuning on domain‑specific data for classification or key‑value prediction.

4. StrucText

Input Features

StrucText combines text, image, segment‑index, character‑length, and modality features into a single sequence. Text is obtained via OCR, providing both the string and the bounding box coordinates (x₀, y₀, x₁, y₁) for each Segment, from which width (w) and height (h) are derived.

Formulas

Segment‑index (S) encodes the order of Segments sorted by their top‑left coordinates. Layout encoding (L) embeds the coordinates of each Segment. Image features (V) are extracted using a ResNet‑50+FPN backbone applied to the image region corresponding to each Segment.

Feature Fusion

All features are concatenated into a sequence and processed by multiple Transformer layers with multi‑head self‑attention, producing a learned embedding for each token.

Self‑Supervised Tasks

MLM (masked language modeling) – predict masked words.

SLP (segment length prediction) – predict the number of words in a Segment.

PBD (relative direction prediction) – classify the relative direction between two Segments into eight categories.

5. LayoutLMv3

Input Features

Text features – RoBERTa embeddings for OCR‑extracted words.

1D Layout – shared positional embeddings for both text and image tokens.

2D Layout – embeddings of the bounding box (x, y, w, h) shared by all words in a Segment.

Image features – Vision Transformer (ViT) patches extracted from the document image.

Text and image sequences are each augmented with their respective layout embeddings and then concatenated into a single sequence.

Feature Fusion

Multi‑head self‑attention computes correlations between all tokens, with additional relative position parameters (1D, 2D‑x, 2D‑y) incorporated into the attention scores.

Self‑Supervised Tasks

MLM – same as StrucText, with Poisson‑distributed masking.

MIM – mask image patches and predict their discrete token IDs using a BIET‑style decoder.

WPA – predict whether the image patch corresponding to a masked word is also masked.

6. GraphDoc

Input Features

Text – Sentence‑BERT embeddings of each Region’s text, combined with layout encoding.

Image – Swin‑Transformer + FPN backbone, with RoIAlign on region‑specific patches (P2).

Feature Fusion

Fusion occurs at two levels: (1) intra‑Region fusion using an attention gate to combine text and visual features, and (2) inter‑Region fusion via a Graph Neural Network (GNN) that propagates information across Regions using a learned adjacency matrix enriched with 2D positional encodings.

Self‑Supervised Task

Randomly mask a Region’s text with a special token, forward it through the GNN, and compute a Smooth‑L1 loss between the GNN‑produced Region representation and the original Sentence‑BERT embedding.

7. Summary

The three surveyed models demonstrate that effective multimodal feature fusion and capturing relationships among Segments/Words/Regions enable the learning of robust representations for downstream tasks such as classification and key‑value extraction in intelligent document processing.

These techniques form the backbone of Laiye’s document‑intelligence product, and future blog posts will detail internal adaptations of these architectures.

References

https://arxiv.org/abs/2108.02923

https://arxiv.org/abs/2204.08387

https://arxiv.org/abs/2203.13530

https://mp.weixin.qq.com/s/WrCDYuvHw-QPMzzRHdOHeA

https://arxiv.org/abs/1706.03762

https://arxiv.org/abs/2106.08254

https://arxiv.org/abs/1805.07445

https://github.com/ibm-aur-nlp/PubLayNet

https://zhuanlan.zhihu.com/p/73138740

https://arxiv.org/abs/2012.14740

https://arxiv.org/abs/2103.14470

https://baike.baidu.com/item/邻接矩阵/9796080

https://blog.csdn.net/luzaijiaoxia0618/article/details/104718146/

https://mage.laiye.com/

multimodalself-supervised learninggraph neural networkslayout analysisDocument AI
Laiye Technology Team
Written by

Laiye Technology Team

Official account of Laiye Technology, featuring its best tech innovations, practical implementations, and cutting‑edge industry insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.