Graph Convolutional Networks for Intelligent Document Processing: Principles, Feature Engineering, and Applications
This article presents a comprehensive overview of using graph convolutional networks in intelligent document processing, covering basic GCN theory, adjacency matrix construction, feature engineering—including text, image, and handcrafted features—model architecture, self-supervised training, and real-world applications such as semantic entity recognition and relation extraction.
1. Introduction
We previously introduced the core scenarios of intelligent document processing and representative deep learning models. Laiye Technology's IDP product achieved the highest level in the China Academy of Information and Communications Technology's Trusted AI evaluation.
To address Semantic Entity Recognition (SER) and Relation Extraction (RE) in documents, we evaluated various solutions and selected a graph convolutional model based on interpretability, feature injection, inference speed, and pre‑trained models.
2. Basic Principles of Graph Convolution
The basic idea is that a node’s feature is aggregated from its neighbors; the simplest case sums neighbor features. For a graph with n nodes and feature dimension d we define the adjacency matrix A, feature matrix X and degree matrix D (illustrated in the following figures).
Mathematically the node features are iteratively updated as X^{(k+1)} = D^{-1/2} (A+I) D^{-1/2} X^{(k)} W^{(k)} where the added identity preserves self‑features and the symmetric normalization prevents feature explosion.
After applying a non‑linear activation (e.g., tanh) the final propagation rule becomes X^{(k+1)} = σ( D^{-1/2} (A+I) D^{-1/2} X^{(k)} W^{(k)} ).
3. Adjacency Matrix Construction
Three strategies are used:
1) Rule‑based matrix : nodes correspond to text regions; edges are created by spatial overlap from top‑left to bottom‑right, optionally weighted by inverse distance.
2) Learned positional matrix : features derived from bounding‑box coordinates (x, y, w, h) are fed through dense layers to produce an adjacency matrix.
3) GAT‑based matrix : attention scores computed from node features (including positional information) generate a lightweight, dynamic adjacency matrix that is normalized each layer.
4. Node Features
We combine multiple modalities:
• Text features: RoBERTa embeddings (truncated to 48 tokens) with digits replaced by a special token.
• Image features: UNet extracts pixel‑level semantics for each text region.
• Hand‑crafted features: ratios of digits, letters, punctuation, Chinese characters, and flags for special entities (person, amount, email, date, URL).
• Index features: embedding of the node order after sorting by top‑left coordinate.
• Positional features: normalized coordinates and size relative to the whole page.
All modalities are fused via an attention‑weighted combination rather than simple concatenation.
5. Application Scenarios
In Laiye’s IDP system the GCN is applied to:
• SER – extracting structured entities (e.g., name, gender) from IDs and licenses.
• RE – discovering key‑value pairs in custom forms by reconstructing a directed adjacency matrix and selecting edges with confidence > 0.5.
• Multi‑task settings – simultaneously handling SER and RE on receipts, identifying items, prices, quantities, and their relationships.
6. Self‑Supervised Learning
Due to limited labeled data, we adopt a pre‑train‑then‑fine‑tune paradigm using a graph contrastive method (NNCLR) that selects the nearest positive example, mitigating over‑fitting compared with SimCLR or MoCo.
Node‑level pooling (max, average, and attention) is evaluated, with attention pooling yielding the best global representation.
Self‑supervised pre‑training dramatically improves downstream F1 scores (e.g., from 18 % to 90 % on a 26‑image PO‑form SER task).
7. Additional Optimizations
To alleviate over‑smoothing and computational cost we employ:
1) Highway connections that add residual features across layers.
2) Sparse adjacency matrices retaining only the top 30 % of values.
3) Drop‑edge regularization that randomly removes edges during forward passes.
8. Reference Implementation
The following TensorFlow 2 layer implements a learnable adjacency matrix for graph learning:
class GraphAdjLearningLayer(tf.keras.layers.Layer):
def __init__(self, name="graph_learning", **kwargs):
super(GraphAdjLearningLayer, self).__init__(name=name, **kwargs)
self.dense1 = tf.keras.layers.Dense(32, activation=tf.keras.layers.LeakyReLU(0.18))
self.dense2 = tf.keras.layers.Dense(16, activation=tf.keras.layers.LeakyReLU(0.18))
self.dense3 = tf.keras.layers.Dense(1, use_bias=False)
self.act = tf.keras.layers.Activation("sigmoid")
def call(self, inputs, training=False):
x = inputs
b, n, d = x.shape
x = tf.nn.l2_normalize(x, axis=-1)
x = self.dense1(x)
x = self.dense2(x)
x = self.dense3(x)
x = self.act(x)
node_num = tf.cast(tf.sqrt(tf.cast(n, tf.float32)), tf.int32)
x = tf.reshape(x, [-1, node_num, node_num])
return xReferences are listed at the end of the original document.
Laiye Technology Team
Official account of Laiye Technology, featuring its best tech innovations, practical implementations, and cutting‑edge industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.