Artificial Intelligence 18 min read

Document Analytics & Anti‑Fraud Support Platform for Hong Kong Virtual Banking

This article describes the design and implementation of a Document Analytics & Anti‑Fraud Support platform for Hong Kong virtual banking, detailing its OCR/NLP‑driven pipeline, dynamic rule engine, multi‑template PDF processing, model training, and the resulting improvements in fraud detection and operational efficiency.

ZhongAn Tech Team

Oct 20, 2023

Document Analytics & Anti‑Fraud Support Platform for Hong Kong Virtual Banking

Background

ZhongAn Bank, Hong Kong's first virtual bank, needed a reliable way to verify income documents (bank statements, payroll slips, tax receipts) submitted by loan applicants, as many were suspected of being forged. Traditional manual checks were insufficient, prompting the development of an automated document analytics platform.

Introduction

The platform combines OCR, NLP, and rule‑based engines to extract and validate data from various bank statement templates, supporting five major Hong Kong banks and 22 document templates as of March 2023.

Core Function Description

Dynamic rule set to confirm the authenticity of PDF/IMG files.

Similarity calculation between homogeneous documents.

Feature‑based analysis of PDF/IMG attributes.

Large‑scale OCR/NLP processing to generate rules and models from desensitized PDFs.

Advanced OCR algorithms to extract text and tables from complex, multi‑font, multi‑language layouts.

NLP‑based text similarity to detect tampered or reused statements.

File lineage construction, template classification, and anti‑fraud rule triggering.

Human‑labeling platform for new bank templates and cold‑start experiments.

Manual verification of OCR/NLP and rule outputs.

Asynchronous file analysis scheduling for task compensation.

Iterative model training that improves accuracy and recall.

1. Platform Business Process

The workflow starts with large‑scale data ingestion, desensitization, and labeling via LabelStudio, followed by OCR/NLP extraction, data structuring (JSON/CSV), and template‑specific dictionary mapping. Extracted key‑value pairs are then used for downstream fraud analysis.

1.1 Model Training

A semi‑supervised approach separates clean and noisy samples, applying label correction to mitigate bias and improve model robustness.

1.2 Data Extraction

API‑driven content extraction based on pre‑built document models.

Plain text extraction.

Table extraction.

Conversion of unstructured data to structured formats.

Custom data dictionaries per bank template.

Global and local layout analysis to capture meaningful phrases.

Boundary‑based extraction to exclude irrelevant data.

Regex‑based smart regions for higher accuracy.

Linking statements to orders for fraud analysis.

2. Overall Technical Solution & Core Algorithms

2.1 Overall Architecture

The system integrates OCR, NLP, and rule engines, continuously training on desensitized PDFs to refine models and handle multi‑language, multi‑font documents.

2.2 Core Algorithms

2.2.1 Text Similarity

NLP techniques compute TF‑based cosine similarity between document vectors, supplemented by neural language models to capture semantic relationships.

2.2.2 Layout‑Based Classification

A visual‑text hybrid model (SVTR) decomposes images into character components, using local and global pyramid mask alignment to recognize characters without sequential modeling.

2.2.3 Multi‑Language & Multi‑Font Recognition

Attention‑based font mapping aligns source fonts to a fixed set, enabling accurate extraction across diverse scripts.

2.2.4 Complex Table Processing

A combination of local‑global pyramid mask alignment and graph neural networks reconstructs table structures, handling merged cells, missing grid lines, and multi‑row/column spans.

2.3 Dynamic Rule Expansion

Rules cover static template text, dynamic calculations (balances, totals), page‑level consistency, commonsense checks (e.g., payroll dates on non‑working days), and PDF attribute security policies.

3. Results Demonstration

3.1 Transition from Manual to Automated Recognition

Improved model accuracy enables quantifiable document tagging, allowing semi‑automatic approval workflows.

3.2 Continuous Closed‑Loop Optimization

Ongoing desensitized data ingestion, human labeling, and similarity‑based training continuously refine detection capabilities.

4. Conclusion

ZhongAn Bank leverages AI and ML to enhance operational efficiency and fraud prevention, integrating 22 templates covering 80% of Hong Kong's effective statement formats, with plans to expand to additional banks and provide a standardized solution for the industry.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Machine Learning anti-fraud OCR NLP bank statements document analytics

Written by

ZhongAn Tech Team

China's first online insurer. Through tech innovation we make insurance simpler, warmer, and more valuable. Powered by technology, we support 50 billion RMB of policies and serve 600 million users with smart, personalized solutions. ZhongAn's hardcore tech and article shares are here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.