Document Analytics & Anti‑Fraud Support Platform for Hong Kong Virtual Banking
This article describes the design and implementation of a Document Analytics & Anti‑Fraud Support platform for Hong Kong virtual banking, detailing its OCR/NLP‑driven pipeline, dynamic rule engine, multi‑template PDF processing, model training, and the resulting improvements in fraud detection and operational efficiency.
Background
ZhongAn Bank, Hong Kong's first virtual bank, needed a reliable way to verify income documents (bank statements, payroll slips, tax receipts) submitted by loan applicants, as many were suspected of being forged. Traditional manual checks were insufficient, prompting the development of an automated document analytics platform.
Introduction
The platform combines OCR, NLP, and rule‑based engines to extract and validate data from various bank statement templates, supporting five major Hong Kong banks and 22 document templates as of March 2023.
Core Function Description
Dynamic rule set to confirm the authenticity of PDF/IMG files.
Similarity calculation between homogeneous documents.
Feature‑based analysis of PDF/IMG attributes.
Large‑scale OCR/NLP processing to generate rules and models from desensitized PDFs.
Advanced OCR algorithms to extract text and tables from complex, multi‑font, multi‑language layouts.
NLP‑based text similarity to detect tampered or reused statements.
File lineage construction, template classification, and anti‑fraud rule triggering.
Human‑labeling platform for new bank templates and cold‑start experiments.
Manual verification of OCR/NLP and rule outputs.
Asynchronous file analysis scheduling for task compensation.
Iterative model training that improves accuracy and recall.
1. Platform Business Process
The workflow starts with large‑scale data ingestion, desensitization, and labeling via LabelStudio, followed by OCR/NLP extraction, data structuring (JSON/CSV), and template‑specific dictionary mapping. Extracted key‑value pairs are then used for downstream fraud analysis.
1.1 Model Training
A semi‑supervised approach separates clean and noisy samples, applying label correction to mitigate bias and improve model robustness.
1.2 Data Extraction
API‑driven content extraction based on pre‑built document models.
Plain text extraction.
Table extraction.
Conversion of unstructured data to structured formats.
Custom data dictionaries per bank template.
Global and local layout analysis to capture meaningful phrases.
Boundary‑based extraction to exclude irrelevant data.
Regex‑based smart regions for higher accuracy.
Linking statements to orders for fraud analysis.
2. Overall Technical Solution & Core Algorithms
2.1 Overall Architecture
The system integrates OCR, NLP, and rule engines, continuously training on desensitized PDFs to refine models and handle multi‑language, multi‑font documents.
2.2 Core Algorithms
2.2.1 Text Similarity
NLP techniques compute TF‑based cosine similarity between document vectors, supplemented by neural language models to capture semantic relationships.
2.2.2 Layout‑Based Classification
A visual‑text hybrid model (SVTR) decomposes images into character components, using local and global pyramid mask alignment to recognize characters without sequential modeling.
2.2.3 Multi‑Language & Multi‑Font Recognition
Attention‑based font mapping aligns source fonts to a fixed set, enabling accurate extraction across diverse scripts.
2.2.4 Complex Table Processing
A combination of local‑global pyramid mask alignment and graph neural networks reconstructs table structures, handling merged cells, missing grid lines, and multi‑row/column spans.
2.3 Dynamic Rule Expansion
Rules cover static template text, dynamic calculations (balances, totals), page‑level consistency, commonsense checks (e.g., payroll dates on non‑working days), and PDF attribute security policies.
3. Results Demonstration
3.1 Transition from Manual to Automated Recognition
Improved model accuracy enables quantifiable document tagging, allowing semi‑automatic approval workflows.
3.2 Continuous Closed‑Loop Optimization
Ongoing desensitized data ingestion, human labeling, and similarity‑based training continuously refine detection capabilities.
4. Conclusion
ZhongAn Bank leverages AI and ML to enhance operational efficiency and fraud prevention, integrating 22 templates covering 80% of Hong Kong's effective statement formats, with plans to expand to additional banks and provide a standardized solution for the industry.
ZhongAn Tech Team
China's first online insurer. Through tech innovation we make insurance simpler, warmer, and more valuable. Powered by technology, we support 50 billion RMB of policies and serve 600 million users with smart, personalized solutions. ZhongAn's hardcore tech and article shares are here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.