Information Security 14 min read

Data Classification and Grading Architecture for Enterprise Data Security

The article details a practical, reusable enterprise architecture for data classification and grading that combines scanning tools, a rule‑engine with hot‑updates, a high‑performance identification service, and a security enforcement platform, addressing massive real‑time data volumes, diverse storage types, cross‑department isolation, and compliance with China’s data security laws.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Data Classification and Grading Architecture for Enterprise Data Security

This article, selected from the Tencent Cloud Developer Community’s original series, shares the author’s experience in implementing data security classification and grading in a large‑scale enterprise.

Background

With the enactment of the Data Security Law and the Personal Information Protection Law, data security has risen to a national strategic level. Data classification and grading have become mandatory for enterprise data governance, but many pain points exist:

Complex rule definition across multiple dimensions and industries.

High coordination cost among many departments and business groups.

Massive data volume that challenges real‑time, high‑efficiency coverage.

Diverse storage components (relational, NoSQL, object storage) each with different protocols and structures.

Existing literature often only explains the concepts and standards, lacking concrete technical implementations. This article therefore focuses on a practical, reusable architecture for data classification and grading.

Business Layer

Classification and grading serve as the foundation for data security controls such as encryption, masking, watermarking, permission management, and audit.

Technical Layer

Data is scanned and reported, then processed by a data identification engine. In practice, challenges include numerous storage component types, high reporting traffic, and the need for timely, accurate, and comprehensive coverage.

Overall Architecture

The architecture consists of five core blocks:

Tools for scanning and reporting data from various storage components.

A data identification service cluster that receives reports and performs identification.

A rule engine that centrally manages identification rules and supports hot updates.

A data middle‑platform that enforces security controls based on classification results.

Underlying framework capabilities (monitoring, alerting, logging, elastic scaling) to ensure high availability.

Key focus areas are the first three components.

Massive Real‑Time Data Identification

Enterprises generate massive data volumes; the system must achieve high performance, low latency, high accuracy, and broad coverage.

Data Storage

The platform currently supports nearly twenty storage component types and over thirty million tables (e.g., MySQL, Redis, TiDB). For tables exceeding five million rows, sharding or partitioning is required. The chosen storage must support large capacity, high concurrency, high availability, and ACID transactions.

After evaluating Hadoop, TiDB, and the internally maintained tdsql‑c, tdsql‑c was selected for storing classification results.

Data Ingestion

Multiple ingestion methods (HTTP, TRPC, Kafka) are provided to meet different performance and latency requirements.

Kafka is recommended for large‑scale data transmission because it supports consumer retries and traffic shaping.

Configuration examples (wrapped in tags):

max.request.size=1048576   // 1 MB
batch.size=262144          // 0.25 MB
linger.ms=0
request.timeout.ms=30000
fetch.max.bytes=1048576   // 1 MB
fetch.max.wait.ms=1000
max.partition.fetch.bytes=262144 // 0.25 MB
max.poll.records=5
topic.partition>=20
retention.ms=2

These settings limit message size, reduce producer memory pressure, and control consumer polling behavior to avoid CPU spikes.

Optimization Techniques

Elastic scaling on Kubernetes to distribute load across multiple containers.

Multi‑core parallelism with semaphore‑based rate limiting.

Regular expression optimization to prevent CPU exhaustion.

MapReduce‑style parallelism is employed to split the workload across cores.

Rule Management

Approximately 400 classification definitions and 800 identification rules (regex, NLP, ML, fuzzy matching, blacklists, etc.) are maintained. The rule engine is decoupled from the identification logic, supports hot updates, and can be enabled/disabled without service disruption.

Weight Calculation

Because a field may match multiple rules (e.g., an identifier could be both a QQ ID and a WeChat ID), each rule produces a weight. The system aggregates weights and selects the classification with the highest weight.

Data Validation

Encryption status, table/instance deletion, and instance offline status are validated using a decision‑tree model.

Cross‑Department and Platform Integration

To serve multiple departments while protecting sensitive classification results, physical isolation of data per department/platform is enforced.

Conclusion

Data classification and grading involve both business and architectural complexities. This article shares architectural decisions (storage selection, scanning capabilities) and ongoing optimizations (massive data identification, resource cost considerations). The presented framework aims to be reusable across the company, contributing to data security compliance.

rule enginecloud-nativearchitectureKafkaBig DataData Securitydata classification
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.