Data Classification and Grading Architecture for Enterprise Data Security
The article details a practical, reusable enterprise architecture for data classification and grading that combines scanning tools, a rule‑engine with hot‑updates, a high‑performance identification service, and a security enforcement platform, addressing massive real‑time data volumes, diverse storage types, cross‑department isolation, and compliance with China’s data security laws.
This article, selected from the Tencent Cloud Developer Community’s original series, shares the author’s experience in implementing data security classification and grading in a large‑scale enterprise.
Background
With the enactment of the Data Security Law and the Personal Information Protection Law, data security has risen to a national strategic level. Data classification and grading have become mandatory for enterprise data governance, but many pain points exist:
Complex rule definition across multiple dimensions and industries.
High coordination cost among many departments and business groups.
Massive data volume that challenges real‑time, high‑efficiency coverage.
Diverse storage components (relational, NoSQL, object storage) each with different protocols and structures.
Existing literature often only explains the concepts and standards, lacking concrete technical implementations. This article therefore focuses on a practical, reusable architecture for data classification and grading.
Business Layer
Classification and grading serve as the foundation for data security controls such as encryption, masking, watermarking, permission management, and audit.
Technical Layer
Data is scanned and reported, then processed by a data identification engine. In practice, challenges include numerous storage component types, high reporting traffic, and the need for timely, accurate, and comprehensive coverage.
Overall Architecture
The architecture consists of five core blocks:
Tools for scanning and reporting data from various storage components.
A data identification service cluster that receives reports and performs identification.
A rule engine that centrally manages identification rules and supports hot updates.
A data middle‑platform that enforces security controls based on classification results.
Underlying framework capabilities (monitoring, alerting, logging, elastic scaling) to ensure high availability.
Key focus areas are the first three components.
Massive Real‑Time Data Identification
Enterprises generate massive data volumes; the system must achieve high performance, low latency, high accuracy, and broad coverage.
Data Storage
The platform currently supports nearly twenty storage component types and over thirty million tables (e.g., MySQL, Redis, TiDB). For tables exceeding five million rows, sharding or partitioning is required. The chosen storage must support large capacity, high concurrency, high availability, and ACID transactions.
After evaluating Hadoop, TiDB, and the internally maintained tdsql‑c, tdsql‑c was selected for storing classification results.
Data Ingestion
Multiple ingestion methods (HTTP, TRPC, Kafka) are provided to meet different performance and latency requirements.
Kafka is recommended for large‑scale data transmission because it supports consumer retries and traffic shaping.
Configuration examples (wrapped in tags):
max.request.size=1048576 // 1 MB batch.size=262144 // 0.25 MB linger.ms=0 request.timeout.ms=30000 fetch.max.bytes=1048576 // 1 MB fetch.max.wait.ms=1000 max.partition.fetch.bytes=262144 // 0.25 MB max.poll.records=5 topic.partition>=20 retention.ms=2These settings limit message size, reduce producer memory pressure, and control consumer polling behavior to avoid CPU spikes.
Optimization Techniques
Elastic scaling on Kubernetes to distribute load across multiple containers.
Multi‑core parallelism with semaphore‑based rate limiting.
Regular expression optimization to prevent CPU exhaustion.
MapReduce‑style parallelism is employed to split the workload across cores.
Rule Management
Approximately 400 classification definitions and 800 identification rules (regex, NLP, ML, fuzzy matching, blacklists, etc.) are maintained. The rule engine is decoupled from the identification logic, supports hot updates, and can be enabled/disabled without service disruption.
Weight Calculation
Because a field may match multiple rules (e.g., an identifier could be both a QQ ID and a WeChat ID), each rule produces a weight. The system aggregates weights and selects the classification with the highest weight.
Data Validation
Encryption status, table/instance deletion, and instance offline status are validated using a decision‑tree model.
Cross‑Department and Platform Integration
To serve multiple departments while protecting sensitive classification results, physical isolation of data per department/platform is enforced.
Conclusion
Data classification and grading involve both business and architectural complexities. This article shares architectural decisions (storage selection, scanning capabilities) and ongoing optimizations (massive data identification, resource cost considerations). The presented framework aims to be reusable across the company, contributing to data security compliance.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.