Big Data 17 min read

Inside Wing Pay’s Scalable Big Data Platform: Architecture & Governance

This article details how Wing Pay built a comprehensive data development and governance platform, covering company background, business scenarios, goals, challenges, task development workflow, task types, SparkSQL editor features, double‑environment deployment, Airflow scheduling, DataX data bus, resource isolation, compute optimization, data quality monitoring, cloud‑native practices, future outlook, and a Q&A on data permissions and governance.

Data Thinking Notes
Data Thinking Notes
Data Thinking Notes
Inside Wing Pay’s Scalable Big Data Platform: Architecture & Governance

1. Company Overview and Business Scenario

China Telecom's subsidiary Wing Pay serves 70 million monthly active users with services such as bill payment, shopping, and finance, leveraging blockchain, cloud computing, big data, and AI to empower over 10 million offline merchants and 170 online e‑commerce platforms.

Business Scenario

The data development and governance platform supports data warehouse, rapid development across business units, offline computation, data integration, real‑time data development, and data services to improve development and governance efficiency.

Goal

Build an integrated platform that unifies data integration, offline and real‑time computation, and data services, providing a one‑stop solution for data engineers.

Challenges

Massive data volume, high concurrency, low‑latency requirements, diverse business scenarios, and complex use cases.

2. Data Development and Governance Platform

Task Development Process

Developers create a business flow (Flow) that groups tasks for offline scheduling, data integration, publishing, and SparkSQL jobs. After scripting, they set core parameters (priority, dependencies, runtime options), test execution, and submit for review. Approved tasks are released to production and scheduled by the engine.

Task Types

Data Sync : Connects and publishes data from Oracle, OceanBase, MySQL, SFTP, HBase, etc.

Spark Task : Executes offline SparkSQL jobs.

Machine Learning : Runs AI model tasks.

Kylin Task : Schedules Kylin data‑warehouse jobs.

Trigger Task : Starts platform tasks via external system callbacks (e.g., audience segmentation, data push).

SparkSQL Task Development Editor Features

New SparkSQL task: drag‑and‑drop node creation, script editing, syntax validation.

Single execution: submit to cluster, test, approve, and deploy.

Automatic dependency configuration: parses SQL to identify source and target tables, generating fine‑grained lineage.

Manual dependency addition: search and attach tasks.

SparkSQL permission parsing: enforces user‑level table permissions from metadata.

Lineage visualization: shows cross‑Flow task dependencies.

3. Platform Technical Architecture Practices

Overall Architecture

The upper layer is the application tier; the lower layer is the scheduling tier. Airflow serves as the core scheduler, extended with custom operators for SparkSQL and data‑exchange tasks.

Scheduling Engine

After evaluating Zeus, Airflow, and Azkaban, Airflow 1.10 was chosen for its Python extensibility, stability, and community activity. Production uses the Celery executor with Redis as the task queue; local development uses the Local executor.

To simplify DAG management, a REST API extension was built to generate DAG files and metadata, allowing the application layer to hide scheduling details from users.

Data Bus

DataX is the core data‑bus module. Templates generated from user‑provided parameters are submitted to a Yarn cluster for execution, enabling scalable batch processing.

Resource Isolation & Compute Optimization

Separate queues isolate real‑time and offline workloads. A three‑level priority system (core, important, normal) and dynamic throttling ensure critical tasks receive resources while low‑priority jobs are delayed.

Spark job optimizations include small‑file handling, resource tuning, data skew mitigation, join optimization, and task splitting.

Data Quality Monitoring

Quality is evaluated across timeliness, accuracy, completeness, consistency, and validity. Rules are defined as strong or weak; strong rules trigger task circuit‑breakers, while weak rules are analyzed post‑execution.

Quality jobs generate SparkSQL jobs that produce rule reports; a failure of a strong rule aborts the workflow.

Cloud‑Native Practices

The platform is decomposed into microservices (real‑time, offline, data bus, quality jobs, monitoring, AI models) and deployed via a CI/CD pipeline with unified monitoring, alerting, and auto‑scaling capabilities.

4. Future Outlook

Key focus areas include improving performance and scalability of the data‑development services, enhancing observability to detect instability, further optimizing SparkSQL workloads, and implementing multi‑site disaster recovery for both batch and real‑time clusters.

5. Q&A

Q1: How is data permission controlled?

Permissions are linked to the user’s organization and space management, with metadata‑level table privileges and UDF encryption controls. SQL parsing validates user access to referenced tables during task creation.

Q2: How is data governance performed and improved?

Early siloed development incurred high compute costs; the integrated platform now streamlines workflow, reduces resource consumption, and accelerates model feature computation and dashboard query performance.

cloud-nativebig datadata platformdata governanceSparkAirflow
Data Thinking Notes
Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.