Big Data 41 min read

Case Study: Migrating Spark Thinking Education's Big Data Architecture from EMR to Serverless

This article details Spark Thinking Education's comprehensive migration from EMR to a serverless big‑data architecture, outlining the challenges of elasticity, cost accounting, and resource contention, the step‑by‑step implementation of serverless compute, storage, and integration services, and the resulting performance, cost, and stability gains.

DataFunSummit
DataFunSummit
DataFunSummit
Case Study: Migrating Spark Thinking Education's Big Data Architecture from EMR to Serverless

Introduction : Spark Thinking Education, an online education company for children, faced explosive data growth and limitations of its self‑built Hadoop and EMR clusters, prompting a move to a serverless big‑data platform to improve elasticity, cost control, and operational efficiency.

Challenges : lack of elastic scaling, inability to precisely allocate costs per business line, resource contention, tight coupling of storage and compute, and high operational complexity.

Serverless Overview : Serverless (no‑server) provides on‑demand resource provisioning, automatic scaling, reduced operational overhead, and lower costs, making it suitable for big‑data workloads.

Solution Architecture : Adopted Tencent Cloud Data Lake Compute (DLC) serverless services—DLC‑Presto for OLAP, DLC‑Spark‑SQL for offline processing, Oceanus for real‑time streaming, and object storage (COS) for decoupled storage. Integrated with Athena data‑factory, Hybris metadata service, and custom SeaTunnel data‑integration pipelines.

Implementation Steps : 1) Planned migration order (compute → storage → integration). 2) Enabled dual‑run in the data‑factory for smooth transition. 3) Migrated Presto, Hive‑Tez, and Flink engines to serverless equivalents, configuring CU quotas, elastic rules, and queue management. 4) Performed consistency checks by comparing query results and MD5 hashes. 5) Re‑engineered storage by converting ORC to Parquet, upgrading file versions, and merging small files. 6) Replaced Sqoop/DataX with SeaTunnel on DLC‑Spark‑Job.

Benefits : Reduced single‑SQL compute cost by 29%, overall storage cost by 43%, and data‑integration/real‑time cost by 42%; improved task latency by ~35%; increased peak compute capacity from 1,600 CU to 7,600 CU; eliminated many operational bottlenecks and lowered SLA incidents.

Lessons Learned : cold‑start latency, file‑format compatibility issues, resource contention in DLC‑Spark‑SQL, need for environment isolation, importance of dual‑run cost, and the value of vendor support during migration.

Future Outlook : Serverless big‑data products from major cloud providers will continue to evolve, offering more choices for enterprises seeking elastic, cost‑effective analytics platforms.

serverlessBig Datacloud computingcost optimizationArchitecture Migration
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.