Big Data 24 min read

Evolution of the Big Data Technology Stack Over the Past Five Years

This article reviews the evolution of big data technologies in the last five years, covering streaming and batch processing frameworks, column‑store NoSQL databases, programming language trends, the cloud‑native multi‑model database Lindorm, and practical Flink/Blink usage with code examples.

ByteDance ADFE Team
ByteDance ADFE Team
ByteDance ADFE Team
Evolution of the Big Data Technology Stack Over the Past Five Years

The author, who has worked on Sina Ads DMP, Alibaba's high‑traffic Red Packet system, and Alibaba's task engine, shares personal experiences to outline the development of the big data technology stack over the past five years.

1. Big Data Technology Classification

The article begins with a high‑level classification of big data components, illustrated by a diagram (image omitted).

2. History of Streaming Computing

Three mainstream streaming frameworks are discussed: Storm/JStorm, Spark Streaming, and Flink/Blink. Storm uses topologies of spouts and bolts; Spark Streaming processes data in micro‑batches via DStreams; Flink unifies stream and batch processing with low‑latency, exactly‑once guarantees.

A comparison table (image omitted) shows Flink/Blink as the current dominant choice.

3. History of Offline Computing

Offline processing includes Hadoop MapReduce, Spark, and Hive/ODPS. MapReduce requires custom Java/Python code; Spark SQL offers in‑memory processing; Hive provides a SQL layer on Hadoop, lowering the development barrier.

Industry usage notes: Alibaba’s Odps, Huawei’s Hadoop‑based ETL, and ByteDance’s Hive are highlighted.

4. Evolution of Column‑Store NoSQL Databases

The focus is on HBase and its cloud‑native successor Lindorm. HBase offers massive column‑family tables with linear scalability, while Lindorm adds multi‑model support, storage‑compute separation, and better performance.

5. Big Data Development Language Trends

Scala once dominated big data (Kafka, Spark). Today, SQL is prevalent for data‑warehouse tasks, JVM languages (Java/Scala) remain core for Hadoop ecosystems, Python is favored for AI, and R for modeling.

6. Lindorm Introduction

Lindorm is a cloud‑native multi‑model database supporting wide‑table, time‑series, search, and file models, compatible with HBase, Cassandra, Phoenix, OpenTSDB, Solr, and SQL interfaces. It solves HBase’s read/write spikes, eliminates the need for sharding, and offers ACID‑like guarantees for single‑row operations.

Comparisons with MySQL and HBase highlight Lindorm’s unlimited storage scalability, avoidance of sharding, and simplified cross‑partition queries.

Practical pitfalls include the cost of secondary indexes, large‑page queries (inefficient offset‑based pagination), and region hot‑spot management.

7. Flink and Blink Overview

Flink follows the Dataflow model, treating batch as a special case of streaming. Blink is Alibaba’s enhanced Flink version, offering superior performance for Alibaba’s double‑11 traffic (up to 40 billion events/s).

A comparison image (omitted) contrasts Flink and Blink.

8. Using Flink/Blink

Examples show how to create source and sink tables and write SQL‑style business logic. The following DDL statements are provided:

CREATE TABLE dwd_tb_trd_pay_ri(
    biz_order_id VARCHAR, -- '订单ID'
    auction_id VARCHAR, -- '商品ID'
    auction_title VARCHAR, -- '商品标题'
    buyer_id VARCHAR, -- '买家ID'
    buyer_nick VARCHAR, -- '买家昵称'
    pay_time VARCHAR, -- '支付时间'
    gmt_create VARCHAR, -- '创建时间'
    gmt_modified VARCHAR, -- '修改时间'
    biz_type VARCHAR, -- '交易类型'
    pay_status VARCHAR, -- '支付状态'
    `attributes` VARCHAR, -- '订单标记'
    from_group VARCHAR, -- '订单来源'
    div_idx_actual_total_fee DOUBLE -- '成交金额'
) WITH (
    type='datahub',
    endPoint='http://dh-cn-hangzhou.aliyun-inc.com',
    project='yourProjectName',
    topic='yourTopicName',
    roleArn='yourRoleArn',
    batchReadSize='500'
);

And the sink table definition:

CREATE TABLE tddl_output(
    gmt_create VARCHAR, -- '创建时间'
    gmt_modified VARCHAR, -- '修改时间'
    buyer_id BIGINT, -- '买家ID'
    cumulate_amount BIGINT, -- '金额'
    effect_time BIGINT, -- '支付时间'
    PRIMARY KEY(buyer_id, effect_time)
) WITH (
    type='rds',
    url='yourDatabaseURL',
    tableName='yourTableName',
    userName='yourUserName',
    password='yourDatabasePassword'
);

Business logic is expressed in SQL:

INSERT INTO tddl_output
SELECT gmt_create, gmt_modified, buyer_id, div_idx_actual_total_fee
FROM dwd_tb_trd_pay_ri
WHERE div_idx_actual_total_fee > 0;

Performance tuning involves SQL optimization and parameter tuning; Blink provides auto‑tuning but manual tuning can yield up to 4× throughput improvements.

9. Big Data at ByteDance

ByteDance’s data platform supports all business lines, offering products such as Fengshen (BI), Dorado (data integration & development), and Libra (A/B testing). The underlying storage is ClickHouse, an OLAP column‑store, heavily optimized for massive analytical workloads.

Additional storage systems include ByteKv (distributed KV with strong consistency), ByteSQL (distributed table store with global secondary indexes), and ByteGraph (distributed graph database supporting Gremlin).

Conclusion

The article provides a concise overview of big data technology evolution, enriched with personal anecdotes, and encourages developers to join the big data and AI revolutions.

References

Links to the cited articles and images are listed at the end of the original document.

data engineeringbig dataFlinkStream ProcessingSQLdatabaseLindorm
ByteDance ADFE Team
Written by

ByteDance ADFE Team

Official account of ByteDance Advertising Frontend Team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.