Tagged articles
3697 articles
Page 21 of 37
The Dominant Programmer
The Dominant Programmer
Aug 2, 2021 · Big Data

How to Build a Beginner Hadoop Cluster on CentOS 7

This article introduces Apache Hadoop’s open‑source framework, explains its core components such as HDFS, MapReduce, ZooKeeper, HBase, Hive, Pig, Mahout, Sqoop, Flume, Chukwa, Oozi​e, Ambari and YARN, and outlines the steps to set up a beginner‑level Hadoop cluster on CentOS 7.

Big DataCentOS 7HBase
0 likes · 11 min read
How to Build a Beginner Hadoop Cluster on CentOS 7
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 2, 2021 · Big Data

Comprehensive Big Data Interview Question Guide for Major Tech Companies

This article compiles extensive interview questions and topics covering Hadoop, Spark, Flink, Hive, Kafka, MySQL, Redis, Java fundamentals, and algorithms, organized by companies such as Xiaomi, ByteDance, Alibaba, Shopee, Tencent, Meituan, NetEase, and Baidu, to help candidates prepare effectively for big‑data engineering roles.

Big DataFlinkHadoop
0 likes · 22 min read
Comprehensive Big Data Interview Question Guide for Major Tech Companies
ByteDance SE Lab
ByteDance SE Lab
Jul 30, 2021 · Operations

Inside Salesforce’s Global Outage: What Went Wrong and How to Prevent It

The article examines Salesforce’s five‑hour global outage caused by a shortcut DNS deployment and the subsequent recovery challenges, then explores a viral experiment where twenty smartphones generated artificial traffic congestion, illustrating how real‑time data feeds and operational safeguards can prevent large‑scale service disruptions.

Big DataCloud ComputingIncident Management
0 likes · 7 min read
Inside Salesforce’s Global Outage: What Went Wrong and How to Prevent It
JD Tech
JD Tech
Jul 30, 2021 · Databases

Practical Use of HBase in a Logistics HR Data Preprocessing Platform

This article details how the logistics HR data preprocessing platform processes around 20 million daily records by adopting HBase for high‑performance, scalable, column‑oriented storage, covering its architecture, read/write mechanisms, best practices, and performance considerations.

Big DataHBaseNoSQL
0 likes · 10 min read
Practical Use of HBase in a Logistics HR Data Preprocessing Platform
DataFunTalk
DataFunTalk
Jul 29, 2021 · Big Data

Real-Time Data Warehouse Construction at TAL Using DorisDB

This article details TAL's transition from offline to real-time data warehousing, describing business drivers, pain points, architectural evolution through Hive, Flink+Kudu, and DorisDB, and outlining the system design, data flow, scheduling, monitoring, and the resulting business and cost benefits.

AirflowBig DataDorisDB
0 likes · 14 min read
Real-Time Data Warehouse Construction at TAL Using DorisDB
Airbnb Technology Team
Airbnb Technology Team
Jul 29, 2021 · Big Data

Airbnb’s Data Quality Improvement Plan: Organizational, Architectural, and Governance Practices

Airbnb’s 2019 Data Quality Improvement Plan reorganized its data‑engineering workforce, introduced a dedicated data‑engineer role, adopted a decentralized Minerva‑based architecture with Spark pipelines, instituted rigorous testing, governance, and certification processes, and established SLAs and monitoring to ensure timely, trustworthy, well‑documented data across the enterprise.

AirbnbBig DataData Architecture
0 likes · 13 min read
Airbnb’s Data Quality Improvement Plan: Organizational, Architectural, and Governance Practices
DataFunTalk
DataFunTalk
Jul 27, 2021 · Big Data

Building a Real‑Time Data Warehouse with Apache Doris at Shuhai Supply Chain

This article describes how Shuhai Supply Chain upgraded its data warehouse from a complex, high‑cost 1.0 architecture to a streamlined, real‑time solution built around Apache Doris, detailing the motivations, design choices, zero‑code ingestion, metadata management, Flink connector, and the resulting performance gains.

Apache DorisBig DataFlink
0 likes · 13 min read
Building a Real‑Time Data Warehouse with Apache Doris at Shuhai Supply Chain
Big Data Technology Architecture
Big Data Technology Architecture
Jul 27, 2021 · Big Data

Key Components of the Big Data Ecosystem: Hadoop, Hive, HBase, Spark, Kafka, and Elasticsearch

This article introduces the most important and still mainstream components of the big data ecosystem—including Hadoop’s storage and compute framework, Hive data warehouse, HBase NoSQL database, Spark unified engine, Kafka messaging platform, and Elasticsearch search engine—explaining their core concepts, architectures, and typical use cases.

Big DataElasticsearchHBase
0 likes · 9 min read
Key Components of the Big Data Ecosystem: Hadoop, Hive, HBase, Spark, Kafka, and Elasticsearch
DataFunTalk
DataFunTalk
Jul 26, 2021 · Big Data

Accelerating Hive Daily Tables with Flink: A SmartNews Case Study

This article describes how SmartNews integrated Flink into its Airflow‑driven Hive batch pipeline to cut the actions table generation latency from three hours to about thirty‑four minutes, detailing the technical challenges, design decisions, and production results.

Big DataFlinkHive
0 likes · 12 min read
Accelerating Hive Daily Tables with Flink: A SmartNews Case Study
dbaplus Community
dbaplus Community
Jul 21, 2021 · Big Data

Youzan’s Blueprint: Data Governance, Quality Scoring, and Cost Reduction for AI

At Youzan, data governance evolves from massive data assets to AI readiness through systematic data assetization, quantitative quality scoring, cost measurement, and targeted operational tactics, enabling precise quality monitoring, cost allocation, and continuous improvement that drive both data value and cost efficiency.

AI readinessBig Datacost optimization
0 likes · 18 min read
Youzan’s Blueprint: Data Governance, Quality Scoring, and Cost Reduction for AI
Tencent Cloud Developer
Tencent Cloud Developer
Jul 21, 2021 · Big Data

Bloom Filter: Introduction, Theory, Construction, Query, and Applications

The article explains Bloom filters—a probabilistic, space‑efficient data structure using multiple hash functions on a bit array to answer set‑membership queries with controllable false‑positive rates, detailing their construction, query process, optimal parameters, and common uses such as URL deduplication, cache protection, and spam filtering.

Big DataCache Optimizationbloom-filter
0 likes · 8 min read
Bloom Filter: Introduction, Theory, Construction, Query, and Applications
IT Architects Alliance
IT Architects Alliance
Jul 20, 2021 · Big Data

Understanding Data Middle Platform: Layers, Architecture, and Implementation Methodology

The article explains the concept of a data middle platform, detailing its three-layer structure—data model, data service, and data development—illustrates how data modeling enables cross-domain integration, how services encapsulate data for flexible consumption, and how development tools support customized data applications, using a telecom operator example.

Big DataData ArchitectureData Platform
0 likes · 2 min read
Understanding Data Middle Platform: Layers, Architecture, and Implementation Methodology
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Jul 20, 2021 · Backend Development

From Non‑Tech Student to Cloud MVP: Go, AI, and Startup Insights

In this interview, Huawei Cloud MVP Wang Ming shares how a non‑computer‑science background led him to a successful IT career, discusses the advantages of interdisciplinary skills, offers entrepreneurship advice, predicts future tech trends, and explains the key concepts of his popular Go concurrency book.

Artificial IntelligenceBig DataEntrepreneurship
0 likes · 7 min read
From Non‑Tech Student to Cloud MVP: Go, AI, and Startup Insights
Xianyu Technology
Xianyu Technology
Jul 20, 2021 · Big Data

Design and Implementation of a Content Flow Control System for Xianyu Community

The Xianyu “Play” tab flow‑control system combines task‑specific and rule‑based strategies with a dynamic strategy‑, control‑, and distribution‑chain architecture that integrates real‑time data processing into the recommendation engine, delivering guaranteed exposure, boosting daily posts by 14.4 % and paving the way for multi‑objective, zero‑code control.

Big DataFlow ControlReal-time Streaming
0 likes · 6 min read
Design and Implementation of a Content Flow Control System for Xianyu Community
21CTO
21CTO
Jul 18, 2021 · Databases

Why Your MySQL Queries Are Slow and How ElasticSearch & HBase Can Help

This article examines common causes of slow MySQL queries, explains index mechanics and failures, then compares ElasticSearch’s fast tokenized search and HBase’s column‑oriented storage, offering practical guidance on when and how to use each technology.

Big DataDatabase PerformanceHBase
0 likes · 21 min read
Why Your MySQL Queries Are Slow and How ElasticSearch & HBase Can Help
Open Source Linux
Open Source Linux
Jul 17, 2021 · Big Data

Master Kafka Basics: Topics, Partitions, Producers & Consumers Explained

This article provides a clear, visual guide to Kafka’s core concepts—including producers, consumers, topics, partitions, consumer groups, message ordering, and the underlying ZooKeeper‑managed cluster architecture—helping readers grasp how Kafka enables reliable, scalable stream processing.

Big DataConsumersPartitions
0 likes · 6 min read
Master Kafka Basics: Topics, Partitions, Producers & Consumers Explained
Architects' Tech Alliance
Architects' Tech Alliance
Jul 15, 2021 · Cloud Computing

Edge Computing: Challenges, Research Focus, and Related Paradigms

The article explains edge computing as a decentralized computing model that addresses high‑reliability, low‑latency demands, data‑center energy consumption, big‑data processing pressure, low resource utilization, intelligent front‑ends, and security‑privacy concerns, and it outlines key research areas and related paradigms such as fog, mobile edge, sea, and intelligent edge computing.

Big DataFog Computingedge computing
0 likes · 8 min read
Edge Computing: Challenges, Research Focus, and Related Paradigms
Xianyu Technology
Xianyu Technology
Jul 13, 2021 · Big Data

Design and Implementation of Xianyu Real-Time Data Warehouse

To meet Xianyu’s billion‑event‑per‑day real‑time analysis needs, the team built a petabyte‑scale warehouse using Hologres for storage and Alibaba‑enhanced Flink (Blink) for streaming, organized into ODS, DWD, DWS, ADS and DIM layers, enabling minute‑level aggregations, rapid anomaly detection, and instant product‑team insights.

Big DataHologresStream Processing
0 likes · 12 min read
Design and Implementation of Xianyu Real-Time Data Warehouse
dbaplus Community
dbaplus Community
Jul 11, 2021 · Big Data

Scaling Real‑Time & Offline Analytics with Druid: Architecture, Optimizations, and Lessons

This article explains how Beike adopted the Druid OLAP engine to handle massive real‑time and offline query workloads, detailing its four‑component architecture, key technologies such as deep storage and metadata storage, practical optimizations for data ingestion, query caching, dynamic throttling, timeout control, and a roadmap for future enhancements.

Big DataDruidOLAP
0 likes · 19 min read
Scaling Real‑Time & Offline Analytics with Druid: Architecture, Optimizations, and Lessons
Tech Musings
Tech Musings
Jul 8, 2021 · Big Data

Building a Simple Single-Node MapReduce System: From Theory to Code

This article walks through implementing a lightweight single‑machine MapReduce framework inspired by the original MapReduce paper, covering the abstract Map/Reduce model, task scheduling between master and workers, core Go code for map, reduce, worker, and coordinator, and a brief reflection on its limitations.

Big DataDistributed SystemsLab
0 likes · 10 min read
Building a Simple Single-Node MapReduce System: From Theory to Code
DataFunTalk
DataFunTalk
Jul 7, 2021 · Big Data

Solving Data Island Challenges and Enabling Advanced OLAP Analysis on Heterogeneous Big Data Platforms – Kyligence Solution Overview

This article explains the growing analytical demands in the big‑data era, the limitations of traditional OLAP, and how Kyligence’s distributed OLAP engine addresses data‑island issues, multi‑dimensional and many‑to‑many analysis, unified security, and performance optimization with MDX on Spark, delivering a seamless Excel‑like experience.

AnalyticsBig DataData Integration
0 likes · 9 min read
Solving Data Island Challenges and Enabling Advanced OLAP Analysis on Heterogeneous Big Data Platforms – Kyligence Solution Overview
dbaplus Community
dbaplus Community
Jul 4, 2021 · Big Data

How Didi Scales MySQL‑to‑Hive Sync with Real‑Time Binlog Capture

This article explains Didi's end‑to‑end architecture for ingesting MySQL data into Hive using real‑time Binlog collection, a customized Canal component, message queues, HDFS storage, Dquality monitoring, and strategies for handling data drift and sharding in large‑scale big‑data environments.

Big DataCanalHive
0 likes · 13 min read
How Didi Scales MySQL‑to‑Hive Sync with Real‑Time Binlog Capture
TAL Education Technology
TAL Education Technology
Jul 1, 2021 · Big Data

Optimization of A/B Test Metric Computation Using Spark and ClickHouse

This article details the design and multi‑stage optimization of an A/B testing metric system, describing its product architecture, Spark‑based computation engine, ClickHouse OLAP layer, cumulative calculation improvements, and batch processing techniques that reduced processing time from hours to a few minutes for hundreds of experiments and metrics.

A/B testingBig DataClickHouse
0 likes · 8 min read
Optimization of A/B Test Metric Computation Using Spark and ClickHouse
Architect
Architect
Jul 1, 2021 · Big Data

Data Governance Practices at Meituan Hotel Travel Platform

This article presents a comprehensive case study of Meituan's hotel‑travel data governance, covering the background, challenges, strategic goals, standardized processes, technical systems, cost and security optimizations, measurable outcomes, and future plans for automated governance.

Big DataData QualityData Security
0 likes · 29 min read
Data Governance Practices at Meituan Hotel Travel Platform
Youzan Coder
Youzan Coder
Jun 30, 2021 · Big Data

Online Monitoring Practices for Offline and Real-Time Data at Youzan

Youzan Data Report Center monitors offline batch and real‑time data pipelines using accuracy and timeliness rules, cross‑table checks, upstream‑downstream comparisons, and scheduled alerts to detect anomalies early; since 2021 it has generated over 25 alerts, and plans a unified data‑quality dashboard.

Big DataData QualityFlink
0 likes · 12 min read
Online Monitoring Practices for Offline and Real-Time Data at Youzan
JD Retail Technology
JD Retail Technology
Jun 29, 2021 · Big Data

The Value of Data and Data Products: From Concept to Practice

This article explains how data has become a critical production resource, outlines the limitations of traditional data‑analysis workflows, defines data products and their components, describes their advantages and key characteristics, and shares practical case studies of data‑product implementations in a large e‑commerce environment.

Big DataData AnalysisData Product
0 likes · 16 min read
The Value of Data and Data Products: From Concept to Practice
DataFunTalk
DataFunTalk
Jun 26, 2021 · Big Data

Building a Scalable Big Data Service System at Didi: Practices and Lessons

Zhang Liang shares Didi's four-stage journey of constructing and governing large‑scale open‑source big‑data engine services—including engine selection, hardware sizing, PaaS platform building, proxy architecture, and governance—highlighting practical challenges, solutions, and ROI‑driven best practices for Kafka, Elasticsearch, Flink, and related technologies.

Big DataData InfrastructureElasticsearch
0 likes · 16 min read
Building a Scalable Big Data Service System at Didi: Practices and Lessons
Laravel Tech Community
Laravel Tech Community
Jun 25, 2021 · Big Data

Apache Kudu 1.15.0 – New Features and Improvements

Apache Kudu 1.15.0 adds experimental multi‑row transaction support (currently INSERT and INSERT_IGNORE), Raft‑based master configuration tools, table comment synchronization with Hive Metastore, per‑table size and row‑count limits configurable via flags or the kudu table set_limit tool, a customizable Kerberos principal flag, and TLS v1.3 with optional cipher‑suite selection, collectively enhancing low‑latency random access and analytical capabilities in the Hadoop ecosystem.

Apache KuduBig DataHadoop
0 likes · 3 min read
Apache Kudu 1.15.0 – New Features and Improvements
Yuewen Technology
Yuewen Technology
Jun 25, 2021 · Big Data

Building Yuedu Group’s Overseas Big Data Platform: Architecture, Offline & Real‑Time Processing

This article details how Yuedu Group designed and implemented an overseas big data platform, covering overall system architecture, offline data‑warehouse construction with dimensional modeling, real‑time streaming using Oceanus and ClickHouse, and future plans for cost reduction and data quality assurance.

Big DataCloud ComputingReal-time Processing
0 likes · 12 min read
Building Yuedu Group’s Overseas Big Data Platform: Architecture, Offline & Real‑Time Processing
Architecture Digest
Architecture Digest
Jun 24, 2021 · Big Data

Kuaishou's Big Data Service Platform: Architecture, Key Technologies, and Future Outlook

This article introduces Kuaishou's data platform serviceification, outlining the background challenges for data engineers, the platform's architecture and key technologies such as configuration‑driven development, multi‑mode APIs, data acceleration, and high‑availability mechanisms, and concludes with a summary of achievements and future directions.

Big DataData AccelerationData Platform
0 likes · 12 min read
Kuaishou's Big Data Service Platform: Architecture, Key Technologies, and Future Outlook
DevOps
DevOps
Jun 22, 2021 · Operations

Building Digital Champion Capabilities: Integrating Customer Solutions, Operations, Technology, and Talent Ecosystems

The article outlines how digital‑champion enterprises achieve superior performance by integrating four core ecosystems—customer solutions, operations, technology, and talent—through strategic planning, partnership, and advanced technologies such as AI, big data, and industrial IoT, while highlighting maturity stages and practical implementation steps.

Artificial IntelligenceBig DataDigital Transformation
0 likes · 28 min read
Building Digital Champion Capabilities: Integrating Customer Solutions, Operations, Technology, and Talent Ecosystems
DataFunTalk
DataFunTalk
Jun 21, 2021 · Big Data

Flink + Iceberg 0.11 Practices in Qunar Data Platform

This article shares Qunar's experience using Flink together with Apache Iceberg 0.11 to address real‑time data warehouse challenges, covering background pain points, Iceberg architecture, solutions for Kafka data loss and Hive latency, and optimization practices such as small‑file handling, sorting, and checkpoint management.

Big DataData LakeFlink
0 likes · 13 min read
Flink + Iceberg 0.11 Practices in Qunar Data Platform
Architecture Digest
Architecture Digest
Jun 21, 2021 · Databases

Using HBase for HR Performance Data Preprocessing Platform: Architecture, Concepts, and Best Practices

This article introduces the HR performance data preprocessing platform’s requirements, explains why HBase was selected as the storage solution, details its core concepts, architecture, data write/read processes, best practices, limitations, and presents performance metrics demonstrating its suitability for large‑scale, high‑throughput workloads.

Big DataDatabase ArchitectureHBase
0 likes · 12 min read
Using HBase for HR Performance Data Preprocessing Platform: Architecture, Concepts, and Best Practices
DataFunTalk
DataFunTalk
Jun 20, 2021 · Databases

Xiaohongshu’s OLAP Architecture Evolution and DorisDB Adoption

This article details Xiaohongshu’s multi‑stage evolution of its OLAP infrastructure—from Redshift to Presto, ClickHouse, and finally DorisDB—describing the data pipeline, tool comparisons, advertising use‑case implementation, and the resulting performance and operational benefits.

Big DataClickHouseDorisDB
0 likes · 12 min read
Xiaohongshu’s OLAP Architecture Evolution and DorisDB Adoption
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Jun 20, 2021 · Big Data

Why HBase Is the Ideal Choice for Large‑Scale HR Data Preprocessing

This article explains how HBase’s distributed column‑oriented architecture, high‑performance read/write capabilities, and flexible schema make it a cost‑effective solution for handling massive, unstructured HR performance data, covering its core concepts, cluster operation, best practices, and performance metrics.

Big DataHBasedata preprocessing
0 likes · 11 min read
Why HBase Is the Ideal Choice for Large‑Scale HR Data Preprocessing
DevOps
DevOps
Jun 16, 2021 · Operations

Understanding Digital Transformation: Definitions, Strategic Questions, Drivers, Frameworks, Roadmaps, Benefits and Pitfalls

The article provides a comprehensive overview of digital transformation, covering its definition, essential strategic questions, key drivers such as customer expectations, cloud and AI, priority areas in the value chain, practical frameworks, roadmap steps, expected benefits and common reasons for failure.

Artificial IntelligenceBig DataDigital Transformation
0 likes · 20 min read
Understanding Digital Transformation: Definitions, Strategic Questions, Drivers, Frameworks, Roadmaps, Benefits and Pitfalls
IT Architects Alliance
IT Architects Alliance
Jun 15, 2021 · Industry Insights

How Cloud Computing, Big Data, and AI Intertwine to Power Modern Services

This article explains the evolution of cloud computing from resource management to elastic virtualization, the emergence of IaaS, PaaS and SaaS service models, how big‑data processing relies on distributed cloud platforms, and why artificial intelligence now depends on massive data and cloud‑scale compute to deliver intelligent services.

Artificial IntelligenceBig DataCloud Computing
0 likes · 37 min read
How Cloud Computing, Big Data, and AI Intertwine to Power Modern Services
Baidu Geek Talk
Baidu Geek Talk
Jun 15, 2021 · Industry Insights

What Baidu Unveiled at QCon 2021: Key Takeaways from 7 Cutting‑Edge Sessions

This article compiles Baidu experts' presentations at QCon 2021, covering unified quality‑efficiency delivery for feed recommendation, software engineering capabilities, AIOps fault‑management practices, Apache Doris real‑time analytics, large‑scale Service Mesh deployment, massive service‑governance techniques, and deep‑learning platform innovations, with speaker details and audience benefits.

AIBaiduBig Data
0 likes · 12 min read
What Baidu Unveiled at QCon 2021: Key Takeaways from 7 Cutting‑Edge Sessions
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 10, 2021 · Big Data

User Profiling: Concepts, Tag Classification, Tag‑System Construction, Applications and Implementation Steps

This article provides a comprehensive overview of user profiling, covering its definition, the five‑dimensional framework (goal, method, organization, standards, validation), various tag classifications, tag‑system architecture, modeling techniques, practical applications such as precise marketing and product innovation, and a step‑by‑step guide for building a profiling system using big‑data and AI methods.

Big DataCustomer Segmentationdata tagging
0 likes · 24 min read
User Profiling: Concepts, Tag Classification, Tag‑System Construction, Applications and Implementation Steps
Architecture Digest
Architecture Digest
Jun 10, 2021 · Big Data

NetEase Game Streaming ETL Architecture and Practices Based on Flink

This article presents NetEase Game's streaming ETL solution built on Flink, covering business background, log characteristics, specialized and generic ETL services, architectural evolution, Python UDF integration, runtime optimizations, fault‑tolerance mechanisms, and future roadmap for unified real‑time and offline data warehouses.

Big DataFlinkLog Processing
0 likes · 19 min read
NetEase Game Streaming ETL Architecture and Practices Based on Flink
58 Tech
58 Tech
Jun 9, 2021 · Big Data

Designing and Implementing a Unified Data Metric System for 58 Commercial Data Team

This article explains how 58's commercial data team built a comprehensive data metric system—from identifying common metric definition issues to establishing a domain‑driven hierarchy, distinguishing atomic and derived metrics, implementing a unified metric management platform, and providing APIs and examples for querying and visualizing metrics.

Big DataJavadata governance
0 likes · 17 min read
Designing and Implementing a Unified Data Metric System for 58 Commercial Data Team
Xianyu Technology
Xianyu Technology
Jun 8, 2021 · Big Data

Longgong Data Analysis Platform: Architecture and Solutions for Large‑Scale Structured Data

The Longgong Data Analysis Platform enables Idle Fish to capture, store, and analyze billions of structured product attributes in real time across more than 8,000 categories, using TableStore, MySQL, ODPS, and a distributed scheduler to achieve over 50% query speedup, 80% category coverage, and rapid support for search and recommendation teams.

AlibabaBig DataData Platform
0 likes · 9 min read
Longgong Data Analysis Platform: Architecture and Solutions for Large‑Scale Structured Data
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 8, 2021 · Artificial Intelligence

Can Low‑Code Bridge the Gap Between Business and AI? Insights on Its Future

The article explores how low‑code platforms can complement traditional algorithm development, enhance collaboration between business users and engineers, and accelerate big‑data and AI initiatives by improving data cleaning, modular design, and feedback loops, while highlighting the trade‑offs of abstraction and flexibility.

AIAlgorithm DevelopmentBig Data
0 likes · 9 min read
Can Low‑Code Bridge the Gap Between Business and AI? Insights on Its Future
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 6, 2021 · Big Data

Understanding Data Warehouses: Concepts, Architecture, Modeling, and Governance

This article provides a comprehensive overview of data warehouses, explaining their purpose, differences from databases, OLTP vs OLAP, traditional versus internet data warehouse models, layered architecture, modeling theories, metric dictionaries, date dimensions, naming conventions, data governance, and incremental synchronization techniques with practical SQL examples.

Big DataETLdata governance
0 likes · 24 min read
Understanding Data Warehouses: Concepts, Architecture, Modeling, and Governance
DataFunTalk
DataFunTalk
Jun 6, 2021 · Big Data

Understanding Apache Pulsar: Cloud‑Native Messaging, Storage‑Compute Separation, and Batch‑Stream Fusion with Flink

This article explains Apache Pulsar’s cloud‑native, storage‑compute separated architecture, its data model and scalability features, and how it integrates with Flink to provide a unified platform for both real‑time streaming and batch processing in big‑data applications.

Apache PulsarBatch-Stream IntegrationBig Data
0 likes · 17 min read
Understanding Apache Pulsar: Cloud‑Native Messaging, Storage‑Compute Separation, and Batch‑Stream Fusion with Flink
DataFunTalk
DataFunTalk
Jun 5, 2021 · Big Data

Building and Evolving a Data Service Platform for NetEase Cloud Music

The article details how NetEase Cloud Music co‑built a unified data service platform with NetEase YouShu, describing its architecture, phased development from internal use to online high‑concurrency services, feature enhancements such as API marketplace, multi‑source support, parameter conversion, and future roadmap for broader data products.

API PlatformBackendBig Data
0 likes · 16 min read
Building and Evolving a Data Service Platform for NetEase Cloud Music
dbaplus Community
dbaplus Community
Jun 5, 2021 · Big Data

How Flink + Iceberg Transform Data Lakes for Real‑Time Streaming

This article explains the concept of data lakes, outlines a four‑layer open‑source architecture, presents several classic Flink‑Iceberg use cases, details why Iceberg was chosen, and describes the design of Flink’s streaming sink and upcoming community roadmap.

Apache FlinkApache IcebergBig Data
0 likes · 14 min read
How Flink + Iceberg Transform Data Lakes for Real‑Time Streaming
MaGe Linux Operations
MaGe Linux Operations
Jun 3, 2021 · Big Data

Why Kafka Handles Billions of Messages: Architecture, Use Cases, and Fast Performance

This article introduces Kafka, LinkedIn’s high‑throughput distributed messaging system, explains its core concepts such as brokers, topics, partitions, offsets, producers, consumers, and consumer groups, outlines common use cases like asynchronous decoupling and data‑stream processing, and details its fast performance mechanisms, fault‑tolerance, installation, and configuration steps.

Big DataData StreamingInstallation
0 likes · 11 min read
Why Kafka Handles Billions of Messages: Architecture, Use Cases, and Fast Performance
dbaplus Community
dbaplus Community
Jun 2, 2021 · Databases

How to Build a Mature Data Warehouse: 7 Essential Steps and Best Practices

This article explains why data warehouses are critical for decision‑making, outlines the challenges of immature warehouses, and provides a step‑by‑step framework—including goal setting, technology selection, problem identification, domain modeling, layer design, modeling principles, and governance standards—to help teams build a robust, maintainable data warehouse.

Big DataData ArchitectureData Warehouse
0 likes · 22 min read
How to Build a Mature Data Warehouse: 7 Essential Steps and Best Practices
Big Data Technology Architecture
Big Data Technology Architecture
Jun 2, 2021 · Big Data

Practical Operations of NetEase Big Data Platform: Architecture, EasyOps, Monitoring, and Experience Sharing

The presentation details NetEase's big data platform operations, covering current usage, the internally built EasyOps control system, a generic service‑operation framework based on Ansible, Prometheus‑Grafana monitoring, configuration management, network and storage optimizations, and lessons learned from cloud migration.

AnsibleBig DataEasyOps
0 likes · 9 min read
Practical Operations of NetEase Big Data Platform: Architecture, EasyOps, Monitoring, and Experience Sharing
Tencent Advertising Technology
Tencent Advertising Technology
Jun 2, 2021 · Big Data

Tencent Advertising Real-Time Strategy Data Framework: Architecture, Performance, and High Availability

The article presents a detailed overview of Tencent Advertising's real‑time strategy data framework, explaining its role in the ad system, the challenges of massive log volumes, and the architectural, performance, and high‑availability solutions implemented to achieve fast, reliable, and scalable ad decision making.

Big DataDistributed SystemsReal-Time Strategy
0 likes · 24 min read
Tencent Advertising Real-Time Strategy Data Framework: Architecture, Performance, and High Availability
dbaplus Community
dbaplus Community
Jun 1, 2021 · Big Data

How Didi Boosted SQL Performance by 40%: Migrating 10k Hive Jobs to Spark

Didi migrated over 10,000 Hive SQL tasks to Spark SQL, achieving 85% Spark task share, cutting execution time by 40%, and reducing CPU and memory usage by 21% and 49% respectively, through a systematic migration process that addressed syntax, UDF, performance, and functional differences between the two engines.

Big DataHivePerformance Optimization
0 likes · 20 min read
How Didi Boosted SQL Performance by 40%: Migrating 10k Hive Jobs to Spark
Qunar Tech Salon
Qunar Tech Salon
Jun 1, 2021 · Big Data

Integrating TensorFlow for Java with Spark‑Scala for Distributed Machine Learning Prediction

This article shares practical experience of building a high‑performance distributed prediction service by combining TensorFlow for Java with Spark‑Scala, covering framework selection, performance comparison, model training, loading, inference, deployment, and optimization techniques for large‑scale data processing.

Big DataJavaPerformance Optimization
0 likes · 16 min read
Integrating TensorFlow for Java with Spark‑Scala for Distributed Machine Learning Prediction
Top Architect
Top Architect
May 31, 2021 · Databases

How to Achieve Fast Queries: MySQL Index Optimization, Large‑Table Strategies, Elasticsearch Basics, and HBase Overview

This article explains common causes of slow MySQL queries, how proper indexing and lock handling can improve performance, introduces Elasticsearch’s inverted‑index advantages and suitable use cases, and outlines HBase’s column‑family storage model and row‑key design for large‑scale data.

Big DataDatabase OptimizationHBase
0 likes · 18 min read
How to Achieve Fast Queries: MySQL Index Optimization, Large‑Table Strategies, Elasticsearch Basics, and HBase Overview
IT Architects Alliance
IT Architects Alliance
May 30, 2021 · Big Data

NetEase Game Streaming ETL Architecture and Practices Based on Flink

This article presents NetEase Game's Flink‑based streaming ETL system, detailing business background, log classifications, specialized and generic ETL services, Python UDF integration, runtime optimizations, HDFS write tuning, SLA metrics, fault‑tolerance mechanisms, and future roadmap for unified data lakes and PyFlink support.

Big DataData IntegrationETL
0 likes · 19 min read
NetEase Game Streaming ETL Architecture and Practices Based on Flink
DataFunTalk
DataFunTalk
May 28, 2021 · Artificial Intelligence

JD's Open‑Source Federated Learning Solution 9N‑FL: Architecture, Features, Timeline, and Business Impact

This article introduces JD's open‑source federated learning platform 9N‑FL, explaining the data‑island problem, the fundamentals and classifications of federated learning, its four key features, the system’s layered architecture, development timeline, real‑world advertising use case results, and future enhancements.

9N-FLBig DataData Security
0 likes · 15 min read
JD's Open‑Source Federated Learning Solution 9N‑FL: Architecture, Features, Timeline, and Business Impact
58 Tech
58 Tech
May 28, 2021 · Big Data

Practical Upgrade Experience of Hadoop 3.2.1 in 58.com Data Platform: HDFS, YARN, and MR3

This article details the end‑to‑end upgrade of a 5000‑node Hadoop 2.6.0 cluster to Hadoop 3.2.1 at 58.com, covering HDFS migration, RBF and EC adoption, Yarn federation and rolling upgrades, MR3 integration, extensive compatibility testing, and operational lessons learned for large‑scale big‑data platforms.

Big DataCluster UpgradeHDFS
0 likes · 19 min read
Practical Upgrade Experience of Hadoop 3.2.1 in 58.com Data Platform: HDFS, YARN, and MR3
dbaplus Community
dbaplus Community
May 27, 2021 · Big Data

How Vipshop Scales Billion‑Row OLAP with ClickHouse, Presto, and Flink

This article details Vipshop's OLAP evolution, describing how Presto, Kylin, and ClickHouse are integrated, the deployment architecture with HAproxy and chproxy, containerization on Kubernetes, and the Flink‑ClickHouse pipeline that enables self‑service analysis of hundred‑billion‑row datasets while addressing performance challenges and future roadmap.

Big DataClickHouseData Warehouse
0 likes · 28 min read
How Vipshop Scales Billion‑Row OLAP with ClickHouse, Presto, and Flink
Tencent Cloud Developer
Tencent Cloud Developer
May 27, 2021 · Big Data

An Introduction to Kafka: Architecture, Core Components, Service Governance, Performance Optimizations, and Installation Guide

Kafka is a high‑throughput distributed publish‑subscribe system that uses brokers, topics, partitions, offsets, producers, consumers, and Zookeeper for metadata and leader election, offering fast sequential disk writes, page‑cache zero‑copy transfers, ISR‑based replication, and includes step‑by‑step installation of JDK, Zookeeper, and Kafka.

Big DataDistributed MessagingInstallation
0 likes · 11 min read
An Introduction to Kafka: Architecture, Core Components, Service Governance, Performance Optimizations, and Installation Guide
IT Architects Alliance
IT Architects Alliance
May 25, 2021 · Big Data

How Modern Data Middle Platforms Power Real‑Time and Offline Analytics

This article provides a comprehensive technical overview of data middle platforms, covering data aggregation, offline and real‑time development, smart operations, data asset management, governance, service layers, platform implementations, warehouse layering, and key differences between offline and real‑time data warehouses.

Big DataData PlatformData Warehouse
0 likes · 26 min read
How Modern Data Middle Platforms Power Real‑Time and Offline Analytics
Alibaba Terminal Technology
Alibaba Terminal Technology
May 25, 2021 · Frontend Development

Inside Alibaba’s Front‑End Visualization Showcase: Insights from CSIG’s Campus‑to‑Enterprise Event

The CSIG Visualization and Visual Analysis Committee’s visit to Alibaba’s Xixi Campus on May 21, 2021 brought together leading academics and industry experts to discuss graph data, big‑data research, spatio‑temporal data, low‑code design, and cutting‑edge visualization techniques, fostering deep industry‑academia collaboration.

Big Dataindustry‑academialow‑code
0 likes · 7 min read
Inside Alibaba’s Front‑End Visualization Showcase: Insights from CSIG’s Campus‑to‑Enterprise Event
Full-Stack Internet Architecture
Full-Stack Internet Architecture
May 25, 2021 · Backend Development

Comprehensive Interview Experience Summary and Preparation Guide for Major Tech Companies

This article compiles detailed interview experiences, question lists, and practical advice for candidates targeting backend, big‑data, and cloud positions at leading Chinese tech firms, offering timelines, personal background, preparation tips, and reflections to help job seekers navigate multi‑round technical interviews efficiently.

Big DataInterviewSystem Design
0 likes · 28 min read
Comprehensive Interview Experience Summary and Preparation Guide for Major Tech Companies
Architects Research Society
Architects Research Society
May 23, 2021 · Big Data

Data Architecture Trends: From Chaos to an Organized Era – Insights from Anthony J. Algmin

The article reviews Anthony J. Algmin’s reflections on past data‑architecture predictions, current hot topics such as cloud, AI/ML, data governance, and real‑time analytics, and forecasts future trends including metadata management, blockchain, and the evolving role of data architects within enterprises.

Artificial IntelligenceBig DataData Architecture
0 likes · 13 min read
Data Architecture Trends: From Chaos to an Organized Era – Insights from Anthony J. Algmin
DataFunTalk
DataFunTalk
May 22, 2021 · Databases

Combining HBase and Elasticsearch: Challenges and the Lindorm Searchindex Solution

The article examines the strengths and weaknesses of combining HBase and Elasticsearch for massive data storage and retrieval, outlines three integration patterns and their challenges, and presents Alibaba Cloud's Lindorm Searchindex as a SQL‑driven, low‑cost, strongly consistent solution that simplifies development and improves performance.

Big DataElasticsearchHBase
0 likes · 11 min read
Combining HBase and Elasticsearch: Challenges and the Lindorm Searchindex Solution
DeWu Technology
DeWu Technology
May 22, 2021 · Big Data

Unified Semantic Layer for Data Development: Addressing Pain Points and Optimizing Queries

A unified semantic layer for data development solves metric‑change ripple effects, developer burden, and large‑scale query performance problems by offering consistent metric definitions, multi‑view access, concise auto‑generated SQL, instant propagation of updates, and engine‑driven optimal query selection, thereby bridging business and engineering and cutting maintenance effort.

Big DataData engineeringOLAP
0 likes · 5 min read
Unified Semantic Layer for Data Development: Addressing Pain Points and Optimizing Queries
Top Architect
Top Architect
May 22, 2021 · Big Data

Kafka Basics: Topics, Partitions, Producers, Consumers, and Cluster Architecture

This article provides a comprehensive introduction to Kafka, covering its role as a message system, core concepts such as topics, partitions, producers, consumers, messages, the cluster architecture with replicas and controllers, performance optimizations, log segmentation, and network design, all illustrated with diagrams and code examples.

Big DataKafkaMessage queue
0 likes · 13 min read
Kafka Basics: Topics, Partitions, Producers, Consumers, and Cluster Architecture
Programmer DD
Programmer DD
May 22, 2021 · Big Data

What Is a Data Lake? Origins, Architecture, and How It Powers Modern Big Data

This article explains the concept of a data lake—its origin in 2011, how it differs from traditional databases and data warehouses, its core characteristics such as raw data storage, on‑demand computing, and schema‑on‑read, as well as its advantages, challenges, architectural components, and future outlook within the big‑data ecosystem.

Big DataData ArchitectureData Lake
0 likes · 20 min read
What Is a Data Lake? Origins, Architecture, and How It Powers Modern Big Data
IT Architects Alliance
IT Architects Alliance
May 22, 2021 · Big Data

Flink-Based Real‑Time Recommendation System: Architecture, Logic, and Docker Deployment Guide

This article presents a comprehensive walkthrough of a Flink‑powered recommendation system, detailing its v2.0 architecture, module functions, recommendation algorithms (hotness, product similarity, collaborative filtering), front‑end and back‑end UI, and step‑by‑step Docker deployment of MySQL, Redis, HBase, and Kafka services.

Big DataDockerFlink
0 likes · 11 min read
Flink-Based Real‑Time Recommendation System: Architecture, Logic, and Docker Deployment Guide
NetEase Game Operations Platform
NetEase Game Operations Platform
May 22, 2021 · Big Data

Comprehensive Overview and Source Code Analysis of NetEase Spark Kyuubi

This article systematically introduces NetEase Kyuubi, an open‑source high‑performance JDBC and SQL execution engine built on Apache Spark, covering its background, core architecture, service discovery, session and operation management, startup processes, and key source‑code implementations with detailed code examples.

Apache ThriftBig DataDistributed computing
0 likes · 47 min read
Comprehensive Overview and Source Code Analysis of NetEase Spark Kyuubi
Tencent Cloud Developer
Tencent Cloud Developer
May 21, 2021 · Big Data

Tencent Cloud Oceanus: Flink SQL Optimization and Extension Practices

Tencent Cloud Oceanus, a computing service powering internal apps like WeChat and external partners such as Bilibili, scales to over 30,000 cores handling 5 PB daily and 500,000 jobs, and tackles Flink SQL’s syntax, function and operational limits with table‑valued functions, incremental and enhanced tumble windows, and caching‑based retraction optimization that cuts downstream data volume up to 30× and improves join performance by about 20 %.

Big DataFlink SQLOceanus
0 likes · 19 min read
Tencent Cloud Oceanus: Flink SQL Optimization and Extension Practices
UCloud Tech
UCloud Tech
May 21, 2021 · Big Data

How US3 Hadoop Adapter Cuts Big Data Storage Costs and Boosts Performance

This article explains how UCloud's US3 object storage, combined with a custom Hadoop adapter, separates compute and storage, optimizes file system operations, and leverages caching and specialized APIs to dramatically reduce storage costs and improve read/write performance for large‑scale Hadoop workloads.

Big DataCacheHadoop
0 likes · 13 min read
How US3 Hadoop Adapter Cuts Big Data Storage Costs and Boosts Performance
iQIYI Technical Product Team
iQIYI Technical Product Team
May 21, 2021 · Big Data

Design and Implementation of iQIYI's User Feedback Analysis System

iQIYI built an in‑house user‑feedback analysis system that automatically ingests multi‑channel data, classifies and clusters issues, assesses feedback quality, localizes problems, and streamlines repair closure, boosting recall accuracy, alarm precision, closure rates and reducing cycle time across business lines to enhance user experience.

AIBig Dataclassification
0 likes · 15 min read
Design and Implementation of iQIYI's User Feedback Analysis System
Byte Quality Assurance Team
Byte Quality Assurance Team
May 19, 2021 · Big Data

Streaming 102: The World Beyond Batch

This article extends the concepts introduced in Streaming 101 by deeply exploring data processing patterns for unbounded data, covering windowing, watermarks, triggers, accumulation modes, and their practical implications for building robust low‑latency streaming pipelines.

Big DataStreamingTriggers
0 likes · 14 min read
Streaming 102: The World Beyond Batch
Big Data Technology & Architecture
Big Data Technology & Architecture
May 19, 2021 · Big Data

Comprehensive Guide to Data Governance: Metadata, Data Quality, Standards, and Asset Management

This article provides an extensive overview of data governance in the big‑data era, covering common pitfalls, the role of metadata, data quality management, data standardization, and data asset management, and offers practical recommendations for organizations to implement effective governance practices.

Big DataData Asset ManagementData Quality
0 likes · 42 min read
Comprehensive Guide to Data Governance: Metadata, Data Quality, Standards, and Asset Management
Tencent Cloud Developer
Tencent Cloud Developer
May 19, 2021 · Industry Insights

How Cloud‑Native Principles Transform Big Data Infrastructure

The article analyzes how cloud‑native concepts such as DevOps, micro‑services, continuous delivery, and containerization can be applied to big‑data foundations, outlining four guiding principles—industrialized delivery, cost quantification, load‑adaptive scaling, and data‑centric design—and describing concrete Hadoop‑based architectures and Tencent Cloud solutions that lower cost while boosting performance.

Big DataData InfrastructureHadoop
0 likes · 22 min read
How Cloud‑Native Principles Transform Big Data Infrastructure
UCloud Tech
UCloud Tech
May 18, 2021 · Big Data

Step‑by‑Step Guide to Deploy UCloud’s Free USDP for Big Data

This article provides a comprehensive tutorial on installing UCloud's free USDP version for private big‑data deployments, covering environment preparation, minimum node specifications, resource download, configuration files, one‑click initialization scripts, server startup, web UI access, license acquisition, and optional manual setup procedures.

Big DataLinuxUCloud
0 likes · 16 min read
Step‑by‑Step Guide to Deploy UCloud’s Free USDP for Big Data
Alibaba Cloud Native
Alibaba Cloud Native
May 17, 2021 · Big Data

How Vineyard Accelerates Cloud‑Native Big Data Workflows with Zero‑Copy Memory Sharing

Vineyard, an open‑source distributed memory data‑sharing engine, tackles the inefficiencies of traditional file‑system based big‑data pipelines by enabling zero‑copy, in‑memory object exchange, Kubernetes‑aware scheduling, and plug‑in operators, delivering up to 1.34× faster end‑to‑end execution.

Big DataMemory SharingVineyard
0 likes · 10 min read
How Vineyard Accelerates Cloud‑Native Big Data Workflows with Zero‑Copy Memory Sharing
Beijing SF i-TECH City Technology Team
Beijing SF i-TECH City Technology Team
May 17, 2021 · Artificial Intelligence

AIOps Overview: Concepts, Applications, and Case Studies

This article provides a comprehensive overview of AIOps, covering its definition, evolution from manual to AI-driven operations, core capabilities, and real-world applications in capacity prediction, anomaly detection, and alarm merging, illustrated with case studies from a food‑retail giant and internal logistics.

Anomaly DetectionArtificial IntelligenceBig Data
0 likes · 13 min read
AIOps Overview: Concepts, Applications, and Case Studies
Architecture Digest
Architecture Digest
May 17, 2021 · Big Data

Technical Architecture Overview of Toutiao: Data Pipeline, User Modeling, Recommendation System, and Microservices

The article provides a comprehensive technical overview of Toutiao's rapid growth, detailing its massive user base, data collection and processing pipelines, user modeling, cold‑start strategies, recommendation engines, storage solutions, push notification mechanisms, and the underlying microservice and PaaS architecture.

Big DataHadoopKafka
0 likes · 8 min read
Technical Architecture Overview of Toutiao: Data Pipeline, User Modeling, Recommendation System, and Microservices