Tag

big data

1 views collected around this technical thread.

Architect's Guide
Architect's Guide
Jun 14, 2025 · Big Data

Mastering Data Warehouse Design: From Fact Tables to Dimensional Modeling

This article explains the core components of a data warehouse ecosystem, distinguishes fact and dimension tables, outlines synchronization strategies, introduces star, snowflake, and constellation schemas, and details the layered architecture from ODS to data marts for effective big‑data analytics.

ETLbig datadata warehouse
0 likes · 15 min read
Mastering Data Warehouse Design: From Fact Tables to Dimensional Modeling
Sohu Tech Products
Sohu Tech Products
Jun 11, 2025 · Big Data

How We Transformed a Microservice Finance System into a Scalable Big Data Warehouse

This article details the evolution of a fast‑growing finance reporting system from a monolithic microservice architecture plagued by data inconsistency, low efficiency, and scalability limits to a robust, high‑performance big‑data warehouse built with layered data models, SparkSQL processing, and unified scheduling, highlighting design decisions, technical trade‑offs, and measurable performance gains.

Microservicesarchitecture evolutionbig data
0 likes · 23 min read
How We Transformed a Microservice Finance System into a Scalable Big Data Warehouse
vivo Internet Technology
vivo Internet Technology
Jun 11, 2025 · Big Data

How Vivo Built a Scalable Pulsar Monitoring System for Trillion‑Message Workloads

This article details Vivo's end‑to‑end Pulsar observability solution, covering the challenges of Prometheus‑based monitoring, the architecture of the alerting pipeline, adaptor development, metric optimizations for subscription backlog and bundle load, and fixes for kop lag reporting issues.

MetricsObservabilityPrometheus
0 likes · 12 min read
How Vivo Built a Scalable Pulsar Monitoring System for Trillion‑Message Workloads
JD Retail Technology
JD Retail Technology
Jun 10, 2025 · Artificial Intelligence

How JD Builds a Scalable AI‑Powered Recommendation Data System with Flink

This article explains JD's complex recommendation system data pipeline—from indexing, sampling, and feature engineering to explainability and real‑time metrics—highlighting challenges such as data consistency, latency, and the use of Flink for massive, low‑latency processing.

big dataexplainabilityfeature engineering
0 likes · 23 min read
How JD Builds a Scalable AI‑Powered Recommendation Data System with Flink
DataFunSummit
DataFunSummit
Jun 10, 2025 · Big Data

How OpenLake Redefines Data Lake Infrastructure for the AI Era

This article explores OpenLake's evolution as a data lake platform for AI, covering the transition from Hive to modern lake formats like Iceberg and Paimon, performance benchmarks, metadata management advances, intelligent storage optimization, and the integration of multimodal support with the Lance file format.

AIData LakeOpenLake
0 likes · 22 min read
How OpenLake Redefines Data Lake Infrastructure for the AI Era
Lobster Programming
Lobster Programming
Jun 9, 2025 · Databases

How to Add a Column to Billion‑Row Tables Without Downtime

This article explains a metadata‑driven approach for extending massive tables—using a separate extension table, sharding, and Elasticsearch sync—to add new fields to billion‑row databases without locking the primary table or disrupting online services.

Shardingbig datadatabase schema
0 likes · 6 min read
How to Add a Column to Billion‑Row Tables Without Downtime
DataFunSummit
DataFunSummit
Jun 6, 2025 · Big Data

How Unicom Digital’s Integrated Data Platform Revolutionizes Metadata Management

This article details Unicom Digital’s metadata management practice on its integrated data platform, covering the strategic background of data, key challenges, award-winning capabilities, three-pronged solutions—automation, linking+, and AI—along with practical implementations, full‑chain lineage, data responsibility, lifecycle management, and future AI‑driven enhancements.

AIautomationbig data
0 likes · 18 min read
How Unicom Digital’s Integrated Data Platform Revolutionizes Metadata Management
DataFunSummit
DataFunSummit
Jun 3, 2025 · Big Data

BiFang: A Unified Lake‑Stream Storage Engine for Real‑Time and Batch Data Processing

BiFang is a lake‑stream integrated storage engine that merges Apache Pulsar message‑queue capabilities with Iceberg data‑lake features, providing a single unified data store with full‑incremental queries, sub‑second visibility, exactly‑once semantics, and seamless integration with Flink, Spark, and StarRocks for both real‑time analytics and batch processing.

Apache IcebergApache PulsarLakehouse
0 likes · 13 min read
BiFang: A Unified Lake‑Stream Storage Engine for Real‑Time and Batch Data Processing
DataFunSummit
DataFunSummit
Jun 1, 2025 · Big Data

Scaling WeChat’s Big Data and AI Workloads on Kubernetes: Challenges and Optimizations

This article details WeChat's migration of large‑scale big data and AI workloads to a cloud‑native Kubernetes platform, discussing performance bottlenecks, API server and ETCD overload protection, scheduler enhancements, observability solutions, resource utilization gains, and future serverless directions.

AIKubernetesObservability
0 likes · 11 min read
Scaling WeChat’s Big Data and AI Workloads on Kubernetes: Challenges and Optimizations
Kuaishou Tech
Kuaishou Tech
May 28, 2025 · Databases

Optimizing Kuaishou's Photo Object Storage: Reducing Size and Boosting Cache Hit Rate

This article details how Kuaishou dramatically cut storage costs and improved cache efficiency for its core Photo data object by cleaning up redundant JSON fields, applying selective serialization, and performing large‑scale data cleaning, achieving a 25% size reduction, a 2% cache‑hit increase, and multi‑hundred‑TB savings.

Cache Hit RateKuaishouPhoto Object
0 likes · 20 min read
Optimizing Kuaishou's Photo Object Storage: Reducing Size and Boosting Cache Hit Rate
Full-Stack Internet Architecture
Full-Stack Internet Architecture
May 27, 2025 · Big Data

Understanding Event Streaming in Kafka: Core Concepts, Architecture, and Use Cases

This article explains Kafka's event streaming concept, detailing events and streams, core components such as producers, topics, partitions, consumers, persistence, and typical real‑time data pipeline, event‑driven architecture, stream processing, and log aggregation use cases, highlighting its role as a foundational big‑data infrastructure.

Event StreamingKafkaMessage Queues
0 likes · 7 min read
Understanding Event Streaming in Kafka: Core Concepts, Architecture, and Use Cases
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
May 26, 2025 · Big Data

Comparative Guide to Apache Airflow and Argo Workflows for Distributed Task Scheduling

This article provides a comprehensive comparison of Apache Airflow and Argo Workflows, covering their core features, architectures, use cases, code examples, and recommendations for selecting the appropriate distributed workflow engine in data engineering, big‑data, and AI pipelines.

Apache AirflowArgo WorkflowsData Engineering
0 likes · 23 min read
Comparative Guide to Apache Airflow and Argo Workflows for Distributed Task Scheduling
Full-Stack Internet Architecture
Full-Stack Internet Architecture
May 23, 2025 · Big Data

Step-by-Step Guide to Installing and Using Apache Kafka 3.8.1 on Linux

This tutorial walks through downloading, extracting, configuring, starting, creating topics, producing and consuming messages, and finally stopping Apache Kafka 3.8.1 on a Linux system, including all necessary command‑line instructions.

InstallationKafkaLinux
0 likes · 4 min read
Step-by-Step Guide to Installing and Using Apache Kafka 3.8.1 on Linux
DataFunSummit
DataFunSummit
May 22, 2025 · Operations

Automated Fault Detection and Repair System for Grab's Data Pipelines (Hugo) – Architecture, Implementation, and Impact

This article presents Grab's Hugo platform, an automated fault‑detection and self‑healing system for over 4,000 data pipelines that combines multi‑source signal collection, intelligent diagnosis, layered auto‑repair, and a health API to dramatically improve data visibility, reduce manual intervention, and boost operational efficiency across the company.

DataOpsautomationbig data
0 likes · 12 min read
Automated Fault Detection and Repair System for Grab's Data Pipelines (Hugo) – Architecture, Implementation, and Impact
Python Programming Learning Circle
Python Programming Learning Circle
May 22, 2025 · Big Data

Introduction to PySpark: Features, Core Components, Sample Code, and Use Cases

This article introduces PySpark as the Python API for Apache Spark, explains Spark's core concepts and advantages, details PySpark's main components and a simple code example, compares it with Pandas, and outlines typical big‑data scenarios and further learning directions.

Apache SparkDataFramesPySpark
0 likes · 5 min read
Introduction to PySpark: Features, Core Components, Sample Code, and Use Cases
Full-Stack Internet Architecture
Full-Stack Internet Architecture
May 20, 2025 · Big Data

Why Learn Kafka? Core Benefits, Use Cases, and a Summary

This article explains why Kafka is widely adopted by top companies, outlines its high throughput, scalability, and durability, and describes key real‑time data pipeline, stream processing, and big‑data integration scenarios, concluding that mastering Kafka is essential for modern backend and data engineering roles.

Data EngineeringKafkaReal-time Processing
0 likes · 4 min read
Why Learn Kafka? Core Benefits, Use Cases, and a Summary
iQIYI Technical Product Team
iQIYI Technical Product Team
May 15, 2025 · Big Data

Introducing AMD and ARM Bare‑Metal Instances for iQIYI Big Data Computing: Cloud Selection, Performance Evaluation, and Heterogeneous Scheduling

To reduce costs and boost compute density, iQIYI's big data team migrated from aging private‑cloud Intel servers to public‑cloud AMD and ARM bare‑metal instances, establishing a systematic machine‑selection process, performance testing framework, and YARN‑based heterogeneous scheduling to fully leverage the new hardware.

AMDARMCloud Computing
0 likes · 16 min read
Introducing AMD and ARM Bare‑Metal Instances for iQIYI Big Data Computing: Cloud Selection, Performance Evaluation, and Heterogeneous Scheduling
Bilibili Tech
Bilibili Tech
May 13, 2025 · Big Data

Live Streaming Ecosystem Governance Architecture and Data Mining Engine Design

The article outlines a comprehensive live‑streaming ecosystem governance framework that combines data‑mining engines, tagging platforms, rule‑based disposal mechanisms, and multi‑stage user touchpoints to improve content quality, compliance, and platform sustainability.

Live StreamingTagging Systembig data
0 likes · 14 min read
Live Streaming Ecosystem Governance Architecture and Data Mining Engine Design
macrozheng
macrozheng
May 12, 2025 · Big Data

Master DataX: Efficient Data Synchronization for Massive MySQL Datasets

Learn how to overcome inaccurate reporting and cross-database challenges by using Alibaba’s open-source DataX tool to efficiently synchronize massive MySQL datasets, covering its architecture, job scheduling, installation, configuration, full- and incremental sync, and practical command-line examples.

Data SynchronizationDataXETL
0 likes · 15 min read
Master DataX: Efficient Data Synchronization for Massive MySQL Datasets
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
May 9, 2025 · Big Data

Mastering Multi‑AZ Replication in HDFS with AZ Mover

This article introduces AZ Mover, a lightweight HDFS client‑side tool that intelligently scans, schedules, and migrates block replicas across multiple availability zones, detailing its design goals, core workflow, command‑line options, concurrency controls, and future enhancements for robust big‑data disaster recovery.

AZ MoverCluster OperationsHDFS
0 likes · 9 min read
Mastering Multi‑AZ Replication in HDFS with AZ Mover