Tagged articles
3697 articles
Page 10 of 37
DataFunSummit
DataFunSummit
Nov 1, 2023 · Artificial Intelligence

DataFunCon2023 Shenzhen: Program Overview and Session Highlights

DataFunCon2023 Shenzhen showcases a comprehensive program featuring expert talks on building Data+LLM applications, large-scale storage, cloud‑native architectures, metric systems, data governance, AB testing, and industry‑specific large language model use cases across finance, gaming, advertising, and more, providing valuable insights for practitioners and researchers alike.

@DataAIGCArtificial Intelligence
0 likes · 50 min read
DataFunCon2023 Shenzhen: Program Overview and Session Highlights
ByteDance Data Platform
ByteDance Data Platform
Nov 1, 2023 · Big Data

How a Leading E‑Commerce Platform Solves EB‑Scale Data Governance Challenges

Facing massive data volumes and strict SLA requirements during the Double 11 shopping festival, a major e‑commerce platform built a systematic data‑governance framework that addresses quality, stability, cost, and efficiency through multi‑layered grading, digital cost models, automated tools, and full‑lifecycle management.

Big DataSLA managementcost optimization
0 likes · 23 min read
How a Leading E‑Commerce Platform Solves EB‑Scale Data Governance Challenges
DataFunSummit
DataFunSummit
Oct 31, 2023 · Big Data

Customer Data Platform (CDP) at Qunar Travel: Business Background, Construction Practice, Applications, and Future Outlook

This article details Qunar Travel's multi‑year development of a Customer Data Platform (CDP), covering its business motivations, architectural design, tag‑based data processing, real‑time and offline pipelines, user segmentation, marketing automation, performance optimizations, and future directions for model‑driven personalization.

Big DataReal-time AnalyticsTagging
0 likes · 18 min read
Customer Data Platform (CDP) at Qunar Travel: Business Background, Construction Practice, Applications, and Future Outlook
StarRocks
StarRocks
Oct 31, 2023 · Databases

How Ctrip Accelerated Report Queries 10× with StarRocks: A Real‑World Lakehouse Migration

Ctrip migrated its Artnova reporting platform from Hive‑based queries to StarRocks, first loading data into OLAP tables and then using StarRocks as a lakehouse with Hive catalog, Data Cache and materialized views, achieving average query latency reductions from 20 seconds to 1.5 seconds, over 7× speed‑up versus Trino and up to 40× acceleration for complex workloads.

Big DataData CacheLakehouse
0 likes · 15 min read
How Ctrip Accelerated Report Queries 10× with StarRocks: A Real‑World Lakehouse Migration
Inke Technology
Inke Technology
Oct 31, 2023 · Operations

How We Re‑engineered Our Log Platform to Cut Costs by 60% with ClickHouse

This article details the redesign of a company’s logging infrastructure—from an ELK‑based solution to a ClickHouse‑powered architecture—highlighting the motivations, key requirements, component choices, configuration examples, performance optimizations, and the resulting cost and storage benefits.

Big DataClickHouseLogging
0 likes · 13 min read
How We Re‑engineered Our Log Platform to Cut Costs by 60% with ClickHouse
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 30, 2023 · Big Data

New Features in Flink 1.18: Operator-Level State TTL, Watermark Alignment, Idle Detection, and Dynamic Scaling

Flink 1.18 introduces several production‑critical enhancements, including per‑operator state TTL configuration, watermark alignment and idle‑timeout settings, as well as dynamic fine‑grained scaling of task parallelism via the Web UI and REST API, improving resource efficiency and job stability.

Big DataDynamic ScalingFlink
0 likes · 6 min read
New Features in Flink 1.18: Operator-Level State TTL, Watermark Alignment, Idle Detection, and Dynamic Scaling
DataFunTalk
DataFunTalk
Oct 28, 2023 · Big Data

Data Lake Architecture, Ingestion Options, Real-time Optimization, and Query Practices

This article presents a comprehensive overview of a unified data lake architecture, evaluates three ingestion solutions, details real‑time ingestion optimizations for Flink‑Hudi pipelines, and describes how Kyuubi enables unified query access across multiple engines, offering practical guidance for large‑scale data processing.

Big DataData LakeFlink
0 likes · 14 min read
Data Lake Architecture, Ingestion Options, Real-time Optimization, and Query Practices
DataFunSummit
DataFunSummit
Oct 25, 2023 · Big Data

Data Serviceization at JD: From Zero to One and Beyond

This technical presentation describes JD's data service platform, covering its origin, performance optimizations, flexible API generation, scaling to massive metrics, caching strategies, service orchestration, governance, and a Q&A on security and data‑source flexibility.

API generationBig DataCaching
0 likes · 11 min read
Data Serviceization at JD: From Zero to One and Beyond
DataFunTalk
DataFunTalk
Oct 25, 2023 · Databases

Apache Doris Summit Asia 2023: Highlights, Innovations, and Industry Use Cases

The Apache Doris Summit Asia 2023 showcased the milestone 2.0 release, impressive performance gains, rapid community growth, and diverse industry deployments, while outlining future cloud‑native and unified analytics directions that position Doris as a leading real‑time data warehouse solution.

Apache DorisBig DataData Warehouse
0 likes · 13 min read
Apache Doris Summit Asia 2023: Highlights, Innovations, and Industry Use Cases
DevOps
DevOps
Oct 25, 2023 · Big Data

An Introduction to Big Data: Origins, Definitions, 5V Characteristics, Applications, Hadoop Architecture, and Testing Strategies

This article provides a comprehensive overview of big data, covering its origins, definitions, 5V characteristics, data formats, real‑world applications, Hadoop architecture, testing challenges, functional and performance testing strategies, and the skills required for effective big data testing.

5V CharacteristicsBig DataData Formats
0 likes · 35 min read
An Introduction to Big Data: Origins, Definitions, 5V Characteristics, Applications, Hadoop Architecture, and Testing Strategies
Data Thinking Notes
Data Thinking Notes
Oct 24, 2023 · Big Data

Unlocking Retail Success: Key Data Metrics and Analysis Methods for the New Era

This article explores how retailers can leverage big‑data analytics across people, products, and places—both offline and online—to build comprehensive indicator systems, apply methods like ABC, RFM, association and funnel analysis, and drive smarter decision‑making in the evolving retail landscape.

ABC analysisBig DataCustomer Segmentation
0 likes · 9 min read
Unlocking Retail Success: Key Data Metrics and Analysis Methods for the New Era
DataFunSummit
DataFunSummit
Oct 24, 2023 · Big Data

Practices of Data Fabric in Data Integration Scenarios

The presentation by Aloudata Vice President Yu Jun introduces his extensive background in large‑scale internet and big‑data platforms and outlines how Data Fabric and data virtualization can be applied to data integration, highlighting the differences from traditional solutions and the business value of logical data warehouses.

Big DataData FabricData Integration
0 likes · 2 min read
Practices of Data Fabric in Data Integration Scenarios
DataFunSummit
DataFunSummit
Oct 24, 2023 · Big Data

Using Apache Arrow to Quickly Build Modern Data Systems

This announcement introduces Li Chenxi, a big‑data R&D engineer, and outlines his talk on leveraging Apache Arrow’s columnar in‑memory format to efficiently construct modern, read‑time modeling data systems, highlighting key features, ecosystem, and practical implementation benefits for the audience.

Apache ArrowBig DataColumnar Memory
0 likes · 2 min read
Using Apache Arrow to Quickly Build Modern Data Systems
DataFunSummit
DataFunSummit
Oct 24, 2023 · Big Data

DataOps & DataFabric in the Era of Large Models

In this presentation, Guo Wei, CEO of Baijiang Open Source and seasoned big‑data expert, explores how large‑model AI reshapes DataOps and DataFabric, detailing efficiency gains, intelligent deployment, and future enterprise architectures for big‑data and AI integration.

Artificial IntelligenceBig DataDataFabric
0 likes · 3 min read
DataOps & DataFabric in the Era of Large Models
DataFunTalk
DataFunTalk
Oct 23, 2023 · Big Data

Alibaba Cloud DataWorks Intelligent Data Modeling: Practices, Challenges, and Solutions

This article introduces Alibaba Cloud DataWorks' intelligent data modeling tool, outlines the data demand flow, shares best practices and hands‑on demonstrations for data warehouse modeling, discusses common challenges and their solutions, and provides Q&A and product details for developers and data engineers.

Alibaba CloudBest PracticesBig Data
0 likes · 12 min read
Alibaba Cloud DataWorks Intelligent Data Modeling: Practices, Challenges, and Solutions
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 23, 2023 · Big Data

Bilibili Data Quality Assurance: Architecture, Goals, Core Capabilities, and Future Outlook

This article outlines Bilibili's data quality assurance framework, detailing its evolution across four development stages, the current data platform architecture, identified pain points, four key quality objectives, core capabilities such as a quality data warehouse, comprehensive monitoring, digital optimization, fault handling, and future directions.

Big DataData PlatformData Quality
0 likes · 22 min read
Bilibili Data Quality Assurance: Architecture, Goals, Core Capabilities, and Future Outlook
Data Thinking Notes
Data Thinking Notes
Oct 22, 2023 · Big Data

Boosting Big Data Governance Capabilities for Digital Transformation

This article outlines how enterprises can enhance their big data governance capabilities during digital transformation, covering the background and challenges of data governance, the emergence of data capability as a core competency with implementation paths, and practical suggestions for governance projects, illustrated with national-level examples.

Big DataDigital Transformationdata governance
0 likes · 3 min read
Boosting Big Data Governance Capabilities for Digital Transformation
DataFunSummit
DataFunSummit
Oct 22, 2023 · Big Data

How Kuaishou E‑commerce Leverages OLAP and a Unified Data Architecture to Solve Business Data Challenges

This article explains how Kuaishou's e‑commerce team built a unified OLAP‑based data platform—covering data ingestion, consistent dimensional and fact layers, metric management, and real‑time services—to address rapid growth, metric inconsistency, and operational inefficiencies across multiple business scenarios.

Big DataData ArchitectureData Warehouse
0 likes · 20 min read
How Kuaishou E‑commerce Leverages OLAP and a Unified Data Architecture to Solve Business Data Challenges
DataFunTalk
DataFunTalk
Oct 22, 2023 · Operations

Bilibili Data Quality Assurance System: Architecture, Practices, and Case Study

This article presents Bilibili's data quality assurance system, detailing its evolution across four data platform stages, the multi‑layer architecture, core capabilities such as a quality data warehouse, digital‑driven continuous optimization, and efficient incident handling, and concludes with a real‑world case study and future outlook.

Big DataData Warehousequality assurance
0 likes · 21 min read
Bilibili Data Quality Assurance System: Architecture, Practices, and Case Study
dbaplus Community
dbaplus Community
Oct 18, 2023 · Databases

Doris vs ClickHouse: Which Database Delivers Faster Writes and Queries?

This article presents a systematic performance comparison between Doris and ClickHouse, covering data ingestion speed, SQL syntax differences, hardware impact, and detailed query benchmarks across multiple scenarios, ultimately revealing that each system excels in different use cases.

Big DataClickHouseDoris
0 likes · 15 min read
Doris vs ClickHouse: Which Database Delivers Faster Writes and Queries?
DataFunSummit
DataFunSummit
Oct 18, 2023 · Big Data

Kuaishou Data Lake Construction with Apache Hudi: Architecture, Challenges, and Solutions

This article explains why Kuaishou built a data lake, outlines the shortcomings of its previous Lambda architecture, describes the adoption of Apache Hudi for unified batch‑stream processing, and details the five major technical challenges and the corresponding solutions implemented to improve performance, consistency, and operational reliability.

Apache HudiBig DataData Architecture
0 likes · 17 min read
Kuaishou Data Lake Construction with Apache Hudi: Architecture, Challenges, and Solutions
DataFunSummit
DataFunSummit
Oct 16, 2023 · Big Data

Bilibili's Iceberg‑Based Lakehouse Platform: Technical Practices for Sub‑Second Query Response

This article details Bilibili's implementation of an Iceberg‑based lakehouse platform that unifies storage and analytics, addressing Hive’s performance and latency issues through multidimensional sorting, various file‑level indexes, cube pre‑aggregation, star‑tree structures, and an automated Magnus service for intelligent optimization, achieving near‑second query responses.

Big DataIcebergLakehouse
0 likes · 14 min read
Bilibili's Iceberg‑Based Lakehouse Platform: Technical Practices for Sub‑Second Query Response
DataFunSummit
DataFunSummit
Oct 16, 2023 · Big Data

Elegant Dimensional Modeling and Multi‑Dimensional Analysis Design Practice

In this presentation, Qiu Shengchang shares his 13‑year experience designing elegant data‑warehouse architectures, detailing a highly generic dimensional model, extreme partitioned tables, and a universal multi‑dimensional analysis framework that enables rapid, comprehensive reporting on massive datasets.

Big DataData WarehouseMulti-dimensional Analysis
0 likes · 3 min read
Elegant Dimensional Modeling and Multi‑Dimensional Analysis Design Practice
DataFunSummit
DataFunSummit
Oct 15, 2023 · Big Data

Construction and Architecture of JD One-Service Data Service System

This article details JD's three‑stage evolution of its data service platform, explains thematic (topic‑based) data services, introduces the One‑Service unified architecture, and outlines future plans for standardization, low‑code front‑end, and operational improvements.

Big DataData PlatformData Service
0 likes · 13 min read
Construction and Architecture of JD One-Service Data Service System
dbaplus Community
dbaplus Community
Oct 14, 2023 · Big Data

What Is a Data Warehouse? From Basics to Modern Practices

This article explains what a data warehouse is, contrasts it with traditional databases, outlines the evolution from classic to internet‑scale warehouses, details modeling approaches and layered architectures, discusses KPI dictionaries, date dimensions, naming standards, data governance, incremental loading techniques, and upstream/downstream coordination.

Big DataETLdata governance
0 likes · 25 min read
What Is a Data Warehouse? From Basics to Modern Practices
DataFunSummit
DataFunSummit
Oct 13, 2023 · Big Data

Practical Experience of Flink on Kubernetes at Kuaishou

This article presents Kuaishou's comprehensive journey of adopting Flink on Kubernetes, covering its background, evolution, architecture, production migration, observability, testing, and future plans, and demonstrates how large‑scale streaming workloads are transformed to a cloud‑native environment.

Big DataFlinkKubernetes
0 likes · 14 min read
Practical Experience of Flink on Kubernetes at Kuaishou
DataFunTalk
DataFunTalk
Oct 13, 2023 · Big Data

Design Principles, Architecture, and Applications of the Open‑Source LakeSoul Lakehouse Framework

This article provides a comprehensive technical overview of LakeSoul, an open‑source, cloud‑native lakehouse framework, covering its design philosophy, core features, architecture, performance benchmarks, real‑time ingestion, incremental computation, multi‑stream joining, security, community progress, and future roadmap.

Big DataData LakehouseFlink
0 likes · 16 min read
Design Principles, Architecture, and Applications of the Open‑Source LakeSoul Lakehouse Framework
Data Thinking Notes
Data Thinking Notes
Oct 11, 2023 · Big Data

How ByteDance Optimized Its E‑Commerce Data Lake to Cut Costs and Boost Real‑Time Accuracy

ByteDance revamped its traditional Lambda architecture for e‑commerce traffic data by introducing a new lake ingestion solution that reduces development and operational costs, ensures timely and stable data, and outlines future plans covering business background, ODS lake design, archiving tags, delayed data handling, and real‑time stability.

Big DataData LakeFlink
0 likes · 7 min read
How ByteDance Optimized Its E‑Commerce Data Lake to Cut Costs and Boost Real‑Time Accuracy
政采云技术
政采云技术
Oct 10, 2023 · Artificial Intelligence

Predicting Membership Purchase with Logistic Regression: Feature Engineering, Model Training, Evaluation, and Deployment

This article presents a complete workflow for predicting whether users will purchase a membership using logistic regression, covering data collection, feature selection, handling imbalanced samples, model training, hyper‑parameter tuning, threshold optimization, evaluation metrics such as accuracy, precision, recall, AUC, lift, and finally deployment on a big‑data platform with PySpark.

Big DataFeature Engineeringlogistic regression
0 likes · 17 min read
Predicting Membership Purchase with Logistic Regression: Feature Engineering, Model Training, Evaluation, and Deployment
Past Memory Big Data
Past Memory Big Data
Oct 10, 2023 · Big Data

2023 Big Data Interview Guide: Hadoop, Hive, Doris, Data Warehouse Essentials

This comprehensive 2023 guide covers essential big‑data interview topics, providing detailed explanations and step‑by‑step processes for Hadoop HDFS read/write, YARN, Hive table types and optimizations, Doris architecture and data models, data‑warehouse layers, modeling techniques, quality monitoring, and classic algorithm design questions such as TOP‑K and duplicate detection.

Big DataData WarehouseDoris
0 likes · 54 min read
2023 Big Data Interview Guide: Hadoop, Hive, Doris, Data Warehouse Essentials
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Oct 9, 2023 · Big Data

How We Cut MaxCompute Costs Using Information Schema Insights

This article details how a fast‑growing HR SaaS company analyzed MaxCompute billing spikes, identified five key cost drivers, leveraged tenant‑level Information Schema to extract task metadata, applied SQL‑based cost formulas, and implemented targeted optimizations that stabilized their cloud data‑processing expenses.

Big DataInformation SchemaMaxCompute
0 likes · 10 min read
How We Cut MaxCompute Costs Using Information Schema Insights
MaGe Linux Operations
MaGe Linux Operations
Oct 8, 2023 · Big Data

Understanding Kafka: Core Concepts, Architecture, and Performance Secrets

This article explains Kafka’s fundamental role as a message system, detailing topics, partitions, producers, consumers, replica management, consumer groups, the controller, Zookeeper coordination, and performance optimizations such as sequential writes, zero‑copy, log segmentation, and network design, providing a comprehensive overview for big‑data practitioners.

Big DataDistributed SystemsKafka
0 likes · 11 min read
Understanding Kafka: Core Concepts, Architecture, and Performance Secrets
DataFunTalk
DataFunTalk
Oct 8, 2023 · Big Data

Full-Process DataOps Practices for Large-Scale Business Data Reporting at Baidu

This article reveals how Baidu implements end‑to‑end DataOps for its commercial data products, covering challenges of massive report generation, the design of a layered data architecture, platform‑wide automation, serverless deployment, risk control, monitoring, and optimization to achieve scalable, reliable data pipelines.

Big DataDataOpsOptimization
0 likes · 13 min read
Full-Process DataOps Practices for Large-Scale Business Data Reporting at Baidu
Efficient Ops
Efficient Ops
Oct 7, 2023 · Big Data

Master Kafka Basics: Topics, Partitions, Producers, and Cluster Architecture

This article explains Kafka's role as a messaging system, covering core concepts such as topics, partitions, producers, consumers, messages, cluster architecture, replicas, consumer groups, controller coordination with Zookeeper, and performance optimizations like sequential writes and zero‑copy networking.

Big DataDistributed SystemsKafka
0 likes · 11 min read
Master Kafka Basics: Topics, Partitions, Producers, and Cluster Architecture
DataFunTalk
DataFunTalk
Oct 7, 2023 · Big Data

Alibaba DataWorks Data Stability Governance: Challenges, Solutions, and Practices

This article presents Alibaba's experience in addressing large‑scale data stability challenges by outlining common problems, governance principles, baseline monitoring, team collaboration methods, practical implementations, and proactive measures to ensure reliable and accurate data production on the DataWorks platform.

AlibabaBig DataDataWorks
0 likes · 12 min read
Alibaba DataWorks Data Stability Governance: Challenges, Solutions, and Practices
Efficient Ops
Efficient Ops
Oct 6, 2023 · Operations

How China Post’s Next‑Gen IT Monitoring Platform Drives Smart Operations

The article details China Post’s new generation IT infrastructure intelligent operation monitoring platform, highlighting its architecture, data collection, stream‑batch processing, AI‑driven algorithms, and one‑stop portal, and explains how the solution exemplifies cutting‑edge digital transformation practices showcased at the 2023 China International Service Trade Fair.

AIBig DataDigital Transformation
0 likes · 9 min read
How China Post’s Next‑Gen IT Monitoring Platform Drives Smart Operations
DataFunTalk
DataFunTalk
Oct 5, 2023 · Big Data

Building a Unified Streaming‑Batch Lakehouse with Amoro Mixed Iceberg

This article describes how Shanghai Steel Union leveraged Amoro Mixed Iceberg on top of Apache Iceberg to create a unified streaming‑batch lakehouse, addressing small‑file and upsert challenges, simplifying architecture, improving data freshness, and providing a scalable solution for real‑time and batch analytics.

AmoroApache IcebergBig Data
0 likes · 13 min read
Building a Unified Streaming‑Batch Lakehouse with Amoro Mixed Iceberg
ITPUB
ITPUB
Oct 4, 2023 · Backend Development

How to Speed Up Slow Elasticsearch Aggregations with execution_hint "map"

In a high‑traffic e‑commerce system, sharding makes cross‑shop queries inefficient, and adding terms aggregations in Elasticsearch caused queries to take dozens of seconds, but using the "execution_hint":"map" option dramatically reduces aggregation latency.

Big DataElasticsearchPerformance Optimization
0 likes · 7 min read
How to Speed Up Slow Elasticsearch Aggregations with execution_hint "map"
DataFunTalk
DataFunTalk
Oct 4, 2023 · Big Data

Understanding Power Law Distributions in Content Ecosystems: Data Science Insights and Applications

This article explores how data scientists at Tencent analyze and model the shape of data in content ecosystems, focusing on normal and power‑law distributions, their prevalence, theoretical mechanisms, practical implications for traffic and compensation strategies, and methods such as integer programming, graph analysis, and causal inference.

Big DataPower LawStatistical Distribution
0 likes · 19 min read
Understanding Power Law Distributions in Content Ecosystems: Data Science Insights and Applications
DataFunSummit
DataFunSummit
Oct 1, 2023 · Big Data

Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans

This presentation introduces Iceberg's core capabilities, details Xiaomi's practical applications—including log ingestion, near‑real‑time warehousing, offline challenges, column‑level encryption, and Hive migration—and outlines future development directions such as materialized views and cloud migration, providing a comprehensive view of modern data‑lake engineering.

Big DataData LakeFlink
0 likes · 22 min read
Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans
DataFunTalk
DataFunTalk
Sep 30, 2023 · Big Data

Building a Marketing‑Oriented Data Middle Platform: Concepts and Practices

This article outlines how a marketing‑focused data middle platform can be constructed by integrating online and offline behavior data, business data, and third‑party sources, then applying data integration, modeling, processing, and application capabilities to enable data‑driven user journeys and personalized marketing strategies.

Big DataData IntegrationMarketing Analytics
0 likes · 13 min read
Building a Marketing‑Oriented Data Middle Platform: Concepts and Practices
ITPUB
ITPUB
Sep 29, 2023 · Big Data

How Vivo Scaled Hive Metastore Using TiDB: A Deep Dive into Big Data Metadata

This article recounts Vivo’s journey to horizontally scale its Hive Metastore service by evaluating MySQL sharding, the open‑source Waggle‑Dance gateway, and ultimately selecting TiDB, detailing the migration process, configuration tweaks, performance benchmarks, encountered issues such as primary‑key conflicts, index choices, memory spikes, and the solutions implemented to ensure stable, high‑performance metadata storage for massive data volumes.

Big DataHive MetastorePerformance Optimization
0 likes · 22 min read
How Vivo Scaled Hive Metastore Using TiDB: A Deep Dive into Big Data Metadata
DataFunSummit
DataFunSummit
Sep 28, 2023 · Big Data

Real‑time Risk Control Practices at NetEase Games Using Apache Flink

The article details NetEase Games' challenges in payment‑environment risk control and explains how they transformed a T+1 batch workflow into a fully real‑time risk‑control system with Apache Flink, describing the platform architecture, data modeling, session windows, joins, and future development plans.

Big DataFlinkReal-time Risk Control
0 likes · 19 min read
Real‑time Risk Control Practices at NetEase Games Using Apache Flink
vivo Internet Technology
vivo Internet Technology
Sep 27, 2023 · Big Data

Horizontal Scaling of Hive Metastore Service at Vivo: Evaluation, TiDB Migration, and Lessons Learned

Vivo’s big‑data team horizontally scaled its Hive Metastore by evaluating MySQL sharding (Waggle‑Dance) against a TiDB migration, ultimately adopting TiDB, which after a synchronized cut‑over delivered ~15% faster queries, 80% DDL latency reduction, linear scaling, low resource use, and valuable operational lessons.

Big DataHive MetastoreTiDB
0 likes · 19 min read
Horizontal Scaling of Hive Metastore Service at Vivo: Evaluation, TiDB Migration, and Lessons Learned
DataFunTalk
DataFunTalk
Sep 25, 2023 · Big Data

Tag System Construction Practice at 58: Pain Points, Solutions, Architecture, and Management Platform

This article details the practical implementation of a tag system at 58, covering business stages that require tagging, common challenges and solutions, a three‑layer architecture, lifecycle management, evaluation metrics, and a unified tag management platform to support scalable, scenario‑driven data products.

Big DataLabel ArchitectureTag Management
0 likes · 17 min read
Tag System Construction Practice at 58: Pain Points, Solutions, Architecture, and Management Platform
Huolala Tech
Huolala Tech
Sep 21, 2023 · Big Data

How We Built a Scalable Data Migration Framework for Billions of Transactions

This article details the design and implementation of a custom, high‑throughput data migration framework that handles petabyte‑scale transaction data, supports heterogeneous source/target schemas, ensures zero‑downtime operation, and provides robust scheduling, checkpointing, and fault‑tolerance mechanisms.

Big DataData MigrationDistributed Systems
0 likes · 17 min read
How We Built a Scalable Data Migration Framework for Billions of Transactions
Architect
Architect
Sep 19, 2023 · Big Data

How Tianyan Beats ELK: Inside a High‑Performance Distributed Log Service

This article analyzes the challenges of logging in distributed services, compares the traditional ELK stack with Baidu's Tianyan platform, and details Tianyan's architecture, data collection, high‑throughput transmission, storage, retrieval, resource isolation, dynamic cleanup, and best‑practice recommendations, complete with code examples and performance insights.

Big DataDistributed SystemsELK
0 likes · 30 min read
How Tianyan Beats ELK: Inside a High‑Performance Distributed Log Service
DataFunTalk
DataFunTalk
Sep 16, 2023 · Big Data

StarRocks Data Lake Analysis, Materialized Views, and Lakehouse Architecture

This article explains how StarRocks 3.0 extends real‑time data‑warehouse capabilities to support data‑lake analysis, external catalog integration, Trino compatibility, extensive I/O optimizations, and powerful materialized‑view features that together enable a unified, cloud‑native Lakehouse solution with high performance and flexible resource isolation.

Big DataData LakeLakehouse
0 likes · 20 min read
StarRocks Data Lake Analysis, Materialized Views, and Lakehouse Architecture
Bilibili Tech
Bilibili Tech
Sep 15, 2023 · Big Data

Introducing Bilibili's SQLScan: Architecture, Key Technologies, and Production Impact

Bilibili's SQLScan is a static‑code analysis tool that parses Hive, Spark, Presto and Flink SQL via Antlr4, builds a unified AST, applies engine‑specific metadata plugins for rule enforcement, provides field‑lineage and cost‑analysis services, and has processed hundreds of thousands of daily queries, intercepting thousands of problematic statements to improve data quality and operational efficiency.

Big DataBilibiliData Lineage
0 likes · 11 min read
Introducing Bilibili's SQLScan: Architecture, Key Technologies, and Production Impact
Programmer DD
Programmer DD
Sep 15, 2023 · Big Data

How Alluxio Manages Massive Metadata: Inode, Block, MountTable, and Worker Insights

This article examines Alluxio's open-source distributed file system, detailing the core types of metadata—inode, block, mount table, and worker—along with the mechanisms for their storage, management, and optimization in both HEAP and ROCKS modes, and provides practical configuration guidance for scaling large-scale data environments.

AlluxioBig DataDistributed File System
0 likes · 15 min read
How Alluxio Manages Massive Metadata: Inode, Block, MountTable, and Worker Insights
DataFunTalk
DataFunTalk
Sep 13, 2023 · Big Data

Design and Implementation of a Lakehouse Data Platform Based on Apache Hudi at Taikang Life Insurance

This article details Taikang Life Insurance's end‑to‑end technical selection, architecture design, implementation, and custom enhancements of an Apache Hudi‑driven lakehouse platform for large‑scale health‑insurance data, covering background, component evaluation, performance benchmarking, multi‑layer architecture, and real‑world results.

Apache HudiBig DataData Lakehouse
0 likes · 44 min read
Design and Implementation of a Lakehouse Data Platform Based on Apache Hudi at Taikang Life Insurance
DataFunTalk
DataFunTalk
Sep 12, 2023 · Big Data

Building an Intelligent Data Governance Platform at NetEase Cloud Music: Architecture, Practices, and Future Plans

This article presents a comprehensive case study of NetEase Cloud Music’s metadata‑driven intelligent governance platform, detailing its scale, construction background, modular architecture, rule‑based automation, practical deployment, and future roadmap for sustainable data ecosystem management.

Big DataRule Engineautomation
0 likes · 22 min read
Building an Intelligent Data Governance Platform at NetEase Cloud Music: Architecture, Practices, and Future Plans
DataFunTalk
DataFunTalk
Sep 10, 2023 · Big Data

Ping An Life Insurance’s Data Middle Platform Construction Practice

The presentation details Ping An Life’s four‑stage data middle‑platform initiative—defining data capability as the foundation of digital transformation, outlining the platform’s architecture and governance, showcasing business‑value applications, and discussing talent and cultural considerations—to illustrate how a large insurer builds a scalable, real‑time data ecosystem.

Big DataDigital TransformationInsurance
0 likes · 9 min read
Ping An Life Insurance’s Data Middle Platform Construction Practice
DataFunTalk
DataFunTalk
Sep 9, 2023 · Big Data

Presto + Tencent DOP (Alluxio) Architecture and Optimization Practices for Financial OLAP

This article presents the practical implementation of Presto combined with Tencent DOP (Alluxio) in a financial OLAP scenario, detailing background and architectural evolution, the Presto‑Alluxio design, optimization techniques for caching, storage scalability, ORC handling, and performance results, followed by conclusions and future directions.

AlluxioBig DataOLAP
0 likes · 15 min read
Presto + Tencent DOP (Alluxio) Architecture and Optimization Practices for Financial OLAP
21CTO
21CTO
Sep 8, 2023 · Big Data

Why Real-Time Data Processing Is the Next Frontier for Data Engineers

Real-time data processing transforms traditional batch pipelines by delivering fresh, low‑latency data to millions of concurrent users, leveraging event‑driven architectures, streaming engines, and real‑time databases, with use cases ranging from fraud detection to personalized e‑commerce and operational dashboards, and includes reference architectures and tool recommendations.

Big DataData EngineeringReal-time Processing
0 likes · 16 min read
Why Real-Time Data Processing Is the Next Frontier for Data Engineers
DataFunSummit
DataFunSummit
Sep 8, 2023 · Big Data

Tianqiong OLAP Real‑time Lakehouse Fusion Platform Architecture Practice

This article explains why lake‑warehouse fusion is needed, describes the challenges of integrating real‑time data warehouses with data lakes, introduces a new StarRocks‑based architecture that supports real‑time ingestion, cooling, offline loading, and adaptive hot‑cold query rewriting, and outlines future plans and Q&A.

Big DataData IntegrationData Warehouse
0 likes · 21 min read
Tianqiong OLAP Real‑time Lakehouse Fusion Platform Architecture Practice
Huolala Tech
Huolala Tech
Sep 7, 2023 · Big Data

How Huolala Ensures Doris Stability: Real-World Big Data Practices

This article details Huolala's big‑data architecture and the practical measures—ranging from background analysis and stability challenges to case studies, discovery mechanisms, capacity planning, high‑availability, and automation—that the company employs to guarantee Doris's reliability and performance across its rapidly growing logistics platform.

Big DataDorisOLAP
0 likes · 15 min read
How Huolala Ensures Doris Stability: Real-World Big Data Practices
StarRocks
StarRocks
Sep 6, 2023 · Big Data

How Paimon + StarRocks Revolutionize Lakehouse Analytics

This article reviews traditional Lambda and Kappa data‑warehouse architectures, then details four Paimon‑StarRocks lakehouse solutions—including a data‑lake center, accelerated query with materialized views, hot‑cold data separation, and the JNI connector—while also outlining StarRocks’ future roadmap for lakehouse analytics.

Big DataLakehousePaimon
0 likes · 11 min read
How Paimon + StarRocks Revolutionize Lakehouse Analytics
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Sep 6, 2023 · Databases

REDck: A Cloud‑Native Real‑Time OLAP Data Warehouse Built on ClickHouse

REDck is a cloud‑native, real‑time OLAP data warehouse built on ClickHouse that adds elastic compute and storage scaling, object‑storage optimizations, multi‑level caching, and exactly‑once ingestion, delivering petabyte‑scale interactive analytics with ten‑fold CPU efficiency, ten‑fold cost reduction, and 99.9% availability.

Big DataClickHouseReal-time OLAP
0 likes · 21 min read
REDck: A Cloud‑Native Real‑Time OLAP Data Warehouse Built on ClickHouse
JD Retail Technology
JD Retail Technology
Sep 4, 2023 · Big Data

JD Mini Program Data Center: Architecture, Milestones, and Real‑time Analytics Solutions

The article details the JD Mini Program platform, its data‑center development milestones, comprehensive business panorama, technical architecture, data collection, storage, and analysis pipelines—including Flink‑based real‑time monitoring, ClickHouse custom analytics, and Elasticsearch user‑behavior insights—while outlining current challenges and future AI‑driven enhancements.

Big DataClickHouseData Warehouse
0 likes · 16 min read
JD Mini Program Data Center: Architecture, Milestones, and Real‑time Analytics Solutions
Data Thinking Notes
Data Thinking Notes
Sep 3, 2023 · Big Data

How to Build an Effective Data Governance Framework: Steps & Best Practices

This article outlines a comprehensive data governance framework for Chinese enterprises, covering organizational structures, data asset inventory, six‑stage methodology, and the creation of unified data standards and quality rules to support effective digital transformation and data‑driven decision making.

Big DataData ManagementData Quality
0 likes · 13 min read
How to Build an Effective Data Governance Framework: Steps & Best Practices
dbaplus Community
dbaplus Community
Sep 3, 2023 · Big Data

How NetEase Yanxuan Migrated from Lambda to Iceberg for Seamless Batch‑Stream Integration

This article explains how NetEase Yanxuan upgraded its legacy Lambda architecture to an Iceberg‑based batch‑stream unified platform, detailing the original data pipeline, the challenges faced, the evaluation of Iceberg versus Hudi and DeltaLake, and the concrete engineering optimizations and governance measures implemented to achieve lower latency and higher query performance.

Batch-Stream IntegrationBig DataFlink
0 likes · 14 min read
How NetEase Yanxuan Migrated from Lambda to Iceberg for Seamless Batch‑Stream Integration
DataFunTalk
DataFunTalk
Sep 3, 2023 · Big Data

Evolution of OLAP at Xingyun Retail Credit Using Apache Doris

This article details how Xingyun Retail Credit transitioned from traditional data warehouses to an Apache Doris‑based OLAP solution, covering data demand generation, OLAP engine selection challenges, multi‑stage implementation, performance optimizations, data‑warehouse construction, real‑world use cases, and future roadmap.

Apache DorisBig DataData Warehouse
0 likes · 16 min read
Evolution of OLAP at Xingyun Retail Credit Using Apache Doris
DataFunSummit
DataFunSummit
Sep 2, 2023 · Big Data

Practical Experience of Bilibili's Big Data Cluster Mixed Deployment Architecture

This article details Bilibili's offline big‑data cluster challenges, the mixed‑deployment architecture that combines offline and online resources, the Amiya service's over‑commit and eviction mechanisms, performance optimizations, monitoring strategies, and future plans to further improve resource utilization and scheduling.

AmiyaBig DataBilibili
0 likes · 14 min read
Practical Experience of Bilibili's Big Data Cluster Mixed Deployment Architecture
DataFunTalk
DataFunTalk
Aug 30, 2023 · Big Data

Design and Implementation of Baidu Cloud Block Storage EC System for Large‑Scale Data

This article presents Baidu Cloud's block storage architecture, comparing replication and erasure‑coding fault‑tolerance methods, detailing the challenges of applying EC to mutable block data, and describing a two‑layer append‑engine solution with selective 3‑replica caching, cost‑benefit compaction, and performance optimizations for low‑cost, high‑throughput storage.

Big DataStorage Architectureappend engine
0 likes · 14 min read
Design and Implementation of Baidu Cloud Block Storage EC System for Large‑Scale Data
ByteDance Data Platform
ByteDance Data Platform
Aug 30, 2023 · Big Data

How We Cut Offline Data Warehouse SLA Delay from 13 Days to Zero with DataLeap

The article details how the "Xingfu Li" real‑estate platform tackled a 13‑day offline data‑warehouse SLA delay by adopting Volcano Engine's DataLeap suite, outlining the challenges, the three‑step governance process, and the measurable improvements achieved across task coverage, alert reduction, and data stability.

Big DataData WarehouseDataLeap
0 likes · 10 min read
How We Cut Offline Data Warehouse SLA Delay from 13 Days to Zero with DataLeap
JD Tech
JD Tech
Aug 30, 2023 · Databases

A Comprehensive Overview of Database Evolution, Types, and Data Structure Design Techniques

This article explains key database terminology, traces the history of database technologies, compares relational, NoSQL, NewSQL, OLTP/OLAP, columnar, time‑series and graph databases, and demonstrates practical data‑structure designs such as zipper tables, bit operations, bitmaps, bloom filters, and ring queues for software development.

Big DataData StructuresDatabases
0 likes · 27 min read
A Comprehensive Overview of Database Evolution, Types, and Data Structure Design Techniques
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Aug 30, 2023 · Big Data

How Transaction Table2.0 Cuts Data Deduplication Costs by 98% in MaxCompute

This article explains how Renliji's data warehouse team leveraged MaxCompute's Transaction Table2.0 to dramatically reduce incremental data deduplication costs and execution time, while also introducing efficient small‑file merging, time‑travel queries, and future data‑sync strategies for a high‑growth HR SaaS platform.

Big DataMaxComputeTransaction Table2.0
0 likes · 11 min read
How Transaction Table2.0 Cuts Data Deduplication Costs by 98% in MaxCompute
DataFunTalk
DataFunTalk
Aug 29, 2023 · Big Data

MaxCompute Incremental Update, Processing Architecture, and Intelligent Data Warehouse Optimizations

This article presents a comprehensive overview of MaxCompute's incremental update and processing architecture, the design of intelligent materialized views, and the engine's adaptive execution optimizations, detailing the integrated near‑real‑time and batch pipelines, transactional table 2.0, and practical Q&A.

Big DataData WarehouseMaxCompute
0 likes · 21 min read
MaxCompute Incremental Update, Processing Architecture, and Intelligent Data Warehouse Optimizations
DataFunSummit
DataFunSummit
Aug 28, 2023 · Big Data

Building Data Production Pipelines with DataOps: Concepts, Practices, and a Six‑Stage Workflow

This article introduces DataOps, outlines its background and the problems it addresses, describes NetEase’s big‑data product ecosystem, and details a six‑stage data production pipeline—including coding, orchestration, testing, code review, release approval, and deployment – plus insights into two pipeline explorations.

Big DataData QualityDataOps
0 likes · 15 min read
Building Data Production Pipelines with DataOps: Concepts, Practices, and a Six‑Stage Workflow
DataFunTalk
DataFunTalk
Aug 28, 2023 · Big Data

Practical Experience of an E‑commerce Platform’s Offline and Real‑time Data Warehouse

This article shares the practical architecture, technology selection, implementation details, and evolution of an e‑commerce platform’s offline and real‑time data warehouses, covering data modeling, processing pipelines, system components such as Hive, Spark, Flink, ClickHouse, Doris, and Hudi, and the lessons learned from multiple production deployments.

Big DataClickHouseData Warehouse
0 likes · 18 min read
Practical Experience of an E‑commerce Platform’s Offline and Real‑time Data Warehouse
Volcano Engine Developer Services
Volcano Engine Developer Services
Aug 25, 2023 · Cloud Native

How ByteDance Scaled with Multi‑Cloud: Lessons from Their Cloud‑Native Journey

ByteDance’s multi‑cloud evolution, driven by rapid business growth, cost control, and compliance needs, showcases a distributed cloud‑native platform built on open‑source orchestration, unified resource management, and advanced data‑lake solutions, while addressing operational complexity, interoperability, and emerging AI‑driven challenges.

AIBig DataKubernetes
0 likes · 14 min read
How ByteDance Scaled with Multi‑Cloud: Lessons from Their Cloud‑Native Journey
iQIYI Technical Product Team
iQIYI Technical Product Team
Aug 25, 2023 · Big Data

Venus Log Platform Architecture Evolution: From ELK to Data Lake

The Venus log platform at iQiyi migrated from an ElasticSearch‑Kibana architecture to an Iceberg‑based data lake with Trino, cutting storage and compute costs by over 70%, boosting stability by 85%, and efficiently supporting billions of daily logs through write‑heavy, low‑query workloads.

Big DataElasticsearchIceberg
0 likes · 22 min read
Venus Log Platform Architecture Evolution: From ELK to Data Lake
Tencent Cloud Developer
Tencent Cloud Developer
Aug 23, 2023 · Big Data

WeChat Experiment Platform: Architecture Design and Iceberg Lakehouse Optimization

The WeChat Experiment Platform migrated its 60,000 metric, 200,000 core, 30 PB plus data pipeline to an Iceberg based lakehouse, leveraging three layer metadata, fine grained partitioning, MERGE into writes, time travel snapshots and skew handling UDFs, which cut core time by 69%, saved ~100 PB storage, and reduced latency by up to 70%.

Big DataData WarehouseIceberg
0 likes · 18 min read
WeChat Experiment Platform: Architecture Design and Iceberg Lakehouse Optimization
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 22, 2023 · Big Data

DataOps Practices and Challenges at ByteDance: From Model to Productization

The article summarizes ByteDance's DataOps journey, detailing its mid‑platform tool and Data BP model, core performance metrics, quality, hardware and human efficiency challenges, concrete DataOps implementation, productization through DataLeap, best‑practice promotion, and future outlook for data‑driven business value.

Big DataByteDanceData Platform
0 likes · 17 min read
DataOps Practices and Challenges at ByteDance: From Model to Productization
JD Retail Technology
JD Retail Technology
Aug 21, 2023 · Artificial Intelligence

ChatGPT-4 Enhances Data Analysis Efficiency and Insight Across Big Data Scenarios

This article examines how ChatGPT-4, as an advanced natural‑language‑processing model, can streamline data analysis tasks—from generating Hive table definitions and sample data to crafting complex HiveSQL queries, visualizing results, and implementing ClickHouse and Flink solutions—thereby improving efficiency, insight, and problem‑solving in big‑data environments.

Artificial IntelligenceBig DataChatGPT-4
0 likes · 7 min read
ChatGPT-4 Enhances Data Analysis Efficiency and Insight Across Big Data Scenarios
DataFunTalk
DataFunTalk
Aug 21, 2023 · Databases

Case Study: Building a Real‑Time Log Data Analysis Platform with Apache Doris at China Unicom

This article describes how China Unicom’s Western Innovation Research Institute designed and deployed a centralized, real‑time log analytics platform using Apache Doris, detailing the migration from Hive and ClickHouse, performance optimizations, storage cost reductions, and the resulting improvements in data ingestion, query speed, and operational efficiency.

Apache DorisBig DataCold‑Hot Data Management
0 likes · 18 min read
Case Study: Building a Real‑Time Log Data Analysis Platform with Apache Doris at China Unicom
DataFunSummit
DataFunSummit
Aug 20, 2023 · Big Data

Kuaishou Data Service System: Modeling, Architecture, and Future Directions

This article presents Kuaishou's comprehensive data service system, covering its domain modeling, evolution from custom to unified services, the Octo query engine and data preparation platform architecture, the dual data API and analysis services, and future plans for intelligence and serverless high‑performance capabilities.

Big DataData PlatformData Service
0 likes · 16 min read
Kuaishou Data Service System: Modeling, Architecture, and Future Directions
DataFunTalk
DataFunTalk
Aug 20, 2023 · Databases

Best Practices for Building Low‑Cost Data Lake Analytics with AnalyticDB MySQL and Serverless Spark

This article presents a comprehensive technical overview of Alibaba Cloud AnalyticDB MySQL and its Serverless Spark integration, detailing architecture, core optimizations, security enhancements, and real‑world case studies that demonstrate how to achieve cost‑effective, high‑performance data lake analytics.

AnalyticDBBig DataData Lake
0 likes · 19 min read
Best Practices for Building Low‑Cost Data Lake Analytics with AnalyticDB MySQL and Serverless Spark
Model Perspective
Model Perspective
Aug 19, 2023 · Artificial Intelligence

Unlocking Hidden Patterns: How Tensor Decomposition Powers Modern AI

This article introduces tensors and tensor decomposition, explains core operations, explores CP and other factorization methods, and demonstrates Python implementations for music and movie recommendation systems, highlighting how these techniques reveal hidden structures in large‑scale data.

Big DataCP decompositionPython
0 likes · 15 min read
Unlocking Hidden Patterns: How Tensor Decomposition Powers Modern AI