Tagged articles

3697 articles

Page 13 of 37

Mar 9, 2023 · Big Data

Implementing Exactly-Once Semantics with Flink and Kafka: Utility Classes, Character Count Example, and Transactional Consumer

This article demonstrates how to achieve exactly‑once processing in Flink by providing Kafka I/O utility classes, a character‑count streaming example, and a transactional consumer implementation, while also discussing configuration nuances and common pitfalls.

Big DataExactly-OnceFlink

0 likes · 11 min read

Implementing Exactly-Once Semantics with Flink and Kafka: Utility Classes, Character Count Example, and Transactional Consumer

政采云技术

Mar 9, 2023 · Fundamentals

Redesigning Data Warehouse Models: When and How to Use Dimensional Modeling

This article explains the concept of data models, why warehouse models need reconstruction, compares normative and dimensional modeling approaches, and provides a step‑by‑step guide—including information gathering, design, and implementation—to build efficient, maintainable data warehouse architectures.

Big DataData WarehouseDatabase Design

0 likes · 12 min read

Redesigning Data Warehouse Models: When and How to Use Dimensional Modeling

Architect's Tech Stack

Mar 9, 2023 · Big Data

Improving Data Warehouse Performance: From Clusters and Pre‑Computation to esProc SPL

The article analyzes the growing performance challenges of data warehouses, evaluates traditional solutions such as clustering, pre‑computation and optimization engines, and presents esProc SPL as a non‑SQL, low‑complexity alternative that delivers orders‑of‑magnitude speedups on modest hardware.

Big DataData WarehousePerformance Optimization

0 likes · 16 min read

Improving Data Warehouse Performance: From Clusters and Pre‑Computation to esProc SPL

Architects Research Society

Mar 8, 2023 · Big Data

Understanding DataOps: Principles, Benefits, and Implementation

DataOps, rooted in agile and DevOps philosophies, uses automation and collaborative practices to streamline data processing, improve quality, and align analytics with business goals, offering continuous analytics, faster insights, and breaking data silos for better decision‑making across organizations.

Big DataContinuous AnalyticsDataOps

0 likes · 10 min read

Understanding DataOps: Principles, Benefits, and Implementation

Alimama Tech

Mar 8, 2023 · Artificial Intelligence

Secure Data Hub: Alibaba's Marketing Privacy Computing Platform

Alibaba’s Secure Data Hub (SDH) is a privacy‑preserving data clean‑room platform that uses secure multi‑party computation and privacy‑enhancing machine learning to let advertisers, ad platforms, and auditors jointly analyze marketing data via a simple SQL API while keeping raw data encrypted, column‑level protected, and confined to each party’s private domain.

Big Datadata clean roomsql

0 likes · 13 min read

Secure Data Hub: Alibaba's Marketing Privacy Computing Platform

DataFunTalk

Mar 8, 2023 · Artificial Intelligence

Applying AI Algorithms to Big Data Governance: Use Cases and Future Directions

This article presents Datacake's experience of integrating AI algorithms into big data governance, covering the bidirectional relationship between AI and big data, health‑score assessment of data tasks, intelligent Spark parameter tuning, SQL engine selection, and future application scenarios across the data lifecycle.

AIBig DataSpark

0 likes · 18 min read

Applying AI Algorithms to Big Data Governance: Use Cases and Future Directions

Architects Research Society

Mar 7, 2023 · Big Data

Best Open‑Source ETL Tools: Detailed Comparison and Recommendations

This article provides an overview of the most popular ETL tools—both open‑source and commercial—explaining their core features, use cases, and how they simplify data extraction, transformation, and loading for modern data‑driven applications.

Big DataData IntegrationData Warehouse

0 likes · 10 min read

Best Open‑Source ETL Tools: Detailed Comparison and Recommendations

Big Data Technology & Architecture

Mar 7, 2023 · Big Data

Implementing Exactly-Once Kafka-to-Redis with Flink: Two-Phase Commit Sink and Bug Fixes

This tutorial explains how to achieve exactly‑once semantics when streaming data from Kafka to Redis using Apache Flink's TwoPhaseCommitSinkFunction, covering Redis transaction basics, utility classes, sink implementation, testing steps, and solutions to common connection and transaction bugs.

Big DataExactly-OnceFlink

0 likes · 11 min read

Implementing Exactly-Once Kafka-to-Redis with Flink: Two-Phase Commit Sink and Bug Fixes

政采云技术

Mar 7, 2023 · Databases

Data Warehouse Modeling: Concepts, Methods, and Implementation

This article explains what data models are, why model refactoring is necessary, compares normalized and dimensional data warehouse modeling approaches, and details a three‑step implementation process—including information research, model design, and model deployment—while highlighting best‑practice naming conventions and practical examples.

Big DataData WarehouseDatabase Design

0 likes · 14 min read

Data Warehouse Modeling: Concepts, Methods, and Implementation

Baidu Geek Talk

Mar 6, 2023 · Big Data

Accelerating Data Production and Consumption in Baidu's Performance Platform

Baidu's Performance Platform speeds data production and consumption by adopting a unified stream‑batch architecture with TM and Spark, leveraging the Turing warehouse, introducing tiered service grading, robust governance and compliance measures, and offering self‑service analytics, cutting latency from minutes or days to milliseconds while handling billions of daily records and boosting SLA adherence, data accuracy, and user satisfaction.

Big DataData engineeringReal-time Processing

0 likes · 12 min read

Accelerating Data Production and Consumption in Baidu's Performance Platform

Architects Research Society

Mar 5, 2023 · Big Data

Best Open‑Source and Commercial ETL Tools: Detailed Comparison

This article introduces the concept of ETL, explains its importance for modern data‑driven applications, and provides a comprehensive comparison of the most popular open‑source and commercial ETL platforms—including their key features, supported data sources, and deployment options—helping readers choose the right tool for their data integration needs.

Big DataData IntegrationData Warehouse

0 likes · 19 min read

Best Open‑Source and Commercial ETL Tools: Detailed Comparison

DataFunSummit

Mar 3, 2023 · Artificial Intelligence

Intelligent Risk Control System Architecture and Development Trends

This article introduces the architecture of intelligent risk control, detailing its four-layer structure, the underlying data, feature, model, and decision components, platform interactions, and future development trends, highlighting how AI and big data enhance risk management efficiency and accuracy.

Big DataDecision SystemsFeature Engineering

0 likes · 12 min read

Intelligent Risk Control System Architecture and Development Trends

Alibaba Cloud Big Data AI Platform

Mar 3, 2023 · Big Data

How Alibaba Cloud EMR Evolved from Open‑Source Compatibility to Enterprise‑Grade Performance

This article outlines Alibaba Cloud EMR's three‑stage evolution—compatibility, contribution, and beyond open source—detailing its early Hadoop adoption, Flink and Spark innovations, cloud‑native optimizations, and enterprise‑grade features such as Remote Shuffle Service, performance benchmarks, and integrated diagnostics.

Alibaba CloudBig DataEMR

0 likes · 13 min read

How Alibaba Cloud EMR Evolved from Open‑Source Compatibility to Enterprise‑Grade Performance

Huolala Tech

Mar 2, 2023 · Big Data

Building a Unified Data Warehouse for Moving Services: Boosting Efficiency and Data Quality

This article details the challenges of fragmented ODS data in the moving‑service domain and explains how a dedicated public‑layer data warehouse, with layered architecture and quality monitoring, was designed and implemented to improve data reuse, reduce redundancy, and stabilize downstream analytics.

Big DataData QualityData Warehouse

0 likes · 15 min read

Building a Unified Data Warehouse for Moving Services: Boosting Efficiency and Data Quality

DataFunSummit

Mar 2, 2023 · Big Data

Huya's Data Self‑Service Product: Challenges, Design, and Practice

The article presents Huya's data‑self‑service product, describing the problems of traditional data services, the principles of a good data service, the MVP implementation, architectural components, project outcomes, and future evolution, while also addressing common Q&A scenarios.

Big DataData ProductData engineering

0 likes · 12 min read

Huya's Data Self‑Service Product: Challenges, Design, and Practice

Programmer DD

Mar 2, 2023 · Backend Development

Why DolphinScheduler Is the Next Powerhouse for Distributed Task Management

DolphinScheduler is an open‑source distributed task scheduling system that supports multiple task types, offers visual workflow orchestration and monitoring, and scales to thousands of servers, making it a robust solution for backend and big‑data processing scenarios.

Big DataDistributed SchedulingDolphinScheduler

0 likes · 4 min read

Why DolphinScheduler Is the Next Powerhouse for Distributed Task Management

DataFunTalk

Mar 2, 2023 · Artificial Intelligence

DataFun Summit 2023 – Knowledge Graph Online Summit

DataFun Summit 2023’s Knowledge Graph Online Summit, held on March 18, brings together leading experts from academia and industry to present six forums covering unified knowledge representation, large‑scale graph construction, massive knowledge storage, KG‑based QA, KG‑AIGC integration, and best‑practice industry applications, with free live streaming registration via QR code.

AIBig DataDataFun

0 likes · 36 min read

DataFunSummit

Mar 1, 2023 · Big Data

Data Governance: Challenges, Framework, and Implementation Practices

This article explains the problems that data governance addresses, outlines a comprehensive governance framework—including system architecture, processes, and policies—and describes practical implementation steps such as integrated tooling, standardized modeling, metadata management, lake‑in and lake‑out governance, and organizational structures for sustainable data management.

Big DataGovernance Frameworkmetadata management

0 likes · 12 min read

Data Governance: Challenges, Framework, and Implementation Practices

DataFunTalk

Mar 1, 2023 · Databases

Evolution and Optimization of Tencent Music Content Library Data Platform: From Architecture 1.0 to 4.0

This article details the evolution of Tencent Music's content library data platform from version 1.0 to 4.0, describing business requirements, architectural redesigns—including migration from ClickHouse to Apache Doris, introduction of a semantic layer, and extensive write, query, and cost optimizations—while sharing practical lessons and future directions.

Apache DorisBig DataData Warehouse

0 likes · 21 min read

Evolution and Optimization of Tencent Music Content Library Data Platform: From Architecture 1.0 to 4.0

Big Data Technology & Architecture

Feb 28, 2023 · Big Data

Comprehensive Guide to Dual‑Stream Join in Flink CDC with Java DataStream API

This article provides a detailed tutorial on implementing various dual‑stream join techniques—including processing‑time, event‑time, and interval joins—using Flink CDC 2.2 and Flink 1.14 with the Java DataStream API, complete with code examples, SQL setup, and execution results.

Big DataCDCDataStream

0 likes · 31 min read

Comprehensive Guide to Dual‑Stream Join in Flink CDC with Java DataStream API

macrozheng

Feb 28, 2023 · Big Data

How Tencent Music Scaled Its Content Data Platform with Apache Doris: From ClickHouse to 4.0 Architecture

This article details the evolution of Tencent Music's content data platform from version 1.0 to 4.0, describing the migration from ClickHouse to Apache Doris, the introduction of a semantic layer, optimization of data ingestion, query performance, and cost reduction strategies that dramatically improved data timeliness, operational efficiency, and storage costs.

Apache DorisBig DataData Architecture

0 likes · 23 min read

How Tencent Music Scaled Its Content Data Platform with Apache Doris: From ClickHouse to 4.0 Architecture

DataFunTalk

Feb 27, 2023 · Big Data

Comprehensive Overview of Data Middle Platform Architecture and Its Core Frameworks

This article provides a detailed overview of data middle platform concepts, describing a decoupled six‑subsystem architecture—including storage, collection, processing, governance, security, and operation frameworks—while illustrating typical enterprise implementations, industry‑specific solutions, and best‑practice considerations for building scalable, secure, and value‑driven data platforms.

Big DataData IntegrationData Platform

0 likes · 25 min read

Comprehensive Overview of Data Middle Platform Architecture and Its Core Frameworks

Programmer DD

Feb 27, 2023 · Big Data

Why Hadoop/Spark Feel Heavy and How SPL Offers a Lightweight Big Data Solution

With data volumes soaring, traditional Hadoop and Spark clusters become costly and cumbersome for small to medium workloads, prompting many to seek lighter alternatives; this article examines the technical, operational, and financial burdens of Hadoop/Spark and introduces the open‑source SPL engine as a fast, low‑cost, easy‑to‑use big‑data solution.

Big DataHadoopSpark

0 likes · 16 min read

Why Hadoop/Spark Feel Heavy and How SPL Offers a Lightweight Big Data Solution

NetEase Yanxuan Technology Product Team

Feb 27, 2023 · Big Data

How NetEase Yanxuan Migrated from Lambda to Iceberg for Real‑Time Batch‑Stream Integration

This article details how NetEase Yanxuan transformed its data platform from a dual Lambda architecture to a unified batch‑stream solution built on Apache Iceberg, covering the original challenges, the evaluation of Iceberg versus Hudi and Delta Lake, implementation of stream‑batch pipelines, message ordering fixes, snapshot generation, and extensive table‑governance optimizations.

Apache FlinkApache SparkBatch-Stream Integration

0 likes · 14 min read

How NetEase Yanxuan Migrated from Lambda to Iceberg for Real‑Time Batch‑Stream Integration

DataFunTalk

Feb 26, 2023 · Big Data

Design, Optimization, and Use Cases of Data Lineage in ByteDance's DataLeap Platform

This article presents an in‑depth overview of DataLeap's data lineage capabilities, covering the challenges, multi‑layer model design, implementation with Apache Atlas and JanusGraph, performance optimizations, diverse use cases across asset, development, governance and security domains, and future trends for lineage technology.

Apache AtlasBig DataData Platform

0 likes · 19 min read

Design, Optimization, and Use Cases of Data Lineage in ByteDance's DataLeap Platform

21CTO

Feb 25, 2023 · Big Data

Which IT Skills Earn Over $140K? 2023’s Top-Paying Tech Expertise Revealed

Based on Dice’s 2023 Tech Salary Report, the article lists the ten highest‑earning IT skill sets in the U.S., detailing average salaries—often exceeding $140,000—and explains why expertise in areas such as containers, Kubernetes, PaaS, Redis, Teradata, Kafka, Elasticsearch, and Go commands premium pay.

2023Big DataCloud Computing

0 likes · 10 min read

Which IT Skills Earn Over $140K? 2023’s Top-Paying Tech Expertise Revealed

DataFunTalk

Feb 25, 2023 · Big Data

T3 Travel’s Modern Data Stack and Feature Platform: Architecture and Practices

This article details T3 Travel’s exploration of the Modern Data Stack, describing its four‑point overview, business scenarios, the initial MDS implementation using Apache Hudi and Kyuubi, and the design of a feature platform that integrates Metricflow, Feast, and other components to support data processing, analytics, and machine‑learning workflows.

Apache HudiBig DataData Lake

0 likes · 22 min read

T3 Travel’s Modern Data Stack and Feature Platform: Architecture and Practices

DeWu Technology

Feb 24, 2023 · Big Data

Real-Time Data Architecture Evolution for a Complex Supply Chain

The article traces Dewu’s supply‑chain data platform from slow MySQL reporting through early CDC‑based wide tables to a Flink‑Kafka‑ClickHouse 1.0 design, then to a more scalable Flink‑Kafka‑Hologres 2.0 architecture that solves upsert and compute‑storage separation, while detailing key operational tricks, code‑generation tools, and future plans for lake‑house integration.

Big DataClickHouseFlink

0 likes · 10 min read

Real-Time Data Architecture Evolution for a Complex Supply Chain

StarRing Big Data Open Lab

Feb 24, 2023 · Big Data

What Makes MPP Databases the Powerhouse Behind Modern Data Analytics?

MPP (Massive Parallel Processing) databases, designed for large‑scale analytical workloads, use distributed, shared‑nothing architectures with multiple control and compute nodes, offering high scalability, diverse data‑sharding strategies, and powerful SQL compatibility, as illustrated by vendors like Teradata, Vertica, Greenplum, and emerging open‑source solutions.

Big DataDistributed computingGreenplum

0 likes · 15 min read

What Makes MPP Databases the Powerhouse Behind Modern Data Analytics?

DataFunTalk

Feb 24, 2023 · Big Data

Presto and Alluxio Integration for Iceberg: Architecture, Best Practices, and Future Work

This article explains how Presto and Alluxio work together to query Iceberg tables, describes their architectures, deployment options, best‑practice recommendations such as using Iceberg native catalogs and local caches, and outlines future research directions for improving CPU usage and off‑heap caching.

AlluxioBig DataCache

0 likes · 14 min read

Presto and Alluxio Integration for Iceberg: Architecture, Best Practices, and Future Work

Big Data Technology & Architecture

Feb 24, 2023 · Big Data

Common Flink Task Submission Issues and Solutions on YARN

This article compiles frequent Flink job submission problems on YARN—including WordCount jar errors, HBase dependency conflicts, MySQL timeout, checkpoint restoration failures, parallelism limits, and unexpected container termination—provides root‑cause analysis and step‑by‑step remediation instructions.

Big DataCheckpointFlink

0 likes · 21 min read

Common Flink Task Submission Issues and Solutions on YARN

JD Cloud Developers

Feb 23, 2023 · Big Data

How to Build a Local Hadoop & Spark Cluster from Scratch (Step‑by‑Step Guide)

This comprehensive tutorial walks you through setting up a three‑node Hadoop 3.3.4 and Spark 3.3.1 environment on CentOS 7 virtual machines, covering system preparation, JDK and Scala installation, Zookeeper configuration, Hadoop and Spark deployment, and verification with practical command‑line examples.

Big DataCluster SetupHadoop

0 likes · 10 min read

How to Build a Local Hadoop & Spark Cluster from Scratch (Step‑by‑Step Guide)

Architects Research Society

Feb 21, 2023 · Big Data

Comparing Apache Spark and Apache Flink: Origins, Architecture, and Processing Models

This article examines the evolution, architectural differences, data and processing models, stateful handling, and programming APIs of Apache Spark and Apache Flink, highlighting their strengths, limitations, and the challenges of big‑data development and operations in the modern data‑driven era.

Batch processingBig DataData Engine

0 likes · 18 min read

Comparing Apache Spark and Apache Flink: Origins, Architecture, and Processing Models

DataFunTalk

Feb 21, 2023 · Databases

Building a Stream‑Batch Integrated Data Architecture with Apache Doris at SelectDB

This article details how SelectDB’s data technology architect designed and implemented a new stream‑batch unified data platform using Apache Doris, covering the shortcomings of the early CDH‑based architecture, the selection process, data modeling, ingestion pipelines, performance testing, operational optimizations, and future plans.

Apache DorisBatch processingBig Data

0 likes · 17 min read

Building a Stream‑Batch Integrated Data Architecture with Apache Doris at SelectDB

dbaplus Community

Feb 20, 2023 · Databases

Why Teradata Is Leaving China and Which Domestic Data Warehouses Can Fill the Gap

Teradata announced its withdrawal from China due to geopolitical uncertainty and rising competition from mature domestic data‑warehouse solutions, prompting a detailed analysis of its architecture, the main Chinese warehouse designs, global market positioning, and migration tools for replacing Teradata.

Big DataData WarehouseGBase

0 likes · 10 min read

Why Teradata Is Leaving China and Which Domestic Data Warehouses Can Fill the Gap

ITPUB

Feb 20, 2023 · Databases

Why Teradata Is Leaving China and What It Means for the Domestic Data Warehouse Market

Teradata's withdrawal from China, driven by geopolitical tensions and the rise of mature domestic data‑warehouse solutions, prompts a detailed look at its MPP architecture, the three main Chinese warehouse designs, Gartner market positioning, and migration tools for alternatives like GBase 8a and GaussDB DWS.

Big DataData WarehouseGBase

0 likes · 9 min read

Why Teradata Is Leaving China and What It Means for the Domestic Data Warehouse Market

DataFunSummit

Feb 20, 2023 · Product Management

Evaluating the Value of Data Products: Scenarios, Frameworks, and Improvement Methods

This article explains why data product value assessment is essential, outlines common usage scenarios and a DBA evaluation framework, describes quantitative methods such as usage, business, and data‑driven metrics, and offers practical ways to enhance data product value through metric optimization, high‑value direction selection, and resource allocation.

Big DataData ProductMetrics

0 likes · 13 min read

Evaluating the Value of Data Products: Scenarios, Frameworks, and Improvement Methods

DataFunTalk

Feb 20, 2023 · Big Data

Understanding Data Lakes and Their Application at iQIYI: Concepts, Scenarios, and Iceberg Implementation

This article explains the definition of data lakes (public‑cloud and non‑public‑cloud), outlines their key characteristics, presents three typical business scenarios—real‑time event analysis, change‑data analysis, and stream‑batch integration—summarizes required product features, evaluates open‑source lake formats, and details iQIYI's adoption of Apache Iceberg across multiple services to achieve low‑latency, large‑scale, cost‑effective analytics.

Big DataData LakeIceberg

0 likes · 23 min read

Understanding Data Lakes and Their Application at iQIYI: Concepts, Scenarios, and Iceberg Implementation

Alibaba Cloud Big Data AI Platform

Feb 20, 2023 · Big Data

How Alibaba’s DataWorks Transforms Data Governance for Efficiency, Security, and Cost Savings

This article explores Alibaba's DataWorks platform and its comprehensive data governance practices, covering application efficiency, security controls, cost optimization, organizational structure, and cultural initiatives that together enable scalable, secure, and cost‑effective data management across the enterprise.

Big DataDataWorkscost optimization

0 likes · 31 min read

How Alibaba’s DataWorks Transforms Data Governance for Efficiency, Security, and Cost Savings

DataFunTalk

Feb 18, 2023 · Big Data

Xiaomi Data Governance Evolution: Cost Governance Practices for HDFS and HBase

The article outlines Xiaomi's data governance journey, focusing on storage‑service cost governance, describing the transition from simple cost‑centered governance to big‑data‑driven asset management, and detailing concrete HDFS and HBase practices that achieved significant resource and cost reductions.

Big DataHBaseHDFS

0 likes · 15 min read

Xiaomi Data Governance Evolution: Cost Governance Practices for HDFS and HBase

DataFunSummit

Feb 17, 2023 · Big Data

Data Governance Practices and Platform Construction with Alibaba DataWorks

Alibaba’s DataWorks team shares extensive experiences in building and operating a large‑scale data platform, covering data governance across stages—from data stability and quality to security, cost control, and organizational culture—illustrating how systematic practices and tools drive efficiency, reliability, and value for enterprises.

Big DataData Platformcost optimization

0 likes · 55 min read

Data Governance Practices and Platform Construction with Alibaba DataWorks

DataFunTalk

Feb 17, 2023 · Big Data

Tencent Alluxio (DOP) Deployment and Optimization in Financial Data Analytics

This article describes how Tencent's Alluxio-based Data Orchestration Platform (DOP) was applied to financial analytics, detailing the business background, challenges of large‑scale OLAP workloads, the Alluxio architecture and usage modes, performance results, and the series of optimizations and tuning performed to achieve significant speedups.

AlluxioBig DataData Orchestration

0 likes · 15 min read

Tencent Alluxio (DOP) Deployment and Optimization in Financial Data Analytics

Tencent Advertising Technology

Feb 17, 2023 · Big Data

Cost Optimization and Mixed‑Resource Deployment in Tencent's Taiji Machine Learning Platform

The article details how Tencent's Taiji machine‑learning platform reduces training costs and improves efficiency for large‑scale advertising models by leveraging cloud‑native mixed‑resource strategies—including online idle, offline elastic, and compute‑resource sharing—while maintaining high service stability through advanced scheduling, fault‑tolerance, and resource‑prediction techniques.

Big DataMachine Learning PlatformTencent

0 likes · 16 min read

Cost Optimization and Mixed‑Resource Deployment in Tencent's Taiji Machine Learning Platform

DataFunSummit

Feb 16, 2023 · Artificial Intelligence

Curated Collection of Articles on AI‑Powered Smart Medicine

This guide introduces the challenges in healthcare, explains how artificial intelligence is already reshaping the field, and provides a curated list of recent articles on smart medicine for readers to explore the emerging AI‑healthcare integration.

AIBig DataHealthcare

0 likes · 4 min read

Curated Collection of Articles on AI‑Powered Smart Medicine

DataFunSummit

Feb 16, 2023 · Big Data

JD Real-Time Data Product Practice: Overview, Low‑Code Platform, Stream‑Batch Integration, and Operations

This article summarizes JD's real‑time data product practice, covering product overview, low‑code real‑time platform construction, stream‑batch integrated architecture, and the three‑layer operational defense model, while highlighting challenges, evolution, user distribution, and future directions.

Big DataLow‑code platformReal-time Data

0 likes · 13 min read

JD Real-Time Data Product Practice: Overview, Low‑Code Platform, Stream‑Batch Integration, and Operations

Kuaishou Big Data

Feb 15, 2023 · Big Data

Kuaishou’s Data Application Factory: Boosting BI with Low‑Code & Unified Queries

This article details how Kuaishou’s Data Application Factory tackles the challenges of rapid BI delivery, data accuracy, and service stability by leveraging low‑code development, unified query services, standardized configurations, and service isolation to achieve efficient, high‑quality data products across multiple business lines.

BIBig DataLow‑code

0 likes · 16 min read

Kuaishou’s Data Application Factory: Boosting BI with Low‑Code & Unified Queries

Alimama Tech

Feb 15, 2023 · Big Data

Dolphin: Alibaba's Hyper‑Converged Multi‑Modal Big Data Engine Overview

Dolphin, Alibaba’s hyper‑converged multi‑modal big‑data engine, unifies OLAP, AI, streaming, and batch workloads on a decoupled compute‑storage MPP foundation, offering a Dolphin SQL layer, advanced bitmap/GroupTable/AFile indexes, intelligent materialization, and one‑write‑multiple‑read storage that cuts costs over 70% while delivering sub‑millisecond queries on trillion‑row datasets.

AIBig DataOLAP

0 likes · 14 min read

Dolphin: Alibaba's Hyper‑Converged Multi‑Modal Big Data Engine Overview

Big Data Technology & Architecture

Feb 15, 2023 · Big Data

Flink Multi-Stream Union Operations and Event-Time Sorting

This article explains how to use Flink's DataStream.union() to combine multiple streams of the same type, demonstrates Maven project setup and code examples for simple unions and for unions with custom event-time sorting, and shows the resulting ordered output.

Big DataDataStreamEventTime

0 likes · 15 min read

Flink Multi-Stream Union Operations and Event-Time Sorting

DataFunTalk

Feb 15, 2023 · Big Data

Alluxio Deployment at Ant Group: Stability Building, Performance Optimization, and Scale‑up for Large‑Scale Model Training

This article summarizes how Ant Group introduced Alluxio to address storage I/O, capacity, and latency challenges in large‑scale model training, detailing stability improvements through worker‑register follower and master migration, performance gains via follower‑only reads, and horizontal scaling using metadata sharding and multi‑cluster deployment.

AlluxioBig DataPerformance Optimization

0 likes · 15 min read

Alluxio Deployment at Ant Group: Stability Building, Performance Optimization, and Scale‑up for Large‑Scale Model Training

ByteDance Data Platform

Feb 15, 2023 · Databases

How ByteHouse Powers Real‑Time Data Warehousing at Scale

ByteHouse, a cloud‑native data warehouse built on ClickHouse, delivers ultra‑fast real‑time and massive offline analytics with elastic scaling, addressing business needs in ByteDance and the financial sector through optimized architecture, ROI‑driven monitoring, and comprehensive operational tools.

Big DataByteHouseClickHouse

0 likes · 16 min read

How ByteHouse Powers Real‑Time Data Warehousing at Scale

Data Thinking Notes

Feb 14, 2023 · Big Data

How Cloud Music Turned 60k Tables into Valuable Data Assets

This article details Cloud Music's year‑long data assetization journey, covering the background, practical achievements, governance methods, and future roadmap for turning massive data warehouses into high‑value, well‑governed assets that drive cost reduction and business insight.

Big DataData PlatformData Warehouse

0 likes · 10 min read

How Cloud Music Turned 60k Tables into Valuable Data Assets

Alibaba Terminal Technology

Feb 14, 2023 · Artificial Intelligence

How ChatGPT Is Reshaping Front‑End Development and Data Engineering

This article reflects on the rapid rise of ChatGPT, reviews key AI concepts and high‑quality external resources, analyzes its current limitations, and explores how the technology is transforming front‑end development, big‑data workflows, and engineers' daily practices, offering practical advice for adapting to the AI‑driven future.

Big Dataproductivity

0 likes · 18 min read

How ChatGPT Is Reshaping Front‑End Development and Data Engineering

DataFunSummit

Feb 13, 2023 · Big Data

ClickHouse in Self‑Service Analytics: Architecture, Optimization Practices and Future Roadmap at ZuanZuan Platform

This article details how ZuanZuan leveraged ClickHouse as the core OLAP engine for its massive self‑service analytics platform, covering OLAP engine selection criteria, system architecture, real‑world use cases, performance tuning, operational challenges, and future development plans.

AnalyticsBig DataClickHouse

0 likes · 16 min read

ClickHouse in Self‑Service Analytics: Architecture, Optimization Practices and Future Roadmap at ZuanZuan Platform

DataFunSummit

Feb 12, 2023 · Big Data

Applying Erasure Coding in HDFS: Strategies, Performance, and Repair Techniques

This article explains how Zhihu adopted HDFS erasure coding to reduce storage costs, outlines cold‑hot file tiering policies, describes the EC conversion workflow and the custom EC Worker tool, and details methods for detecting and repairing damaged EC files in a Hadoop environment.

Big DataData StorageHDFS

0 likes · 16 min read

Applying Erasure Coding in HDFS: Strategies, Performance, and Repair Techniques

DataFunTalk

Feb 12, 2023 · Big Data

Optimizing Bilibili Presto Cluster Query Performance with Alluxio and Local Cache

This article presents a comprehensive technical overview of Bilibili's Presto cluster architecture, the challenges of query performance on Hadoop, and the systematic optimizations—including Alluxio integration, local cache mechanisms, multi‑active coordinators, label‑based scheduling, and real‑time penalties—that together improve availability, stability, and latency for large‑scale analytics workloads.

AlluxioBig DataCache

0 likes · 23 min read

Optimizing Bilibili Presto Cluster Query Performance with Alluxio and Local Cache

Big Data Technology & Architecture

Feb 10, 2023 · Big Data

The Most Comprehensive Big Data Interview Preparation Handbook

This article presents a curated collection of big‑data learning resources, including interview guides, in‑depth articles on Flink, Spark, Hive, ClickHouse, data governance, and personal growth, offering readers a one‑stop reference to boost their big‑data expertise and interview readiness.

Big DataFlinkHive

0 likes · 5 min read

The Most Comprehensive Big Data Interview Preparation Handbook

Big Data Technology & Architecture

Feb 9, 2023 · Big Data

The Most Comprehensive Big Data Interview Preparation Handbook and Resource Collection

This article presents a curated collection of the most comprehensive big‑data interview preparation resources, including expert guides, tutorials, and deep‑dive articles on Flink, Spark, Hive, ClickHouse, data governance, and related topics, accompanied by a call to engage with the content.

Big DataClickHouseFlink

0 likes · 4 min read

The Most Comprehensive Big Data Interview Preparation Handbook and Resource Collection

Sohu Tech Products

Feb 8, 2023 · Big Data

Design and Implementation of a General H5 User Behavior Tracking and Data Warehouse Model

This article presents a comprehensive H5 (HTML5) tracking solution that details the planning of event‑collection points, the full data‑warehouse modeling process—including schema design, retention calculations, and SQL implementations—and the automatic data‑capture mechanisms needed to improve user‑behavior analysis efficiency across the product lifecycle.

Big DataData WarehouseH5 analytics

0 likes · 17 min read

Design and Implementation of a General H5 User Behavior Tracking and Data Warehouse Model

Architects' Tech Alliance

Feb 8, 2023 · Artificial Intelligence

Computing‑in‑Memory (CiM) Technology: Concepts, History, Advantages, Classifications and Application Scenarios

This article provides a comprehensive overview of Computing‑in‑Memory technology, covering its definition, historical evolution, performance advantages over traditional von Neumann architectures, various technical classifications, storage‑media choices, market drivers, and its pivotal role in AI and big‑data workloads across edge, cloud and automotive domains.

AI accelerationBig Datacomputing-in-memory

0 likes · 17 min read

Computing‑in‑Memory (CiM) Technology: Concepts, History, Advantages, Classifications and Application Scenarios

DataFunSummit

Feb 8, 2023 · Product Management

Content‑Driven Data Product Management: Challenges, Governance Frameworks, and Implementation Strategies

This article shares practical insights from a data product expert on the problems faced by content‑oriented data products, outlines a comprehensive governance methodology—including DAMA, Huawei, and Alibaba frameworks—and demonstrates how to operationalize these ideas through concrete examples such as event‑tracking and metric governance.

Big DataData Product ManagementMethodology

0 likes · 16 min read

Content‑Driven Data Product Management: Challenges, Governance Frameworks, and Implementation Strategies

StarRing Big Data Open Lab

Feb 8, 2023 · Big Data

Why MapReduce and Spark Still Matter: A Deep Dive into Distributed Computing

Distributed computing splits massive tasks across multiple servers, and this article explains the classic MapReduce batch engine and the modern Spark framework, covering their architectures, strengths, limitations, and evolution, while highlighting key features like fault tolerance, in‑memory processing, and real‑time streaming capabilities.

Big DataDistributed computingMapReduce

0 likes · 12 min read

Why MapReduce and Spark Still Matter: A Deep Dive into Distributed Computing

Alibaba Cloud Big Data AI Platform

Feb 8, 2023 · Big Data

How Alibaba Cloud EMR 2.0 Redefines Open‑Source Big Data Platforms

This article summarizes Alibaba Cloud senior product expert He Yuan's presentation on EMR 2.0, outlining the challenges of open‑source big data, the evolution of EMR, and the new features—including cloud‑native architecture, enhanced performance, diverse resource models, and expanded analysis scenarios—aimed at reducing cost and complexity.

Alibaba CloudBig DataData Lake

0 likes · 11 min read

How Alibaba Cloud EMR 2.0 Redefines Open‑Source Big Data Platforms

Youzan Coder

Feb 7, 2023 · Big Data

Automated Offline Data Cost Optimization in Youzan's Data Platform

Youzan built an automated offline data cost‑optimization platform that gathers accurate metadata, mines unused or failing tables and tasks, and safely decommissions them through a backend‑frontend workflow with owner validation, notifications, rollback safeguards, and plans to extend lineage coverage and real‑time asset handling.

Big DataCost ReductionPipeline Automation

0 likes · 11 min read

Automated Offline Data Cost Optimization in Youzan's Data Platform

Data Thinking Notes

Feb 6, 2023 · Big Data

How Tencent Tackles Data Governance Challenges with the WeData Platform

This article outlines Tencent's data governance challenges, its internal three‑stage practice, detailed case studies such as Tencent News and PCG cost governance, and introduces the WeData platform's architecture and tools for standardization, quality, security, and metadata management, concluding with a Q&A session.

Big DataData PlatformTencent

0 likes · 17 min read

How Tencent Tackles Data Governance Challenges with the WeData Platform

Python Programming Learning Circle

Feb 6, 2023 · Big Data

Reproducing Google Ngram Viewer Trends with Python, NumPy, and PyTubes

This article demonstrates how to download the Google 1‑gram dataset, load the ~1.4 billion rows with Python and NumPy (using the PyTubes library), compute yearly word frequencies, visualize the rise of "Python" and compare it with Pascal and Perl, while discussing performance challenges and future improvements.

Big DataData AnalysisGoogle Ngram

0 likes · 8 min read

Reproducing Google Ngram Viewer Trends with Python, NumPy, and PyTubes

Big Data Technology & Architecture

Feb 6, 2023 · Big Data

Real-Time Data Warehouse Solutions with Hudi: Scenarios, Challenges, and Optimizations

This article presents an in‑depth overview of real‑time data‑warehouse scenarios, discusses challenges such as timeliness, update efficiency, and resource consumption, and details practical solutions using Apache Hudi, Flink, Presto, and related optimizations for ingestion, indexing, compaction, and query performance.

Big DataData LakeFlink

0 likes · 17 min read

Real-Time Data Warehouse Solutions with Hudi: Scenarios, Challenges, and Optimizations

Big Data Technology & Architecture

Feb 4, 2023 · Big Data

Apache Linkis Graduates to Top-Level Project – Overview, Core Features, Roadmap, and Ecosystem

The article announces Apache Linkis’s graduation to an Apache top‑level project, explains its role as a computing middleware linking applications to engines like Spark, Hive, and Flink, details its core capabilities, roadmap, ecosystem integrations, and provides official resources for the community.

ApacheBig DataComputing Middleware

0 likes · 8 min read

Apache Linkis Graduates to Top-Level Project – Overview, Core Features, Roadmap, and Ecosystem

DataFunTalk

Feb 4, 2023 · Big Data

Design and Practice of Tencent Lighthouse Fusion Analysis Engine

This article presents the design and implementation of Tencent Lighthouse's Fusion Analysis Engine, covering its background, challenges, fusion architecture, kernel optimizations, acceleration techniques, practical outcomes, and future evolution directions for high‑performance data access.

Big DataFusion EngineLighthouse

0 likes · 12 min read

Design and Practice of Tencent Lighthouse Fusion Analysis Engine

Kuaishou Big Data

Feb 3, 2023 · Big Data

Inside Kuaishou’s Company‑Wide Metric Platform: Architecture, Lessons & Best Practices

This article details Kuaishou’s three‑year evolution of its metric middle platform, covering the data infrastructure, key challenges of data inconsistency and low analysis efficiency, the enterprise‑level OneMetric solution, architectural design, development phases, practical lessons, system implementation, and real‑world applications.

Big DataData engineeringKuaishou

0 likes · 23 min read

Inside Kuaishou’s Company‑Wide Metric Platform: Architecture, Lessons & Best Practices

Java High-Performance Architecture

Feb 3, 2023 · Big Data

How to Use Alibaba DataX for Efficient MySQL Data Synchronization

This guide explains how to install DataX, set up MySQL environments, configure JSON job files, and run both full and incremental data synchronization between heterogeneous databases using DataX's Reader/Writer framework and job scheduling features.

Big DataDataXETL

0 likes · 14 min read

How to Use Alibaba DataX for Efficient MySQL Data Synchronization

DataFunTalk

Feb 2, 2023 · Big Data

SeaTunnel: Design Goals, Current Status, Architecture, and Future Roadmap

This article provides a comprehensive overview of Apache SeaTunnel, covering its design objectives, current capabilities such as multi‑engine support and extensive connector ecosystem, detailed architecture including engine‑independent APIs and execution flows, and outlines the upcoming roadmap to expand connectors, launch a visual web UI, and introduce a dedicated SeaTunnel Engine.

ApacheBatch processingBig Data

0 likes · 12 min read

SeaTunnel: Design Goals, Current Status, Architecture, and Future Roadmap

DataFunTalk

Jan 31, 2023 · Big Data

Tencent's Data Governance Practices and Technical Implementation

This article presents Tencent's comprehensive data governance framework, covering its definition, objectives, challenges, methodology, organizational structure, metadata management, data asset lifecycle, security measures, and technical implementation details such as microservice architecture, data collection, lineage analysis, and storage solutions.

Big DataTencentdata governance

0 likes · 19 min read

Tencent's Data Governance Practices and Technical Implementation

DataFunTalk

Jan 31, 2023 · Big Data

SPI Refactoring Practice in Apache InLong Manager to Reduce Maintenance Cost and Enhance Extensibility

This article presents the SPI-based refactoring of Apache InLong Manager, describing the project's background, existing maintenance challenges, the concept of Java Service Provider Interface, the concrete implementation steps, code restructuring, and the resulting benefits such as higher code reuse, easier extension, and reduced DDL changes.

Apache InLongBig DataCode Refactoring

0 likes · 10 min read

SPI Refactoring Practice in Apache InLong Manager to Reduce Maintenance Cost and Enhance Extensibility

Bilibili Tech

Jan 31, 2023 · Big Data

Design and Optimization of Real-Time Data Quality Control (DQC) Platform on Bilibili's Big Data System

Bilibili redesigned its real-time data-quality control platform by replacing per-rule Flink jobs with a unified, dynamically-configured architecture that classifies Kafka topics, aggregates via InfluxDB full-table and continuous queries, mitigates data inflation, adds a high-performance proxy, and implements robust monitoring and recovery to ensure scalable, reliable data quality for its big-data services.

Big DataDQCFlink

0 likes · 22 min read

Design and Optimization of Real-Time Data Quality Control (DQC) Platform on Bilibili's Big Data System

DataFunTalk

Jan 30, 2023 · Big Data

Data Governance Strategies: Principles, Practices, and Real‑World Case Studies

The article explains why data governance is essential for high‑quality data in big‑data organizations, outlines narrow and broad governance scopes, presents strategic principles, and shares eight detailed case studies from leading Chinese tech companies illustrating practical implementation and lessons learned.

Big Datadata governance

0 likes · 7 min read

Data Governance Strategies: Principles, Practices, and Real‑World Case Studies

Data Thinking Notes

Jan 29, 2023 · Big Data

How to Turn Data Assets into Business Value: A Roadmap for Enterprises

Enterprises must shift their perception of data assets and embed data‑value into every digital process, establishing governance, unified asset catalogs, operational metrics, security controls, integration, services, and visualization to transform raw data into strategic business outcomes.

Big DataData IntegrationData Security

0 likes · 12 min read

How to Turn Data Assets into Business Value: A Roadmap for Enterprises

DataFunSummit

Jan 29, 2023 · Big Data

Data Serviceization at JD: From Zero to One and Beyond

This article presents JD's data service platform, describing its origin, performance optimizations, flexible API generation, caching strategies, service orchestration, and governance, and includes a Q&A that addresses security, performance, and multi‑source data handling challenges.

APIBig DataCaching

0 likes · 11 min read

Data Serviceization at JD: From Zero to One and Beyond

DataFunTalk

Jan 28, 2023 · Big Data

Data Lake vs Data Warehouse: Differences, Evolution, and Integrated Lakehouse Design

This article explores the ongoing debate between data lakes and data warehouses, clarifies their distinct purposes and technologies, discusses how they can coexist or complement each other, and introduces the concept of an integrated lakehouse architecture while promoting a comprehensive data intelligence knowledge map.

Big DataData LakeData Warehouse

0 likes · 5 min read

Data Lake vs Data Warehouse: Differences, Evolution, and Integrated Lakehouse Design

DataFunSummit

Jan 27, 2023 · Databases

StarRocks in Youzu's Multi-Dimensional Analytics: Architecture, Advantages, and Future Plans

This article presents Youzu Network’s adoption of StarRocks for multi-dimensional analytics, detailing the historical OLAP challenges, StarRocks’ features and advantages, its application scenarios, data modeling choices, ingestion methods, performance benchmarks, and future roadmap for unified analytics.

Big DataFlinkKafka

0 likes · 18 min read

StarRocks in Youzu's Multi-Dimensional Analytics: Architecture, Advantages, and Future Plans

DataFunSummit

Jan 27, 2023 · Big Data

Data Governance Strategies: Principles, Practices, and Case Studies

The article explains the importance of data governance, distinguishes narrow and broad governance, outlines strategic principles such as systemic engineering and prioritization, and presents eight case studies from leading Chinese tech companies illustrating practical implementations and effective strategies.

Big DataData Managementcase study

0 likes · 8 min read

Data Governance Strategies: Principles, Practices, and Case Studies

Tencent Cloud Developer

Jan 26, 2023 · Operations

Technical Article Digest: Operations, AI, Web3, Rust, Big Data, and More

This technical digest surveys Tencent’s health‑code operations architecture, dissects ChatGPT’s training pipeline, contrasts Web 2.0 and Web 3.0 on Ethereum, explains AI‑generated art, details WeChat’s overload controls and QQ Music’s high‑availability design, examines the rapid scaling of the “Sheep Sheep” mini‑game, introduces Rust for front‑end developers, showcases big‑data football prediction models, and outlines common C++ pitfalls and best‑practice recommendations.

Big DataC++Rust

0 likes · 7 min read

Technical Article Digest: Operations, AI, Web3, Rust, Big Data, and More

DataFunTalk

Jan 26, 2023 · Big Data

Tencent Data Governance Practices and the WeData Platform

This article outlines Tencent's data governance challenges, internal practices across three maturity stages, and introduces the WeData platform that provides comprehensive capabilities for data assetization, cost control, quality assurance, security, and metadata management to support large‑scale big‑data operations.

Big DataTencentWeData

0 likes · 15 min read

Tencent Data Governance Practices and the WeData Platform

DataFunTalk

Jan 26, 2023 · Big Data

Data Governance Strategies: Principles, Practices, and Real‑World Case Studies

This article explains why data is a company's most valuable asset, distinguishes narrow and broad data‑governance approaches, outlines strategic design principles, and presents eight detailed case studies from leading Chinese tech firms illustrating practical governance implementations and lessons learned.

Big Datadata governance

0 likes · 8 min read

DataFunSummit

Jan 24, 2023 · Databases

Practical Experience of Using Apache Doris for Real‑Time Data Warehouse at Tongcheng Data Science

This article details how Tongcheng Data Science built a real‑time analytical data warehouse using Apache Doris, covering business scenarios, the evolution from a legacy 1.0 architecture to a Doris‑based 2.0 design, deployment topology, development workflow, operational benefits, and future roadmap.

Apache DorisBig DataData Architecture

0 likes · 10 min read

Practical Experience of Using Apache Doris for Real‑Time Data Warehouse at Tongcheng Data Science

DataFunSummit

Jan 23, 2023 · Big Data

Design and Practice of the 58 Agile BI System (Starfire)

This article presents a comprehensive overview of the 58 Agile BI platform called Starfire, covering its background, technical architecture, core permission and query engine challenges, MPP cache acceleration, visualization resource library, developer services, and future development directions.

BIBig DataClickHouse

0 likes · 13 min read

Design and Practice of the 58 Agile BI System (Starfire)

DataFunSummit

Jan 22, 2023 · Big Data

Applying Spark SQL at Ping An Insurance: Business Background, Deployment Choices, Migration Process, and Lessons Learned

This article details how Ping An Insurance migrated its offline Hive SQL workloads to Spark SQL, covering business background, deployment mode selection, migration workflow, typical challenges, optimization measures, and the resulting performance and resource utilization improvements.

Big DataCluster MigrationDeployment Modes

0 likes · 16 min read

Applying Spark SQL at Ping An Insurance: Business Background, Deployment Choices, Migration Process, and Lessons Learned

DataFunSummit

Jan 21, 2023 · Big Data

Building and Evolving Data Management Systems: From IT to DT Era, Standards, Models, and Marketization

This article outlines the evolution of data management in the big‑data era, covering the history of the industry, key governance frameworks such as DMBOK, DCMM and DMM, the steps to construct a data‑management system, the requirements for a data‑factor market, and an introduction to the DataEasy company and its services.

Big DataDCMMDMBOK

0 likes · 15 min read

Building and Evolving Data Management Systems: From IT to DT Era, Standards, Models, and Marketization

DataFunTalk

Jan 20, 2023 · Big Data

Introduction to Flink CDC: Incremental Snapshot Algorithm and Framework

This article introduces Flink CDC, explains its incremental snapshot algorithm and the 2.0 framework design, compares it with traditional CDC pipelines, discusses the core API and dialect concept, and outlines community growth and future plans, providing a comprehensive technical overview for data engineers.

Apache FlinkBig DataChange Data Capture

0 likes · 13 min read

Introduction to Flink CDC: Incremental Snapshot Algorithm and Framework

DataFunTalk

Jan 19, 2023 · Big Data

Tencent Alluxio: Accelerating the Next Generation of Big Data and AI

This article presents a comprehensive overview of Tencent's Alluxio project, covering the evolution of big‑data architecture, recent Alluxio research progress, typical deployment cases, and future work, while highlighting performance improvements, integration with cloud and AI workloads, and community contributions.

AIAlluxioBig Data

0 likes · 21 min read

Tencent Alluxio: Accelerating the Next Generation of Big Data and AI

NetEase Cloud Music Tech Team

Jan 17, 2023 · Big Data

How NetEase Cloud Music Cut Data Pipeline Delays by 60% with Full‑Link Baseline Governance

This case study details NetEase Cloud Music's full‑link baseline governance initiative, outlining the challenges of massive data pipelines, the metrics used to measure success, the three‑pronged action plan (infrastructure, task optimization, and standards), and the resulting improvements in availability, resource utilization, and monitoring accuracy.

Big Databaseline governancedata ops

0 likes · 11 min read

How NetEase Cloud Music Cut Data Pipeline Delays by 60% with Full‑Link Baseline Governance

Data Thinking Notes

Jan 16, 2023 · Big Data

How Kuaishou Scaled Its Big Data Platform to Handle EB‑Level Data and Millions of Daily Tasks

This article details Kuaishou's one‑stop big data development platform, covering its massive scale, low‑code and real‑time capabilities, multi‑layer architecture, SLA guarantees, diagnostic tools, and future plans to further lower development barriers and democratize data engineering.

Big DataData PlatformLow-Code Development

0 likes · 21 min read

How Kuaishou Scaled Its Big Data Platform to Handle EB‑Level Data and Millions of Daily Tasks

Huolala Tech

Jan 16, 2023 · Big Data

How Leading Logistics Companies Master Data Governance for Cost and Stability

At the 2022 DataFun Summit, data governance experts from Huolala, Zhongtong, and SF Express shared comprehensive practices—including governance drivers, quality monitoring, model management, master data processes, platform architecture, cost control, and stability measures—illustrating how large logistics firms implement end‑to‑end data governance to boost efficiency, compliance, and business value.

Big DataCost ManagementData Quality

0 likes · 13 min read

How Leading Logistics Companies Master Data Governance for Cost and Stability

JD Tech

Jan 13, 2023 · Big Data

UData: Solving the Last Mile of Data Usage – Architecture, Query Engine Design, and Federated Query Enhancements

This article introduces the UData platform, explains its data‑integration architecture, details the StarRocks‑based query engine workflow from SQL parsing to distributed execution, and describes recent optimizations such as computation push‑down, support for JSF/HTTP/ClickHouse external tables, and a proxy‑based federated query framework.

Big DataData IntegrationQuery Engine

0 likes · 20 min read

UData: Solving the Last Mile of Data Usage – Architecture, Query Engine Design, and Federated Query Enhancements

DataFunSummit

Jan 12, 2023 · Big Data

Data Governance Strategies: Systemic Engineering and Practical Cases from Leading Companies

This article explains the importance of data governance, distinguishes narrow and broad governance, outlines its systemic and selective nature, and presents eight practical case studies from companies like Tencent, NetEase, and MobTech, offering actionable strategies for high‑quality data across its lifecycle.

Big DataData ManagementEnterprise Strategy

0 likes · 8 min read

Data Governance Strategies: Systemic Engineering and Practical Cases from Leading Companies

DataFunSummit

Jan 12, 2023 · Big Data

Industrial IoT Data Collection Platform: Neuron v2.0 Architecture, Design, and Case Studies

This article presents a comprehensive overview of EMQ's Neuron industrial IoT data collection platform, detailing the lessons learned from version 1.x, the redesigned v2.0 architecture, core modules, plugin mechanisms, data‑tag management, eKuiper integration, and two real‑world case studies in oil‑field and smart‑factory environments.

Big DataData CollectionIoT

0 likes · 16 min read

Industrial IoT Data Collection Platform: Neuron v2.0 Architecture, Design, and Case Studies

Ctrip Technology

Jan 12, 2023 · Big Data

Evolution of Ctrip's Log System: From Elasticsearch to ClickHouse and Log 3.0

This article details the evolution of Ctrip's log infrastructure, describing the shift from fragmented departmental logging to a unified Elasticsearch-based platform, the migration to ClickHouse for cost‑effective, high‑performance storage, and the subsequent Log 3.0 redesign that leverages Kubernetes, sharding, and a unified query governance layer to handle petabyte‑scale data.

Big DataClickHouseETL

0 likes · 16 min read

Evolution of Ctrip's Log System: From Elasticsearch to ClickHouse and Log 3.0

Alibaba Cloud Big Data AI Platform

Jan 12, 2023 · Operations

What Is DataOps and How Can It Transform Your Data Management?

DataOps, the data‑centric counterpart of DevOps, combines agile principles, standardized tools, and cross‑team collaboration to manage the full data lifecycle—from integration and development to storage, governance, and service—enabling organizations to handle massive, diverse datasets efficiently, reduce silos, and turn data into actionable value.

Big DataData IntegrationData Management

0 likes · 15 min read

What Is DataOps and How Can It Transform Your Data Management?

vivo Internet Technology

Jan 11, 2023 · Cloud Native

Practices of Distributed Message Middleware at vivo: From RocketMQ to Kafka and Pulsar

vivo’s Internet Storage team details how it operates RocketMQ for low‑latency online services and Kafka for massive big‑data pipelines, outlines resource isolation, traffic balancing, intelligent throttling, and governance practices, and describes its migration from RabbitMQ and planned shift from Kafka to cloud‑native Pulsar.

Big DataKafkaMessage Middleware

0 likes · 22 min read

Practices of Distributed Message Middleware at vivo: From RocketMQ to Kafka and Pulsar

Data Thinking Notes

Jan 10, 2023 · Big Data

How Bilibili Built a Scalable Data Quality Platform for Billions of Events

This article describes Bilibili’s data quality platform, outlining its background, objectives, theoretical models, workflow stages (recording, checking, alerting), DSL for metrics, root‑cause analysis, scheduling strategies, heterogeneous source integration, rule coverage, intelligent monitoring, and future plans to achieve automated, real‑time, high‑reliability data assurance for massive daily workloads.

Big DataData QualityRoot Cause Analysis

0 likes · 21 min read

How Bilibili Built a Scalable Data Quality Platform for Billions of Events

dbaplus Community

Jan 10, 2023 · Big Data

Choosing the Right OLAP Engine: Druid vs ClickHouse and Optimization Tips

This article introduces OLAP concepts, compares major OLAP solutions such as Druid, Kylin, Doris, and ClickHouse, outlines their features and suitable scenarios, and shares practical optimization techniques—including materialized views, caching, node tiering, and query tuning—to improve performance for high‑concurrency analytical workloads.

Big DataClickHouseData Warehouse

0 likes · 16 min read

Choosing the Right OLAP Engine: Druid vs ClickHouse and Optimization Tips