Tagged articles
3697 articles
Page 12 of 37
DataFunTalk
DataFunTalk
Jun 2, 2023 · Big Data

Iceberg Data Lake Implementation and Optimization at iQIYI

This article details iQIYI's adoption of the Iceberg data lake, covering its OLAP architecture, reasons for a lake, Iceberg table format advantages over Hive, platform construction, extensive performance optimizations, and real‑world business use cases such as ad‑flow unification, log analysis, audit, and CDC pipelines.

Big DataData LakeFlink
0 likes · 18 min read
Iceberg Data Lake Implementation and Optimization at iQIYI
WeChat Backend Team
WeChat Backend Team
Jun 1, 2023 · Big Data

How WeChat Boosted Flink Stability with TaskManager Recovery and Load Balancing

This article details WeChat’s Gemini‑2.0 real‑time streaming platform built on Flink, explaining two key stability enhancements: a TaskManager‑level partial failure recovery that avoids data loss during node crashes, and a load‑balancing scheduler that evenly distributes tasks across TaskManagers to improve resource utilization and reduce latency.

Big DataFlinkStream Processing
0 likes · 16 min read
How WeChat Boosted Flink Stability with TaskManager Recovery and Load Balancing
DataFunTalk
DataFunTalk
May 30, 2023 · Big Data

Optimizing Chart Query Performance in YouShu BI: Data Query Principles, Intelligent Caching, Query Merging, and Diagnostics

This article explains the data query fundamentals of YouShu BI charts, introduces intelligent caching design, describes query merging and various optimization techniques—including partition filters, value acceleration, and SQL generation—and outlines performance diagnosis methods to improve BI chart responsiveness.

BIBig DataChart Performance
0 likes · 16 min read
Optimizing Chart Query Performance in YouShu BI: Data Query Principles, Intelligent Caching, Query Merging, and Diagnostics
Architects Research Society
Architects Research Society
May 28, 2023 · Big Data

Understanding Azure Synapse Analytics: An Integrated Data Lake and Data Warehouse Platform

This article examines Microsoft Azure Synapse Analytics, explaining how its unified framework combines data lake and data warehouse capabilities through components such as Pipelines, Dedicated SQL pools, Spark pools, and Serverless SQL, and evaluates its advantages over separate tools like Snowflake and Databricks.

Azure SynapseBig DataCloud Analytics
0 likes · 7 min read
Understanding Azure Synapse Analytics: An Integrated Data Lake and Data Warehouse Platform
Architects Research Society
Architects Research Society
May 28, 2023 · Big Data

Databricks vs Snowflake: Comparing Data Lake and Data Warehouse Cloud Solutions

This article compares the cloud‑based analytics platforms Databricks and Snowflake, examining how Databricks serves as a data‑lake processing tool with emerging warehouse features while Snowflake operates as a scalable data‑warehouse that incorporates lake‑style capabilities, and discusses their complementary use cases.

Big DataCloud AnalyticsDatabricks
0 likes · 7 min read
Databricks vs Snowflake: Comparing Data Lake and Data Warehouse Cloud Solutions
StarRocks
StarRocks
May 26, 2023 · Big Data

How SeaTunnel’s StarRocks Connector Enables High‑Performance Data Sync

This article explains SeaTunnel’s architecture and its StarRocks connector, detailing source and sink features such as field projection, predicate push‑down, parallel reading, state recovery, data type mapping, Stream Load writes, CDC support, configuration examples, and future roadmap for exactly‑once semantics.

Big DataConnectorData Integration
0 likes · 16 min read
How SeaTunnel’s StarRocks Connector Enables High‑Performance Data Sync
DataFunTalk
DataFunTalk
May 23, 2023 · Big Data

Building a Millisecond‑Response Lakehouse Platform with Apache Iceberg: Architecture, Query Acceleration, and Intelligent Optimization

This article details Bilibili's technical practice of constructing a millisecond‑response lake‑warehouse platform using Apache Iceberg, covering the background challenges, unified architecture, multi‑dimensional sorting and indexing for query acceleration, the Magnus service for intelligent optimization, and the current production deployment and performance metrics.

Big DataCubeIceberg
0 likes · 14 min read
Building a Millisecond‑Response Lakehouse Platform with Apache Iceberg: Architecture, Query Acceleration, and Intelligent Optimization
DataFunTalk
DataFunTalk
May 22, 2023 · Big Data

Alibaba Cloud Data Lake: Unified Metadata and Storage Management Practices

This article explains Alibaba Cloud's data lake architecture, unified metadata services, storage management optimizations, and format handling techniques, illustrating how lakehouse concepts, multi‑engine support, and lifecycle policies enable efficient, secure, and cost‑effective big data processing in the cloud.

Big DataCloud ServicesData Lake
0 likes · 22 min read
Alibaba Cloud Data Lake: Unified Metadata and Storage Management Practices
Data Thinking Notes
Data Thinking Notes
May 21, 2023 · Information Security

Why Government Data Sharing Stalls and How a “Three‑Rights” Model Can Unlock It

The article analyzes why government data sharing often fails—citing legal, technical, security, and organizational hurdles—then outlines one‑to‑one and centralized sharing models, highlights four critical success factors, and proposes a “three‑rights” framework supported by blockchain to create trustworthy, sustainable inter‑departmental data exchange.

Big DataBlockchainInformation Security
0 likes · 11 min read
Why Government Data Sharing Stalls and How a “Three‑Rights” Model Can Unlock It
Data Thinking Notes
Data Thinking Notes
May 17, 2023 · Big Data

Inside Wing Pay’s Scalable Big Data Platform: Architecture & Governance

This article details how Wing Pay built a comprehensive data development and governance platform, covering company background, business scenarios, goals, challenges, task development workflow, task types, SparkSQL editor features, double‑environment deployment, Airflow scheduling, DataX data bus, resource isolation, compute optimization, data quality monitoring, cloud‑native practices, future outlook, and a Q&A on data permissions and governance.

AirflowBig DataData Platform
0 likes · 17 min read
Inside Wing Pay’s Scalable Big Data Platform: Architecture & Governance
DataFunTalk
DataFunTalk
May 17, 2023 · Databases

Evolution of 360 Commercial Real-Time Data Warehouse and Apache Doris Deployment

This article details the three‑stage evolution of 360's real‑time data warehouse—from Storm + Druid + MySQL to Flink + Druid + TiDB and finally to Flink + Apache Doris—explaining architectural pain points, the reasons for choosing Doris, and how the new system delivers sub‑second query latency, strong consistency, and simplified operations across advertising scenarios.

Apache DorisBig DataData Consistency
0 likes · 17 min read
Evolution of 360 Commercial Real-Time Data Warehouse and Apache Doris Deployment
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
May 17, 2023 · Databases

StarRocks Production Practice at Tongcheng Travel: Architecture, Use Cases, and Technical Evaluation

This article details Tongcheng Travel’s production deployment of the StarRocks OLAP database, covering background, business scenarios, technical evaluation against ClickHouse and Greenplum, implementation with Flink SQL, real‑time analytics, offline reporting, CDP use cases, performance optimizations, and future cloud‑native plans.

Big DataData WarehouseFlink
0 likes · 12 min read
StarRocks Production Practice at Tongcheng Travel: Architecture, Use Cases, and Technical Evaluation
WeChat Backend Team
WeChat Backend Team
May 17, 2023 · Big Data

Boosting Real-Time Recommendations: Apache Pulsar Optimizations at WeChat

This article details how WeChat's Gemini‑2.0 big‑data platform leverages Apache Pulsar, outlining cloud‑native advantages, load‑balancing refinements, cache and SSD tuning, high‑availability safeguards, and cost‑saving strategies that together enable large‑scale, real‑time, deep‑learning recommendation workloads.

Apache PulsarBig DataMessage queue
0 likes · 17 min read
Boosting Real-Time Recommendations: Apache Pulsar Optimizations at WeChat
Laravel Tech Community
Laravel Tech Community
May 15, 2023 · Big Data

Introducing DataEase: An Easy‑to‑Use Open‑Source BI Tool with Rich Features and Quick Deployment

The article reviews DataEase, a Chinese open‑source business‑intelligence platform that offers a low‑learning‑curve interface, extensive data‑source support, built‑in template marketplace, and Docker‑based one‑command installation, making data visualization and dashboard creation accessible to a broad range of users.

BIBig DataData visualization
0 likes · 7 min read
Introducing DataEase: An Easy‑to‑Use Open‑Source BI Tool with Rich Features and Quick Deployment
Data Thinking Notes
Data Thinking Notes
May 14, 2023 · Big Data

Why Data Governance Matters: Boosting Data Quality and Business Value

Data governance, the overarching framework for evaluating, guiding, and supervising an organization’s data lifecycle—from collection to utilization—ensures high data quality, compliance, and security, ultimately maximizing data value and supporting AI-driven initiatives, while distinguishing itself from data management and data control through a strategic, top‑down approach.

Big DataData ManagementData Quality
0 likes · 8 min read
Why Data Governance Matters: Boosting Data Quality and Business Value
DataFunTalk
DataFunTalk
May 11, 2023 · Big Data

Scaling ByteDance Feature Store to EB‑Level with Apache Iceberg: Architecture, Practices, and Future Roadmap

This article describes how ByteDance tackled petabyte‑scale feature storage by adopting Apache Iceberg, detailing the problem background, design choices, implementation of COW and MOR back‑fill strategies, performance optimizations, and future plans such as lake‑cold‑layering and materialized views.

Apache IcebergBig DataData Lake
0 likes · 16 min read
Scaling ByteDance Feature Store to EB‑Level with Apache Iceberg: Architecture, Practices, and Future Roadmap
Amap Tech
Amap Tech
May 11, 2023 · Artificial Intelligence

A 20‑Year Review of AI Infrastructure Milestones

Over the past two decades, AI infrastructure has evolved from early distributed storage and MapReduce to GPU programming, modern package managers, in‑memory processing, deep‑learning frameworks, parameter servers, AI compilers, synthetic data pipelines, open‑source model hubs, and today’s large‑scale Kubernetes‑based clusters, forming the essential foundation for every breakthrough.

AI CompilersAI InfrastructureBig Data
0 likes · 29 min read
A 20‑Year Review of AI Infrastructure Milestones
Big Data Technology & Architecture
Big Data Technology & Architecture
May 11, 2023 · Big Data

Remote State Backend for Flink: Design, Optimization, and Deployment with Taishan KV Store

This article describes the motivation, challenges, design, and performance optimizations of a remote state backend for Flink that leverages Bilibili's Taishan distributed KV store to achieve storage‑compute separation, lighter checkpoints, faster rescaling, and improved resource utilization in large‑scale streaming jobs.

Big DataFlinkPerformance Optimization
0 likes · 20 min read
Remote State Backend for Flink: Design, Optimization, and Deployment with Taishan KV Store
DataFunTalk
DataFunTalk
May 9, 2023 · Databases

High‑Performance Inverted Index in Apache Doris for Log Data Storage and Analysis

This article explains how Apache Doris implements a high‑performance, column‑oriented inverted index to address the challenges of massive, real‑time log data storage and analysis, delivering dramatically higher write throughput, lower storage costs, and faster query performance than traditional Elasticsearch and Loki solutions.

Apache DorisBig DataLog Analytics
0 likes · 19 min read
High‑Performance Inverted Index in Apache Doris for Log Data Storage and Analysis
Data Thinking Notes
Data Thinking Notes
May 7, 2023 · Big Data

How Financial Institutions Can Master Data‑Driven Transformation in 2024

This article examines two decades of data warehouse evolution in the financial sector, identifies persistent pain points such as platform lag, data quality, and low service efficiency, and proposes a cloud‑native, data‑centric framework—including a unified blueprint, three‑layer architecture, and six core capabilities—to accelerate enterprise‑wide data capability building and drive high‑quality digital growth.

Big DataData PlatformDigital Transformation
0 likes · 18 min read
How Financial Institutions Can Master Data‑Driven Transformation in 2024
DataFunSummit
DataFunSummit
May 7, 2023 · Big Data

Tencent SuperSQL: A Unified Adaptive Big Data Computing Platform

The article presents Tencent's SuperSQL platform, detailing the big‑data challenges of heterogeneous data sources and fragmented SQL experiences, describing its multi‑layer adaptive architecture, core technologies such as unified SQL parsing, cost‑based and history‑based optimization, federated computation, materialized views and security, and summarizing its performance gains, industry impact and community contributions.

Big DataSQL optimizationSuperSQL
0 likes · 16 min read
Tencent SuperSQL: A Unified Adaptive Big Data Computing Platform
DataFunTalk
DataFunTalk
May 5, 2023 · Big Data

NetEase Cloud Music Real-Time Data Warehouse Architecture and Low-Code Platform Practices

This article presents NetEase Cloud Music's real-time data warehouse architecture, covering its streaming and batch scenarios, layered design (ODS, CDM, ADS), technology stack choices, consistency mechanisms, the FastX low-code platform, and future development plans, offering a comprehensive technical overview for data engineers and architects.

Big DataClickHouseFlink
0 likes · 18 min read
NetEase Cloud Music Real-Time Data Warehouse Architecture and Low-Code Platform Practices
Big Data Technology & Architecture
Big Data Technology & Architecture
May 5, 2023 · Big Data

Strategies for Handling Small Files in Hive and Spark

This article examines the causes and impacts of small file proliferation in Hive and Spark environments, and presents multiple mitigation techniques—including Spark 3 adaptive query execution, reducing reduce tasks, using DISTRIBUTE BY RAND(), post‑processing clean‑up, Hive and Spark configuration tweaks, and automated tooling—to improve performance and storage efficiency.

Big DataHiveSmall Files
0 likes · 9 min read
Strategies for Handling Small Files in Hive and Spark
Top Architect
Top Architect
May 4, 2023 · Big Data

Data Middle Platform: General Architecture and Core Components

The article explains the concept, benefits, and detailed modular architecture of a data middle platform, covering data storage, acquisition, processing, governance, security, and operation frameworks, and illustrates how enterprises can build and evolve such platforms to turn data into valuable services.

Big DataData ArchitectureData Integration
0 likes · 19 min read
Data Middle Platform: General Architecture and Core Components
DataFunTalk
DataFunTalk
May 3, 2023 · Big Data

Shuttle2.0: Enhancing Spark and Flink Shuffle with Distributed Sorting and Adaptive Broadcast

Shuttle2.0 extends OPPO's open‑source high‑availability Spark Remote Shuffle Service to support Flink, introduces a unified stream‑batch data model, pipelines shuffle with distributed sorting, and provides an Adaptive BroadcastJoin solution that dramatically improves performance and stability for large‑scale big‑data workloads.

Adaptive BroadcastBig DataDistributed Sorting
0 likes · 11 min read
Shuttle2.0: Enhancing Spark and Flink Shuffle with Distributed Sorting and Adaptive Broadcast
Data Thinking Notes
Data Thinking Notes
Apr 25, 2023 · Operations

Why Data Quality Matters: A Practical Guide to Governance and Seven‑Dimensional Evaluation

This article explains why data quality is critical for businesses, outlines common data quality problems, their root causes, and presents a comprehensive governance framework—including monitoring rules, alerting, full‑link monitoring, and a seven‑dimensional evaluation model—to ensure high‑quality data delivery.

Big DataData QualityOperations
0 likes · 12 min read
Why Data Quality Matters: A Practical Guide to Governance and Seven‑Dimensional Evaluation
ITPUB
ITPUB
Apr 25, 2023 · Big Data

Top 8 Open‑Source ETL Tools for Data Migration and Integration

This article reviews eight widely used ETL and data‑migration tools—including Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, and Canal—detailing their core features, architectures, supported data sources, and typical usage scenarios to help practitioners choose the right solution.

Big DataData IntegrationData Migration
0 likes · 13 min read
Top 8 Open‑Source ETL Tools for Data Migration and Integration
Python Programming Learning Circle
Python Programming Learning Circle
Apr 23, 2023 · Big Data

Parallel Processing of Large CSV Files in Python with multiprocessing, joblib, and tqdm

This tutorial demonstrates how to accelerate processing of a 2.8‑million‑row CSV dataset by using Python's multiprocessing, joblib, and tqdm libraries, covering serial, parallel, and batch processing techniques, performance measurements, and best‑practice code examples for efficient large‑scale data handling.

Big DataData EngineeringPython
0 likes · 9 min read
Parallel Processing of Large CSV Files in Python with multiprocessing, joblib, and tqdm
Data Thinking Notes
Data Thinking Notes
Apr 19, 2023 · Big Data

How Bilibili Transformed Big Data Governance: From Reactive Storage Management to Proactive Multi‑Dimensional Control

This article details Bilibili's evolution of big data governance, describing the early data growth challenges, the launch of the "Wanglou" project, the development of asset metadata and governance indicator frameworks, storage cost reduction strategies, scoring models, and the shift from passive, single‑point fixes to proactive, multi‑dimensional governance across the organization.

Big DataBilibiliCost Management
0 likes · 22 min read
How Bilibili Transformed Big Data Governance: From Reactive Storage Management to Proactive Multi‑Dimensional Control
Big Data Technology Architecture
Big Data Technology Architecture
Apr 19, 2023 · Big Data

Why the Big Data Era Is Over

The article argues that the era of big data is ending, showing that most organizations store only modest amounts of data, that storage costs outweigh benefits, and that modern cloud and analytics tools allow efficient processing without needing massive datasets.

AnalyticsBig DataData Management
0 likes · 16 min read
Why the Big Data Era Is Over
Code Ape Tech Column
Code Ape Tech Column
Apr 19, 2023 · Databases

Comparative Analysis of Elasticsearch and ClickHouse: Architecture, Query Performance, and Practical Benchmarks

This article compares Elasticsearch and ClickHouse by outlining their architectures, detailing deployment configurations, presenting benchmark queries and performance results, and concluding that ClickHouse generally outperforms Elasticsearch in many basic search and aggregation scenarios, while also noting each system's strengths and limitations.

Big DataClickHouseElasticsearch
0 likes · 13 min read
Comparative Analysis of Elasticsearch and ClickHouse: Architecture, Query Performance, and Practical Benchmarks
DataFunTalk
DataFunTalk
Apr 18, 2023 · Big Data

Real-time OLAP with Apache Doris: Architecture, Use Cases, and Optimization at Dingdong Maicai

This article details Dingdong Maicai's adoption of Apache Doris as a real‑time OLAP engine, covering business requirements, comparative evaluation with ClickHouse, system architecture, practical applications such as real‑time analytics, B‑end queries, tag systems, and performance‑boosting techniques like Colocate Join, bitmap, prefix and Bloom‑filter indexes, materialized views, and streamlined Broker Load workflows.

Apache DorisBig DataData Warehouse
0 likes · 19 min read
Real-time OLAP with Apache Doris: Architecture, Use Cases, and Optimization at Dingdong Maicai
Huolala Tech
Huolala Tech
Apr 17, 2023 · Big Data

How HuoLala Accelerated Ad‑hoc Queries with a Hybrid Offline Engine

This article describes how HuoLala identified slow ad‑hoc query performance in its Hive‑on‑Tez stack, surveyed comparable industry solutions, and built a multi‑engine hybrid offline service that dramatically improves query latency, outlines its architecture, key design decisions, production impact, and future roadmap.

Big DataDistributed SystemsSQL Routing
0 likes · 12 min read
How HuoLala Accelerated Ad‑hoc Queries with a Hybrid Offline Engine
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 17, 2023 · Big Data

Comprehensive Guide to Data Governance and Data Asset Management

This article presents a detailed roadmap for enterprise data governance, covering business digitization goals, data governance construction, typical digital platform architecture, core governance actions, implementation pathways, data asset inventory techniques, and real‑world case studies to illustrate practical execution.

Big DataData Asset ManagementData Quality
0 likes · 18 min read
Comprehensive Guide to Data Governance and Data Asset Management
Data Thinking Notes
Data Thinking Notes
Apr 16, 2023 · Big Data

Mastering Data Asset Management: From Inventory to Value Realization

This article outlines a complete data asset management lifecycle—starting with data inventory, moving through governance, classification, responsibility, permission, and security, and culminating in value realization via basic services, profiling, and algorithmic models—providing practical guidance for building a robust big‑data platform.

Big DataData QualityMaster Data
0 likes · 10 min read
Mastering Data Asset Management: From Inventory to Value Realization
Efficient Ops
Efficient Ops
Apr 16, 2023 · Operations

How Capability Platforms Empower Intelligent Container Cloud Operations

At the 20th GOPS Global Operations Conference, China Mobile Jiangsu showcased how its capability platform leverages AI, big data, and blockchain to automate health scoring and intelligent inspection, dramatically improving container‑cloud operational efficiency and paving the way for smarter, SRE‑driven DevOps practices.

Artificial IntelligenceBig DataCapability Platform
0 likes · 5 min read
How Capability Platforms Empower Intelligent Container Cloud Operations
ITPUB
ITPUB
Apr 15, 2023 · Big Data

How Bilibili Turned Big Data Governance from Reactive to Proactive

This article details Bilibili's journey from a late‑started, reactive big‑data platform to a mature, proactive governance system that combines asset metadata, metric‑driven strategies, cost‑aware billing, and automated tooling to achieve massive storage savings and operational efficiency across the organization.

Big DataOperational EfficiencyStorage Management
0 likes · 22 min read
How Bilibili Turned Big Data Governance from Reactive to Proactive
JD Retail Technology
JD Retail Technology
Apr 14, 2023 · Big Data

Understanding Data Skew and Its Mitigation in Hive and Spark

This article explains the concept of data skew, its symptoms such as slow tasks and OOM errors, and provides comprehensive mitigation techniques and configuration examples for Hive and Spark, including custom partitioning, map joins, adaptive execution, and key detection methods.

Adaptive ExecutionBig DataData Skew
0 likes · 15 min read
Understanding Data Skew and Its Mitigation in Hive and Spark
DataFunSummit
DataFunSummit
Apr 14, 2023 · Big Data

An Overview of User Profiling: Definitions, Elements, Types, Dimensions, Applications, and Development Process

This article provides a comprehensive introduction to user profiling, covering its definition, key elements, classification types, common dimensions, practical application scenarios, lifecycle considerations, development workflow, and validation methods for building effective data‑driven user models.

Big DataData AnalysisMarketing
0 likes · 10 min read
An Overview of User Profiling: Definitions, Elements, Types, Dimensions, Applications, and Development Process
DataFunTalk
DataFunTalk
Apr 13, 2023 · Big Data

Four Paradigms of StarRocks Lakehouse Integration and an Overview of StarRocks 3.0

This article explains why lake‑warehouse integration is needed, outlines its challenges, describes StarRocks' four integration paradigms—including query acceleration, layered modeling, real‑time warehouse‑lake fusion, and the cloud‑native 3.0 solution—and previews the upcoming StarRocks 3.0 release.

Big DataData LakeData Warehouse
0 likes · 18 min read
Four Paradigms of StarRocks Lakehouse Integration and an Overview of StarRocks 3.0
DataFunSummit
DataFunSummit
Apr 10, 2023 · Big Data

Spark on Kubernetes: Practices and Optimizations at Eggplant Technology

This article explains how Spark can be effectively deployed on Kubernetes, covering its advantages over traditional Hadoop clusters, the principles of Spark on K8s, dynamic allocation, reuse PVC enhancements, scheduling optimizations, and real‑world performance results from Eggplant Technology's production use.

Big DataSchedulingperformance-optimization
0 likes · 21 min read
Spark on Kubernetes: Practices and Optimizations at Eggplant Technology
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 10, 2023 · Big Data

Fine‑grained Configuration, State Migration, and Debugging Techniques for Flink SQL at Meituan

This article describes how Meituan addresses the rapid growth of Flink SQL jobs by introducing fine‑grained TTL and concurrency settings, an editable execution plan for state migration, pre‑analysis compatibility checks, and a bytecode‑instrumented debugging system that captures operator data and streams it to Kafka for analysis.

Big DataDebuggingFlink
0 likes · 24 min read
Fine‑grained Configuration, State Migration, and Debugging Techniques for Flink SQL at Meituan
DataFunTalk
DataFunTalk
Apr 10, 2023 · Big Data

Interview on Data Lakehouse: Current Applications, Challenges, and Evolution

This interview with NetEase data‑lake technology manager Ma Jin explains the distinction between data lakes and lakehouses, reviews the evolution of table‑format technologies such as Iceberg, Hudi and Delta Lake, evaluates feature maturity and performance trade‑offs, and discusses systematic versus non‑systematic adoption in enterprises.

Big DataData LakehouseDelta Lake
0 likes · 13 min read
Interview on Data Lakehouse: Current Applications, Challenges, and Evolution
Data Thinking Notes
Data Thinking Notes
Apr 9, 2023 · Big Data

Why Data Quality Is the Hidden Driver of Big Data Success

In the big‑data era, high‑quality data are essential for reliable analytics, and this article explains data‑quality concepts, key dimensions, analysis methods for missing values, outliers, inconsistencies and duplicates, as well as practical management practices to ensure data assets become a competitive advantage.

Big DataData AnalysisData Management
0 likes · 15 min read
Why Data Quality Is the Hidden Driver of Big Data Success
DataFunSummit
DataFunSummit
Apr 9, 2023 · Big Data

Expert Interview: Architecture and Trends of Big Data Platforms

This article presents a comprehensive interview with several big‑data platform experts, outlining the core components such as data integration, storage and computation, distributed scheduling, and query analysis, while also highlighting current challenges, best‑practice tools, and future trends in big‑data architecture.

Big DataData IntegrationDistributed computing
0 likes · 10 min read
Expert Interview: Architecture and Trends of Big Data Platforms
DataFunTalk
DataFunTalk
Apr 9, 2023 · Big Data

Building an Agile Business Intelligence Platform at Zhongyuan Bank: Architecture, Practices, and Future Outlook

The article details Zhongyuan Bank's end‑to‑end agile BI platform construction, covering business goals, a step‑by‑step development timeline, core architecture, eight key functionalities, low‑code data processing, real‑time streaming, visualization dashboards, intelligent Q&A, and future directions for platform intelligence and openness.

BIBig DataData Platform
0 likes · 19 min read
Building an Agile Business Intelligence Platform at Zhongyuan Bank: Architecture, Practices, and Future Outlook
ITPUB
ITPUB
Apr 8, 2023 · Big Data

How Bilibili Cut Data Pipeline Costs by 20% with Flink Real‑Time Incremental Computing

Facing daily terabyte‑scale data ingestion and costly duplicate reads in its ODS‑to‑DWD pipeline, Bilibili introduced a Flink‑based real‑time incremental computation and multi‑level partition shuffling, dramatically reducing read amplification, cutting resource usage by ~20%, improving latency to minutes, and enhancing scalability.

Big DataFlinkReal-time Processing
0 likes · 19 min read
How Bilibili Cut Data Pipeline Costs by 20% with Flink Real‑Time Incremental Computing
DataFunTalk
DataFunTalk
Apr 7, 2023 · Big Data

Introducing Apache Paimon: An Open‑Source Streaming Lakehouse Storage Engine

Apache Paimon is an open‑source streaming data lake storage system that combines LSM‑based real‑time updates, open file formats, and deep integration with Flink, Spark, and Trino to deliver high‑throughput ingestion, low‑latency queries, and unified batch‑stream processing for modern big‑data workloads.

Apache PaimonBig DataFlink
0 likes · 7 min read
Introducing Apache Paimon: An Open‑Source Streaming Lakehouse Storage Engine
Data Thinking Notes
Data Thinking Notes
Apr 5, 2023 · Big Data

Mastering Data Governance: From Challenges to End‑to‑End Solutions

This article explores the key problems data governance aims to solve, outlines a comprehensive governance framework, and details practical implementation steps—including tool integration, metadata management, lake‑in and lake‑out processes, and governance policies—to achieve a closed‑loop, value‑driven data ecosystem.

Big DataData LakeData Quality
0 likes · 13 min read
Mastering Data Governance: From Challenges to End‑to‑End Solutions
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 4, 2023 · Big Data

Understanding Flink’s Data Flow: Buffer Pools, Network Transfer, and Credit‑Based Flow Control

This article explains Flink’s internal data abstraction and transfer mechanisms, detailing how data moves between operators via network buffers, the role of ByteBuffer and NetworkBufferPool, the serialization process, Netty integration, and credit‑based flow control to handle backpressure.

Big DataCredit-based Flow ControlData Flow
0 likes · 10 min read
Understanding Flink’s Data Flow: Buffer Pools, Network Transfer, and Credit‑Based Flow Control
DataFunTalk
DataFunTalk
Apr 4, 2023 · Big Data

Upgrading Hangzhou Bank Consumer Finance Big Data Platform with Apache Doris 1.2: Architecture, Performance Gains, and Integration

This article details how Hangzhou Bank Consumer Finance modernized its big‑data platform by introducing Apache Doris 1.2, replacing the original Greenplum + CDH architecture, unifying data sources via Multi‑Catalog, achieving second‑level query latency, reducing storage and compute costs, and outlining the integration workflow with DolphinScheduler, SeaTunnel, and Spark.

Apache DorisBig DataData Integration
0 likes · 20 min read
Upgrading Hangzhou Bank Consumer Finance Big Data Platform with Apache Doris 1.2: Architecture, Performance Gains, and Integration
DataFunTalk
DataFunTalk
Apr 4, 2023 · Big Data

Compass: An Open‑Source Big Data Task Diagnosis Platform for DolphinScheduler, Airflow and Spark

Compass is an open‑source big‑data diagnostic platform developed by OPPO that provides non‑intrusive, real‑time monitoring and root‑cause analysis for offline and streaming tasks on schedulers such as DolphinScheduler and Airflow, covering workflow‑level failures, Spark engine anomalies, resource usage, and offering one‑click reports and extensible rule‑based diagnostics.

Big DataDolphinSchedulerSpark
0 likes · 13 min read
Compass: An Open‑Source Big Data Task Diagnosis Platform for DolphinScheduler, Airflow and Spark
Bilibili Tech
Bilibili Tech
Apr 4, 2023 · Big Data

How Bilibili’s Flink‑Based Real‑Time Incremental Pipeline Cuts Costs and Boosts Latency

This article details Bilibili’s migration from a Spark‑based offline ODS‑to‑DWD sharding process to a Flink real‑time incremental pipeline, explaining the background challenges, the design of multi‑level partitioning, small‑file optimizations, stability enhancements, and the measurable performance gains achieved.

Big DataData WarehouseFlink
0 likes · 19 min read
How Bilibili’s Flink‑Based Real‑Time Incremental Pipeline Cuts Costs and Boosts Latency
DataFunSummit
DataFunSummit
Apr 3, 2023 · Big Data

Evolution and Architecture of Data Lineage in Volcano Engine DataLeap

This article outlines the background, development stages, architectural evolution, key features such as incremental updates and quality metrics, and future directions of the data lineage capability within Volcano Engine's DataLeap big‑data governance platform.

Big DataDataLeapMetadata
0 likes · 18 min read
Evolution and Architecture of Data Lineage in Volcano Engine DataLeap
dbaplus Community
dbaplus Community
Apr 2, 2023 · Big Data

Unlock Faster ODPS SQL: Proven UNION, COUNT DISTINCT, and Join Optimizations

This article walks through common ODPS SQL scenarios—union, count distinct, large‑table joins, mapjoin, and predicate placement—explains why naïve implementations can be inefficient, shows how to read and interpret execution plans, and provides concrete rewritten queries that dramatically improve performance and resource usage.

Big DataCOUNT DISTINCTMapJoin
0 likes · 17 min read
Unlock Faster ODPS SQL: Proven UNION, COUNT DISTINCT, and Join Optimizations
DataFunSummit
DataFunSummit
Mar 31, 2023 · Big Data

Data Governance Practices and Implementation at DataCake

The article outlines DataCake's data governance journey, describing the challenges of data silos and cost inefficiencies, the strategic thinking behind a unified metadata platform, the implementation of governance tools, cost analysis modules, and asset inventory, and concludes with results, future plans, and a Q&A session.

Big DataOperational Efficiencycost analysis
0 likes · 14 min read
Data Governance Practices and Implementation at DataCake
HomeTech
HomeTech
Mar 31, 2023 · Artificial Intelligence

Digital Transformation of Used‑Car Buying: Integrated Data, AI Valuation, and VR Visualization

The article describes how a comprehensive digital platform combines structured, semi‑structured, and panoramic data with machine‑learning valuation models, natural‑language processing, and VR technology to make used‑car condition information transparent, improve estimation accuracy, and enhance user decision‑making in the Chinese second‑hand car market.

AI valuationBig DataData Integration
0 likes · 15 min read
Digital Transformation of Used‑Car Buying: Integrated Data, AI Valuation, and VR Visualization
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 30, 2023 · Big Data

Apache Paimon (Incubating): A Streaming Lakehouse Storage Project Overview

Apache Paimon, newly incubated by the Apache Software Foundation, combines Flink's real‑time streaming capabilities with open lakehouse storage formats, offering high‑throughput, low‑latency data ingestion, partial‑update merges, and seamless integration with engines like Flink, Spark, and Trino for unified batch and streaming analytics.

Apache PaimonBig DataData Lake
0 likes · 7 min read
Apache Paimon (Incubating): A Streaming Lakehouse Storage Project Overview
ITPUB
ITPUB
Mar 28, 2023 · Big Data

How We Turned a Hive Data Warehouse into a Real‑Time Lakehouse with Apache Hudi

This article details the migration from a traditional Hive‑based data warehouse to a lakehouse architecture using Apache Hudi, covering the original Lambda setup, its pain points, lake‑vs‑warehouse differences, Hudi features, integration challenges, practical solutions, and future roadmap.

Apache HudiBig DataData Warehouse
0 likes · 11 min read
How We Turned a Hive Data Warehouse into a Real‑Time Lakehouse with Apache Hudi
DataFunTalk
DataFunTalk
Mar 28, 2023 · Big Data

Big Data Challenges and Serverless Data Solutions: Insights from an AWS Data Architect

The article examines the evolution of big‑data technologies, outlines the operational, cost and security challenges enterprises face, and presents serverless data—particularly AWS’s cloud‑native services—as a scalable, low‑cost solution that eliminates maintenance while enabling real‑time processing and advanced analytics.

Big DataCloud ComputingServerless
0 likes · 16 min read
Big Data Challenges and Serverless Data Solutions: Insights from an AWS Data Architect
Baidu Geek Talk
Baidu Geek Talk
Mar 27, 2023 · Big Data

Precise Watermark Design and Implementation in Baidu's Unified Streaming-Batch Data Warehouse

The article details Baidu's precise watermark design for its unified streaming‑batch data warehouse, describing how a centralized watermark server and client ensure end‑to‑end data completeness, align real‑time and batch windows with 99.9‑99.99% precision, and support accurate anti‑fraud calculations within the broader big‑data ecosystem.

Apache FlinkBaiduBig Data
0 likes · 14 min read
Precise Watermark Design and Implementation in Baidu's Unified Streaming-Batch Data Warehouse
macrozheng
macrozheng
Mar 27, 2023 · Big Data

Top 8 Open-Source ETL Tools for Efficient Data Migration

This guide reviews eight popular ETL and data migration tools—including Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, and Canal—detailing their core features, architectures, and use cases to help engineers choose the right solution for reliable data integration.

Big DataData IntegrationData Migration
0 likes · 14 min read
Top 8 Open-Source ETL Tools for Efficient Data Migration
Data Thinking Notes
Data Thinking Notes
Mar 26, 2023 · Big Data

Why Data Governance Is the Key to Unlocking Your Data’s True Value

This article explains how effective data governance transforms raw data into a trusted enterprise asset, outlines common pitfalls such as backward and passive governance, and presents a structured, four‑phase approach—including organizational setup, standards, platform selection, and continuous operations—to successfully implement data governance at scale.

Big DataData ManagementData Quality
0 likes · 10 min read
Why Data Governance Is the Key to Unlocking Your Data’s True Value
ITPUB
ITPUB
Mar 25, 2023 · Big Data

Mastering Efficient SQL in ODPS: Union, Count‑Distinct, and Join Optimizations

This article walks through common SQL development scenarios on ODPS, examining why naïve UNION and COUNT DISTINCT can be slow, how to rewrite queries with GROUP BY, UNION ALL, JSON aggregation, and map‑join techniques, and shows the resulting execution‑plan improvements with concrete code and performance numbers.

Big DataCountDistinctMapJoin
0 likes · 17 min read
Mastering Efficient SQL in ODPS: Union, Count‑Distinct, and Join Optimizations
Su San Talks Tech
Su San Talks Tech
Mar 24, 2023 · Big Data

Top 8 Open-Source ETL Tools You Should Know for Efficient Data Migration

Explore a comprehensive overview of eight popular ETL and data migration tools—including Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, and Canal—detailing their features, architectures, and use cases to help you choose the right solution for efficient data integration.

Big DataData IntegrationData Migration
0 likes · 13 min read
Top 8 Open-Source ETL Tools You Should Know for Efficient Data Migration
Volcano Engine Developer Services
Volcano Engine Developer Services
Mar 22, 2023 · Fundamentals

How ByteDance Scales Data Governance: Challenges, Distributed Solutions, and Best Practices

This article examines ByteDance's data governance journey, outlining business, organizational, and cultural challenges, the six-stage evolution framework, real‑world case studies, and the shift from centralized to distributed autonomous governance to improve quality, security, cost, and team efficiency.

Big DataData QualityOperations
0 likes · 18 min read
How ByteDance Scales Data Governance: Challenges, Distributed Solutions, and Best Practices
DataFunTalk
DataFunTalk
Mar 21, 2023 · Databases

Design and Technical Details of Apache Doris for Lakehouse Architecture

This article explains how Apache Doris extends its real‑time OLAP capabilities to support Lakehouse architectures, covering unified metadata, query acceleration, elastic compute, performance benchmarks, and future roadmap for richer data‑source integration and resource isolation.

Apache DorisBig DataData Warehouse
0 likes · 20 min read
Design and Technical Details of Apache Doris for Lakehouse Architecture
Data Thinking Notes
Data Thinking Notes
Mar 19, 2023 · Big Data

Why Data Quality Is the Key to Successful Big Data Initiatives

The article explains that while big data aims to boost organizational insight and innovation, its true value depends on high data quality, outlines industry standards, identifies technical, business, and management causes of poor quality, and proposes a three‑phase strategy of prevention, monitoring, and post‑improvement to ensure reliable data for decision‑making.

Big DataData QualityStandards
0 likes · 21 min read
Why Data Quality Is the Key to Successful Big Data Initiatives
DataFunSummit
DataFunSummit
Mar 16, 2023 · Artificial Intelligence

Construction of Real‑World Medical Knowledge Graphs and Clinical Event Graphs

The article describes how YiduCloud builds real‑world medical knowledge graphs and clinical event graphs from heterogeneous hospital systems (EMR, HIS, LIS, RIS) using data aggregation, de‑identification, quality control, NLP‑driven entity extraction, standardisation, graph construction, cleaning, embedding and various AI‑powered applications such as decision support, intelligent diagnosis, automated medical‑record generation and patient recruitment.

AIBig DataMedical Knowledge Graph
0 likes · 21 min read
Construction of Real‑World Medical Knowledge Graphs and Clinical Event Graphs
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 16, 2023 · Big Data

How SLS’s Schema‑on‑Read Scanning Boosts Log Analytics Flexibility and Cuts Costs

This article explains the motivation, design, and implementation of Alibaba Cloud's SLS Schema‑on‑Read scanning mode, showing how it enables SQL analysis on raw log data without pre‑built indexes, improves flexibility for evolving schemas, and reduces storage and index costs in various log‑analysis scenarios.

Big DataColumnar StorageLog Analytics
0 likes · 27 min read
How SLS’s Schema‑on‑Read Scanning Boosts Log Analytics Flexibility and Cuts Costs
Bilibili Tech
Bilibili Tech
Mar 14, 2023 · Big Data

Bilibili HDFS Erasure Coding Strategy and Implementation

Bilibili reduced petabyte‑scale storage costs by back‑porting erasure‑coding patches to its HDFS 2.8.4 cluster, deploying a parallel EC‑enabled cluster, adding a data‑proxy service, intelligent routing and block‑checking, and automating cold‑data migration, while noting write overhead and planning native acceleration.

Big DataData ReliabilityDistributed Systems
0 likes · 14 min read
Bilibili HDFS Erasure Coding Strategy and Implementation
ITPUB
ITPUB
Mar 13, 2023 · Big Data

What’s New in Apache Kyuubi 1.6.0? Server, Client, and Engine Enhancements

Apache Kyuubi 1.6.0 introduces major server‑side upgrades such as batch JAR task submission with RESTful APIs and a metadata store for HA, client‑side improvements including a unified JDBC driver and enhanced Beeline, plus mature Spark, Flink, Trino, and Hive engine plugins, while outlining the community’s roadmap.

Big DataEngine PluginsFlink
0 likes · 13 min read
What’s New in Apache Kyuubi 1.6.0? Server, Client, and Engine Enhancements
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 13, 2023 · Big Data

Unlocking Big Data with Alibaba Cloud’s Native Data Lake Solution

Alibaba Cloud’s cloud‑native data lake analysis solution combines fully managed storage (OSS‑HDFS), a one‑stop lake management platform (Data Lake Formation), and multimodal compute capabilities, delivering high performance, massive scalability, and low cost for big‑data and AI workloads across offline, real‑time, and lake‑house scenarios.

AnalyticsBig DataData Lake
0 likes · 11 min read
Unlocking Big Data with Alibaba Cloud’s Native Data Lake Solution
Data Thinking Notes
Data Thinking Notes
Mar 12, 2023 · Big Data

Why Data Middle Platforms Are Evolving: New Trends in Data Governance and DataOps

The article examines how China's data middle platform concept is reshaping enterprise data strategy, highlighting a shift toward value‑driven adoption, the intertwined relationship with data governance, and emerging trends such as fine‑grained business governance, full‑link monitoring, integrated platforms, and DataOps.

Big DataData Middle PlatformDataOps
0 likes · 9 min read
Why Data Middle Platforms Are Evolving: New Trends in Data Governance and DataOps
DataFunTalk
DataFunTalk
Mar 12, 2023 · Big Data

Apache Kyuubi 1.6.0 Feature Overview and Enhancements

The article provides a comprehensive walkthrough of Apache Kyuubi 1.6.0, detailing server‑side enhancements such as batch (JAR) task submission, metadata store and unified API/authentication, client‑side improvements to the built‑in JDBC driver and Beeline, as well as engine plugins for Spark, Flink, Trino and Hive, and concludes with the community’s roadmap and statistics.

Apache KyuubiBatch processingBig Data
0 likes · 12 min read
Apache Kyuubi 1.6.0 Feature Overview and Enhancements
DataFunSummit
DataFunSummit
Mar 11, 2023 · Databases

Graph Database Storage and Knowledge Graph Practices – Forum Overview

The forum explores the rapid growth and complexity of knowledge graphs, addressing storage and computation challenges through expert talks on graph database storage, query languages, practical implementation, and large‑scale financial knowledge graph platforms, offering attendees deep technical insights and hands‑on guidance.

Big DataData StorageKnowledge Graph
0 likes · 8 min read
Graph Database Storage and Knowledge Graph Practices – Forum Overview
DataFunSummit
DataFunSummit
Mar 9, 2023 · Big Data

Designing Efficient and Agile Real-Time Big Data Analytics Platforms for Enterprises

The article explains how enterprises can build a comprehensive big data analytics platform—covering data collection, storage, computation, and decision layers—by clarifying business scenarios, choosing appropriate on‑premise or cloud deployment, selecting suitable architectures such as Lambda/Kappa, and addressing component choices and emerging technical trends.

Big DataData ArchitectureReal-time Analytics
0 likes · 9 min read
Designing Efficient and Agile Real-Time Big Data Analytics Platforms for Enterprises