Tagged articles
3697 articles
Page 8 of 37
Baidu Geek Talk
Baidu Geek Talk
Apr 10, 2024 · Big Data

TDA: A One‑Stop Self‑Service BI Platform – Architecture, Challenges, and Solutions

The article presents Turing Data Analysis (TDA), a self‑service BI platform that replaces fragile traditional pipelines with a unified DWD‑based data model, drag‑and‑drop analytics, multi‑engine query optimization and caching, delivering sub‑10‑second queries on billions of rows, fine‑grained permissions, and rapid dashboard creation, while reporting significant usage growth and outlining AI‑driven future enhancements.

BIBig DataData Platform
0 likes · 15 min read
TDA: A One‑Stop Self‑Service BI Platform – Architecture, Challenges, and Solutions
Data Thinking Notes
Data Thinking Notes
Apr 9, 2024 · Big Data

What Is a Data Middle Platform and Why It’s Essential for Modern Enterprises

Data middle platforms transform raw enterprise data into reusable assets by integrating collection, storage, processing, governance, and service layers, enabling faster deployment, consistent metrics, improved data quality, and business value across digital transformation, while addressing challenges like siloed data, low efficiency, and inconsistent standards.

Big DataData IntegrationData Platform
0 likes · 23 min read
What Is a Data Middle Platform and Why It’s Essential for Modern Enterprises
DataFunTalk
DataFunTalk
Apr 9, 2024 · Big Data

Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform

This article shares Xiaomi's real‑world challenges and solutions when building a new Spark 3.1‑based data platform, covering Multiple Catalog implementation, Hive‑to‑Spark SQL migration, automated batch upgrades, performance and stability optimizations, and future roadmap for vectorized execution.

Apache SparkBig DataData Migration
0 likes · 14 min read
Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform
Baidu Geek Talk
Baidu Geek Talk
Apr 8, 2024 · Big Data

How RTS Platform Turns Real‑Time Data Streams into Reliable Business Value

This article analyzes the challenges of commercial real‑time data processing—such as stability, multi‑stage computation, and frequent schema changes—and explains how the RTS platform provides end‑to‑end managed solutions, auto schema handling, primary‑secondary redundancy, experiment‑first deployment, and metadata generation to unlock high‑velocity data value for advertising operations.

Big DataCloud ComputingRTS platform
0 likes · 17 min read
How RTS Platform Turns Real‑Time Data Streams into Reliable Business Value
DataFunSummit
DataFunSummit
Apr 7, 2024 · Big Data

Li Auto’s Flink on Kubernetes Data Integration Practice

This article presents Li Auto’s end‑to‑end data integration journey, detailing the evolution of its data platform, the challenges of heterogeneous sources, and how a unified Flink‑on‑K8s solution with cloud‑native architecture, operator management, monitoring, and checkpointing addresses batch‑stream convergence and future scalability.

Batch processingBig DataData Integration
0 likes · 12 min read
Li Auto’s Flink on Kubernetes Data Integration Practice
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Apr 6, 2024 · Big Data

Deep Dive into Kafka’s Underlying Mechanisms: Sequential Writes, Sparse Indexing, Segment Storage, and Replication

This article explores Apache Kafka’s core storage architecture, explaining how sequential append‑only writes, sparse indexing, segmented log files, and a leader‑based replication mechanism together enable high‑throughput, reliable, and scalable event streaming for massive data workloads.

Big DataEvent StreamingKafka
0 likes · 11 min read
Deep Dive into Kafka’s Underlying Mechanisms: Sequential Writes, Sparse Indexing, Segment Storage, and Replication
DataFunSummit
DataFunSummit
Apr 4, 2024 · Big Data

Design Principles and Future Directions of DataOps

This article outlines the core capabilities of data-driven development, the background and architecture of DataOps, its research challenges and focus areas, and explores future directions such as data virtualization, platform governance, and data value assessment, providing a comprehensive overview of DataOps practices.

Big DataData Platform
0 likes · 8 min read
Design Principles and Future Directions of DataOps
Practical DevOps Architecture
Practical DevOps Architecture
Apr 4, 2024 · Databases

ClickHouse Training Course Overview and Curriculum

This article introduces a comprehensive ClickHouse training program that covers fundamental concepts, architecture, installation, distributed cluster design, data import, performance tuning, and includes a detailed list of 33 video modules and additional recommended reading resources for large‑scale data analytics.

Big DataClickHouseColumnar Database
0 likes · 4 min read
ClickHouse Training Course Overview and Curriculum
DataFunTalk
DataFunTalk
Apr 3, 2024 · Artificial Intelligence

DataFunCon 2024 Shanghai: AI, Big Data, Cloud and Industry Forum Program

DataFunCon 2024 Shanghai brings together leading experts from AI, big data, cloud computing, and industry sectors to discuss cutting‑edge technologies, large‑model applications, intelligent operations, and digital transformation across automotive, healthcare, finance, retail, and entertainment.

Big DataCloud ComputingDigital Transformation
0 likes · 69 min read
DataFunCon 2024 Shanghai: AI, Big Data, Cloud and Industry Forum Program
DataFunSummit
DataFunSummit
Apr 1, 2024 · Artificial Intelligence

DataFunCon 2024 Shanghai Conference Program Overview

The DataFunCon 2024 Shanghai conference brings together leading experts from academia and industry to discuss cutting‑edge topics such as large language models, AI‑driven operations, data governance, digital transformation, and emerging applications across automotive, finance, retail, and entertainment sectors.

AIBig DataCloud Computing
0 likes · 69 min read
DataFunCon 2024 Shanghai Conference Program Overview
DataFunSummit
DataFunSummit
Apr 1, 2024 · Big Data

DataOps at ByteDance: Challenges, Implementation, and Future Outlook

This article examines ByteDance's DataOps journey, detailing the data‑engineering challenges faced, the concrete solutions and productization through the DataLeap platform, the metrics and best‑practice framework adopted, and the future directions involving AI‑assisted development and open‑source collaboration.

Big DataData PlatformMetrics
0 likes · 20 min read
DataOps at ByteDance: Challenges, Implementation, and Future Outlook
ITPUB
ITPUB
Mar 29, 2024 · Databases

How to Import 1 Billion Records into MySQL at Lightning Speed

This guide explains how to efficiently load one billion 1‑KB log entries from HDFS or S3 into MySQL by analyzing B‑tree limits, using batch inserts, choosing the right storage engine, sharding tables, optimizing file reading, and coordinating tasks with Redis, Redisson, and Zookeeper.

Batch InsertBig DataDistributed Tasks
0 likes · 19 min read
How to Import 1 Billion Records into MySQL at Lightning Speed
DataFunSummit
DataFunSummit
Mar 29, 2024 · Artificial Intelligence

DataFunCon2024 Shanghai: AI, Big Data, Cloud and Industry Innovation Conference

DataFunCon2024 Shanghai brings together leading experts from AI, big data, cloud computing and various industries such as automotive, biotech, retail, finance and entertainment to share cutting‑edge research, practical case studies and future trends through a series of keynote speeches, panels and technical sessions.

AIBig DataCloud Computing
0 likes · 70 min read
DataFunCon2024 Shanghai: AI, Big Data, Cloud and Industry Innovation Conference
Didi Tech
Didi Tech
Mar 28, 2024 · Big Data

How We Unified Real‑Time and Batch Features with StarRocks in Financial Risk Control

This article analyzes the challenges of building real‑time and batch risk‑control features, compares Lambda and Kappa architectures, evaluates storage‑unified and compute‑unified solutions, and details how StarRocks was selected, validated, and deployed to achieve high‑performance, low‑latency feature serving in a financial context.

Big DataData WarehouseFeature Engineering
0 likes · 19 min read
How We Unified Real‑Time and Batch Features with StarRocks in Financial Risk Control
Data Thinking Notes
Data Thinking Notes
Mar 27, 2024 · Big Data

How to Build and Optimize a Scalable User Profiling Platform from Scratch

This article explains the value of user profiling platforms, outlines their core functions, presents a layered architecture with open‑source options, and details engineering optimizations—from wide‑table design to BitMap caching and task‑mode execution—while also discussing current industry trends.

Big DataData EngineeringPerformance Optimization
0 likes · 18 min read
How to Build and Optimize a Scalable User Profiling Platform from Scratch
DataFunTalk
DataFunTalk
Mar 27, 2024 · Big Data

Data Collection Quality Review: From Compliance to Reasonableness and Toolchain Overview

This article explores data collection governance by distinguishing data quality compliance from reasonableness, introduces a comprehensive quality review tool suite—including visual inspection, intelligent judgment, and self‑diagnosis—detailing its architecture, key techniques, and practical case studies for ensuring reliable data metrics.

Big DataIntelligent JudgmentQuality Review Tools
0 likes · 19 min read
Data Collection Quality Review: From Compliance to Reasonableness and Toolchain Overview
DataFunTalk
DataFunTalk
Mar 26, 2024 · Big Data

Building an Enterprise Real-Time Data Warehouse with Hologres and Flink at Cao Cao Mobility

This article presents a comprehensive case study of Cao Cao Mobility's transition from a traditional Lambda architecture to an enterprise‑grade real‑time data warehouse built on Hologres and Flink, detailing business background, pain points, architectural design, performance optimizations, metadata management, and future development directions.

Big DataData EngineeringFlink
0 likes · 20 min read
Building an Enterprise Real-Time Data Warehouse with Hologres and Flink at Cao Cao Mobility
StarRocks
StarRocks
Mar 26, 2024 · Big Data

How Replacing Spark with StarRocks Cut Data Refresh Time by 90% and Saved 99% Cost

The article details how the Xiaohongshu data warehouse team integrated StarRocks into their offline processing pipeline, replacing Spark for heavy Cube calculations, which reduced job execution from hours to minutes, cut resource consumption by over 90%, advanced daily data output by 1.5 hours, and lowered refresh cost by more than 99%.

Big DataOLAPPerformance Optimization
0 likes · 18 min read
How Replacing Spark with StarRocks Cut Data Refresh Time by 90% and Saved 99% Cost
DataFunTalk
DataFunTalk
Mar 24, 2024 · Big Data

Didi's Big Data Asset Governance Practices: Hadoop and Elasticsearch Governance

This article details Didi's comprehensive big‑data asset governance platform, covering its architectural layers, Hadoop and Elasticsearch governance practices, health‑score models, lifecycle recommendations, and future plans for automated and intelligent data governance to reduce cost and manual effort.

Big DataElasticsearchHadoop
0 likes · 17 min read
Didi's Big Data Asset Governance Practices: Hadoop and Elasticsearch Governance
DataFunSummit
DataFunSummit
Mar 20, 2024 · Big Data

Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance

This article details ByteDance's large‑scale evolution of Spark Shuffle to a cloud‑native architecture, describing background, stability and mixed‑resource scenarios, challenges such as CPU and I/O limits, custom ESS enhancements, shuffle throttling, spill‑split mechanisms, and the Cloud Shuffle Service with its push‑based design and performance gains.

Big DataKubernetesPerformance Optimization
0 likes · 21 min read
Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance
StarRocks
StarRocks
Mar 19, 2024 · Databases

How StarRocks Powers Data‑Driven Financial Marketing at Ping An Bank

This article explains how Ping An Bank transformed its retail finance model from product‑centric to customer‑centric using a five‑in‑one data‑driven approach, the KYC/KYP/KYATO methodology, and the StarRocks analytics platform to build the Smart Bank 3.0 architecture, CDP, and real‑time metric layers.

Big DataCustomer 360Financial Marketing
0 likes · 14 min read
How StarRocks Powers Data‑Driven Financial Marketing at Ping An Bank
Alipay Experience Technology
Alipay Experience Technology
Mar 19, 2024 · Big Data

How Alipay Cut Merchant Bill Complexity by 60% Using a Five‑Step Method

This article details how Alipay's data engineering team applied Elon Musk's five‑step work method to completely refactor a decade‑old merchant billing system, reducing overall complexity by over 60%, improving timeliness by an hour, cutting storage and compute costs by a third, and dramatically lowering operational and maintenance burdens.

Big DataCost ReductionData Engineering
0 likes · 23 min read
How Alipay Cut Merchant Bill Complexity by 60% Using a Five‑Step Method
DataFunTalk
DataFunTalk
Mar 19, 2024 · Big Data

High‑Performance Vehicle IoT Big Data Platform Solution Based on DolphinDB

This article presents a comprehensive vehicle‑IoT big‑data platform solution that outlines required capabilities, describes a DolphinDB‑based architecture, shares a real‑world case of 1.8 × 10⁸ writes per second, and provides step‑by‑step deployment and query scripts for rapid verification.

Big DataData AnalyticsDolphinDB
0 likes · 18 min read
High‑Performance Vehicle IoT Big Data Platform Solution Based on DolphinDB
DataFunSummit
DataFunSummit
Mar 18, 2024 · Big Data

Scenario‑Based Data Governance Practices in the Securities Industry

This article presents a comprehensive, scenario-driven data governance practice at Guoxin Securities, covering the industry's pain points, a three‑layer governance framework, detailed implementations for data standards, metadata, data quality, data modeling, and data security, and outlines future directions for intelligent and measurable governance.

Big DataData QualityData Security
0 likes · 30 min read
Scenario‑Based Data Governance Practices in the Securities Industry
DataFunTalk
DataFunTalk
Mar 16, 2024 · Big Data

Performance Optimization Practices for KwaiBI Big Data Analysis Platform

This article introduces KwaiBI, the internal data analysis product of Kuaishou, outlines its five major functional areas, details the performance challenges of large‑scale analytics, and presents a comprehensive set of optimization techniques—including cache warming, query rewriting, materialized acceleration, and the Bleem lake‑house engine—along with future directions and a brief Q&A.

Big DataData AnalyticsKwaiBI
0 likes · 15 min read
Performance Optimization Practices for KwaiBI Big Data Analysis Platform
Didi Tech
Didi Tech
Mar 12, 2024 · Big Data

Understanding Flink Metrics System: Core Concepts, Elastic Design, and Practical Usage

The article explains Flink’s metrics architecture—core concepts, reporter interfaces, built‑in and custom metric types, elastic plugin design, and scheduled reporting—illustrated with a consumption‑latency example, and shows how Didi uses these metrics for real‑time UI curves, alerts, and intelligent task diagnosis.

Big DataFlinkMetrics
0 likes · 11 min read
Understanding Flink Metrics System: Core Concepts, Elastic Design, and Practical Usage
Open Source Linux
Open Source Linux
Mar 11, 2024 · Big Data

Step‑by‑Step Guide to Deploying Flink on Standalone, Yarn, and Kubernetes

This tutorial explains how to install and configure Apache Flink in three deployment modes—Standalone, Hadoop YARN, and Kubernetes—covering node preparation, configuration files, package distribution, job submission, and monitoring through the Flink Web UI, with full command‑line examples and code snippets.

Big DataFlinkKubernetes
0 likes · 12 min read
Step‑by‑Step Guide to Deploying Flink on Standalone, Yarn, and Kubernetes
DataFunSummit
DataFunSummit
Mar 8, 2024 · Databases

Ant TuGraph Computing Engine Architecture and Applications

Ant TuGraph’s open‑source graph computing engine, led by Fang Zhihong, will be introduced covering its development history, architectural design, technical principles, integrated stream‑batch‑graph processing capabilities, real‑world large‑scale graph use cases, and future roadmap, offering insights into design, implementation, and value.

Big DataDistributed SystemsTuGraph
0 likes · 2 min read
Ant TuGraph Computing Engine Architecture and Applications
Huolala Tech
Huolala Tech
Mar 7, 2024 · Big Data

Integrating Apache Tez with Remote Shuffle Service via Uniffle: HuoLala’s Experience

Facing exploding data volumes and rising cluster costs, HuoLala adopted Apache Tez’s Remote Shuffle Service built on Apache Uniffle, redesigning the Tez client to operate without source modifications, detailing architecture, implementation challenges, testing, stability measures, and future plans to enhance big‑data shuffle performance and cost efficiency.

Apache TezBig DataData Engineering
0 likes · 14 min read
Integrating Apache Tez with Remote Shuffle Service via Uniffle: HuoLala’s Experience
Sohu Tech Products
Sohu Tech Products
Mar 6, 2024 · Big Data

Building Data Systems with Apache Arrow: Architecture, Memory Format, and Execution

The article explains how Apache Arrow’s columnar, cross‑language in‑memory format enables high‑performance, interoperable data systems—replacing traditional row‑oriented databases—by supporting dynamic schemas, zero‑copy data exchange, efficient indexing, Acero‑based query execution, and Flight/ADBC connectivity, while offering practical guidance and highlighting challenges.

Apache ArrowBig DataColumnar Storage
0 likes · 20 min read
Building Data Systems with Apache Arrow: Architecture, Memory Format, and Execution
Didi Tech
Didi Tech
Mar 5, 2024 · Databases

Migrating Didi's Log Retrieval from Elasticsearch to ClickHouse: Architecture, Challenges, and Performance Optimizations

Didi replaced its Elasticsearch‑based log platform with ClickHouse, redesigning architecture into isolated Log and Trace clusters, using hourly‑partitioned MergeTree tables and aggregating views to handle petabyte‑scale writes, diverse low‑latency queries, and high QPS, achieving over 400 nodes, 40 GB/s throughput, 30 % cost savings and four‑fold query latency reduction.

Big DataClickHouseData Storage
0 likes · 15 min read
Migrating Didi's Log Retrieval from Elasticsearch to ClickHouse: Architecture, Challenges, and Performance Optimizations
DataFunTalk
DataFunTalk
Mar 5, 2024 · Big Data

Changan Automotive Big Data Platform: Challenges and Practices in Connected Vehicle Scenarios

This article outlines the rapid growth of data in the smart automotive sector and details Changan's big data platform challenges—high cost, data accessibility, and operational complexity—and the practical migration from a Lambda to a unified Kappa architecture that delivers significant storage, compute, and maintenance efficiencies.

Big DataConnected VehiclesData Platform
0 likes · 14 min read
Changan Automotive Big Data Platform: Challenges and Practices in Connected Vehicle Scenarios
DataFunTalk
DataFunTalk
Mar 4, 2024 · Big Data

Design and Implementation of a Lakehouse‑Integrated Data Platform for Financial Innovation by Shuxin Network

This article presents Shuxin Network's practical experience in building a cloud‑native, lakehouse‑integrated data platform for the financial sector, covering architecture evolution, challenges of domestic‑innovation (信创), the DataCyber solution, core components, deployment roadmap, and real‑world case studies.

Big DataData PlatformFinancial Innovation
0 likes · 21 min read
Design and Implementation of a Lakehouse‑Integrated Data Platform for Financial Innovation by Shuxin Network
DataFunSummit
DataFunSummit
Mar 2, 2024 · Big Data

OPPO's Application Distribution: Leveraging Big Data, AI, and Intelligent Computing for Cost and Efficiency

This article presents OPPO's practical use of algorithms, big‑data infrastructure, intelligent compute systems, and unified modeling to improve cost efficiency and performance across its application distribution platform, while outlining future plans for edge‑cloud collaboration and large‑model deployment.

Application DistributionArtificial IntelligenceBig Data
0 likes · 14 min read
OPPO's Application Distribution: Leveraging Big Data, AI, and Intelligent Computing for Cost and Efficiency
DataFunTalk
DataFunTalk
Mar 1, 2024 · Big Data

Understanding Data Fabric and Data Virtualization: Concepts, Practices, and Real‑World Case Study

This article explains the fundamentals of Data Fabric and data virtualization, highlights the limitations of traditional centralized data warehouses, describes the three‑layer virtualization architecture, and presents a detailed securities‑industry case study that demonstrates cost, efficiency, and compliance benefits.

Big DataData FabricData Integration
0 likes · 17 min read
Understanding Data Fabric and Data Virtualization: Concepts, Practices, and Real‑World Case Study
DataFunSummit
DataFunSummit
Feb 29, 2024 · Big Data

Trino at Xiaomi: Architecture, Practices, and Future Plans

This article details Xiaomi’s practical deployment of Trino, covering its architectural role, core and extended capabilities, performance comparisons, integration with Iceberg and Spark, operational enhancements, multi‑cluster and ad‑hoc query scenarios, future cloud‑storage plans, and a Q&A session.

Big DataIcebergOLAP
0 likes · 20 min read
Trino at Xiaomi: Architecture, Practices, and Future Plans
Sohu Tech Products
Sohu Tech Products
Feb 28, 2024 · Big Data

How SimHash and Cosine Similarity Accelerate Large‑Scale Text Deduplication

This article explains why massive news feeds need efficient deduplication, compares cosine similarity and SimHash for measuring text similarity, walks through a step‑by‑step implementation with Java code, and shows how a space‑for‑time indexing strategy can reduce duplicate‑detection complexity from O(n²) to near O(1).

Big DataCosine SimilarityNear-Duplicate Detection
0 likes · 14 min read
How SimHash and Cosine Similarity Accelerate Large‑Scale Text Deduplication
Baidu Tech Salon
Baidu Tech Salon
Feb 28, 2024 · Big Data

Design, Optimization, and Practice of Baidu's Fusion Compute Engine for Data Warehouse

Baidu’s Fusion Compute Engine, built on Spark with a one‑layer wide‑table model, combines data‑skipping, push‑down, code‑generation, vectorization and extensive tuning to cut ad‑hoc query latency to seconds, shrink storage by ~30 %, and accelerate ETL workloads while maintaining stability for massive data‑warehouse workloads.

BaiduBig DataFusion Compute Engine
0 likes · 10 min read
Design, Optimization, and Practice of Baidu's Fusion Compute Engine for Data Warehouse
Baidu Geek Talk
Baidu Geek Talk
Feb 28, 2024 · Big Data

How Baidu’s Fusion Compute Engine Cuts Query Time to Seconds on Petabyte‑Scale Data

This article analyzes Baidu's fusion compute engine for its data warehouse, detailing its architecture, optimization techniques such as data skipping, Parquet column indexing, ProjectLimit and CodeGen, and demonstrates how these innovations reduce query latency to seconds while cutting storage costs by about 30% on multi‑petabyte workloads.

BaiduBig DataData Warehouse
0 likes · 12 min read
How Baidu’s Fusion Compute Engine Cuts Query Time to Seconds on Petabyte‑Scale Data
DataFunTalk
DataFunTalk
Feb 28, 2024 · Big Data

Building a Data System with Apache Arrow: Design, Modeling, and Execution

This article explains why new data systems are needed, introduces Apache Arrow and its columnar in‑memory format, describes read‑time modeling and dynamic schema handling, and shows how Arrow can be used to build a complete data processing pipeline with indexing, SQL planning, and zero‑copy data exchange.

Apache ArrowBig DataColumnar Storage
0 likes · 20 min read
Building a Data System with Apache Arrow: Design, Modeling, and Execution
Didi Tech
Didi Tech
Feb 27, 2024 · Big Data

Real-time Precise Deduplication Using StarRocks Materialized Views at Didi

Didi leverages StarRocks materialized views with a global dictionary and bitmap aggregation to perform real‑time, high‑cardinality precise deduplication, automatically rewriting queries and refreshing views, cutting query latency by ~80%, reducing resource use ~95%, and boosting concurrent QPS up to 100‑fold, while planning further automation and bitmap optimizations.

Big DataMaterialized ViewsOLAP
0 likes · 19 min read
Real-time Precise Deduplication Using StarRocks Materialized Views at Didi
StarRocks
StarRocks
Feb 27, 2024 · Databases

How StarRocks Materialized Views Enable High‑Concurrency Precise Deduplication

StarRocks’ materialized view feature lets Didi replace costly fuzzy deduplication with precise, high‑concurrency deduplication for real‑time dashboards, using global dictionary mapping, layered ODS/DWD/ADS views, synchronous and asynchronous refreshes, and transparent query rewrite to cut query latency by 80% and boost QPS dramatically.

Big DataMaterialized ViewsOLAP
0 likes · 20 min read
How StarRocks Materialized Views Enable High‑Concurrency Precise Deduplication
DataFunTalk
DataFunTalk
Feb 27, 2024 · Big Data

Best Practices of Cloud‑Native OLAP Architecture and Logistics Warning at Jushuitan

This article presents Jushuitan's cloud‑native OLAP architecture, detailing its evolution, current big‑data stack—including DataWorks, MaxCompute, Flink, Hologres, and Aerospike—along with logistics warning workflows, rule‑matching mechanisms, real‑time processing challenges, and future scalability plans.

Big DataData WarehouseFlink
0 likes · 20 min read
Best Practices of Cloud‑Native OLAP Architecture and Logistics Warning at Jushuitan
DataFunSummit
DataFunSummit
Feb 26, 2024 · Big Data

Building a New Lakehouse Analytics Paradigm with StarRocks and Paimon

This article introduces a new lakehouse analytics paradigm by combining StarRocks and Paimon, covering the evolution of data lake technologies, key integration scenarios, core technical mechanisms such as JNI connectors, materialized views, and future roadmap for enhanced lakehouse capabilities.

AnalyticsBig DataData Lake
0 likes · 16 min read
Building a New Lakehouse Analytics Paradigm with StarRocks and Paimon
DataFunTalk
DataFunTalk
Feb 25, 2024 · Big Data

Implementation Practice of Bilibili's Tag System: Evolution, Architecture, and Future Plans

This article details Bilibili's tag system from its 2021 inception through successive redesigns, describing the three‑layer architecture, data flow pipelines using Hive, Iceberg, Spark and ClickHouse, crowd selection DSL, online services with Redis, performance optimizations, and upcoming governance and quality initiatives.

Big DataClickHouseData Engineering
0 likes · 12 min read
Implementation Practice of Bilibili's Tag System: Evolution, Architecture, and Future Plans
NewBeeNLP
NewBeeNLP
Feb 25, 2024 · Interview Experience

Comprehensive Interview Question Cheat Sheet for Top Tech Companies

This article compiles a detailed list of interview question topics from leading tech firms—including search, algorithm engineering, NLP, multimodal LLMs, advertising, recommendation, risk control, and big‑data domains—covering algorithms, system design, machine‑learning concepts, and practical coding challenges.

AlgorithmsBig DataInterview Questions
0 likes · 10 min read
Comprehensive Interview Question Cheat Sheet for Top Tech Companies
DataFunTalk
DataFunTalk
Feb 22, 2024 · Big Data

Flink on Kubernetes: Kuaishou’s Practice, Migration, and Future Refactoring

This article details Kuaishou’s five‑year evolution of Flink, covering its background, production refactoring to Kubernetes, migration practices, and future improvements, highlighting architecture layers, resource management, observability, and testing strategies for large‑scale stream processing.

Big DataFlinkKubernetes
0 likes · 12 min read
Flink on Kubernetes: Kuaishou’s Practice, Migration, and Future Refactoring
JavaEdge
JavaEdge
Feb 20, 2024 · Big Data

Designing a Scalable Data Quality Center for Offline Big‑Data Pipelines

This article describes the design and implementation of a platform‑wide Data Quality Center for offline big‑data pipelines, covering research of existing solutions, design goals, system architecture based on DolphinScheduler, rule definition language, binding and execution mechanisms, and future enhancements such as lineage monitoring and real‑time checks.

Apache GriffinBig DataData Quality
0 likes · 18 min read
Designing a Scalable Data Quality Center for Offline Big‑Data Pipelines
DataFunSummit
DataFunSummit
Feb 20, 2024 · Big Data

BitSail Open‑Source Data Integration Engine: Architecture, New Features, CDC Solutions and Future Outlook

This article introduces ByteDance's open‑source data integration engine BitSail, covering its background, layered architecture, recent feature enhancements, automated testing framework, CDC‑based full‑library synchronization solutions, and future development plans for connectors and real‑time data consistency.

Big DataCDCData Integration
0 likes · 12 min read
BitSail Open‑Source Data Integration Engine: Architecture, New Features, CDC Solutions and Future Outlook
DataFunSummit
DataFunSummit
Feb 19, 2024 · Big Data

Yipay Data Warehouse Construction and Data Governance Practices

This presentation by senior data warehouse engineer Huang Luo details Yipay's end‑to‑end data warehouse build, covering background challenges, governance framework, platform development, layered architecture, naming standards, monitoring, and future plans, offering practical insights for data engineers, architects, and business stakeholders.

Big DataData ArchitectureData Quality
0 likes · 14 min read
Yipay Data Warehouse Construction and Data Governance Practices
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 18, 2024 · Big Data

Understanding Apache Paimon Table Modes and Their Use Cases

Apache Paimon provides multiple table modes—including primary key tables with fixed or dynamic buckets, Append scalable and queue tables—each with specific configurations, compaction behavior, and suitable scenarios, and the article explains their structures, performance considerations, and how to use them with Flink.

Apache PaimonAppend TableBig Data
0 likes · 12 min read
Understanding Apache Paimon Table Modes and Their Use Cases
DataFunTalk
DataFunTalk
Feb 17, 2024 · Big Data

JD Logistics One‑Stop Agile BI Solution: Architecture, Challenges, and Optimization

This article presents JD Logistics' one‑stop agile BI platform, detailing the complex data sources, rapid requirement changes, and Chinese‑style reporting challenges it addresses, while outlining the UData solution, product methodology, performance enhancements, and real‑world case studies that demonstrate significant efficiency gains.

Agile AnalyticsBIBig Data
0 likes · 26 min read
JD Logistics One‑Stop Agile BI Solution: Architecture, Challenges, and Optimization
DataFunTalk
DataFunTalk
Feb 15, 2024 · Big Data

Data Quality Review: From Compliance to Reasonableness and Toolchain Overview

This article explores data collection governance by distinguishing compliance from reasonableness, introduces a comprehensive quality review tool system—including visual inspection, intelligent judgement, and self‑diagnosis—details key techniques such as comparison operators and sampling, and outlines a three‑layer architecture and future directions for data quality assurance.

Big Datadata governancedata sampling
0 likes · 18 min read
Data Quality Review: From Compliance to Reasonableness and Toolchain Overview
DataFunTalk
DataFunTalk
Feb 9, 2024 · Big Data

Alluxio’s Role in Lakehouse Architecture: Benefits, Challenges, and Real‑World Use Cases

This article explains how Alluxio enables lake‑warehouse integration by providing a data orchestration layer that caches data near compute, reduces storage‑compute separation costs, improves performance, and addresses challenges such as security, scalability, and multi‑cloud deployment, illustrated with several industry case studies.

AIAlluxioBig Data
0 likes · 16 min read
Alluxio’s Role in Lakehouse Architecture: Benefits, Challenges, and Real‑World Use Cases
DataFunTalk
DataFunTalk
Feb 8, 2024 · Big Data

Design and Practice of Ant Group's Metric System

This talk by Ant Group’s senior technical expert Wang Gaohang details the definition, design, mechanism, productization, and future outlook of the company’s metric system, covering concept consensus, semantic layers, workflow, AI assistance, performance optimization, and practical case studies.

AIBig DataData Platform
0 likes · 28 min read
Design and Practice of Ant Group's Metric System
DataFunSummit
DataFunSummit
Feb 7, 2024 · Big Data

Evolution of OLAP with Apache Doris at Xingyun Retail Credit

Facing rapid data growth, Xingyun Retail Credit transitioned from traditional OLTP systems to an Apache Doris‑based OLAP solution, detailing the data demand generation, OLAP engine selection challenges, multi‑stage implementation, performance gains, data‑warehouse construction, and future roadmap for scalable analytics.

Apache DorisBig DataData Warehouse
0 likes · 17 min read
Evolution of OLAP with Apache Doris at Xingyun Retail Credit
DataFunSummit
DataFunSummit
Feb 6, 2024 · Big Data

Exploring ByteDance's EB‑Scale HDFS: Architecture, Multi‑Datacenter Challenges, Tiered Storage, and Data Protection Practices

This article presents an in‑depth overview of ByteDance's EB‑scale HDFS, covering its new features, multi‑datacenter architecture, tiered storage implementation, data management services, capacity and fault‑tolerance strategies, as well as practical data‑protection mechanisms and related Q&A.

Big DataData ProtectionHDFS
0 likes · 22 min read
Exploring ByteDance's EB‑Scale HDFS: Architecture, Multi‑Datacenter Challenges, Tiered Storage, and Data Protection Practices
Amap Tech
Amap Tech
Feb 5, 2024 · Artificial Intelligence

Gaode Tech 2023 Highlights: 15 Popular Articles on AI, Data, Mapping, and Navigation Technologies

Gaode Technology’s 2023 roundup showcases fifteen of its most-read articles, spanning AI infrastructure evolution, cloud‑native data optimization, BEV‑based perception, real‑time crowdsourced mapping, ETA prediction, lane‑level navigation, AR HUD, architecture design, low‑code platforms, and high‑performance Android testing.

AIBig DataData Engineering
0 likes · 9 min read
Gaode Tech 2023 Highlights: 15 Popular Articles on AI, Data, Mapping, and Navigation Technologies
DataFunTalk
DataFunTalk
Feb 3, 2024 · Big Data

Alluxio: Introduction, Architecture, and Practical Experience for Big Data Construction

This article introduces Alluxio as an open‑source data orchestration layer, explains its architecture and core features such as unified namespace, caching strategies, and cloud‑native deployment, and shares practical experiences on using Alluxio to simplify data lakehouse construction, migration, and hot‑cold data separation in complex big‑data environments.

AlluxioBig DataCaching
0 likes · 13 min read
Alluxio: Introduction, Architecture, and Practical Experience for Big Data Construction
Sohu Tech Products
Sohu Tech Products
Jan 31, 2024 · Industry Insights

How Didi Scaled Real‑Time Dashboards with StarRocks Materialized Views

This article details Didi's evolution from a multi‑engine OLAP stack to a unified StarRocks solution, explains the design of global dictionaries and materialized views for real‑time dashboard acceleration, and shares performance results, challenges, and future optimization directions.

Big DataDidiMaterialized Views
0 likes · 19 min read
How Didi Scaled Real‑Time Dashboards with StarRocks Materialized Views
Efficient Ops
Efficient Ops
Jan 31, 2024 · Databases

Why ClickHouse Beats Elasticsearch for High‑Performance Log Analytics

Facing data security and cost challenges in SaaS, the author evaluates ClickHouse versus Elasticsearch, highlighting ClickHouse’s superior write throughput, query speed, lower storage and CPU usage, and provides detailed deployment guides for Zookeeper, Kafka, FileBeat, and ClickHouse to build a cost‑effective private analytics platform.

Big DataClickHouseDatabase Deployment
0 likes · 8 min read
Why ClickHouse Beats Elasticsearch for High‑Performance Log Analytics
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 31, 2024 · Big Data

2023 Data Development Trends and Outlook for 2024

The article reviews how data development accelerated in 2023—with mature offline computing, rapid adoption of real‑time and lake‑warehouse solutions, and a clearer technical layering—while offering practical insights and future directions for professionals entering 2024.

Big DataData EngineeringReal-Time Computing
0 likes · 8 min read
2023 Data Development Trends and Outlook for 2024
DataFunSummit
DataFunSummit
Jan 31, 2024 · Big Data

iQIYI Magic Mirror: Evolution of a Big Data Analysis Platform

iQIYI's Magic Mirror platform, evolving from 1.0 to 3.0, addresses the growing data analysis demands of the internet industry by empowering self‑service analytics, introducing multi‑stage architectures, advanced computation engines, customizable SQL, and visual dashboards, thereby improving efficiency, scalability, and data security for business users.

Big DataData PlatformSelf-Service Analytics
0 likes · 18 min read
iQIYI Magic Mirror: Evolution of a Big Data Analysis Platform
StarRocks
StarRocks
Jan 30, 2024 · Big Data

How InLong Guarantees Exactly‑Once Real‑Time Writes to StarRocks

This article explains how Apache InLong provides automatic, secure, high‑performance real‑time data transfer to StarRocks, detailing the transactional Stream Load API, the two‑phase commit process, Flink‑based ingestion architecture, exactly‑once guarantees, and performance test results across different parallelism levels.

Big DataExactly-OnceInLong
0 likes · 11 min read
How InLong Guarantees Exactly‑Once Real‑Time Writes to StarRocks
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 29, 2024 · Databases

Practical Experience of StarRocks Materialized Views at Didi

This article details Didi's evolution of OLAP systems, the adoption of StarRocks for high‑performance MPP analytics, and how materialized views, global dictionary mapping, and transparent acceleration were engineered to boost real‑time dashboard queries while outlining performance gains, challenges, and future optimization plans.

Big DataDidiOLAP
0 likes · 16 min read
Practical Experience of StarRocks Materialized Views at Didi
DataFunTalk
DataFunTalk
Jan 28, 2024 · Databases

Practical Experience of StarRocks Materialized Views at Didi

This article presents Didi's practical experience with StarRocks materialized views, covering the evolution of its OLAP architecture, the challenges of previous engines, the adoption of StarRocks, the design of materialized view acceleration for real‑time dashboards, and future optimization directions.

Big DataData PlatformOLAP
0 likes · 17 min read
Practical Experience of StarRocks Materialized Views at Didi
DataFunTalk
DataFunTalk
Jan 27, 2024 · Big Data

JuiceFS: A Cloud‑Native Distributed File System for Data Lake and Lakehouse

This article presents JuiceFS, a cloud‑native distributed file system that bridges the gaps between HDFS and object storage, explaining Data Lake and Lakehouse concepts, comparing storage options, detailing JuiceFS's architecture and performance benefits, and showcasing real‑world user case studies.

Big DataDistributed File SystemJuiceFS
0 likes · 23 min read
JuiceFS: A Cloud‑Native Distributed File System for Data Lake and Lakehouse
DataFunSummit
DataFunSummit
Jan 26, 2024 · Big Data

Data Governance Practices for E‑commerce Platforms: Challenges, Frameworks, and Solutions

This article details Volcano Engine DataLeap's comprehensive data governance system for e‑commerce platforms, covering the key challenges of SLA quality, model stability, cost control, and low efficiency, and presenting a five‑part framework that includes top‑level architecture, systematic stability and cost governance, tool‑driven automation, SLA assurance processes, and future outlooks.

Big DataStabilityautomation
0 likes · 18 min read
Data Governance Practices for E‑commerce Platforms: Challenges, Frameworks, and Solutions
DataFunSummit
DataFunSummit
Jan 25, 2024 · Big Data

Best Practices of Jushuitan Cloud‑Native OLAP Architecture and Logistics Warning

This article presents Jushuitan's cloud‑native OLAP architecture, covering business background, data‑warehouse evolution, real‑time processing with Flink, Hologres, and Aerospike, and detailed logistics‑warning use cases, followed by technical challenges, future outlook, and a Q&A on implementation details.

Big DataData WarehouseFlink
0 likes · 20 min read
Best Practices of Jushuitan Cloud‑Native OLAP Architecture and Logistics Warning
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Jan 25, 2024 · Fundamentals

Inside China’s 2024 National Advanced Computer Teaching Training: Highlights and Insights

The 2024 National Advanced Computer Teaching Training held in Dongguan brought together over 200 university teachers from 119 schools to explore cutting‑edge topics such as cloud data warehouses, AI platforms, digital logic, and OpenHarmony, showcasing industry‑academic collaboration and practical hands‑on sessions.

Big DataCloud Computingcomputer education
0 likes · 11 min read
Inside China’s 2024 National Advanced Computer Teaching Training: Highlights and Insights
DataFunSummit
DataFunSummit
Jan 24, 2024 · Big Data

Trends, Challenges, and Technical Practices of Modern Data Analysis and Indicator Platforms

This article reviews the evolution of data analysis and business intelligence, highlights current trends such as precision, agility, and real‑time needs, discusses common challenges, and presents the design and implementation of a unified semantic layer and indicator platform to enable agile, accurate, and real‑time analytics.

Big DataData AnalysisMetrics Platform
0 likes · 14 min read
Trends, Challenges, and Technical Practices of Modern Data Analysis and Indicator Platforms
政采云技术
政采云技术
Jan 23, 2024 · Big Data

Design and Implementation of a Big Data Permission Management System

This article outlines the background, importance, scenarios, challenges, objectives, and architectural design—including RBAC and ABAC models, metadata integration, data classification, and verification mechanisms—of a comprehensive big data permission management system for secure and fine‑grained data access.

ABACBig DataData Security
0 likes · 14 min read
Design and Implementation of a Big Data Permission Management System
MaGe Linux Operations
MaGe Linux Operations
Jan 21, 2024 · Big Data

Master Kafka: Core Concepts, Metrics, and Troubleshooting Guide

This article explains Kafka's fundamental components, version evolution, key monitoring metrics for producers, brokers, consumers and Zookeeper, and provides step‑by‑step troubleshooting methods for common issues such as slow topic throughput and message backlog.

Big DataKafkaMessage queue
0 likes · 8 min read
Master Kafka: Core Concepts, Metrics, and Troubleshooting Guide
DataFunTalk
DataFunTalk
Jan 20, 2024 · Big Data

How ByteDance Leverages the Data Flywheel in Large‑Scale Projects

This article explains how ByteDance (Douyin) transforms its data infrastructure from isolated workshops to a unified middle platform and finally to a data flywheel, detailing the three development stages, the Data BP organizational model, real‑time analytics, A/B testing, and the resulting business benefits for large‑scale event projects.

Big DataData EngineeringData Flywheel
0 likes · 13 min read
How ByteDance Leverages the Data Flywheel in Large‑Scale Projects
Test Development Learning Exchange
Test Development Learning Exchange
Jan 20, 2024 · Big Data

Practical Data Analysis Code Samples for Business Decision Making

This article presents ten practical Python code examples that demonstrate common data analysis techniques—such as handling missing values, sorting, pivot tables, visualization, association rules, outlier detection, time‑series forecasting, clustering, feature selection, and cross‑validation—to help improve business decision effectiveness.

Big DataBusiness IntelligencePython
0 likes · 4 min read
Practical Data Analysis Code Samples for Business Decision Making
JD Tech
JD Tech
Jan 18, 2024 · Databases

Understanding ClickHouse: Architecture, Principles, and Performance

This article introduces ClickHouse, an open‑source columnar OLAP database, explains its architecture—including columnar storage, block processing, LSM, indexing and vectorized execution—highlights its performance advantages over other engines, and discusses its limitations such as write‑amplification, concurrency constraints, and ZooKeeper dependency.

Big DataClickHouseColumnar Database
0 likes · 12 min read
Understanding ClickHouse: Architecture, Principles, and Performance
Bitu Technology
Bitu Technology
Jan 17, 2024 · Artificial Intelligence

Rosetta Stone: Scalable ID Mapping System for Tubi's Content Library Using LLMs and Embeddings

This article describes how Tubi built the Rosetta Stone system—a flexible ID mapping workflow that leverages large language models, embedding similarity ranking, and K‑nearest‑neighbors to unify and enrich metadata across a 200,000‑title library, improve content recommendation, and streamline operations.

Big DataEmbeddingsLLM
0 likes · 10 min read
Rosetta Stone: Scalable ID Mapping System for Tubi's Content Library Using LLMs and Embeddings
Past Memory Big Data
Past Memory Big Data
Jan 17, 2024 · Big Data

How WeChat Implements a StarRocks‑Powered Lakehouse Across Multiple Business Scenarios

WeChat evolved its data platform from Hadoop to ClickHouse and finally to a StarRocks‑based lakehouse, solving data fragmentation and storage redundancy while achieving sub‑second to minute‑level query latency, cutting storage costs by over 65%, halving operational tasks, and reducing offline job time by two hours across several business lines.

Big DataLakehouseMaterialized Views
0 likes · 16 min read
How WeChat Implements a StarRocks‑Powered Lakehouse Across Multiple Business Scenarios
360 Smart Cloud
360 Smart Cloud
Jan 15, 2024 · Big Data

Design and Optimization of the Ozone Distributed Object Storage System

This article presents a comprehensive overview of Ozone, a Hadoop‑based distributed object storage system, detailing its architecture, metadata management, scalability enhancements, small‑file handling, erasure coding, lifecycle policies, and future improvements aimed at boosting performance and reliability for large‑scale unstructured data workloads.

Big DataDistributed SystemsHadoop
0 likes · 15 min read
Design and Optimization of the Ozone Distributed Object Storage System
dbaplus Community
dbaplus Community
Jan 14, 2024 · Operations

How AI-Driven Event Intelligence Transforms Data Center Fault Management

The article explains the design and functionality of an AI‑enhanced event intelligent analysis system that automates fault identification, analysis, and remediation in data‑center operations, detailing its architecture, integration with monitoring, CMDB, ITSM, big‑data platforms, and the AI techniques that enable automatic modeling, clustering, and knowledge‑base retrieval.

AIBig Dataautomation
0 likes · 18 min read
How AI-Driven Event Intelligence Transforms Data Center Fault Management
DataFunTalk
DataFunTalk
Jan 14, 2024 · Big Data

Optimizing Object Storage and Impala Engine in NetEase NDH: Performance Enhancements and Feature Additions

This presentation outlines NetEase's NDH big‑data platform, detailing its background, object‑storage upload and rename optimizations, Impala engine adaptations—including file‑handle caching, transparent URI handling, and getFileBlockLocations improvements—and a suite of operational enhancements such as dynamic proxy user configuration and audit‑log extensions.

AlluxioBig DataImpala
0 likes · 14 min read
Optimizing Object Storage and Impala Engine in NetEase NDH: Performance Enhancements and Feature Additions