Tagged articles
3684 articles
Page 3 of 37
ITPUB
ITPUB
Jul 29, 2025 · Big Data

How to Deduplicate 4 Billion QQ IDs Using a Bitmap Within 1 GB Memory

Learn how to efficiently remove duplicates from 4 billion QQ numbers using a memory‑friendly Bitmap approach that fits within a 1 GB limit, including calculations, step‑by‑step implementation, Java code, and a discussion of its advantages and drawbacks.

Big DataData StructuresDeduplication
0 likes · 9 min read
How to Deduplicate 4 Billion QQ IDs Using a Bitmap Within 1 GB Memory
360 Tech Engineering
360 Tech Engineering
Jul 29, 2025 · Information Security

How AI and Big Data Are Redefining Global Cybersecurity – Insights from Zhou Hongyi

In his 2025 World Internet Conference Digital Silk Road Forum keynote, Zhou Hongyi warned that the programmable, AI‑driven, data‑centric world amplifies cyber vulnerabilities, described the rise of state‑level cyber warfare and AI‑powered attacks, and outlined 360’s security‑as‑service strategy and global cooperation plans to protect nations and enterprises.

AIBig DataSecurity Operations
0 likes · 5 min read
How AI and Big Data Are Redefining Global Cybersecurity – Insights from Zhou Hongyi
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jul 29, 2025 · Big Data

How GoTerra Cut Costs and Boost Speed: BigQuery‑to‑MaxCompute Performance Secrets

This article details the real‑world migration of a leading Southeast Asian tech group from BigQuery to MaxCompute, exposing the three major challenges, the data‑driven performance‑optimization methodology, and the concrete techniques—Auto Partition, UNNEST redesign, large‑query graph optimizations, and intelligent tuning—that delivered dramatic cost reductions and query‑speed gains.

Auto PartitionBig DataData Warehouse Migration
0 likes · 17 min read
How GoTerra Cut Costs and Boost Speed: BigQuery‑to‑MaxCompute Performance Secrets
Bilibili Tech
Bilibili Tech
Jul 25, 2025 · Big Data

How Unified Metadata Lineage Transforms Big Data Governance and Security

This article introduces the comprehensive design and evolution of a unified metadata lineage platform for big data, covering background, data processing chain, lineage models, system architecture, quality metrics, application scenarios, and future plans to enhance data governance, quality, and security.

Big DataData GovernanceData Security
0 likes · 27 min read
How Unified Metadata Lineage Transforms Big Data Governance and Security
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jul 25, 2025 · Big Data

Cross-Contrastive Learning Cuts Flink Anomaly Detection Errors by 12%

The paper “Noise Matters: Cross Contrastive Learning for Flink Anomaly Detection”, accepted at VLDB 2025, introduces a novel cross‑contrastive method that leverages attention‑based representations and a boundary‑aware loss to detect Flink‑specific hotspot anomalies, achieving a 12.1% F1 improvement over state‑of‑the‑art techniques.

Big DataCross-Contrastive LearningFlink
0 likes · 6 min read
Cross-Contrastive Learning Cuts Flink Anomaly Detection Errors by 12%
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Jul 20, 2025 · Big Data

Exploring the Architecture of a Data Lake and Application Platform

This article outlines the overall architecture, data architecture, logical project structure, and the construction of a data resource center for a data lake and application platform, illustrated through a series of diagrams that depict each component and their interconnections.

Big DataData PlatformData Resource Center
0 likes · 1 min read
Exploring the Architecture of a Data Lake and Application Platform
DataFunSummit
DataFunSummit
Jul 19, 2025 · Artificial Intelligence

Big Data Meets Generative AI: Industry Transformations from Prof. Dou

Prof. Dou Dejing shares his journey into Fudan University's Data Intelligence Lab, outlines the history and synergy of big data and AI, reviews generative AI breakthroughs, evaluates large‑model strengths and weaknesses, and explores their expanding industrial applications and market potential.

Artificial IntelligenceBig DataGenerative AI
0 likes · 13 min read
Big Data Meets Generative AI: Industry Transformations from Prof. Dou
DataFunSummit
DataFunSummit
Jul 18, 2025 · Databases

Boosting ClickHouse on WeChat: Performance Tools, Lakehouse Hacks & AI

This article explores how ClickHouse is deployed across WeChat for real‑time analytics, introduces a suite of performance‑monitoring tools, details lakehouse read and bitmap optimizations, and describes the integration of AI‑driven vector search, showcasing substantial speedups and scalability improvements.

AIBig DataClickHouse
0 likes · 12 min read
Boosting ClickHouse on WeChat: Performance Tools, Lakehouse Hacks & AI
Youzan Coder
Youzan Coder
Jul 18, 2025 · Cloud Native

How Mixed Workloads Boost Kubernetes CPU Utilization by Over 40%

This article explains how Youzan transformed its Kubernetes clusters from static over‑commit scheduling to load‑balanced mixed workloads using Koordinator and the Longxi kernel, achieving higher CPU utilization, lower costs, and better resource management for both online and offline services.

Big DataCloud NativeKoordinator
0 likes · 10 min read
How Mixed Workloads Boost Kubernetes CPU Utilization by Over 40%
DataFunSummit
DataFunSummit
Jul 18, 2025 · Big Data

Data Lake & Lakehouse Innovations: Real-Time Analytics and Industry Case Studies

This article presents a curated collection of cutting‑edge data lake and lakehouse case studies—including real‑time analytics, cloud‑native architectures, industry implementations from sales platforms to automotive IoT, and the latest advancements in open‑source projects—offering insights into modern big‑data strategies and governance.

Big DataLakehousecloud architecture
0 likes · 2 min read
Data Lake & Lakehouse Innovations: Real-Time Analytics and Industry Case Studies
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 16, 2025 · Big Data

Master Flink Optimizations: TTL, Mini‑Batch, Two‑Phase Aggregation, Lookup Join & More

This article reviews the most effective Flink optimization techniques since 2022, including operator‑level TTL, mini‑batch processing, two‑phase aggregation, multi‑dimensional DISTINCT with FILTER, lookup join caching strategies, and TopN implementations, each rated with recommendation stars for production use.

Big DataFlinkLookup Join
0 likes · 8 min read
Master Flink Optimizations: TTL, Mini‑Batch, Two‑Phase Aggregation, Lookup Join & More
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jul 15, 2025 · Big Data

How MaxCompute’s Append DeltaTable Transforms BigQuery Migration

This article details the complex migration of a leading Southeast Asian tech group's data warehouse from Google BigQuery to Alibaba Cloud MaxCompute, outlining challenges such as storage format differences, SQL compatibility, and performance tuning, and explains how the new Append DeltaTable format with dynamic bucketing and incremental reclustering resolves these issues.

Big DataData MigrationData Warehouse
0 likes · 19 min read
How MaxCompute’s Append DeltaTable Transforms BigQuery Migration
IT Architects Alliance
IT Architects Alliance
Jul 10, 2025 · Cloud Native

Inside Alibaba’s Tech Stack: Cloud‑Native Architecture Behind Billions of Transactions

This article examines Alibaba's extensive cloud‑native technology stack—including distributed computing, storage, middleware, real‑time data processing, AI platforms, performance engineering, and security—revealing how its architects design systems that handle massive transaction volumes during events like Double 11.

Big DataDistributed SystemsSecurity
0 likes · 12 min read
Inside Alibaba’s Tech Stack: Cloud‑Native Architecture Behind Billions of Transactions
IT Architects Alliance
IT Architects Alliance
Jul 8, 2025 · Cloud Native

Why Do Big‑Tech Architects Earn Six Figures? The Skills That Set Them Apart

The article explores why architects at leading tech firms command six‑figure salaries while those in traditional companies earn far less, highlighting gaps in technical depth, massive data handling, performance optimization, business insight, continuous learning, and the scarcity of true senior architects.

Big DataDistributed SystemsPerformance Optimization
0 likes · 9 min read
Why Do Big‑Tech Architects Earn Six Figures? The Skills That Set Them Apart
Model Perspective
Model Perspective
Jul 8, 2025 · Big Data

Why Historical Data Can Mislead Your Forecasts—and What to Do Instead

The article explains how relying solely on historical data for prediction often leads to large errors because future structural changes and missing variables are ignored, and it proposes causal modeling, scenario simulation, and real‑time signals as more reliable forecasting approaches.

Big Datacausal modelingforecasting
0 likes · 9 min read
Why Historical Data Can Mislead Your Forecasts—and What to Do Instead
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 8, 2025 · Big Data

Flink’s AI Agents and Disaggregated State: Transforming Big Data

The article reviews key topics from the FFA2025 Singapore conference, highlighting Flink’s new AI‑focused Agents framework, the breakthrough Flink 2.0 disaggregated state architecture, emerging lake storage solutions like Paimon, and the Fluss streaming table store, illustrating how big‑data platforms are evolving for AI workloads.

AI agentsBig DataDisaggregated State
0 likes · 6 min read
Flink’s AI Agents and Disaggregated State: Transforming Big Data
DataFunTalk
DataFunTalk
Jul 7, 2025 · Big Data

Unlock Real-Time Analytics with Cloud Lakehouse: A Complete Guide

This article presents a curated list of sessions covering cloud Lakehouse technology for real-time, multidimensional data analysis, including case studies from SalesEasy, Changan Auto, Tencent, and JD, as well as discussions on data lake adoption, streaming lake Paimon, and the relevance of metadata‑driven data governance in the digital economy.

Big DataData GovernanceIoT
0 likes · 2 min read
Unlock Real-Time Analytics with Cloud Lakehouse: A Complete Guide
DataFunTalk
DataFunTalk
Jul 6, 2025 · Big Data

How Cloud Lakehouse Is Redefining Real-Time Multi-Dimensional Data Analytics

This article presents a curated list of case studies and insights on cloud Lakehouse technology, covering real-time intelligent analytics, data architecture simplification, IoT big‑data platforms, integrated data platforms, and the evolving role of metadata‑driven data governance in the digital economy.

Big DataData GovernanceLakehouse
0 likes · 2 min read
How Cloud Lakehouse Is Redefining Real-Time Multi-Dimensional Data Analytics
FunTester
FunTester
Jul 5, 2025 · Big Data

Master Kafka: Core Concepts and Performance Testing Strategies

This article explains Kafka’s high‑performance distributed streaming architecture, key components such as topics, partitions, producers, consumers, brokers, offsets, and ZooKeeper, and provides step‑by‑step workflows for producers and consumers along with performance‑testing tips and Maven setup.

Big DataJavaKafka
0 likes · 9 min read
Master Kafka: Core Concepts and Performance Testing Strategies
360 Tech Engineering
360 Tech Engineering
Jul 4, 2025 · Artificial Intelligence

How AI is Revolutionizing Security Operations: Insights from the 2025 Global Digital Economy Conference

The 2025 Global Digital Economy Conference highlighted the fusion of big data and AI in security, revealing both the transformative potential of large‑model technologies for operational efficiency and the critical challenges they pose, while showcasing 360's AI‑native platform and measurable performance gains.

AI securityBig DataSecurity Operations
0 likes · 5 min read
How AI is Revolutionizing Security Operations: Insights from the 2025 Global Digital Economy Conference
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 4, 2025 · Big Data

Spark 4.0: New Features, Performance Gains, and Why It Still Leads Big Data

Despite the hype around Flink and AI models, Spark 4.0’s release brings a lightweight Python client, Spark Connect GA, enhanced SQL optimization, vectorized execution, and AI integration, reaffirming its leading position in the big‑data ecosystem while hinting at future challenges and innovations.

Big DataData EngineeringPerformance Optimization
0 likes · 6 min read
Spark 4.0: New Features, Performance Gains, and Why It Still Leads Big Data
Baidu Geek Talk
Baidu Geek Talk
Jul 2, 2025 · Big Data

Baidu’s Secret to Faster Search Data: Wide‑Table Modeling & Fusion Engine

This article outlines Baidu’s innovative approach to building its search data platform, detailing the design of wide‑table models, the upgrade to a Spark‑based fusion computation engine, and the new Turing 3.0 service delivery framework, which together deliver higher efficiency, lower cost, and faster, more reliable analytics.

Big DataData WarehouseFusion Engine
0 likes · 21 min read
Baidu’s Secret to Faster Search Data: Wide‑Table Modeling & Fusion Engine
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Jul 1, 2025 · Big Data

Master ElasticSearch: Core Concepts, Architecture, and Search Workflow Explained

This article provides a comprehensive overview of ElasticSearch, covering its definition, core components such as indexes, shards and replicas, the analysis pipeline, inverted index mechanics, and the two‑stage search process that enables scalable, fault‑tolerant full‑text search in big‑data environments.

AnalyzersBig DataDistributed Search
0 likes · 7 min read
Master ElasticSearch: Core Concepts, Architecture, and Search Workflow Explained
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 1, 2025 · Big Data

What’s New in Apache Hive 4.0? Key Features and Industry Outlook

After a weekend dive into Apache Hive’s official Wiki and GitHub, this article highlights Hive’s declining visibility compared to Spark and Flink, examines its 4.0 release’s major features—including Iceberg integration, enhanced ACID, cost‑based optimizer upgrades, and Ozone support—while reflecting on its role in modern data ecosystems.

Apache HiveBig DataData Warehouse
0 likes · 4 min read
What’s New in Apache Hive 4.0? Key Features and Industry Outlook
DataFunSummit
DataFunSummit
Jun 22, 2025 · Databases

Unlocking Apache Doris: How Lakehouse Integration Supercharges Data Analytics

This article walks through Apache Doris’s lakehouse‑in‑one architecture, explains its core value and paradigm, details the system’s components and use cases, examines technical challenges such as file‑format diversity and I/O stability, and presents a suite of optimizations—from predicate push‑down and partition pruning to metadata caching and dynamic scheduling—that dramatically improve query performance and resource utilization, while also outlining future roadmap plans.

Apache DorisBig DataData Warehouse
0 likes · 22 min read
Unlocking Apache Doris: How Lakehouse Integration Supercharges Data Analytics
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Jun 13, 2025 · Artificial Intelligence

Designing AI-Ready Data Architecture: Key Features and Future Trends

AI-era data architecture must handle massive, multimodal datasets with real-time processing, prioritize data quality over quantity, support scalability, provenance, and native ML/AI integration, while addressing governance, security, and ethical challenges through emerging technologies like data fabric, mesh, and federated learning.

AIBig DataData Architecture
0 likes · 6 min read
Designing AI-Ready Data Architecture: Key Features and Future Trends
DataFunSummit
DataFunSummit
Jun 10, 2025 · Big Data

How OpenLake Redefines Data Lake Infrastructure for the AI Era

This article explores OpenLake's evolution as a data lake platform for AI, covering the transition from Hive to modern lake formats like Iceberg and Paimon, performance benchmarks, metadata management advances, intelligent storage optimization, and the integration of multimodal support with the Lance file format.

AIBig DataOpenLake
0 likes · 22 min read
How OpenLake Redefines Data Lake Infrastructure for the AI Era
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jun 10, 2025 · Big Data

Boosting Automotive Data Processing with Alibaba Cloud EMR Serverless Spark

This article details how a leading automotive parts supply‑chain platform migrated from a traditional Hadoop stack to Alibaba Cloud EMR Serverless Spark and DataWorks, achieving faster, more elastic, and cost‑effective data processing, enhanced AI integration, and significant operational improvements across multiple business scenarios.

Big DataCloud NativeEMR Serverless
0 likes · 12 min read
Boosting Automotive Data Processing with Alibaba Cloud EMR Serverless Spark
Lobster Programming
Lobster Programming
Jun 9, 2025 · Databases

How to Add a Column to Billion‑Row Tables Without Downtime

This article explains a metadata‑driven approach for extending massive tables—using a separate extension table, sharding, and Elasticsearch sync—to add new fields to billion‑row databases without locking the primary table or disrupting online services.

Big DataDatabase SchemaElasticsearch
0 likes · 6 min read
How to Add a Column to Billion‑Row Tables Without Downtime
DataFunSummit
DataFunSummit
Jun 6, 2025 · Big Data

How Unicom Digital’s Integrated Data Platform Revolutionizes Metadata Management

This article details Unicom Digital’s metadata management practice on its integrated data platform, covering the strategic background of data, key challenges, award-winning capabilities, three-pronged solutions—automation, linking+, and AI—along with practical implementations, full‑chain lineage, data responsibility, lifecycle management, and future AI‑driven enhancements.

AIAutomationBig Data
0 likes · 18 min read
How Unicom Digital’s Integrated Data Platform Revolutionizes Metadata Management
Instant Consumer Technology Team
Instant Consumer Technology Team
Jun 5, 2025 · Big Data

Mastering Kafka in Production: Boost Throughput, Ensure Reliability, and Avoid Data Loss

This article shares practical Kafka production insights, covering architecture overview, producer throughput tuning, message loss prevention, broker and consumer configurations, duplicate consumption avoidance, backlog mitigation, ordering guarantees, and the mechanics of consumer group rebalancing, helping engineers build stable, high‑performance streaming pipelines.

Big DataKafkaMessage Queue
0 likes · 15 min read
Mastering Kafka in Production: Boost Throughput, Ensure Reliability, and Avoid Data Loss
DataFunSummit
DataFunSummit
May 22, 2025 · Operations

Automated Fault Detection and Repair System for Grab's Data Pipelines (Hugo) – Architecture, Implementation, and Impact

This article presents Grab's Hugo platform, an automated fault‑detection and self‑healing system for over 4,000 data pipelines that combines multi‑source signal collection, intelligent diagnosis, layered auto‑repair, and a health API to dramatically improve data visibility, reduce manual intervention, and boost operational efficiency across the company.

AutomationBig DataDataOps
0 likes · 12 min read
Automated Fault Detection and Repair System for Grab's Data Pipelines (Hugo) – Architecture, Implementation, and Impact
Zhuanzhuan Tech
Zhuanzhuan Tech
May 21, 2025 · Big Data

How We Turned a Microservice Finance System into a Scalable Big‑Data Warehouse

This article details the evolution of a fast‑growing e‑commerce finance platform from a monolithic microservice architecture plagued by data inconsistency, low processing efficiency, and scalability limits to a robust, distributed big‑data warehouse using SparkSQL, layered data models, and optimized scheduling, achieving ten‑fold performance gains and near‑zero failure rates.

Big DataData WarehouseETL
0 likes · 21 min read
How We Turned a Microservice Finance System into a Scalable Big‑Data Warehouse
Java Backend Technology
Java Backend Technology
May 21, 2025 · Big Data

Master DataX: Fast Offline Data Sync for MySQL without mysqldump

This guide explains how to use Alibaba's open‑source DataX tool to perform high‑performance offline synchronization between heterogeneous MySQL databases, covering installation, framework design, job configuration, full‑ and incremental sync, and practical command‑line examples.

Big DataDataXETL
0 likes · 15 min read
Master DataX: Fast Offline Data Sync for MySQL without mysqldump
Big Data Technology & Architecture
Big Data Technology & Architecture
May 21, 2025 · Big Data

Interview Experience: Flink Task Resource Allocation, Issues, and Monitoring

This article shares an interviewee's experience discussing core Flink interview questions, including typical resource allocation for large online tasks, common problems such as data, performance, stability, and resource issues, and the monitoring practices for clusters and tasks, while also containing a brief self‑promotion.

Big DataFlinkInterview
0 likes · 7 min read
Interview Experience: Flink Task Resource Allocation, Issues, and Monitoring
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
May 19, 2025 · Industry Insights

How Xiaohongshu Built a Minute‑Level Near‑Real‑Time Data Warehouse with Incremental Computing

Facing billions of daily logs and the need for minute‑level experiment metrics, Xiaohongshu partnered with Yunqi Tech to design a generic incremental‑compute solution that delivers near‑real‑time data warehousing with lower cost, higher accuracy, simplified pipelines, and improved query performance.

Big DataFlinkIceberg
0 likes · 24 min read
How Xiaohongshu Built a Minute‑Level Near‑Real‑Time Data Warehouse with Incremental Computing
Huolala Tech
Huolala Tech
May 14, 2025 · Big Data

How Lalamove Scaled Real‑Time Data Warehousing with Flink and Paimon

Lalamove’s international logistics platform transformed its real‑time data warehouse by leveraging Apache Flink and the Paimon lakehouse, addressing challenges of multi‑region data centers, time‑zone diversity, frequent upstream changes, and high costs, while improving scalability, latency, and operational efficiency across global markets.

Big DataFlinkPaimon
0 likes · 13 min read
How Lalamove Scaled Real‑Time Data Warehousing with Flink and Paimon
JD Tech
JD Tech
May 13, 2025 · Databases

Unlock ClickHouse’s Lightning‑Fast Queries: Architecture, Storage, and Index Secrets

This article examines ClickHouse’s high‑performance OLAP design, covering its MPP architecture, columnar storage, vectorized execution, pre‑sorting, table engines, extensive data‑type system, sharding and replication strategies, as well as its sparse and skip‑index mechanisms that together enable ultra‑fast analytics on massive datasets.

Big DataClickHouseColumnar Storage
0 likes · 16 min read
Unlock ClickHouse’s Lightning‑Fast Queries: Architecture, Storage, and Index Secrets
macrozheng
macrozheng
May 12, 2025 · Big Data

Master DataX: Efficient Data Synchronization for Massive MySQL Datasets

Learn how to overcome inaccurate reporting and cross-database challenges by using Alibaba’s open-source DataX tool to efficiently synchronize massive MySQL datasets, covering its architecture, job scheduling, installation, configuration, full- and incremental sync, and practical command-line examples.

Big DataDataXETL
0 likes · 15 min read
Master DataX: Efficient Data Synchronization for Massive MySQL Datasets
Top Architect
Top Architect
May 7, 2025 · Big Data

Using DataX for Efficient MySQL Data Synchronization

This article provides a comprehensive guide on using Alibaba's open‑source DataX tool for efficient offline synchronization between heterogeneous databases such as MySQL, covering its architecture, installation on Linux, job configuration, full‑ and incremental data transfer, and practical code examples.

Big DataDataXETL
0 likes · 18 min read
Using DataX for Efficient MySQL Data Synchronization
DataFunSummit
DataFunSummit
May 4, 2025 · Big Data

Iceberg Table Format Practice in Huawei Terminal Cloud

This article explains how Huawei's terminal cloud adopts the Apache Iceberg table format to efficiently manage large-scale datasets, detailing its architecture, feature engineering, merge operations, LSM-based storage, schema versioning, AB testing support, catalog enhancements, and future roadmap for full lifecycle data governance.

Big DataHuawei CloudIceberg
0 likes · 13 min read
Iceberg Table Format Practice in Huawei Terminal Cloud
JD Tech
JD Tech
Apr 30, 2025 · Artificial Intelligence

TimeHF: A Billion‑Scale Time Series Forecasting Model Guided by Human Feedback

The JD Supply Chain algorithm team introduces TimeHF, a billion‑parameter time‑series large model that leverages RLHF to boost demand‑forecast accuracy by over 10%, detailing dataset construction, the PCTLM architecture, a custom RLHF framework (TPO), and extensive SOTA experimental results.

Big DataLarge Language ModelsRLHF
0 likes · 10 min read
TimeHF: A Billion‑Scale Time Series Forecasting Model Guided by Human Feedback
Big Data Tech Team
Big Data Tech Team
Apr 28, 2025 · Big Data

Mastering Metadata, Master Data, and Data Governance: A Complete Guide

This article explains the core concepts of metadata, master data, data resources, data governance, and data management, outlines their roles, compares governance with management, and provides practical steps and best‑practice recommendations for building a robust enterprise data framework.

Big DataData GovernanceMaster Data
0 likes · 15 min read
Mastering Metadata, Master Data, and Data Governance: A Complete Guide
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Apr 27, 2025 · Big Data

Scaling Property Services: StarRocks‑Powered Storage‑Compute Separation for 8000+ Communities

Facing a flood of data from over 8,000 communities, the Bifeng service team migrated from a monolithic storage‑compute architecture to a StarRocks‑based storage‑compute separation solution, achieving lower costs, higher resource utilization, faster queries, and improved SLA across their property management platform.

Big DataData WarehouseInfrastructure Migration
0 likes · 11 min read
Scaling Property Services: StarRocks‑Powered Storage‑Compute Separation for 8000+ Communities
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Apr 24, 2025 · Big Data

Boosting Product Recommendations with Serverless Spark and Milvus: A Real‑World Case Study

蝉妈妈 migrated its recommendation platform to Alibaba Cloud Serverless Spark and Milvus, replacing traditional vector search and Spark clusters, achieving 40% faster offline tasks, 80% lower failure rates, significant cost savings, and scalable, low‑latency similar‑product retrieval for personalized marketing.

Big DataMilvusrecommendation system
0 likes · 8 min read
Boosting Product Recommendations with Serverless Spark and Milvus: A Real‑World Case Study
Big Data Tech Team
Big Data Tech Team
Apr 20, 2025 · Industry Insights

Essential Skills & Tech Stacks for Every Data Team Role

This guide breaks down the main positions in a data team— from data development and analysis engineers to product managers and operations specialists—detailing each role’s key responsibilities, essential skill sets, and the typical technology stack they rely on.

Big DataData AnalyticsData Engineering
0 likes · 7 min read
Essential Skills & Tech Stacks for Every Data Team Role
dbaplus Community
dbaplus Community
Apr 20, 2025 · Databases

Why Wide Tables Fail and How to Design Them Efficiently

This article explains what wide tables are, why they are controversial, outlines three common design pitfalls with practical avoidance tips, and introduces three key technologies—ClickHouse, Cassandra, and Hudi/Iceberg—to help engineers build performant, maintainable wide‑table solutions in data warehouses.

Big DataClickHouseDatabase Design
0 likes · 7 min read
Why Wide Tables Fail and How to Design Them Efficiently
macrozheng
macrozheng
Apr 18, 2025 · Big Data

How to Build Near Real-Time Elasticsearch Indexes for PB-Scale Data

This article explains why traditional databases like MySQL struggle with massive data, introduces Elasticsearch’s advantages, and details a practical architecture using Hive, Canal, and Otter to achieve near real‑time indexing of petabyte‑scale datasets with minimal latency.

Big DataCanalData Transfer Service
0 likes · 20 min read
How to Build Near Real-Time Elasticsearch Indexes for PB-Scale Data
AntTech
AntTech
Apr 17, 2025 · Artificial Intelligence

Data+AI Forum at the 18th China Electronics Information Conference (2025) – Speaker Bios and Session Summaries

The 18th China Electronics Information Conference will be held in Chengdu from April 17‑21, 2025, featuring the DATA+AI forum that gathers leading academicians and industry experts to discuss data‑AI integration, with detailed speaker biographies, presentation titles, and abstracts covering topics such as large‑model inference, cloud‑edge ultrasound diagnostics, and the future of databases in the AI era.

@DataAIBig Data
0 likes · 12 min read
Data+AI Forum at the 18th China Electronics Information Conference (2025) – Speaker Bios and Session Summaries
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 17, 2025 · Big Data

MaxCompute: Intelligent Data Warehouse Platform for the Data+AI Era

This article, based on a meetup presentation, details Alibaba Cloud's MaxCompute platform—its evolution, serverless architecture, AI integration, distributed Python framework, Object Table, near‑real‑time processing, and intelligent warehouse features—addressing the challenges of data warehouses in the Data+AI era.

Big DataData WarehouseDistributed computing
0 likes · 11 min read
MaxCompute: Intelligent Data Warehouse Platform for the Data+AI Era
vivo Internet Technology
vivo Internet Technology
Apr 16, 2025 · Big Data

Offline Mixed Deployment of Spark Tasks on Kubernetes: Containerization, Scheduling, and Elastic Resource Management

The article explains how the vivo Internet Big Data team containerized offline Spark jobs and deployed them with the Spark Operator on a mixed online‑offline Kubernetes cluster, using elastic scheduling and resource‑over‑subscription to boost CPU utilization by 30‑40% and handle over 100,000 daily tasks.

Big DataContainerizationKubernetes
0 likes · 36 min read
Offline Mixed Deployment of Spark Tasks on Kubernetes: Containerization, Scheduling, and Elastic Resource Management
dbaplus Community
dbaplus Community
Apr 15, 2025 · Big Data

How Xiaohongshu Boosted Data Warehouse Performance with Logical Datasets and Materialized Views

Xiaohongshu introduced logical datasets and materialized views to overcome low reuse of APP tables, limited scalability of single‑table BI datasets, and poor dashboard query performance, achieving higher data processing efficiency and faster query responses through optimized data flow, query pruning, and accelerated ETL scheduling.

Big Datalogical datasetquery optimization
0 likes · 24 min read
How Xiaohongshu Boosted Data Warehouse Performance with Logical Datasets and Materialized Views
DataFunSummit
DataFunSummit
Apr 13, 2025 · Big Data

Data Governance at Didi: Interview with Liu Chao on Big Data Asset Management

In this interview, Didi data governance lead Liu Chao discusses his career journey, the unique technical architecture of Didi’s big‑data governance system, cost‑driven pricing models, metadata management, lineage extraction, automation practices, and offers practical advice for enterprises seeking effective data governance.

AutomationBig DataCost-based Pricing
0 likes · 12 min read
Data Governance at Didi: Interview with Liu Chao on Big Data Asset Management
JD Cloud Developers
JD Cloud Developers
Apr 11, 2025 · Artificial Intelligence

How a Billion-Parameter Time Series Model Beats GPT4TS: The PCTLM Breakthrough

This article introduces PCTLM, a pioneering billion‑parameter pure time‑series large model that outperforms existing solutions like GPT4TS across multiple benchmarks, detailing its massive high‑quality dataset, novel patch‑based architecture, and a tailored RLHF framework (TPO) that enhances zero‑shot forecasting accuracy.

Big DataPCTLMRLHF
0 likes · 11 min read
How a Billion-Parameter Time Series Model Beats GPT4TS: The PCTLM Breakthrough
DataFunTalk
DataFunTalk
Apr 9, 2025 · Big Data

Highlights of the Apache Hudi Asia Technical Salon Hosted by Kuaishou – Practices and Innovations from Leading Companies

The Kuaishou‑hosted Apache Hudi Asia technical salon gathered over 230 attendees and featured seven experts from Kuaishou, Meituan, TikTok, Huawei, JD and others, who shared best practices, architecture designs, and performance optimizations for large‑scale data lake applications across AI, BI, and real‑time workloads.

AIApache HudiBatch Processing
0 likes · 14 min read
Highlights of the Apache Hudi Asia Technical Salon Hosted by Kuaishou – Practices and Innovations from Leading Companies
JD Retail Technology
JD Retail Technology
Apr 8, 2025 · Databases

ClickHouse Architecture and Core Technologies Overview

ClickHouse is an open‑source, massively parallel, column‑oriented OLAP database that integrates its own columnar storage, vectorized batch processing, pre‑sorted data, diverse table engines, extensive data types, sharding with replication, sparse primary‑key and skip indexes, and a multithreaded query engine, delivering high‑throughput real‑time analytics on massive datasets.

Big DataClickHouseColumnar Storage
0 likes · 15 min read
ClickHouse Architecture and Core Technologies Overview
DataFunSummit
DataFunSummit
Apr 3, 2025 · Big Data

Apache Hudi Asia Technical Salon Highlights: Practices and Innovations from Kuaishou, Meituan, Douyin, Huawei, and JD

The Apache Hudi Asia technical salon held in Beijing on March 29 gathered over 230 on‑site participants and 16,000 online viewers, featuring expert talks from leading Chinese tech companies that showcased real‑world Hudi implementations, performance optimizations, and future roadmap for data‑lake technologies.

Apache HudiBig DataFlink
0 likes · 13 min read
Apache Hudi Asia Technical Salon Highlights: Practices and Innovations from Kuaishou, Meituan, Douyin, Huawei, and JD
Kuaishou Tech
Kuaishou Tech
Apr 2, 2025 · Big Data

Apache Hudi Asia Summit Successfully Held

The first Apache Hudi Asia Summit in Beijing attracted over 230 attendees, featuring technical discussions on data lake optimization and case studies from companies like Fastly and Meituan.

Apache HudiBig DataData Engineering
0 likes · 12 min read
Apache Hudi Asia Summit Successfully Held
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 2, 2025 · Databases

Replacing Elasticsearch with Apache Doris for Real‑Time Big Data Analytics: Architecture, Performance, and Enterprise Cases

This article analyzes why Elasticsearch struggles with large‑scale, complex real‑time analytics and demonstrates how Apache Doris’s MPP, columnar storage, and native SQL support provide a cost‑effective, high‑performance alternative, illustrated with detailed enterprise case studies.

Apache DorisBig DataElasticsearch
0 likes · 11 min read
Replacing Elasticsearch with Apache Doris for Real‑Time Big Data Analytics: Architecture, Performance, and Enterprise Cases
Mingyi World Elasticsearch
Mingyi World Elasticsearch
Apr 1, 2025 · Big Data

Elasticsearch Unveiled: Learn Search Engine Basics Through Comics

This visual guide walks readers through Elasticsearch fundamentals—from architecture and indexing to clustering, query DSL, aggregations, and performance tuning—using comic-style illustrations that simplify each concept for easy understanding, and security considerations, multilingual support, and real‑time search capabilities.

Big DataDistributed SystemsElasticsearch
0 likes · 2 min read
Elasticsearch Unveiled: Learn Search Engine Basics Through Comics
DataFunSummit
DataFunSummit
Apr 1, 2025 · Big Data

Understanding Flink CDC 3.3: Features, Improvements, and Future Plans

This article provides a comprehensive overview of Flink CDC 3.3, detailing its CDC fundamentals, new connectors, Transform module enhancements, asynchronous snapshot splitting, community adoption, and upcoming roadmap for broader ecosystem support and batch‑mode execution.

Big DataCDCChange Data Capture
0 likes · 15 min read
Understanding Flink CDC 3.3: Features, Improvements, and Future Plans
IT Architects Alliance
IT Architects Alliance
Mar 30, 2025 · Backend Development

Douyin’s Architectural Evolution: From Simple Beginnings to Scalable Cloud‑Native System

The article chronicles Douyin’s journey from a modest early‑stage architecture to a sophisticated, distributed, micro‑service and cloud‑native infrastructure that leverages load balancing, caching, big‑data frameworks, CDN, edge computing, and automated operations to support billions of users and massive traffic spikes.

Big DataDouyincloud-native
0 likes · 12 min read
Douyin’s Architectural Evolution: From Simple Beginnings to Scalable Cloud‑Native System
vivo Internet Technology
vivo Internet Technology
Mar 26, 2025 · Big Data

Reading Encrypted ORC Files in StarRocks: Architecture and Implementation Details

The article details how StarRocks extends the Apache ORC C++ library to decrypt column‑level encrypted ORC files, describing the file hierarchy, AES‑128‑CTR key handling, the query‑time master‑key retrieval, a decorator‑based decryption/decompression pipeline, and the block‑skip‑read mechanism that enables efficient predicate push‑down.

Big DataDatabaseFile Format
0 likes · 19 min read
Reading Encrypted ORC Files in StarRocks: Architecture and Implementation Details
Big Data Technology Architecture
Big Data Technology Architecture
Mar 25, 2025 · Big Data

Kafka 4.0 Release: KRaft Architecture, Consumer Group Optimizations, and New Queue Features

Kafka 4.0 marks a milestone release that replaces ZooKeeper with the KRaft consensus engine, improves scalability and performance, introduces a server‑side consumer‑group protocol, adds shared‑group queue capabilities, and updates Java requirements and documentation, delivering a more robust and flexible streaming platform.

Big DataDistributed StreamingJava11
0 likes · 6 min read
Kafka 4.0 Release: KRaft Architecture, Consumer Group Optimizations, and New Queue Features
Baidu Geek Talk
Baidu Geek Talk
Mar 24, 2025 · Big Data

How Turing Data Finder Transforms Growth Analysis with a Unified Data Platform

The article provides a detailed technical overview of the Turing Data Finder (TDF) platform, describing its background, core components, data schema, ingestion workflow, and a suite of growth‑analysis features such as event, retention, funnel, path, component, distribution, and attribution analysis, while also outlining performance‑optimisation techniques and future development directions.

Big DataData EngineeringData Platform
0 likes · 17 min read
How Turing Data Finder Transforms Growth Analysis with a Unified Data Platform
Didi Tech
Didi Tech
Mar 20, 2025 · Big Data

Key Questions and Value Assessment in Data Warehouse Modeling and Development

The article explores nine fundamental questions about data‑warehouse modeling—why and when to model, how to evaluate and compare models, the warehouse’s unique role versus business systems, modern architectural shifts, a quantitative value‑proof scoring framework, industry‑standard versus custom approaches, demonstrating business impact, and career insights—concluding that true value lies in enabling informed decisions rather than technology hype.

AIBig DataData Value
0 likes · 12 min read
Key Questions and Value Assessment in Data Warehouse Modeling and Development
Model Perspective
Model Perspective
Mar 20, 2025 · Big Data

How to Sample Effectively in the Big Data Era: Methods and Best Practices

This article explores essential sampling strategies for big‑data environments—including simple random, reservoir, stratified, oversampling, undersampling, and weighted sampling—detailing their principles, algorithmic steps, advantages, drawbacks, and suitable application scenarios to help analysts choose the right method.

Big DataSamplingoversampling
0 likes · 8 min read
How to Sample Effectively in the Big Data Era: Methods and Best Practices
AntData
AntData
Mar 20, 2025 · Big Data

Design and Optimization of Real‑time Data Lake Tables with Paimon and Flink for Advertising Diagnostics

This article presents a comprehensive exploration of using Apache Paimon and Flink to design lake tables that support minute‑level latency, low cost, and unified batch‑stream processing for advertising data, covering schema design, partitioning strategies, performance trade‑offs, cost analysis, and operational best practices.

Big DataFlinkPaimon
0 likes · 34 min read
Design and Optimization of Real‑time Data Lake Tables with Paimon and Flink for Advertising Diagnostics
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 20, 2025 · Big Data

How to Read and Write StarRocks Data with EMR Serverless Spark

This step‑by‑step guide explains how to use EMR Serverless Spark together with the StarRocks Spark Connector to create a workspace, upload the connector JAR, configure network connections, create databases and tables in StarRocks, and perform read/write operations via SQL sessions, Notebook sessions, or batch Spark jobs, complete with code examples and UI screenshots.

Big DataEMR ServerlessSpark
0 likes · 14 min read
How to Read and Write StarRocks Data with EMR Serverless Spark
Data Thinking Notes
Data Thinking Notes
Mar 19, 2025 · Big Data

How to Maximize Data Asset Value: From DataOps to Monetization

This report outlines a comprehensive framework for turning raw data into valuable assets, introducing DataOps and panoramic data architecture, and detailing practical methods for data value assessment, asset circulation, and operational mechanisms to help enterprises build a solid value baseline and expand data asset applications.

Big DataData Asset ManagementData Governance
0 likes · 4 min read
How to Maximize Data Asset Value: From DataOps to Monetization
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 17, 2025 · Big Data

How MaxFrame Enables Scalable Python AI Workloads on MaxCompute

This article introduces MaxFrame, a cloud‑native distributed Python compute service built on MaxCompute, detailing its architecture, seamless integration with the Python ecosystem, and real‑world use cases ranging from large‑scale data analysis and machine learning to offline LLM inference and custom image deployments.

Big DataData WarehouseDistributed computing
0 likes · 18 min read
How MaxFrame Enables Scalable Python AI Workloads on MaxCompute
JD Tech
JD Tech
Mar 13, 2025 · Operations

Ensuring Stability of the Double 11 Supply‑Chain Dashboard: Full‑Link Process, Risk Points, and Technical Safeguards

This article details how JD Logistics guarantees the stability of its Double 11 supply‑chain dashboard by mapping the entire data‑flow, identifying risk points across ingestion, processing, storage, service, and monitoring layers, and applying targeted technical and organizational safeguards.

Big DataSupply Chaindashboard
0 likes · 10 min read
Ensuring Stability of the Double 11 Supply‑Chain Dashboard: Full‑Link Process, Risk Points, and Technical Safeguards
DataFunSummit
DataFunSummit
Mar 12, 2025 · Big Data

Principles and Common Optimization Techniques of the Spark SQL Optimizer

This article explains the underlying principles of the Spark SQL optimizer and presents three classic optimization paradigms—push‑down optimization, operator elimination/merging, and expression elimination/replacement—illustrating each with concrete rule implementations and code examples.

Big DataRule EngineSpark SQL
0 likes · 12 min read
Principles and Common Optimization Techniques of the Spark SQL Optimizer
JD Tech Talk
JD Tech Talk
Mar 12, 2025 · Big Data

Ensuring Stability of the Double‑11 Supply Chain Dashboard: Full‑Chain Process, Risk Points, and Technical Safeguard Strategies

This article details how the supply‑chain big‑screen dashboard for Double‑11 maintains high stability by mapping the full data‑flow, identifying risk points across ingestion, processing, storage and service layers, and applying comprehensive technical safeguards such as high‑availability design, fault‑tolerance, monitoring, and coordinated operational procedures.

Big DataSupply Chaindashboard
0 likes · 11 min read
Ensuring Stability of the Double‑11 Supply Chain Dashboard: Full‑Chain Process, Risk Points, and Technical Safeguard Strategies