Tagged articles
3697 articles
Page 16 of 37
ShiZhen AI
ShiZhen AI
Sep 7, 2022 · Big Data

Getting Started with DataHub: A One‑Stop Guide to Metadata Governance

This article walks you through the fundamentals of data governance, explains metadata management concepts, compares traditional tools with DataHub, and provides a step‑by‑step tutorial for installing Docker, Python, and DataHub 0.8.20 on CentOS 7, ingesting MySQL metadata, and exploring the UI.

Big DataDataHubDocker
0 likes · 19 min read
Getting Started with DataHub: A One‑Stop Guide to Metadata Governance
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Sep 6, 2022 · Big Data

How China’s Universities Are Redesigning Big Data Education: Insights from the 2nd Virtual Research Meeting

The second virtual research meeting of China’s Data Science Curriculum Group gathered nearly a hundred educators and industry partners in Beijing to discuss new models for big‑data course design, curriculum construction, industry‑academia collaboration, and digital teaching platforms across multiple universities.

Big DataCurriculum DesignIndustry-Academia Collaboration
0 likes · 5 min read
How China’s Universities Are Redesigning Big Data Education: Insights from the 2nd Virtual Research Meeting
DaTaobao Tech
DaTaobao Tech
Sep 6, 2022 · Big Data

SQL Optimization Techniques for ODPS (Open Data Processing Service)

The article presents practical ODPS SQL optimization strategies—including explicit column selection, partition limiting, multi‑insert, proper handling of nulls, join‑type choices, map‑join and skew hints, bucketed tables, and tuned task parameters—illustrated with three real‑world cases that dramatically cut execution time and resource usage.

Big DataData SkewHive
0 likes · 23 min read
SQL Optimization Techniques for ODPS (Open Data Processing Service)
Bilibili Tech
Bilibili Tech
Sep 6, 2022 · Big Data

Lancer: Evolution of Bilibili's Real-Time Streaming Architecture

Lancer, Bilibili’s real‑time streaming backbone, has evolved from a monolithic Flume pipeline to a log‑id‑isolated, Kubernetes‑native architecture where Go edge agents feed synchronous Kafka‑proxied gateways into per‑logid topics processed by dedicated Flink‑SQL jobs, delivering exactly‑once, back‑pressured, highly scalable data ingestion for billions of daily requests.

Big DataFlinkKafka
0 likes · 29 min read
Lancer: Evolution of Bilibili's Real-Time Streaming Architecture
DevOps
DevOps
Sep 5, 2022 · Big Data

Why Informationization Is Not Equal to Digitalization: Insights for Enterprise Digital Transformation

The article explains the fundamental differences between informationization and digitalization, outlines how enterprises can bridge the gap through data‑driven strategies, and provides practical frameworks and case studies such as Netflix and Huawei to guide traditional manufacturers in successful digital transformation.

Big DataDigital Transformationdata driven
0 likes · 13 min read
Why Informationization Is Not Equal to Digitalization: Insights for Enterprise Digital Transformation
DataFunTalk
DataFunTalk
Sep 4, 2022 · Big Data

Design and Implementation of Bilibili's Offline Multi‑Datacenter Solution

This article describes Bilibili's offline multi‑datacenter architecture, explaining why a scale‑out approach was chosen over scale‑up, and detailing the unit‑based design, job placement, data replication, routing, versioning, bandwidth throttling, traffic analysis, and the operational results and future directions.

Big DataHDFSJob Scheduling
0 likes · 24 min read
Design and Implementation of Bilibili's Offline Multi‑Datacenter Solution
DataFunSummit
DataFunSummit
Sep 2, 2022 · Big Data

ZhongAn Insurance Data Platform: Digital Transformation, 4633 Framework, and Real‑time Data Warehouse with StarRocks

This article details ZhongAn Insurance's digital transformation through its 4633 data‑centric framework, the architecture of its JiZhi data platform, the challenges of its original ClickHouse‑based real‑time warehouse, and how migrating to StarRocks improved performance, scalability, and operational efficiency across advertising and insurance use cases.

Big DataData PlatformDigital Transformation
0 likes · 13 min read
ZhongAn Insurance Data Platform: Digital Transformation, 4633 Framework, and Real‑time Data Warehouse with StarRocks
Shopee Tech Team
Shopee Tech Team
Sep 2, 2022 · Big Data

Shopee Data System Challenges and Apache Hudi Practices

Shopee tackled its data‑system bottlenecks by customizing Apache Hudi to provide unified stream‑batch integration, efficient state‑detail snapshots, and low‑latency wide‑table generation, using CDC‑based bootstrapping, COW/MOR tables, savepoints and partial updates, which cut latency to ten minutes, lowered resource use, and yielded several community‑backed enhancements.

Apache HudiBig DataData Integration
0 likes · 18 min read
Shopee Data System Challenges and Apache Hudi Practices
Aikesheng Open Source Community
Aikesheng Open Source Community
Aug 31, 2022 · Big Data

Tencent's Big Data Construction: Philosophy, Architecture Evolution, and Open‑Source Strategy

The article introduces Tencent's big‑data platform philosophy and overall architecture, detailing three generations of evolution from offline Hadoop‑based processing to real‑time Spark/Storm integration and finally AI‑driven machine‑learning platforms, while also highlighting the team, book publication, and a related giveaway event.

Big DataData PlatformTencent
0 likes · 12 min read
Tencent's Big Data Construction: Philosophy, Architecture Evolution, and Open‑Source Strategy
IT Architects Alliance
IT Architects Alliance
Aug 30, 2022 · Big Data

Understanding Kafka: Architecture, Topics, Partitions, Producers, Consumers, Offsets, Transactions, and Configuration

This article provides a comprehensive overview of Apache Kafka, explaining its distributed message‑queue architecture, the role of topics and partitions, producer and consumer workflows, leader election, offset management, consumer‑group rebalancing, delivery semantics, transaction processing, file organization, and key configuration settings.

Big DataDistributed MessagingKafka
0 likes · 17 min read
Understanding Kafka: Architecture, Topics, Partitions, Producers, Consumers, Offsets, Transactions, and Configuration
DataFunSummit
DataFunSummit
Aug 30, 2022 · Operations

CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms

This article presents the design, implementation, and evaluation of CloudRCA, an intelligent root cause analysis framework for Alibaba Cloud's big‑data computing services, detailing challenges such as heterogeneous data, sample imbalance, and real‑time constraints, and describing the multi‑stage data processing, hierarchical Bayesian modeling, and deployment results that reduce MTTR by 20%.

Big DataOperationsRoot Cause Analysis
0 likes · 16 min read
CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms

How to Build a Unified Big Data Security Platform with Ranger and Custom Authorization

This article explains the design and implementation of a unified data security control platform that protects user privacy and corporate data across multiple big‑data components (Hive, Hetu, GaussDB) by integrating Apache Ranger, custom authorization APIs, asynchronous processing, distributed locking, and SDK‑based authentication to achieve fine‑grained, one‑stop permission management.

AuthorizationBig DataData Security
0 likes · 17 min read
How to Build a Unified Big Data Security Platform with Ranger and Custom Authorization
Architects' Tech Alliance
Architects' Tech Alliance
Aug 28, 2022 · Databases

Data Replication: Fundamentals, Technologies, and Industry Trends

The article explains data replication concepts, processes, and technologies across storage hardware, operating system, and database layers, outlines synchronous, asynchronous, and hybrid methods, discusses industry applications, trends such as hardware‑software decoupling, cloud replication, and big‑data real‑time copying, and highlights challenges and future directions.

Big Dataclouddata replication
0 likes · 14 min read
Data Replication: Fundamentals, Technologies, and Industry Trends
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Aug 26, 2022 · Cloud Computing

How Baidu Cloud Flow Log Boosts Network Visibility and Cuts Costs

Baidu Intelligent Cloud's Flow Log product provides real‑time, high‑throughput network flow collection, visualization, and analysis for VPC, dedicated line, and NAT gateways, enabling fault diagnosis, cost allocation, elephant‑flow management, and security inspection across ultra‑large scale cloud environments.

Big DataCloud ComputingCost Management
0 likes · 10 min read
How Baidu Cloud Flow Log Boosts Network Visibility and Cuts Costs
ByteDance Data Platform
ByteDance Data Platform
Aug 24, 2022 · Big Data

How ByteDance Guarantees Real‑Time Data Point Quality with Scalable Validation

This article explains ByteDance's end‑to‑end data‑point (埋点) validation system, covering its technical challenges—usability, accuracy, real‑time visibility, stability, and extensibility—along with SDK integration, QR‑code workflow, JSON‑Schema verification, push‑service architecture, SLA metrics, and future automation plans.

Big DataPush ServiceSDK
0 likes · 11 min read
How ByteDance Guarantees Real‑Time Data Point Quality with Scalable Validation
Python Programming Learning Circle
Python Programming Learning Circle
Aug 22, 2022 · Big Data

20 Data Visualization Tools: From Entry‑Level to Expert Solutions

This article surveys twenty data‑visualization tools—covering entry‑level options like Excel, online JavaScript libraries such as D3 and Google Chart API, interactive GUI utilities, map frameworks, advanced desktop environments, and expert‑grade platforms like R, Weka and Gephi—highlighting their key features, formats supported and typical use cases.

Big DataJavaScriptMapping
0 likes · 11 min read
20 Data Visualization Tools: From Entry‑Level to Expert Solutions
DataFunSummit
DataFunSummit
Aug 21, 2022 · Big Data

Alluxio Stress Testing Methods and Practices

This article explains the purpose, sources, and manifestations of pressure in Alluxio, describes its built‑in stress testing framework, outlines how to run and configure stress tools, and provides guidance on result calculation, reporting, common issues, and debugging for effective performance evaluation.

AlluxioBig DataPerformance evaluation
0 likes · 11 min read
Alluxio Stress Testing Methods and Practices
DataFunSummit
DataFunSummit
Aug 19, 2022 · Big Data

Taobao Data Model Governance: Challenges, Analysis, and Solutions

This article presents a comprehensive overview of Taobao's data model governance, detailing the background and problems of the current data architecture, analyzing root causes, proposing a structured governance framework with DataWorks automation, and outlining future plans to improve efficiency, standardization, and product tooling.

AlibabaBig DataDataWorks
0 likes · 13 min read
Taobao Data Model Governance: Challenges, Analysis, and Solutions
DeWu Technology
DeWu Technology
Aug 19, 2022 · Big Data

DeWu Reach Strategy Platform and HBase Buffer Pool Architecture

The DeWu Reach Strategy platform uses a task‑strategy‑action model and an HBase‑backed buffer pool that temporarily stores billions of user records, enabling large‑scale algorithmic push, AB testing, and dynamic horizontal scaling while ensuring even data distribution and low‑latency processing.

Big DataHBaseReach Strategy
0 likes · 9 min read
DeWu Reach Strategy Platform and HBase Buffer Pool Architecture
DataFunSummit
DataFunSummit
Aug 17, 2022 · Big Data

Data Governance Practices and Frameworks: Insights from Alibaba

This article presents an overview of data governance concepts, common enterprise challenges, and Alibaba's comprehensive data governance framework, covering theory, demand layers, practical solutions for stability, quality, standards, security, cost control, and the supporting platforms and operational practices.

AlibabaBig DataData Security
0 likes · 13 min read
Data Governance Practices and Frameworks: Insights from Alibaba
Python Programming Learning Circle
Python Programming Learning Circle
Aug 17, 2022 · Big Data

Game Industry User Data Analysis: Registration Distribution, Payment Metrics, and Consumption Patterns

This article presents a comprehensive Python-based analysis of a large game dataset (2.29 million records, 109 fields), covering user registration trends, payment rates, ARPU/ARPPU calculations, level‑based spending behavior, and consumption patterns of resources and acceleration items, with visualizations and actionable conclusions.

Big DataGame AnalyticsPython
0 likes · 11 min read
Game Industry User Data Analysis: Registration Distribution, Payment Metrics, and Consumption Patterns
Volcano Engine Developer Services
Volcano Engine Developer Services
Aug 15, 2022 · Big Data

How ByteDance Scales Event Tracking: Inside a Billion‑Events‑Per‑Second Data Pipeline

This article explains how ByteDance’s event‑tracking (埋点) data flow handles billions of events per second using Flink‑based real‑time ETL, dynamic rule engines, data sharding, and multi‑datacenter disaster‑recovery to ensure stability, low latency, and cost‑effective processing for diverse downstream services.

Big DataFlinkScalability
0 likes · 16 min read
How ByteDance Scales Event Tracking: Inside a Billion‑Events‑Per‑Second Data Pipeline
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Aug 15, 2022 · Cloud Computing

How Baidu’s Canghai Storage Tackles Massive Data Challenges in the Cloud

This article outlines the four major storage challenges of the ABC era—massive scale, cost efficiency, stability, and diversity—and explains how Baidu’s Canghai storage suite, including BOS, CDS, CFS, PFS, RapidFS, CloudFlow, and storage gateways, addresses each through multi‑cloud migration, tiered lifecycle management, and robust disaster‑recovery solutions.

AIBig DataData Migration
0 likes · 15 min read
How Baidu’s Canghai Storage Tackles Massive Data Challenges in the Cloud
High Availability Architecture
High Availability Architecture
Aug 15, 2022 · Big Data

Comprehensive Guide to Event Tracking Governance and the One‑Stop Tracking Management Platform

This article explains why event‑tracking (埋点) governance is essential, outlines the methodology and practice of full‑link tracking management, and introduces the one‑stop tracking platform with its innovative features such as standardized processes, verification tools, real‑time dashboards, cross‑platform data unification, and future roadmap.

AnalyticsBig DataPlatform
0 likes · 15 min read
Comprehensive Guide to Event Tracking Governance and the One‑Stop Tracking Management Platform
Past Memory Big Data
Past Memory Big Data
Aug 15, 2022 · Big Data

How Pinterest Scaled a Hadoop Upgrade Across 17k Nodes

Pinterest’s Monarch batch‑processing platform, built on over 17 k YARN nodes in AWS, was upgraded from Hadoop 2.7.1 to 2.10.0 using a phased, cluster‑by‑cluster strategy that balanced minimal downtime, extensive validation, and custom patches to handle compatibility and dependency issues.

AWS EC2Big DataCluster Upgrade
0 likes · 18 min read
How Pinterest Scaled a Hadoop Upgrade Across 17k Nodes
DataFunTalk
DataFunTalk
Aug 13, 2022 · Big Data

Data Governance Practices and Logical Closed‑Loop at KuaiKan

The talk outlines KuaiKan's data governance journey, describing the rapid business growth challenges, the three‑step logical closed‑loop framework, practical experiences in business scope management, data asset governance, collaboration techniques, and future outlook, highlighting evaluation metrics and ongoing improvements.

Big DataData Qualitydata governance
0 likes · 16 min read
Data Governance Practices and Logical Closed‑Loop at KuaiKan
ITPUB
ITPUB
Aug 13, 2022 · Big Data

How Alibaba Uses Flink to Power Massive Real‑Time Risk Control

This article explains how Alibaba leverages Flink to handle over 40 billion events per second across all business units, detailing risk‑control concepts, rule types, architectural stages, resource tuning, dynamic CEP, shared computing, and the FY23 roadmap for large‑scale streaming risk management.

AlibabaBig DataCEP
0 likes · 16 min read
How Alibaba Uses Flink to Power Massive Real‑Time Risk Control
Python Programming Learning Circle
Python Programming Learning Circle
Aug 13, 2022 · Big Data

Parallel Processing of Large CSV Files in Python Using multiprocessing, joblib, and tqdm

This tutorial demonstrates how to accelerate processing of a multi‑million‑row CSV dataset by splitting the work into sub‑tasks and applying Python's multiprocessing, joblib, and tqdm libraries for serial, parallel, and batch processing, showing significant speed‑ups and best‑practice code snippets.

Big DataData cleaningPython
0 likes · 10 min read
Parallel Processing of Large CSV Files in Python Using multiprocessing, joblib, and tqdm
DataFunTalk
DataFunTalk
Aug 11, 2022 · Databases

Fundamentals of Knowledge Graphs, Graph Databases, and Their Applications in AI and Big Data

This article introduces the basic concepts of knowledge graphs, explores their research dimensions across knowledge engineering, natural language processing, databases and machine learning, discusses graph database storage models and their integration with artificial intelligence and big data, and presents related projects and real‑world case studies.

Big DataGraph DatabaseKnowledge Graph
0 likes · 13 min read
Fundamentals of Knowledge Graphs, Graph Databases, and Their Applications in AI and Big Data
DataFunSummit
DataFunSummit
Aug 10, 2022 · Artificial Intelligence

Leveraging Cross-Industry Data and Quantum-Inspired Feature Engineering for SME Supply Chain Finance

This article presents Huace Data Science's practical approaches to digital supply‑chain finance for SMEs, detailing challenges of cross‑industry data, the SME engine for authentic business assessment, graph‑based fraud detection, and quantum‑inspired feature‑engineering methods that enhance credit‑risk models.

Big DataFeature EngineeringQuantum-Inspired Algorithms
0 likes · 15 min read
Leveraging Cross-Industry Data and Quantum-Inspired Feature Engineering for SME Supply Chain Finance
Baidu Geek Talk
Baidu Geek Talk
Aug 9, 2022 · Big Data

How to Build a Real-Time Data Warehouse with Unified Stream‑Batch Architecture

This article examines the evolution of big‑data architectures, identifies the latency and maintenance issues of classic Lambda designs, and presents a hybrid Lambda‑Kappa solution that unifies streaming and batch processing to achieve minute‑level data freshness and second‑level query latency while reducing development cost.

Big DataKappa architectureLambda architecture
0 likes · 13 min read
How to Build a Real-Time Data Warehouse with Unified Stream‑Batch Architecture
DataFunTalk
DataFunTalk
Aug 9, 2022 · Databases

Graph Database Storage Technologies and Practices: Concepts, Core Goals, Technical Solutions, and Galaxybase Case Study

This article introduces graph database fundamentals, explains why graph databases are needed, outlines core storage goals such as index‑free adjacency, compares array, linked‑list and LSM‑tree storage schemes, and presents the design, performance advantages, and real‑world applications of the Galaxybase distributed graph database.

Big DataDistributed SystemsGalaxybase
0 likes · 20 min read
Graph Database Storage Technologies and Practices: Concepts, Core Goals, Technical Solutions, and Galaxybase Case Study
Past Memory Big Data
Past Memory Big Data
Aug 9, 2022 · Big Data

Master the Complete Big Data Ecosystem in One Article

This article provides a comprehensive overview of the big data ecosystem, detailing nine core technology categories—from data collection and storage to computation, analysis, scheduling, and underlying infrastructure—along with tool comparisons, selection guidelines to help readers quickly build a complete big data knowledge system.

Big DataData AnalysisData Collection
0 likes · 12 min read
Master the Complete Big Data Ecosystem in One Article
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Aug 9, 2022 · Big Data

Unlocking MaxCompute: How Alibaba’s Big Data Platform Secures Your Data

This article provides a comprehensive overview of Alibaba Cloud MaxCompute, covering its product features, architecture, ecosystem integrations, and in‑depth data security mechanisms such as authentication, RAM roles, access control policies, label‑based security, project protection, audit logging, encryption, backup, disaster recovery, and the complementary DataWorks security capabilities.

Big DataData SecurityMaxCompute
0 likes · 31 min read
Unlocking MaxCompute: How Alibaba’s Big Data Platform Secures Your Data
IT Services Circle
IT Services Circle
Aug 7, 2022 · Artificial Intelligence

How Smart Pens and AI Surveillance Are Monitoring Students' Homework

The article examines the rise of smart pens, point‑matrix technology, and other AI‑driven monitoring tools in Chinese schools, detailing how they record handwriting, emotions, screen activity, and even biometric data, while raising privacy concerns and highlighting the massive market for educational surveillance.

AI surveillanceBig DataEducation Technology
0 likes · 9 min read
How Smart Pens and AI Surveillance Are Monitoring Students' Homework
Snowball Engineer Team
Snowball Engineer Team
Aug 5, 2022 · Big Data

Snowball Data Warehouse Modeling and OneData System Implementation

This article outlines Snowball's data warehouse background, compares major modeling approaches such as ER, dimensional, DataVault and Anchor models, describes the current challenges of their dimensional model, and details the OneData methodology—including OneModel, OneID, and OneService—along with its practical implementation, results, and future plans.

Big DataData WarehouseETL
0 likes · 23 min read
Snowball Data Warehouse Modeling and OneData System Implementation
High Availability Architecture
High Availability Architecture
Aug 5, 2022 · Big Data

Innovative Marketing Practices on the Cloud: How an Intelligent Data Lake Enables Flexible and Efficient Marketing Capabilities

The presentation details how Amazon Web Services’ intelligent data lake architecture integrates big data and machine learning to overcome marketing challenges, improve data governance, and provide scalable, real‑time analytics for personalized, data‑driven marketing across enterprises.

Big DataCloud ComputingData Lake
0 likes · 13 min read
Innovative Marketing Practices on the Cloud: How an Intelligent Data Lake Enables Flexible and Efficient Marketing Capabilities
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Aug 5, 2022 · Big Data

Why Alibaba Cloud Dominates China’s Big Data Public Cloud Market in 2021

A recent IDC report reveals that Alibaba Cloud captured 14.9 billion yuan in revenue, securing the top spot in China’s big data platform public‑cloud market in 2021, driven by rapid 53.8 % growth and emerging technologies such as real‑time data warehouses, lake‑house integration, streaming‑batch convergence, and AI‑enabled analytics.

Alibaba CloudBig DataIDC
0 likes · 4 min read
Why Alibaba Cloud Dominates China’s Big Data Public Cloud Market in 2021
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 4, 2022 · Big Data

Comprehensive Guide to DataX: Introduction, Architecture, Usage, and Deployment

This article provides a detailed overview of DataX, covering its purpose, framework design, core architecture, scheduling process, practical examples of MySQL-to-MySQL synchronization, step‑by‑step installation and configuration of DataX‑WEB, UI usage, routing strategies, task types, and advanced task building techniques.

Big DataData IntegrationDataX
0 likes · 14 min read
Comprehensive Guide to DataX: Introduction, Architecture, Usage, and Deployment
IT Architects Alliance
IT Architects Alliance
Aug 3, 2022 · Big Data

Understanding Kafka Architecture: Topics, Partitions, Replication, Log Segmentation, Zero‑Copy, and Zookeeper Integration

This article explains Kafka's core concepts—including topics, partitions and replicas, log segment storage, leader‑follower mechanics, consumer groups, network threading model, zero‑copy I/O, and the essential role of Zookeeper for broker, topic, consumer, and offset management—providing a comprehensive overview for developers and architects.

Big DataKafkaStreaming
0 likes · 10 min read
Understanding Kafka Architecture: Topics, Partitions, Replication, Log Segmentation, Zero‑Copy, and Zookeeper Integration

Understanding Spark Streaming Checkpoint Mechanism for Real‑Time Feature Computation

The article explains how Spark Streaming's checkpoint mechanism works, detailing the four-step process—from setting the checkpoint directory to writing RDD data and finalizing the checkpoint—highlighting its role in ensuring fault‑tolerant, fast recovery for real‑time recommendation feature pipelines.

Big DataCheckpointReal-time Processing
0 likes · 7 min read
Understanding Spark Streaming Checkpoint Mechanism for Real‑Time Feature Computation
DataFunSummit
DataFunSummit
Aug 2, 2022 · Big Data

Tencent PCG Real‑Time Data Warehouse and Operations Architecture Overview

This article presents Tencent's PCG data platform evolution, detailing the challenges of integrating multiple business groups, the design of a unified big‑data architecture, real‑time and batch processing pipelines, MQ and ATTA systems, and comprehensive operational practices for reliability and scalability.

ATTABig DataMQ
0 likes · 17 min read
Tencent PCG Real‑Time Data Warehouse and Operations Architecture Overview
Open Source Linux
Open Source Linux
Aug 2, 2022 · Cloud Computing

How China Telecom Is Building the Nation’s First “National Cloud” and Its Global Impact

China Telecom is creating a state‑backed “national cloud” by partnering with multiple central‑enterprise investors, consolidating resources, accelerating indigenous cloud technology, and setting ambitious infrastructure targets, while similar initiatives emerge worldwide in the US, Russia, India, France and Italy.

Big DataChina TelecomCloud Computing
0 likes · 7 min read
How China Telecom Is Building the Nation’s First “National Cloud” and Its Global Impact
ITPUB
ITPUB
Aug 1, 2022 · Big Data

How Bilibili Scaled Offline Computing: Migrating from Hive to Spark and Boosting Performance

This article details Bilibili's evolution from a Hadoop‑based offline platform to a Spark‑driven architecture, covering the Hive‑to‑Spark migration, automated SQL conversion, result validation, stability enhancements, performance tuning, meta‑store federation, and future directions for large‑scale data processing.

Big DataData SkippingHive
0 likes · 31 min read
How Bilibili Scaled Offline Computing: Migrating from Hive to Spark and Boosting Performance
Baidu Geek Talk
Baidu Geek Talk
Aug 1, 2022 · Artificial Intelligence

Sugar BI: AI-Powered Business Intelligence Platform Architecture and Intelligent Visualization

Sugar BI, Baidu Cloud’s AI‑powered business intelligence platform, lets users create professional, zero‑code dashboards in minutes by connecting to 30+ data sources, leveraging Apache ECharts, intelligent chart recommendation, and natural‑language voice interaction to deliver automated analysis, visualization, and predictive insights.

AI-Powered AnalyticsBig DataBusiness Intelligence
0 likes · 15 min read
Sugar BI: AI-Powered Business Intelligence Platform Architecture and Intelligent Visualization
Architecture Digest
Architecture Digest
Aug 1, 2022 · Big Data

Understanding Data Lakes: Concepts, Features, Architectures, and Vendor Solutions

This article provides a comprehensive overview of data lakes, explaining their definition, key characteristics, architectural evolution, and detailed comparisons of major cloud providers' solutions, while also presenting typical use cases, construction processes, and future development directions for this emerging big‑data infrastructure.

Alibaba CloudAzureBig Data
0 likes · 52 min read
Understanding Data Lakes: Concepts, Features, Architectures, and Vendor Solutions
DataFunTalk
DataFunTalk
Jul 31, 2022 · Big Data

Design, Evolution, and Optimization of NetEase's Log Collection and Transmission Service (Datastream‑NG)

This article presents a comprehensive overview of NetEase's log collection and transmission platform, detailing its evolution from 2011 to the current Datastream‑NG architecture, the system's design goals, core component optimizations, operational monitoring, and future plans for intelligent scaling and diagnostics.

Big DataData StreamingDistributed Systems
0 likes · 23 min read
Design, Evolution, and Optimization of NetEase's Log Collection and Transmission Service (Datastream‑NG)
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jul 28, 2022 · Big Data

How Baidu Cloud Accelerates Data Lakes with Compute‑Storage Separation

This article explains Baidu Intelligent Cloud’s data lake acceleration solution, covering the evolution of big‑data technologies, the benefits and challenges of compute‑storage separation, the architecture of BOS object storage, and the native hierarchical namespace and RapidFS cache mechanisms that boost performance and reduce costs.

BOSBig DataCompute-Storage Separation
0 likes · 18 min read
How Baidu Cloud Accelerates Data Lakes with Compute‑Storage Separation
SQB Blog
SQB Blog
Jul 28, 2022 · Frontend Development

How AntV Powers Data Visualization: From Charts to Graph Analysis

This article explores data visualization fundamentals, compares scientific, information, and analytical visualization, reviews popular frontend libraries like ECharts and AntV/G2, showcases real-world case studies, and details technical choices for building interactive charts and graph‑based analytics in modern applications.

AntVBig DataFrontend Development
0 likes · 13 min read
How AntV Powers Data Visualization: From Charts to Graph Analysis
Big Data Technology Architecture
Big Data Technology Architecture
Jul 28, 2022 · Big Data

Reflections on Data Governance Challenges and Approaches

The author shares a candid account of transitioning from a non‑data role to confronting data‑centric bottlenecks, describing the current state of data projects, common pitfalls, and practical thoughts on simplifying data governance within limited resources and budget constraints.

Big DataDAMAData Management
0 likes · 7 min read
Reflections on Data Governance Challenges and Approaches
DataFunTalk
DataFunTalk
Jul 27, 2022 · Big Data

Building a Big Data Platform at FenbeiTong: Architecture, Practices, and Lessons Learned

This article shares FenbeiTong's experience in building a big data platform, covering company background, data construction challenges, technology selection, architecture design, implementation details, data modeling tools, and real-world application scenarios such as CDP and CEM, offering practical insights for similar enterprises.

AIBig DataCloud Computing
0 likes · 19 min read
Building a Big Data Platform at FenbeiTong: Architecture, Practices, and Lessons Learned
Laravel Tech Community
Laravel Tech Community
Jul 26, 2022 · Big Data

Red Hat 2019 Enterprise Open Source Survey: Overview of Popular Open Source Projects Across Web Servers, Big Data, Cloud, Storage, Operating Systems, Databases, and Development Tools

The Red Hat 2019 Enterprise Open Source Survey summarizes the most widely adopted open‑source projects in enterprises, covering web servers, big‑data frameworks, cloud platforms, distributed storage, operating systems, databases, development tools, and middleware, and highlights their strategic importance for modern IT infrastructure.

Big DataCloud ComputingDatabases
0 likes · 18 min read
Red Hat 2019 Enterprise Open Source Survey: Overview of Popular Open Source Projects Across Web Servers, Big Data, Cloud, Storage, Operating Systems, Databases, and Development Tools
DataFunTalk
DataFunTalk
Jul 26, 2022 · Big Data

Feature Platform Architecture and Stream‑Batch Integrated Solutions

This talk presents Shuhe Technology’s feature platform, detailing its four‑layer architecture, feature storage services, stream‑batch integrated processing, event‑center design, consistency models, and four model‑strategy invocation schemes, illustrating data flows from MySQL through Sqoop, Kafka, Flink, HBase and ClickHouse.

Big DataClickHouseFlink
0 likes · 17 min read
Feature Platform Architecture and Stream‑Batch Integrated Solutions
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jul 26, 2022 · Big Data

How Alibaba’s Big Data Model Governance Boosted Efficiency and Cut Costs

This report details Alibaba’s large‑scale data model governance initiative for the DaTao ecosystem, analyzing current data issues such as naming inconsistencies, low reuse, and application‑layer inefficiencies, and presents a comprehensive solution—including a model evaluation system, DataWorks co‑development, intelligent modeling, data map enhancements, and future roadmap—to improve data health, reduce costs, and increase operational efficiency.

Big DataDataWorksdata governance
0 likes · 15 min read
How Alibaba’s Big Data Model Governance Boosted Efficiency and Cut Costs
JavaEdge
JavaEdge
Jul 25, 2022 · Big Data

Choosing Between Lambda and Kappa: Real‑Time Data Warehouse Strategies

The article uses an acorn‑moving analogy to highlight latency and traceability challenges in enterprise data warehouses, then explains offline versus real‑time approaches, compares Lambda and Kappa architectures, discusses Iceberg integration, and shares a detailed e‑commerce real‑time warehouse case study with optimization tips.

Big DataFlinkIceberg
0 likes · 15 min read
Choosing Between Lambda and Kappa: Real‑Time Data Warehouse Strategies
DataFunTalk
DataFunTalk
Jul 25, 2022 · Big Data

Taobao Data Model Governance and Intelligent Modeling with DataWorks

This article summarizes Guo Jinshi's presentation on Taobao's data model governance, covering the current data landscape, identified problems, analysis of root causes, proposed governance solutions—including DataWorks intelligent modeling—and future plans, while also providing a Q&A session on practical implementation.

AlibabaBig DataDataWorks
0 likes · 13 min read
Taobao Data Model Governance and Intelligent Modeling with DataWorks

Probability Algorithms in Big Data: BloomFilter and Count-min Sketch Applications

The article explains how space‑efficient probabilistic structures such as BloomFilter and Count‑min Sketch enable large‑scale data deduplication, join pruning, real‑time idempotent filtering, and approximate top‑K analytics by trading modest accuracy loss for dramatically reduced storage and faster computation.

Big DataBloomFilterCount-Min Sketch
0 likes · 12 min read
Probability Algorithms in Big Data: BloomFilter and Count-min Sketch Applications
ITPUB
ITPUB
Jul 24, 2022 · Databases

How Apache Doris Enables Real‑Time Queries on Hudi Data Lakes

This article explains Apache Doris’s architecture, introduces the Hudi data‑lake format, compares Lambda and Kappa approaches, and details the design and implementation of Doris’s Hudi external table support, including practical steps, code examples, and future roadmap.

Apache DorisBig DataData Lake
0 likes · 10 min read
How Apache Doris Enables Real‑Time Queries on Hudi Data Lakes
DataFunTalk
DataFunTalk
Jul 24, 2022 · Big Data

Real-time Data Warehouse Empowering Fine-grained Intelligent Operations in Finance – A Practical Case Study

This talk by Zhongan Insurance’s Data Senior Director Shi Xingtian outlines the company’s digital transformation, detailing the 4633 framework, the real-time data warehouse architecture, the migration from ClickHouse to StarRocks, and how these technologies support fine‑grained, intelligent financial operations and advertising analytics.

Big DataStarRocksZhongan Insurance
0 likes · 14 min read
Real-time Data Warehouse Empowering Fine-grained Intelligent Operations in Finance – A Practical Case Study
DataFunTalk
DataFunTalk
Jul 23, 2022 · Artificial Intelligence

Graph Algorithm Deployment and Practices on the DataFun Security Spark Cluster

This article presents a comprehensive overview of deploying and running graph learning algorithms—both inductive and transductive—on the secure Spark cluster, covering framework choices, data sampling strategies, distributed training techniques, model evaluation metrics, and future directions.

Big DataSparkdistributed training
0 likes · 13 min read
Graph Algorithm Deployment and Practices on the DataFun Security Spark Cluster
Bilibili Tech
Bilibili Tech
Jul 23, 2022 · Backend Development

API Gateway Evolution and Engineering Practices; Applying ClickHouse for Massive Data Processing

The talk traces the evolution of API Gateway architectures and the engineering practices—design patterns, deployment strategies, and operational considerations—required for scalable, reliable services, then demonstrates how ClickHouse can be leveraged for massive data workloads, highlighting practical scenarios, performance optimizations, and key lessons learned.

Backend DevelopmentBig DataClickHouse
0 likes · 1 min read
API Gateway Evolution and Engineering Practices; Applying ClickHouse for Massive Data Processing
ITPUB
ITPUB
Jul 22, 2022 · Big Data

From Client‑Side to Server‑Side: How NetEase Built StreamflySQL on Flink SQL

This article chronicles NetEase Games' evolution of its real‑time StreamflySQL platform, detailing the transition from a client‑side Flink SQL implementation to a server‑side architecture powered by SQL Gateway, and discusses the motivations, design choices, challenges, and performance improvements achieved.

Big DataFlinkSQL Gateway
0 likes · 19 min read
From Client‑Side to Server‑Side: How NetEase Built StreamflySQL on Flink SQL
StarRocks
StarRocks
Jul 22, 2022 · Big Data

How 37 Mobile Games Boosted Analytics with StarRocks: A Real‑World Performance Case Study

37 Mobile Games, a leading mobile game publisher, migrated its user‑profile analytics from a Hadoop‑Hudi‑Kafka‑Hive‑Flink stack to StarRocks, achieving sub‑second query latency on billion‑row tables, simplifying operations, reducing storage costs, and enabling real‑time data sync, as detailed in this technical case study.

Big DataOLAPPerformance Optimization
0 likes · 12 min read
How 37 Mobile Games Boosted Analytics with StarRocks: A Real‑World Performance Case Study
DataFunTalk
DataFunTalk
Jul 21, 2022 · Big Data

Large-Scale Offline‑Online Mixed Deployment at Huya: Architecture, Challenges, and Solutions

This article describes Huya's large‑scale offline‑online mixed deployment, detailing the low resource‑utilization problems, the time‑sharing and elastic scheduling solutions, the containerized architecture, multi‑datacenter isolation, heterogeneous resource handling, stability safeguards, and the resulting performance improvements and future directions.

Big DataContainerizationHuya
0 likes · 13 min read
Large-Scale Offline‑Online Mixed Deployment at Huya: Architecture, Challenges, and Solutions
政采云技术
政采云技术
Jul 21, 2022 · Fundamentals

Insights and Principles for Designing Data Visualization Dashboards

This article shares practical experiences and foundational concepts for creating data‑visualization dashboards, covering screen types, design principles, characteristics, audience analysis, and the broader role of visualization in turning massive data into actionable insights while enhancing human cognition.

Big DataData visualizationdashboard design
0 likes · 3 min read
Insights and Principles for Designing Data Visualization Dashboards
JD Retail Technology
JD Retail Technology
Jul 19, 2022 · Backend Development

Design and Architecture of JD Retail Product Selection Platform

This article details the design and implementation of JD Retail’s product selection platform, covering its business background, core data retrieval capabilities, domain model, system architecture—including frontend configurability, backend query engine, ClickHouse indexing, and both offline and real-time data processing pipelines.

Big DataSystem architecturedata indexing
0 likes · 14 min read
Design and Architecture of JD Retail Product Selection Platform
ByteDance Data Platform
ByteDance Data Platform
Jul 18, 2022 · Big Data

Unlocking Real‑Time Data Quality: ByteDance’s Dynamic Exploration Solution

This article explains how ByteDance’s dynamic data exploration tool improves data quality assurance by replacing time‑consuming SQL validation with real‑time, sample‑based profiling, detailing its problem background, core features, technical architecture, front‑end rendering techniques, operation‑stack management, and future enhancements.

Big DataSQL Generationdata exploration
0 likes · 13 min read
Unlocking Real‑Time Data Quality: ByteDance’s Dynamic Exploration Solution
DataFunSummit
DataFunSummit
Jul 17, 2022 · Big Data

Elasticsearch and Big Data: Architecture, Use Cases, and Advantages

This article explains what Elasticsearch is, how it solves database acceleration, log observability, and data analysis problems, details its core components and underlying engine features, compares its strengths and weaknesses, and presents classic application scenarios and a real‑world case study integrating Elasticsearch with Flink for large‑scale log analytics.

Big DataElasticsearchFlink
0 likes · 13 min read
Elasticsearch and Big Data: Architecture, Use Cases, and Advantages
DataFunTalk
DataFunTalk
Jul 16, 2022 · Big Data

Deep Dive into Apache Hudi 0.11.0: Multi‑Level Index, Spark SQL Enhancements, Flink Integration, and Other Improvements

The article provides an in‑depth overview of Apache Hudi 0.11.0, covering its new multi‑level index design, Spark SQL enhancements, Flink integration improvements, and additional performance and usability features aimed at boosting read/write efficiency in large‑scale data lake environments.

Apache HudiBig DataData Lake
0 likes · 15 min read
Deep Dive into Apache Hudi 0.11.0: Multi‑Level Index, Spark SQL Enhancements, Flink Integration, and Other Improvements
DataFunSummit
DataFunSummit
Jul 15, 2022 · Big Data

Apache DolphinScheduler Practice at Xinwang Bank

Xinwang Bank leverages Apache DolphinScheduler to handle over 9,000 daily task instances across real‑time, near‑real‑time, and offline batch scenarios, detailing background, application scenarios, optimizations, workflow improvements, import/export enhancements, alert system upgrades, and future plans to expand data‑ops capabilities.

Apache DolphinSchedulerBig DataDataOps
0 likes · 13 min read
Apache DolphinScheduler Practice at Xinwang Bank
IT Architects Alliance
IT Architects Alliance
Jul 14, 2022 · Big Data

Elasticsearch Overview: Core Concepts, Architecture, and Practical Usage

This article provides a comprehensive introduction to Elasticsearch, covering data types, Lucene fundamentals, cluster architecture, node roles, shard and replica mechanisms, mapping, installation, health monitoring, indexing principles, storage strategies, refresh and translog handling, segment merging, performance tuning, and JVM optimization for large‑scale search applications.

Big DataElasticsearchPerformance Optimization
0 likes · 35 min read
Elasticsearch Overview: Core Concepts, Architecture, and Practical Usage
GuanYuan Data Tech Team
GuanYuan Data Tech Team
Jul 14, 2022 · Big Data

How to Train Massive GBDT Models on Spark: A Complete Step‑by‑Step Guide

This article walks through using Apache Spark for large‑scale GBDT training, covering the challenges of massive data, Spark deployment, PySpark code examples, differences from Pandas, feature engineering, mmlspark installation, early‑stopping tricks, performance bottlenecks, and a systematic evaluation of alternative frameworks.

Big DataGBDTPerformance Optimization
0 likes · 38 min read
How to Train Massive GBDT Models on Spark: A Complete Step‑by‑Step Guide
Top Architect
Top Architect
Jul 14, 2022 · Big Data

A Comprehensive Introduction to Elasticsearch: Architecture, Core Concepts, and Practical Usage

This article provides a detailed overview of Elasticsearch, covering its data model, Lucene foundation, cluster architecture, shard and replica mechanisms, index mapping, installation steps, health monitoring, write and storage processes, segment management, and performance tuning techniques for large‑scale search applications.

Big DataElasticsearchPerformance tuning
0 likes · 35 min read
A Comprehensive Introduction to Elasticsearch: Architecture, Core Concepts, and Practical Usage
Programmer DD
Programmer DD
Jul 14, 2022 · Big Data

Master Fast Data Synchronization with Alibaba DataX: A Step‑by‑Step Guide

This article explains why traditional mysqldump and file‑based methods struggle with massive tables, introduces Alibaba DataX as a high‑performance offline data integration tool, details its architecture, and provides comprehensive installation and configuration steps for full and incremental MySQL‑to‑MySQL synchronization using JSON job files.

Big DataDataXETL
0 likes · 15 min read
Master Fast Data Synchronization with Alibaba DataX: A Step‑by‑Step Guide
Sohu Tech Products
Sohu Tech Products
Jul 13, 2022 · Fundamentals

Digital Economy and Digital Transformation: Trends, Strategies, and Enabling Technologies

The article outlines how the COVID‑19‑driven shift to remote work accelerated digitalization, describes the rapid growth of the digital economy, explains the two‑step process of industry digitization and digital industrialization, and highlights the strategic role of AI, cloud computing, big data, 5G and digital twins in reshaping enterprises across sectors.

5GArtificial IntelligenceBig Data
0 likes · 15 min read
Digital Economy and Digital Transformation: Trends, Strategies, and Enabling Technologies
dbaplus Community
dbaplus Community
Jul 13, 2022 · Big Data

Unpacking the Core Technologies Behind Modern Big Data Platforms

From data ingestion to real‑time analytics, this guide breaks down the essential layers of a typical big‑data platform—covering collection methods, HDFS storage, Hive/Spark analysis, data sharing mechanisms, application use‑cases, streaming with Spark Streaming, and the need for robust scheduling and monitoring.

Big DataData IntegrationData Warehouse
0 likes · 9 min read
Unpacking the Core Technologies Behind Modern Big Data Platforms
Alibaba Cloud Native
Alibaba Cloud Native
Jul 12, 2022 · Big Data

How to Troubleshoot Kafka Message Loss with the Managed Retrieval Component

This article explains common Kafka message‑loss and duplicate‑consumption issues, introduces Alibaba Cloud's fully managed Kafka Retrieval Component, and provides step‑by‑step guidance—including enabling the service, using Tablestore for multi‑index and SQL searches—to help engineers quickly locate and verify missing or duplicated messages.

Big DataKafkaMessage Retrieval
0 likes · 7 min read
How to Troubleshoot Kafka Message Loss with the Managed Retrieval Component
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 12, 2022 · Big Data

Analyzing Spark's Iceberg Data Reading Process and Small‑File Merging

This article explains how Spark reads data from Apache Iceberg tables by parsing snapshots and manifest files into DataFile objects, creates Batch and InputPartition objects, uses readers to materialize InternalRows, and then demonstrates how Iceberg's RewriteDataFilesAction can merge tiny Parquet files into larger ones through Spark‑driven tasks.

Big DataData LakeIceberg
0 likes · 17 min read
Analyzing Spark's Iceberg Data Reading Process and Small‑File Merging
DataFunTalk
DataFunTalk
Jul 11, 2022 · Big Data

Predictive Maintenance (PdM): Value, Technical Roadmaps, Time‑Series Database Selection, and Real‑World Cases

This article explores the value and evolution of predictive maintenance (PdM), outlines common technical approaches—including signal processing, mechanism + big‑data, digital twin, and AI—examines time‑series database choices such as MatrixDB, presents case studies and practical insights, and concludes with reflections on industrial digital transformation.

Big DataDigital TwinIndustrial IoT
0 likes · 15 min read
Predictive Maintenance (PdM): Value, Technical Roadmaps, Time‑Series Database Selection, and Real‑World Cases
DataFunTalk
DataFunTalk
Jul 10, 2022 · Big Data

Serverless Technologies Empowering Big Data Analytics: An Overview of Amazon EMR Serverless

This article presents a comprehensive overview of how Amazon EMR Serverless leverages serverless technology to simplify, scale, and cost‑optimize big data analytics, covering the evolution of serverless services, the intelligent lakehouse architecture, core concepts, key benefits, common use cases, and available documentation.

Amazon EMRAnalyticsBig Data
0 likes · 17 min read
Serverless Technologies Empowering Big Data Analytics: An Overview of Amazon EMR Serverless
DataFunTalk
DataFunTalk
Jul 8, 2022 · Information Security

DataFun 2022 Summit on Privacy Computing and Data Security

DataFun's 2022 summit brings together leading experts from academia and industry to discuss privacy computing, federated learning, secure data sharing, and their applications across finance, healthcare, telecom, and blockchain, offering insights into technologies, standards, and real-world implementations that enable data utility while protecting privacy.

Big DataData SecurityPrivacy Computing
0 likes · 43 min read
DataFun 2022 Summit on Privacy Computing and Data Security
Ctrip Technology
Ctrip Technology
Jul 7, 2022 · Big Data

Design and Implementation of a Unified Data Service Platform for Reducing Development Cost and Enhancing Efficiency

The article describes how Ctrip built a unified data service platform that standardizes API development, leverages multiple storage engines, introduces token‑based security, Sentinel rate‑limiting, caching, and automatic contract generation to dramatically cut development cycles and improve reliability for big‑data workloads.

APIBig DataCaching
0 likes · 10 min read
Design and Implementation of a Unified Data Service Platform for Reducing Development Cost and Enhancing Efficiency
Hulu Beijing
Hulu Beijing
Jul 7, 2022 · Big Data

How Hulu Upgraded Hadoop 2.6 to 3.0: Lessons in Compatibility and Migration

This article details Hulu's five‑year journey from Hadoop 2.6 to 3.3.2, covering major feature evolutions, the original cluster architecture, a comprehensive upgrade plan, compatibility challenges across HDFS, YARN, Hive, Spark and Flink, and the testing and rollout strategies that ensured a smooth migration.

Big DataCluster UpgradeFlink
0 likes · 17 min read
How Hulu Upgraded Hadoop 2.6 to 3.0: Lessons in Compatibility and Migration