Tagged articles
3697 articles
Page 28 of 37
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 2, 2020 · Big Data

Structured Streaming: Design, Challenges, Programming Model, and Performance Evaluation

This article provides a comprehensive overview of Apache Spark Structured Streaming, describing its declarative API, the challenges of stream processing, the programming model with code examples, query planning, execution modes, production use cases, and performance benchmarks compared with other streaming systems.

Big DataSparkStreaming
0 likes · 42 min read
Structured Streaming: Design, Challenges, Programming Model, and Performance Evaluation
DataFunTalk
DataFunTalk
Jan 2, 2020 · Big Data

ByteDance’s HDFS Architecture and Evolution: Design, Challenges, and Optimizations

This article presents an in‑depth overview of ByteDance’s large‑scale HDFS deployment, describing its unique access layer, metadata and data layers, the evolution through multiple growth stages, and the key architectural improvements such as NNProxy, DanceNN, lock redesign, startup acceleration, and slow‑node mitigation techniques.

Big DataByteDanceFederation
0 likes · 18 min read
ByteDance’s HDFS Architecture and Evolution: Design, Challenges, and Optimizations
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Dec 31, 2019 · Big Data

Apache Kylin Overview and Model Optimization Practices for Trajectory Analytics

This article introduces Apache Kylin, details its deployment at Tongcheng Yilong, explains the design of a large‑scale trajectory model, and provides step‑by‑step optimization techniques—including cube dimension reduction, HBase rowkey tuning, build parameter tweaks, high‑cardinality handling, and query compression disabling—to achieve sub‑second OLAP queries on multi‑terabyte data.

Apache KylinBig DataCube
0 likes · 17 min read
Apache Kylin Overview and Model Optimization Practices for Trajectory Analytics
DataFunTalk
DataFunTalk
Dec 30, 2019 · Databases

Cassandra: Past, Present, and Future – History, Architecture, Features, and Use Cases

This article summarizes a Cassandra meetup presentation that traces the database's origins from BigTable and Dynamo, outlines its key milestones, explains its peer‑to‑peer and LSM architecture, highlights current features, real‑world deployments, performance advantages, and previews upcoming 4.0 releases and community projects.

Big DataGossip ProtocolLSM
0 likes · 14 min read
Cassandra: Past, Present, and Future – History, Architecture, Features, and Use Cases
Java High-Performance Architecture
Java High-Performance Architecture
Dec 29, 2019 · Fundamentals

Which Technologies Will Dominate Software Development in 2020? A Trend Forecast

This article forecasts the 2020 software development landscape, highlighting the rise of cloud adoption, Kubernetes, micro‑services, Python, Java, emerging languages like Rust and Kotlin, JavaScript frameworks, API standards, SQL dominance, big‑data engines Spark and Flink, and the growing impact of WebAssembly.

Big DataCloud Computingmicroservices
0 likes · 9 min read
Which Technologies Will Dominate Software Development in 2020? A Trend Forecast
Efficient Ops
Efficient Ops
Dec 28, 2019 · Operations

What the 2019 IT Operations Whitepaper Reveals About Enterprise Ops Trends

The 2019 Enterprise IT Operations Whitepaper, released at the national Operations Conference, systematically examines the definition, value, key capabilities, industry applications, challenges, and future trends of IT operations across telecom, finance, Internet, and manufacturing sectors.

Artificial IntelligenceBig DataIT Operations
0 likes · 6 min read
What the 2019 IT Operations Whitepaper Reveals About Enterprise Ops Trends
ITPUB
ITPUB
Dec 27, 2019 · Big Data

How Facebook Scaled Entity Ranking from Hive to Spark: Lessons and Performance Gains

Facebook replaced a multi‑stage Hive pipeline for real‑time entity ranking with a single Spark job, applying extensive reliability fixes and performance tweaks that reduced CPU usage by up to six times, cut latency fivefold, and demonstrated the feasibility of shuffling over 90 TB of data in production.

Big DataHivePerformance Optimization
0 likes · 16 min read
How Facebook Scaled Entity Ranking from Hive to Spark: Lessons and Performance Gains
21CTO
21CTO
Dec 26, 2019 · Artificial Intelligence

Will AI and Machine Learning Redefine Software Testing in 2020?

The article outlines five major 2020 software testing trends—including the surge of AI/ML, digital transformation, cloud and IoT adoption, the shift from performance testing to performance engineering, and the growing importance of big‑data testing—highlighting their impact on quality assurance practices.

AIBig DataCloud Computing
0 likes · 7 min read
Will AI and Machine Learning Redefine Software Testing in 2020?
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 25, 2019 · Big Data

Understanding Flink StreamPartitioner and Its Implementations

Flink’s StreamPartitioner abstracts data routing in DataStream, offering eight built‑in partitioners—including Global, Shuffle, Rebalance, KeyGroup, Broadcast, Rescale, Forward, and Custom—each with distinct channel selection logic, illustrated with source code snippets and explanations of their runtime behavior.

Big DataDataStreamFlink
0 likes · 8 min read
Understanding Flink StreamPartitioner and Its Implementations
DataFunTalk
DataFunTalk
Dec 24, 2019 · Big Data

Deep Dive into PySpark Implementation: Multi‑Process Architecture, Java Integration, RDD/SQL Interfaces, Executor Communication, and Pandas UDF

This article explains PySpark's multi‑process architecture, how the Python driver uses Py4J to call Java/Scala APIs, the implementation of RDD and DataFrame interfaces, executor‑side process communication and serialization with Arrow, and the design of Pandas UDFs, while also discussing current limitations and future directions.

ArrowBig DataDistributed computing
0 likes · 13 min read
Deep Dive into PySpark Implementation: Multi‑Process Architecture, Java Integration, RDD/SQL Interfaces, Executor Communication, and Pandas UDF
dbaplus Community
dbaplus Community
Dec 23, 2019 · Databases

How to Deploy, Scale, and Monitor ClickHouse for High‑Performance Big Data Analytics

This article explains ClickHouse's deployment architecture, read‑write separation, shard expansion steps, write‑batch strategies, a three‑layer monitoring model, and its practical application in Tencent's game analytics platform, offering concrete guidance for building a stable, high‑throughput analytics service.

Big DataDatabaseGame Analytics
0 likes · 21 min read
How to Deploy, Scale, and Monitor ClickHouse for High‑Performance Big Data Analytics
DataFunTalk
DataFunTalk
Dec 23, 2019 · Databases

Cassandra Deployment and Optimization at 360 Cloud Storage

This article details how 360 adopted Cassandra for its cloud drive, describing Cassandra’s decentralized architecture, the reasons for its selection over HBase, large‑scale deployment challenges, performance optimizations, reliability improvements, disk utilization techniques, and the evolution of the system from 2010 to present.

Big DataData ReliabilityScalability
0 likes · 15 min read
Cassandra Deployment and Optimization at 360 Cloud Storage
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 22, 2019 · Big Data

Dynamic Resource Allocation in Spark Streaming: Problems, Mechanisms, and Practical Guidelines

The article explains Spark's default static resource allocation, analyzes the limitations of its Dynamic Resource Allocation (DRA) for streaming workloads, describes the internal Spark components and code paths involved, and proposes concrete design and configuration recommendations for implementing more responsive executor scaling.

Big DataDynamic Resource AllocationExecutor Management
0 likes · 11 min read
Dynamic Resource Allocation in Spark Streaming: Problems, Mechanisms, and Practical Guidelines
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 21, 2019 · Big Data

Kafka Offset Management and Replication Mechanisms Explained

This article provides a comprehensive technical overview of Kafka's offset handling, covering the request entry point, in‑memory offset sources, offset commit and fetch implementations, file storage layout, and the leader‑follower synchronization process that ensures data replication and high‑watermark updates.

Big DataDistributed SystemsHigh Watermark
0 likes · 16 min read
Kafka Offset Management and Replication Mechanisms Explained
macrozheng
macrozheng
Dec 20, 2019 · Big Data

How to Supercharge Elasticsearch for Billion‑Row Queries: Practical Optimization Guide

This article explains the architecture of Elasticsearch and Lucene, outlines common performance bottlenecks, and provides concrete indexing and search optimization techniques—including bulk writes, shard routing, doc values tuning, and pagination strategies—to achieve sub‑second query responses on billions of records.

Big DataElasticsearchPerformance tuning
0 likes · 14 min read
How to Supercharge Elasticsearch for Billion‑Row Queries: Practical Optimization Guide
Qunar Tech Salon
Qunar Tech Salon
Dec 20, 2019 · Big Data

Understanding Flink Cluster Startup and Job Execution Process

This article explains the architecture of a Flink cluster, detailing the startup procedures for JobManager and TaskManager, the three deployment modes, and the end‑to‑end flow of a Flink job from client code through StreamGraph, JobGraph, ExecutionGraph to the physical execution on TaskManagers.

Big DataCluster ArchitectureFlink
0 likes · 10 min read
Understanding Flink Cluster Startup and Job Execution Process
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 19, 2019 · Big Data

Apache Kafka 2.4.0 Release: New Features and Improvements

Apache Kafka 2.4.0 introduces a range of new capabilities—including consumer replica fetching, incremental cooperative rebalancing, MirrorMaker 2.0, a new Java authorization API, KTable non‑key joins, administrative replica reassignment, protected REST endpoints, and offset deletion—along with numerous performance and stability improvements.

Apache KafkaBig DataDistributed Systems
0 likes · 3 min read
Apache Kafka 2.4.0 Release: New Features and Improvements
vivo Internet Technology
vivo Internet Technology
Dec 18, 2019 · Big Data

Comprehensive Overview of Big Data Architecture, Lambda/Kappa Models, and End-to-End Data Platform Design

The article surveys modern big‑data architecture, contrasting Lambda and Kappa models, highlights common governance and integration pain points, and proposes an end‑to‑end platform featuring unified metadata, stream‑batch processing, one‑click ingestion, standardized modeling, intelligent query abstraction, and a comprehensive development IDE.

Big DataData PlatformETL
0 likes · 13 min read
Comprehensive Overview of Big Data Architecture, Lambda/Kappa Models, and End-to-End Data Platform Design
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 17, 2019 · Big Data

Understanding Flink Sliding Windows and Performance Optimizations

This article explains Flink's sliding window mechanism, shows how the WindowAssigner and WindowOperator work with code examples, analyzes the performance impact of fine‑grained sliding windows, and proposes a practical workaround using tumbling windows combined with external storage such as Redis for efficient PV/UV aggregation.

Big DataFlinkPerformance Optimization
0 likes · 8 min read
Understanding Flink Sliding Windows and Performance Optimizations
DataFunTalk
DataFunTalk
Dec 13, 2019 · Databases

Lindorm: High‑Performance Distributed NoSQL Database for Big Data

Lindorm, an Alibaba‑derived distributed NoSQL database built on HBase, delivers multi‑model hybrid storage, five‑fold throughput gains, sub‑millisecond latency, advanced indexing, cloud‑native elasticity, strong/adjustable consistency, and comprehensive security and multi‑tenant features for massive data workloads.

Big DataNoSQLPerformance Optimization
0 likes · 25 min read
Lindorm: High‑Performance Distributed NoSQL Database for Big Data
HomeTech
HomeTech
Dec 12, 2019 · Big Data

Architecture and Design of the Home Data Integration Governance Platform

The article describes the background, architecture, and design principles of a unified big‑data scheduling and data‑exchange platform, detailing its data ingestion “direct‑train”, centralized scheduling engine, and DataX‑based data‑exchange components along with monitoring, alerting, and security features.

Big DataData IntegrationDataX
0 likes · 7 min read
Architecture and Design of the Home Data Integration Governance Platform
Product Technology Team
Product Technology Team
Dec 11, 2019 · Big Data

How a Data Middle Platform Transforms Business: Design, Architecture, and Modeling Insights

This article explains what a data middle platform is, why it matters, its core components—including storage, compute, IDE, workflow, API services, and data asset management—and details the layered architecture of ODS, DWD, DWT, DIM, and DWA, as well as dimensional modeling using Kimball’s methodology.

Big DataData PlatformData Warehouse
0 likes · 6 min read
How a Data Middle Platform Transforms Business: Design, Architecture, and Modeling Insights
Programmer DD
Programmer DD
Dec 11, 2019 · Big Data

Big Data Architecture Secrets: Storage-Compute Separation & Spark in Action

This article explores how enterprises can tackle the explosive growth of data by adopting modern big‑data architectures, including storage‑compute separation, data‑driven workflows, risk‑control frameworks, and real‑world Spark optimizations, offering practical guidance for scalable, high‑performance analytics.

Big DataData ArchitectureSpark
0 likes · 12 min read
Big Data Architecture Secrets: Storage-Compute Separation & Spark in Action
dbaplus Community
dbaplus Community
Dec 10, 2019 · Backend Development

How to Optimize Elasticsearch for Billions of Records: Practical Tuning Guide

An in‑depth guide walks through Elasticsearch’s underlying Lucene architecture, explains shard routing and DocValues, then presents concrete index‑ and search‑performance tweaks—bulk writes, refresh intervals, memory allocation, SSD usage, field mapping, pagination strategies—and shows benchmark results that reduce query latency to seconds for billions of records.

Big DataElasticsearchIndex Optimization
0 likes · 13 min read
How to Optimize Elasticsearch for Billions of Records: Practical Tuning Guide
21CTO
21CTO
Dec 9, 2019 · Big Data

China’s Big Data Crackdown: Legal Risks Every Developer Should Know

The article examines the sweeping regulatory crackdown on China’s big‑data and financial‑risk companies, detailing the dissolution of major crawler firms, new legal restrictions on data collection, and practical guidance on what data‑scraping activities are illegal and how to protect personal information.

Big DataData PrivacyLegal Compliance
0 likes · 11 min read
China’s Big Data Crackdown: Legal Risks Every Developer Should Know
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 9, 2019 · Big Data

Building a Real‑Time ETL Pipeline with Apache Flink: Kafka to HDFS with Exactly‑Once Guarantees

This article explains how to develop a real‑time ETL application using Apache Flink that reads events from Kafka, partitions them by event time into HDFS directories, and achieves exactly‑once processing through checkpointing, custom bucket assigners, and proper state backend configuration.

Apache FlinkBig DataExactly-Once
0 likes · 11 min read
Building a Real‑Time ETL Pipeline with Apache Flink: Kafka to HDFS with Exactly‑Once Guarantees
Architecture Digest
Architecture Digest
Dec 8, 2019 · Big Data

Technical Feasibility of a Nationwide WeChat Group with 1.4 Billion Users

The article analyses whether it is technically possible to place all 1.4 billion Chinese users into a single WeChat group, examining population data, message volume, CPU and network requirements, hardware costs, physical space, and human visual limits to assess scalability and practicality.

Big DataNetwork BandwidthServer Architecture
0 likes · 11 min read
Technical Feasibility of a Nationwide WeChat Group with 1.4 Billion Users
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 4, 2019 · Big Data

Comprehensive Flink Interview Guide: Core Concepts, Advanced Topics, and Source‑Code Insights

This article provides an in‑depth Flink interview guide covering the framework’s core concepts, advanced features such as fault‑tolerance, state management, and checkpointing, as well as detailed explanations of its architecture, APIs, partitioning strategies, and source‑code flow, complete with code examples.

Big DataDistributed SystemsFlink
0 likes · 29 min read
Comprehensive Flink Interview Guide: Core Concepts, Advanced Topics, and Source‑Code Insights
Yanxuan Tech Team
Yanxuan Tech Team
Dec 2, 2019 · Big Data

Why Modern Enterprises Need a Data Middle Platform: Lessons from NetEase Yanxuan

Drawing on NetEase Yanxuan’s experience, this article explains what a data middle platform is, why companies are building one for digital transformation and fine‑grained operations, and details its core components—including the data warehouse, data services, and BI platform—illustrated with real‑world diagrams.

BIBig DataData Middle Platform
0 likes · 12 min read
Why Modern Enterprises Need a Data Middle Platform: Lessons from NetEase Yanxuan
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 1, 2019 · Big Data

Understanding Flink LatencyMarker: End-to-End Delay Measurement and Implementation Details

This article explains the background, source‑code analysis, and practical implementation of Flink's LatencyMarker feature for measuring end‑to‑end job latency, including metric exposure, configuration options, and code snippets illustrating how latency markers are emitted and processed within the streaming pipeline.

Big DataEnd-to-End LatencyFlink
0 likes · 6 min read
Understanding Flink LatencyMarker: End-to-End Delay Measurement and Implementation Details
58 Tech
58 Tech
Nov 29, 2019 · Big Data

Application of Big Data and Algorithms in the Real‑Estate Internet

The talk presented at the Shanghai Computer Society Annual Meeting details how big data and algorithms are leveraged in the real‑estate internet sector to enhance user personalization, improve agent matching, and assess video quality, illustrating practical implementations and performance gains across data collection, modeling, and recommendation pipelines.

AIBig DataReal Estate
0 likes · 10 min read
Application of Big Data and Algorithms in the Real‑Estate Internet
Mafengwo Technology
Mafengwo Technology
Nov 28, 2019 · Big Data

Why NiFi Beats Flink: Practical Data Flow for Recommendation Engines

This article explains why the team prefers Apache NiFi over Flink or Storm for data‑flow handling in information‑stream recommendation systems, outlines NiFi’s core components, features, cluster setup, custom processor development, and real‑world use cases such as HDFS, Elasticsearch, and RocketMQ integrations.

Big DataNiFiProcessor Development
0 likes · 17 min read
Why NiFi Beats Flink: Practical Data Flow for Recommendation Engines
58 Tech
58 Tech
Nov 27, 2019 · Information Security

Evolution and Architecture of a Big Data‑Driven Security Portrait System at 58.com

The article details the design, multi‑stage evolution, and operational impact of a big‑data‑based security portrait platform built by 58.com, describing its data pipelines, real‑time risk tagging, strategy scheduling, configuration management, and overall architecture that enable large‑scale threat detection and mitigation.

Big DataSecurityrisk management
0 likes · 15 min read
Evolution and Architecture of a Big Data‑Driven Security Portrait System at 58.com
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 26, 2019 · Big Data

Understanding Flink SQL Window Functions: Types, Implementation, and Emit Triggers

This article provides a comprehensive overview of Flink SQL window functions, detailing time‑based window types, their underlying implementation in the StreamExecGroupWindowAggregate operator, the processing flow of WindowOperator, timer handling, emit/trigger strategies, and practical code examples for Tumble, Hop, and Session windows.

Big DataEmitFlink
0 likes · 20 min read
Understanding Flink SQL Window Functions: Types, Implementation, and Emit Triggers
Architecture Digest
Architecture Digest
Nov 25, 2019 · Big Data

Introduction to Apache Kafka: Core Concepts, Architecture, and APIs

This article provides a comprehensive overview of Apache Kafka, covering its fundamental capabilities, typical use cases, core components, key APIs, and essential concepts such as topics, partitions, segments, brokers, producers, and consumers, illustrated with diagrams.

APIsBig DataDistributed Systems
0 likes · 8 min read
Introduction to Apache Kafka: Core Concepts, Architecture, and APIs
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 24, 2019 · Big Data

Common Apache Kafka Exceptions and Their Causes

This article lists frequent Apache Kafka exceptions such as UnknownTopicOrPartitionException, LEADER_NOT_AVAILABLE, NotLeaderForPartitionException, TimeoutException, RecordTooLargeException, and others, explaining each error message, typical reasons, and practical troubleshooting steps for producers and consumers.

Big DataConsumerError Handling
0 likes · 5 min read
Common Apache Kafka Exceptions and Their Causes
Tianxing Digital Tech User Experience
Tianxing Digital Tech User Experience
Nov 22, 2019 · Product Management

Can Tesla’s Shadow‑Mode Revolutionize Product Design Evaluation?

This article examines the shortcomings of traditional usability testing, explains Tesla’s shadow‑mode data collection and high‑precision mapping, and proposes how the same AI‑driven, data‑rich approach can be adapted to create a self‑learning, automated product‑design evaluation and iteration cycle.

AIBig Datadata-driven iteration
0 likes · 14 min read
Can Tesla’s Shadow‑Mode Revolutionize Product Design Evaluation?
Architecture Digest
Architecture Digest
Nov 22, 2019 · Big Data

Elasticsearch Optimization Practices for Large‑Scale Data Platforms

This article presents a comprehensive guide to optimizing Elasticsearch for massive data volumes, covering Lucene fundamentals, index and shard design, practical performance‑tuning techniques, and real‑world testing results that enable cross‑month queries and sub‑second response times.

Big DataElasticsearchIndex Optimization
0 likes · 14 min read
Elasticsearch Optimization Practices for Large‑Scale Data Platforms
Meituan Technology Team
Meituan Technology Team
Nov 21, 2019 · Big Data

Designing a Platformized Jupyter Service Integrated with Spark for Meituan

Meituan Homestay created a platform‑wide Jupyter service built on JupyterHub and Kubernetes that integrates Spark, scheduling, documentation and storage, providing seamless, reproducible notebooks with custom extensions, magics and container isolation to unify data analysis, model training and production workflows.

Big DataData AnalysisJupyter
0 likes · 19 min read
Designing a Platformized Jupyter Service Integrated with Spark for Meituan
Xianyu Technology
Xianyu Technology
Nov 21, 2019 · Big Data

Event-Driven Rule Engine for User Growth at Xianyu

To accelerate growth on Xianyu’s 20 million‑DAU platform, the team built an event‑driven rule engine with a SQL‑like DSL that translates user‑behavior streams into real‑time Flink/Blink queries, cutting rule development from four days to half a day and achieving sub‑5‑second processing latency.

Big DataDSLEvent Stream
0 likes · 9 min read
Event-Driven Rule Engine for User Growth at Xianyu
JD Retail Technology
JD Retail Technology
Nov 19, 2019 · Industry Insights

How JD.com Is Building an Open, Integrated Tech Ecosystem Across Retail, Logistics, and Cloud

JD.com's 2019 JDDiscovery conference revealed a comprehensive, cloud‑native technology landscape that spans AI, big data, IoT, and blockchain, detailing how the company has transformed its integrated retail, logistics, and finance systems into modular, open‑service solutions for external partners.

Artificial IntelligenceBig DataCloud Computing
0 likes · 9 min read
How JD.com Is Building an Open, Integrated Tech Ecosystem Across Retail, Logistics, and Cloud
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 18, 2019 · Big Data

Understanding JVM Garbage Collection and Flink Memory Management

This article explains the fundamentals of JVM garbage collection, its generational algorithms and associated performance issues, and then details Apache Flink's memory management architecture, including MemorySegment, off‑heap buffers, serialization mechanisms, and type information for efficient big‑data processing.

Big DataFlinkGarbage Collection
0 likes · 7 min read
Understanding JVM Garbage Collection and Flink Memory Management
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 14, 2019 · Big Data

Comparison of Flink and Spark Structured Streaming: Joins, State Management, Fault Tolerance, and Backpressure

This article compares Flink and Spark Structured Streaming, detailing their differences in join capabilities, state management, fault‑tolerance mechanisms, exactly‑once semantics, back‑pressure handling, and table registration, while providing code examples and practical insights for real‑time big‑data processing.

Big DataFlinkJoin
0 likes · 13 min read
Comparison of Flink and Spark Structured Streaming: Joins, State Management, Fault Tolerance, and Backpressure
Tencent Cloud Developer
Tencent Cloud Developer
Nov 14, 2019 · Big Data

Tencent Announces Open‑Source High‑Performance Graph Computing Framework Plato

Tencent has open‑sourced its high‑performance graph computing framework Plato, which can process billion‑node graphs in minutes on as few as ten servers, outpacing Spark GraphX by up to two orders of magnitude, and supports offline computation, representation learning, and integration with Kubernetes/YARN for social, recommendation, and biomedical applications.

Big DataDistributed SystemsOpen Source
0 likes · 7 min read
Tencent Announces Open‑Source High‑Performance Graph Computing Framework Plato
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 13, 2019 · Databases

ClickHouse Engines: Use Cases, Syntax, and Limitations

This article provides a comprehensive overview of ClickHouse, covering its typical application scenarios, inherent limitations, common SQL syntax, default values, data types, materialized and expression columns, and detailed explanations of its various storage engines such as TinyLog, Log, Memory, Merge, Distributed, Null, Buffer, Set, MergeTree, ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree, and CollapsingMergeTree, accompanied by practical code examples.

Big DataClickHouseDatabase Engines
0 likes · 25 min read
ClickHouse Engines: Use Cases, Syntax, and Limitations
DataFunTalk
DataFunTalk
Nov 13, 2019 · Big Data

ByteDance’s Core Optimization Practices on Spark SQL

ByteDance’s data warehouse team shares comprehensive optimizations for Spark SQL, covering architecture overview, bucket join enhancements, materialized columns and views, and shuffle stability and performance improvements, illustrating practical techniques that boost query efficiency and job reliability in large‑scale big‑data environments.

Big DataMaterialized ColumnsShuffle Optimization
0 likes · 20 min read
ByteDance’s Core Optimization Practices on Spark SQL
DevOps
DevOps
Nov 11, 2019 · Operations

Capital One DevOps Transformation: Data‑Driven Innovation, Cloud Migration, and AI‑Enabled Services

This case study details Capital One’s evolution from a regional credit‑card unit to a data‑centric financial giant, highlighting its vision, data‑driven product strategy, big‑data analytics, AI‑powered customer service, cloud migration to AWS, and the DevOpsSec practices that enabled rapid, secure, and scalable innovation across banking, automotive finance, and digital services.

Big DataDevOpsFinTech
0 likes · 19 min read
Capital One DevOps Transformation: Data‑Driven Innovation, Cloud Migration, and AI‑Enabled Services
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 9, 2019 · Big Data

Comparative Study of Apache Flink and Spark Streaming at Xiaomi: Architecture, Performance, and Serialization

This article examines Xiaomi's migration from Spark Streaming to Apache Flink, comparing scheduling strategies, mini‑batch versus true streaming, resource utilization, latency, and serialization mechanisms, and concludes with practical insights and custom optimization techniques for large‑scale data processing.

Big DataFlinkMini-Batch
0 likes · 17 min read
Comparative Study of Apache Flink and Spark Streaming at Xiaomi: Architecture, Performance, and Serialization
JD Retail Technology
JD Retail Technology
Nov 7, 2019 · Industry Insights

How JD’s Advertising Architecture Scaled for 11.11: Lessons in Cost‑Cutting and Performance

The article details how JD’s advertising division tackled the massive traffic surge of the 11.11 shopping festival by expanding shard capacity, optimizing models and data pipelines, migrating workloads to the cloud, and implementing cost‑saving measures that together ensured stable, high‑performance ad delivery.

AdvertisingBig DataPerformance Optimization
0 likes · 7 min read
How JD’s Advertising Architecture Scaled for 11.11: Lessons in Cost‑Cutting and Performance
DataFunTalk
DataFunTalk
Nov 7, 2019 · Big Data

Real-Time Computing Engine at Beike: Architecture, Practices, and Future Plans

This article details Beike's real‑time computing engine, covering its background, streaming platform built on Spark Streaming and Flink, data ingestion via Kafka, metadata handling, SQL‑based task development, monitoring, storage solutions, and future roadmap for resource management and AI‑enhanced monitoring.

Big DataFlinkKafka
0 likes · 14 min read
Real-Time Computing Engine at Beike: Architecture, Practices, and Future Plans
Xianyu Technology
Xianyu Technology
Nov 7, 2019 · Big Data

Sequence Pattern Mining for User Behavior Analysis in Xianyu

By applying sequence pattern mining and unsupervised clustering to Xianyu’s massive event logs, the study abstracts high‑level user behaviors, discovers frequent subsequences, uncovers unknown fraudulent account patterns, expands known fraud cohorts with 99 % precision, and enables richer analyses such as PCA‑based cross‑group comparisons.

Big Dataclusteringdata mining
0 likes · 8 min read
Sequence Pattern Mining for User Behavior Analysis in Xianyu
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Nov 5, 2019 · Operations

How 360 Scaled AIOps: From Data to Self‑Healing Operations

At the 360 Internet Technology Training Camp, experts detailed how AI-driven AIOps can transform large‑scale operations, covering data collection, model‑based anomaly detection, alert correlation, self‑healing workflows, and visual dashboards, and presented a practical end‑to‑end framework that other companies can adopt quickly.

Big DataOperationsaiops
0 likes · 15 min read
How 360 Scaled AIOps: From Data to Self‑Healing Operations
Architecture Digest
Architecture Digest
Nov 5, 2019 · Big Data

Architecture Overview of Taobao, Meituan, and Didi Big Data Platforms

This article examines the big‑data architectures of three leading Chinese internet companies—Taobao, Meituan, and Didi—detailing their data sources, synchronization mechanisms, batch and streaming processing layers, and the common scheduling components that unify their Hadoop‑based ecosystems.

Big DataData ArchitectureDidi
0 likes · 7 min read
Architecture Overview of Taobao, Meituan, and Didi Big Data Platforms
Efficient Ops
Efficient Ops
Nov 3, 2019 · Operations

How Beijing Mobile Achieved Tier‑3 DevOps Maturity: A Deep Dive into Continuous Delivery

This article details Beijing Mobile's successful Tier‑3 DevOps standard assessment, showcasing their micro‑service, container‑based performance management system, the role of standards and tooling in boosting efficiency, and insights from a Q&A with senior engineers on implementation challenges and future DevOps prospects.

AIBig DataContainerization
0 likes · 11 min read
How Beijing Mobile Achieved Tier‑3 DevOps Maturity: A Deep Dive into Continuous Delivery
Efficient Ops
Efficient Ops
Nov 3, 2019 · Operations

How Zhejiang Mobile Is Pioneering AIOps to Reach NoOps

Zhejiang Mobile’s IT department chronicles its journey from a 2015 cloud‑native initiative to a cutting‑edge AIOps transformation, detailing a six‑level NoOps roadmap, digital fault‑governance, middle‑platform consolidation, organizational agility, and measurable operational gains that position it as a telecom industry leader.

Artificial IntelligenceBig DataDigital Transformation
0 likes · 7 min read
How Zhejiang Mobile Is Pioneering AIOps to Reach NoOps
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 3, 2019 · Big Data

Understanding Spark Shuffle and Smart Shuffle: Design, Implementation, and Performance Analysis

This article explains the evolution of Spark Shuffle from hash‑based to sort‑based, introduces the Smart Shuffle optimization, details their implementations and configurations, and presents performance comparisons using TPC‑DS benchmarks, highlighting significant speedups and reduced I/O overhead.

Big DataShuffleSmart Shuffle
0 likes · 7 min read
Understanding Spark Shuffle and Smart Shuffle: Design, Implementation, and Performance Analysis
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 2, 2019 · Big Data

Evolution of Elasticsearch Cluster Architecture for JD Daojia Order Center

This article details how JD Daojia's order center migrated its Elasticsearch cluster through multiple architectural stages—from an initial loosely configured setup to a real‑time dual‑cluster solution—addressing scalability, high availability, data synchronization, and performance optimization for billions of documents and hundreds of millions of daily queries.

Big DataCluster ArchitectureElasticsearch
0 likes · 12 min read
Evolution of Elasticsearch Cluster Architecture for JD Daojia Order Center
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 30, 2019 · Big Data

Building a Real‑Time Data Processing Pipeline with Apache Kafka, Spark Streaming, and Cassandra

This tutorial explains how to create a highly scalable, fault‑tolerant real‑time data processing platform by configuring a Kafka topic, a Cassandra keyspace, adding Spark and connector dependencies, developing a Java‑based Spark Streaming pipeline, enabling checkpoints, and deploying the application with spark‑submit.

Big DataJavaKafka
0 likes · 8 min read
Building a Real‑Time Data Processing Pipeline with Apache Kafka, Spark Streaming, and Cassandra
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 30, 2019 · Big Data

How Real-Time Big Data Pipelines Detect E‑Commerce Ad Misplacements

This article explains how a large‑scale e‑commerce search advertising system uses real‑time big‑data pipelines, log synchronization, NoSQL storage, and proactive verification to automatically discover and correct ad placement errors across the entire data processing chain, protecting both advertisers and the platform.

Big Dataad verificationdata pipeline
0 likes · 13 min read
How Real-Time Big Data Pipelines Detect E‑Commerce Ad Misplacements
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 28, 2019 · Big Data

Big Data Technology and Architecture: Leveraging Spark and HBase for Real‑Time and Offline Processing

This article outlines the challenges of various big‑data scenarios such as financial risk control, recommendation systems, and social feeds, explains why Spark is chosen over alternatives, describes a one‑stop data platform architecture with Spark‑HBase integration, and shares best‑practice tips and case studies.

Big DataData ArchitectureHBase
0 likes · 7 min read
Big Data Technology and Architecture: Leveraging Spark and HBase for Real‑Time and Offline Processing
DataFunTalk
DataFunTalk
Oct 25, 2019 · Big Data

Migrating Data from HBase to Kafka Using MapReduce

This article explains how to reverse the typical data flow by extracting massive Rowkeys from HBase with MapReduce, storing them on HDFS, and then using batch Get operations to retrieve the full records and write them into Kafka, while handling retries and monitoring progress.

Big DataData MigrationHBase
0 likes · 9 min read
Migrating Data from HBase to Kafka Using MapReduce
dbaplus Community
dbaplus Community
Oct 22, 2019 · Big Data

How Weibo Built a Billion‑Log Real‑Time Data Platform with Flink

This article details how Weibo’s advertising team designed and implemented a real‑time data platform capable of processing over a hundred billion daily logs, covering technology selection, Flink advantages, architecture evolution, data processing pipelines, component libraries, fault‑tolerance strategies, and the construction of a multi‑layer real‑time data warehouse.

Big DataCheckpointData Architecture
0 likes · 25 min read
How Weibo Built a Billion‑Log Real‑Time Data Platform with Flink
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 22, 2019 · Big Data

Real-Time Data Verification: Building a Log Comparison Solution with Flink, Elasticsearch, and Hive

This article explains how to design and implement a real‑time data verification framework using Flink to generate wide tables, storing detailed records in Elasticsearch or HDFS with Hive for cross‑checking against offline data, ensuring trustworthy metrics for dashboards and stakeholders.

Big DataData verificationElasticsearch
0 likes · 7 min read
Real-Time Data Verification: Building a Log Comparison Solution with Flink, Elasticsearch, and Hive
58 Tech
58 Tech
Oct 21, 2019 · Big Data

Improving Information Exposure Measurement: Visible Ad Metrics and Data Processing Practices at 58 Platform

To address inaccuracies in traditional information exposure metrics, this article proposes adopting advertising visibility standards—defining visible exposure by pixel and time thresholds, implementing client-side logging, unique TID tracking, and ETL pipelines—to provide more reliable data for product strategy and user behavior analysis.

Big DataData Qualityad visibility
0 likes · 8 min read
Improving Information Exposure Measurement: Visible Ad Metrics and Data Processing Practices at 58 Platform
dbaplus Community
dbaplus Community
Oct 20, 2019 · Big Data

Mastering Kafka: Concepts, Installation, Optimization, and Security

This comprehensive guide covers Kafka's core concepts, design principles, installation steps, configuration tweaks, performance optimizations, permission management, common operational commands, cluster scaling, log retention settings, and monitoring scripts to help you build and maintain a robust Kafka ecosystem.

Big DataInstallationKafka
0 likes · 20 min read
Mastering Kafka: Concepts, Installation, Optimization, and Security
Architects' Tech Alliance
Architects' Tech Alliance
Oct 17, 2019 · Big Data

Understanding Alibaba's Data Middle Platform: Concepts, Architecture, and Differences from Data Warehouses and Data Lakes

The article explains Alibaba's data middle platform—its definition, methodology, organizational structure, key tools, and how it differs from traditional data warehouses and data lakes—while highlighting its role in supporting scalable, business‑centric data services and digital transformation.

AlibabaBig DataData Architecture
0 likes · 16 min read
Understanding Alibaba's Data Middle Platform: Concepts, Architecture, and Differences from Data Warehouses and Data Lakes
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 17, 2019 · Big Data

Delta Lake: Architecture, Features, and Hands‑On Tutorial

This article explains the origins and motivations of Delta Lake, details its ACID transaction support, schema enforcement, metadata handling, versioning, and unified batch‑and‑stream processing, and provides a step‑by‑step Maven and Spark code tutorial for creating, updating, and querying Delta tables.

ACIDApache SparkBig Data
0 likes · 10 min read
Delta Lake: Architecture, Features, and Hands‑On Tutorial
Meituan Technology Team
Meituan Technology Team
Oct 17, 2019 · Big Data

OneData Methodology: Building a Unified Data Warehouse Architecture and Governance Framework

By adapting Alibaba’s OneData methodology, the project establishes a unified data‑warehouse architecture, standards, and governance framework—including consolidated business intake, standardized design layers, naming conventions, and delivery metrics—that resolves data‑quality issues, enhances scalability and reusability, and delivers faster, reliable data support for evolving business needs.

Big DataData ArchitectureData Warehouse
0 likes · 15 min read
OneData Methodology: Building a Unified Data Warehouse Architecture and Governance Framework
Youku Technology
Youku Technology
Oct 16, 2019 · Artificial Intelligence

Building an Entertainment Content Cognition Brain: AI and Big Data for the Full Content Lifecycle

The talk outlines how Alibaba’s Entertainment Brain leverages AI, big-data analytics, and psychological modeling to map content attributes and user emotions across the entire production-to-distribution lifecycle, enabling data-driven talent selection, script evaluation, real-time feedback, and predictive traffic forecasting for hit-making.

AIBig DataContent Analytics
0 likes · 11 min read
Building an Entertainment Content Cognition Brain: AI and Big Data for the Full Content Lifecycle