Tagged articles

3697 articles

Page 28 of 37

Jan 2, 2020 · Big Data

Structured Streaming: Design, Challenges, Programming Model, and Performance Evaluation

This article provides a comprehensive overview of Apache Spark Structured Streaming, describing its declarative API, the challenges of stream processing, the programming model with code examples, query planning, execution modes, production use cases, and performance benchmarks compared with other streaming systems.

Big DataSparkStreaming

0 likes · 42 min read

Structured Streaming: Design, Challenges, Programming Model, and Performance Evaluation

Mafengwo Technology

Jan 2, 2020 · Big Data

How We Scaled Kafka for Real‑Time Big Data at Mafengwo: Lessons and Practices

This article details Mafengwo's practical experience using Kafka within its big‑data platform, covering application scenarios, evolution through version upgrades, resource isolation, security and monitoring enhancements, and future plans for data duplication handling and consumer throttling.

Big DataData StreamingKafka

0 likes · 16 min read

How We Scaled Kafka for Real‑Time Big Data at Mafengwo: Lessons and Practices

DataFunTalk

Jan 2, 2020 · Big Data

ByteDance’s HDFS Architecture and Evolution: Design, Challenges, and Optimizations

This article presents an in‑depth overview of ByteDance’s large‑scale HDFS deployment, describing its unique access layer, metadata and data layers, the evolution through multiple growth stages, and the key architectural improvements such as NNProxy, DanceNN, lock redesign, startup acceleration, and slow‑node mitigation techniques.

Big DataByteDanceFederation

0 likes · 18 min read

ByteDance’s HDFS Architecture and Evolution: Design, Challenges, and Optimizations

dbaplus Community

Jan 1, 2020 · Big Data

How Facebook Replaced Hundreds of Hive Jobs with a Single Spark Pipeline

Facebook migrated a massive, multi‑stage Hive‑based entity ranking pipeline to a single Spark job, detailing the challenges of scaling to 20 TB inputs, the reliability fixes, performance optimizations, and the resulting 4‑6× CPU speedup and reduced latency.

Big DataHiveReliability

0 likes · 16 min read

How Facebook Replaced Hundreds of Hive Jobs with a Single Spark Pipeline

Tongcheng Travel Technology Center

Dec 31, 2019 · Big Data

Apache Kylin Overview and Model Optimization Practices for Trajectory Analytics

This article introduces Apache Kylin, details its deployment at Tongcheng Yilong, explains the design of a large‑scale trajectory model, and provides step‑by‑step optimization techniques—including cube dimension reduction, HBase rowkey tuning, build parameter tweaks, high‑cardinality handling, and query compression disabling—to achieve sub‑second OLAP queries on multi‑terabyte data.

Apache KylinBig DataCube

0 likes · 17 min read

Apache Kylin Overview and Model Optimization Practices for Trajectory Analytics

Cloud Native Technology Community

Dec 30, 2019 · Big Data

Kafka 2.4.0 Release Summary: New Features, Improvements, and Bug Fixes

The article provides a comprehensive overview of Apache Kafka 2.4.0, detailing its major new capabilities such as consumer replica fetching, progressive cooperative rebalancing, MirrorMaker 2.0, new Java authentication APIs, and extensive bug fixes, along with upgrade considerations and related resources.

Apache KafkaBig DataRelease Notes

0 likes · 26 min read

Kafka 2.4.0 Release Summary: New Features, Improvements, and Bug Fixes

DataFunTalk

Dec 30, 2019 · Databases

Cassandra: Past, Present, and Future – History, Architecture, Features, and Use Cases

This article summarizes a Cassandra meetup presentation that traces the database's origins from BigTable and Dynamo, outlines its key milestones, explains its peer‑to‑peer and LSM architecture, highlights current features, real‑world deployments, performance advantages, and previews upcoming 4.0 releases and community projects.

Big DataGossip ProtocolLSM

0 likes · 14 min read

Cassandra: Past, Present, and Future – History, Architecture, Features, and Use Cases

dbaplus Community

Dec 29, 2019 · Databases

What New Database Versions and Trends Shaped 2019? A Comprehensive Review

The 2019 dbaplus Newsletter compiles a detailed overview of major RDBMS, NoSQL, NewSQL, big‑data, Chinese and cloud database releases, highlighting key features, performance improvements, security enhancements, and future road‑maps for each product.

Big DataCloud ComputingNewSQL

0 likes · 40 min read

What New Database Versions and Trends Shaped 2019? A Comprehensive Review

Java High-Performance Architecture

Dec 29, 2019 · Fundamentals

Which Technologies Will Dominate Software Development in 2020? A Trend Forecast

This article forecasts the 2020 software development landscape, highlighting the rise of cloud adoption, Kubernetes, micro‑services, Python, Java, emerging languages like Rust and Kotlin, JavaScript frameworks, API standards, SQL dominance, big‑data engines Spark and Flink, and the growing impact of WebAssembly.

Big DataCloud Computingmicroservices

0 likes · 9 min read

Which Technologies Will Dominate Software Development in 2020? A Trend Forecast

Efficient Ops

Dec 28, 2019 · Operations

What the 2019 IT Operations Whitepaper Reveals About Enterprise Ops Trends

The 2019 Enterprise IT Operations Whitepaper, released at the national Operations Conference, systematically examines the definition, value, key capabilities, industry applications, challenges, and future trends of IT operations across telecom, finance, Internet, and manufacturing sectors.

Artificial IntelligenceBig DataIT Operations

0 likes · 6 min read

What the 2019 IT Operations Whitepaper Reveals About Enterprise Ops Trends

360 Tech Engineering

Dec 27, 2019 · Big Data

Introduction to ElasticSearch: Core Concepts, Architecture, and Common Operations

This article provides a comprehensive overview of ElasticSearch, covering its distributed architecture, fundamental components such as nodes, shards, and indices, as well as practical guidance on index design, mapping, bulk operations, query processing, scroll searches, alias management, and performance tuning tips.

Big DataClusterMapping

0 likes · 11 min read

Introduction to ElasticSearch: Core Concepts, Architecture, and Common Operations

ITPUB

Dec 27, 2019 · Big Data

How Facebook Scaled Entity Ranking from Hive to Spark: Lessons and Performance Gains

Facebook replaced a multi‑stage Hive pipeline for real‑time entity ranking with a single Spark job, applying extensive reliability fixes and performance tweaks that reduced CPU usage by up to six times, cut latency fivefold, and demonstrated the feasibility of shuffling over 90 TB of data in production.

Big DataHivePerformance Optimization

0 likes · 16 min read

How Facebook Scaled Entity Ranking from Hive to Spark: Lessons and Performance Gains

Huawei Cloud Developer Alliance

Dec 27, 2019 · Big Data

How to Compile and Install CDH Hadoop on Kunpeng Cloud: Step‑by‑Step Guide

This article walks through the full‑stack process of migrating and compiling the CDH Hadoop distribution on Kunpeng cloud servers, covering environment setup, dependency installation, source code adjustments, common build errors, and final packaging for a production‑ready big‑data platform.

Big DataCDHHadoop

0 likes · 14 min read

How to Compile and Install CDH Hadoop on Kunpeng Cloud: Step‑by‑Step Guide

21CTO

Dec 26, 2019 · Artificial Intelligence

Will AI and Machine Learning Redefine Software Testing in 2020?

The article outlines five major 2020 software testing trends—including the surge of AI/ML, digital transformation, cloud and IoT adoption, the shift from performance testing to performance engineering, and the growing importance of big‑data testing—highlighting their impact on quality assurance practices.

AIBig DataCloud Computing

0 likes · 7 min read

Will AI and Machine Learning Redefine Software Testing in 2020?

Big Data Technology & Architecture

Dec 25, 2019 · Big Data

Understanding Flink StreamPartitioner and Its Implementations

Flink’s StreamPartitioner abstracts data routing in DataStream, offering eight built‑in partitioners—including Global, Shuffle, Rebalance, KeyGroup, Broadcast, Rescale, Forward, and Custom—each with distinct channel selection logic, illustrated with source code snippets and explanations of their runtime behavior.

Big DataDataStreamFlink

0 likes · 8 min read

Understanding Flink StreamPartitioner and Its Implementations

Tongcheng Travel Technology Center

Dec 25, 2019 · Big Data

Recap of Tongcheng Elong 5th Big Data Technology and Application Salon (2019)

The article reviews the 2019 Tongcheng Elong Big Data Technology and Application Salon, summarizing six expert talks on data middle platforms, intelligent marketing, real‑time recommendation, Apache Pulsar, Chinese entity recognition, and hotel ranking models, plus event highlights and future plans.

Apache PulsarBig DataData Platform

0 likes · 5 min read

Recap of Tongcheng Elong 5th Big Data Technology and Application Salon (2019)

DataFunTalk

Dec 24, 2019 · Big Data

Deep Dive into PySpark Implementation: Multi‑Process Architecture, Java Integration, RDD/SQL Interfaces, Executor Communication, and Pandas UDF

This article explains PySpark's multi‑process architecture, how the Python driver uses Py4J to call Java/Scala APIs, the implementation of RDD and DataFrame interfaces, executor‑side process communication and serialization with Arrow, and the design of Pandas UDFs, while also discussing current limitations and future directions.

ArrowBig DataDistributed computing

0 likes · 13 min read

Deep Dive into PySpark Implementation: Multi‑Process Architecture, Java Integration, RDD/SQL Interfaces, Executor Communication, and Pandas UDF

dbaplus Community

Dec 23, 2019 · Databases

How to Deploy, Scale, and Monitor ClickHouse for High‑Performance Big Data Analytics

This article explains ClickHouse's deployment architecture, read‑write separation, shard expansion steps, write‑batch strategies, a three‑layer monitoring model, and its practical application in Tencent's game analytics platform, offering concrete guidance for building a stable, high‑throughput analytics service.

Big DataDatabaseGame Analytics

0 likes · 21 min read

How to Deploy, Scale, and Monitor ClickHouse for High‑Performance Big Data Analytics

DataFunTalk

Dec 23, 2019 · Databases

Cassandra Deployment and Optimization at 360 Cloud Storage

This article details how 360 adopted Cassandra for its cloud drive, describing Cassandra’s decentralized architecture, the reasons for its selection over HBase, large‑scale deployment challenges, performance optimizations, reliability improvements, disk utilization techniques, and the evolution of the system from 2010 to present.

Big DataData ReliabilityScalability

0 likes · 15 min read

Cassandra Deployment and Optimization at 360 Cloud Storage

Big Data Technology & Architecture

Dec 22, 2019 · Big Data

Dynamic Resource Allocation in Spark Streaming: Problems, Mechanisms, and Practical Guidelines

The article explains Spark's default static resource allocation, analyzes the limitations of its Dynamic Resource Allocation (DRA) for streaming workloads, describes the internal Spark components and code paths involved, and proposes concrete design and configuration recommendations for implementing more responsive executor scaling.

Big DataDynamic Resource AllocationExecutor Management

0 likes · 11 min read

Dynamic Resource Allocation in Spark Streaming: Problems, Mechanisms, and Practical Guidelines

Big Data Technology & Architecture

Dec 22, 2019 · Big Data

Implementing Multi‑threaded Kafka Consumer and Producer with Partition Management

This article explains how to build a multi‑threaded Kafka consumer and producer in Java, covering partition concepts, consumer group offsets, thread‑pool configuration, and code examples that demonstrate proper use of Kafka streams, partition keys, and batch message sending for improved throughput.

Big DataConsumerKafka

0 likes · 15 min read

Implementing Multi‑threaded Kafka Consumer and Producer with Partition Management

Big Data Technology & Architecture

Dec 21, 2019 · Big Data

Kafka Offset Management and Replication Mechanisms Explained

This article provides a comprehensive technical overview of Kafka's offset handling, covering the request entry point, in‑memory offset sources, offset commit and fetch implementations, file storage layout, and the leader‑follower synchronization process that ensures data replication and high‑watermark updates.

Big DataDistributed SystemsHigh Watermark

0 likes · 16 min read

Kafka Offset Management and Replication Mechanisms Explained

macrozheng

Dec 20, 2019 · Big Data

How to Supercharge Elasticsearch for Billion‑Row Queries: Practical Optimization Guide

This article explains the architecture of Elasticsearch and Lucene, outlines common performance bottlenecks, and provides concrete indexing and search optimization techniques—including bulk writes, shard routing, doc values tuning, and pagination strategies—to achieve sub‑second query responses on billions of records.

Big DataElasticsearchPerformance tuning

0 likes · 14 min read

How to Supercharge Elasticsearch for Billion‑Row Queries: Practical Optimization Guide

Qunar Tech Salon

Dec 20, 2019 · Big Data

Understanding Flink Cluster Startup and Job Execution Process

This article explains the architecture of a Flink cluster, detailing the startup procedures for JobManager and TaskManager, the three deployment modes, and the end‑to‑end flow of a Flink job from client code through StreamGraph, JobGraph, ExecutionGraph to the physical execution on TaskManagers.

Big DataCluster ArchitectureFlink

0 likes · 10 min read

Understanding Flink Cluster Startup and Job Execution Process

Big Data Technology & Architecture

Dec 20, 2019 · Big Data

Understanding Hadoop YARN Schedulers: FIFO, Capacity, and Fair Scheduler

This article explains the role of YARN's Scheduler, compares FIFO, Capacity, and Fair schedulers, details their configurations—including XML snippets for Capacity and Fair schedulers, queue hierarchy, preemption settings, and provides practical guidance for resource allocation in Hadoop clusters.

Big DataCapacity SchedulerFair Scheduler

0 likes · 13 min read

Understanding Hadoop YARN Schedulers: FIFO, Capacity, and Fair Scheduler

Big Data Technology & Architecture

Dec 19, 2019 · Big Data

Apache Kafka 2.4.0 Release: New Features and Improvements

Apache Kafka 2.4.0 introduces a range of new capabilities—including consumer replica fetching, incremental cooperative rebalancing, MirrorMaker 2.0, a new Java authorization API, KTable non‑key joins, administrative replica reassignment, protected REST endpoints, and offset deletion—along with numerous performance and stability improvements.

Apache KafkaBig DataDistributed Systems

0 likes · 3 min read

Apache Kafka 2.4.0 Release: New Features and Improvements

vivo Internet Technology

Dec 18, 2019 · Big Data

Comprehensive Overview of Big Data Architecture, Lambda/Kappa Models, and End-to-End Data Platform Design

The article surveys modern big‑data architecture, contrasting Lambda and Kappa models, highlights common governance and integration pain points, and proposes an end‑to‑end platform featuring unified metadata, stream‑batch processing, one‑click ingestion, standardized modeling, intelligent query abstraction, and a comprehensive development IDE.

Big DataData PlatformETL

0 likes · 13 min read

Comprehensive Overview of Big Data Architecture, Lambda/Kappa Models, and End-to-End Data Platform Design

Big Data Technology & Architecture

Dec 17, 2019 · Big Data

Understanding Flink Sliding Windows and Performance Optimizations

This article explains Flink's sliding window mechanism, shows how the WindowAssigner and WindowOperator work with code examples, analyzes the performance impact of fine‑grained sliding windows, and proposes a practical workaround using tumbling windows combined with external storage such as Redis for efficient PV/UV aggregation.

Big DataFlinkPerformance Optimization

0 likes · 8 min read

Understanding Flink Sliding Windows and Performance Optimizations

Alibaba Cloud Developer

Dec 16, 2019 · Big Data

Why Apache Flink Became the Fastest‑Growing Open‑Source Big Data Engine in 2019

Apache Flink, the open‑source stream‑and‑batch processing engine, has surged to become one of the most active Apache projects, with rapid community growth in China, unified SQL capabilities, AI‑focused extensions, Kubernetes integration, and benchmark results that outperform Hive by up to seven times.

AIApache FlinkBig Data

0 likes · 14 min read

Why Apache Flink Became the Fastest‑Growing Open‑Source Big Data Engine in 2019

DataFunTalk

Dec 13, 2019 · Databases

Lindorm: High‑Performance Distributed NoSQL Database for Big Data

Lindorm, an Alibaba‑derived distributed NoSQL database built on HBase, delivers multi‑model hybrid storage, five‑fold throughput gains, sub‑millisecond latency, advanced indexing, cloud‑native elasticity, strong/adjustable consistency, and comprehensive security and multi‑tenant features for massive data workloads.

Big DataNoSQLPerformance Optimization

0 likes · 25 min read

Lindorm: High‑Performance Distributed NoSQL Database for Big Data

Architecture Digest

Dec 13, 2019 · Big Data

Understanding Data Middle Platform: Concepts, Architecture, and Real‑Time Implementation

The article explains the data middle platform concept, its distinction from traditional big‑data platforms, the architectural principles behind Alibaba's implementation, and how real‑time ingestion, processing, and service layers enable efficient, collaborative, and scalable data-driven applications.

AlibabaBig DataData Middle Platform

0 likes · 13 min read

Understanding Data Middle Platform: Concepts, Architecture, and Real‑Time Implementation

HomeTech

Dec 12, 2019 · Big Data

Architecture and Design of the Home Data Integration Governance Platform

The article describes the background, architecture, and design principles of a unified big‑data scheduling and data‑exchange platform, detailing its data ingestion “direct‑train”, centralized scheduling engine, and DataX‑based data‑exchange components along with monitoring, alerting, and security features.

Big DataData IntegrationDataX

0 likes · 7 min read

Architecture and Design of the Home Data Integration Governance Platform

Sohu Tech Products

Dec 11, 2019 · Mobile Development

Technical Q&A: Android Dex Encryption and User Profiling for Content Recall

The article announces the new “Expert Talk” column, shares technical answers on Android dex‑based app hardening and user profiling for content recall, and promotes a giveaway event with prize details and participation instructions for readers.

AndroidBig DataMobile Development

0 likes · 4 min read

Technical Q&A: Android Dex Encryption and User Profiling for Content Recall

Product Technology Team

Dec 11, 2019 · Big Data

How a Data Middle Platform Transforms Business: Design, Architecture, and Modeling Insights

This article explains what a data middle platform is, why it matters, its core components—including storage, compute, IDE, workflow, API services, and data asset management—and details the layered architecture of ODS, DWD, DWT, DIM, and DWA, as well as dimensional modeling using Kimball’s methodology.

Big DataData PlatformData Warehouse

0 likes · 6 min read

How a Data Middle Platform Transforms Business: Design, Architecture, and Modeling Insights

Programmer DD

Dec 11, 2019 · Big Data

Big Data Architecture Secrets: Storage-Compute Separation & Spark in Action

This article explores how enterprises can tackle the explosive growth of data by adopting modern big‑data architectures, including storage‑compute separation, data‑driven workflows, risk‑control frameworks, and real‑world Spark optimizations, offering practical guidance for scalable, high‑performance analytics.

Big DataData ArchitectureSpark

0 likes · 12 min read

Big Data Architecture Secrets: Storage-Compute Separation & Spark in Action

dbaplus Community

Dec 10, 2019 · Backend Development

How to Optimize Elasticsearch for Billions of Records: Practical Tuning Guide

An in‑depth guide walks through Elasticsearch’s underlying Lucene architecture, explains shard routing and DocValues, then presents concrete index‑ and search‑performance tweaks—bulk writes, refresh intervals, memory allocation, SSD usage, field mapping, pagination strategies—and shows benchmark results that reduce query latency to seconds for billions of records.

Big DataElasticsearchIndex Optimization

0 likes · 13 min read

How to Optimize Elasticsearch for Billions of Records: Practical Tuning Guide

21CTO

Dec 9, 2019 · Big Data

China’s Big Data Crackdown: Legal Risks Every Developer Should Know

The article examines the sweeping regulatory crackdown on China’s big‑data and financial‑risk companies, detailing the dissolution of major crawler firms, new legal restrictions on data collection, and practical guidance on what data‑scraping activities are illegal and how to protect personal information.

Big DataData PrivacyLegal Compliance

0 likes · 11 min read

China’s Big Data Crackdown: Legal Risks Every Developer Should Know

Big Data Technology & Architecture

Dec 9, 2019 · Big Data

Building a Real‑Time ETL Pipeline with Apache Flink: Kafka to HDFS with Exactly‑Once Guarantees

This article explains how to develop a real‑time ETL application using Apache Flink that reads events from Kafka, partitions them by event time into HDFS directories, and achieves exactly‑once processing through checkpointing, custom bucket assigners, and proper state backend configuration.

Apache FlinkBig DataExactly-Once

0 likes · 11 min read

Building a Real‑Time ETL Pipeline with Apache Flink: Kafka to HDFS with Exactly‑Once Guarantees

Architecture Digest

Dec 8, 2019 · Big Data

Technical Feasibility of a Nationwide WeChat Group with 1.4 Billion Users

The article analyses whether it is technically possible to place all 1.4 billion Chinese users into a single WeChat group, examining population data, message volume, CPU and network requirements, hardware costs, physical space, and human visual limits to assess scalability and practicality.

Big DataNetwork BandwidthServer Architecture

0 likes · 11 min read

Technical Feasibility of a Nationwide WeChat Group with 1.4 Billion Users

ITPUB

Dec 5, 2019 · Big Data

How to Achieve Sub‑Second Queries on Billions of Records with Elasticsearch

This article explains how a data platform handling billions of daily records can be optimized for cross‑month queries and sub‑second response times by tuning Elasticsearch indexing, shard routing, Lucene structures, and hardware configurations.

Big DataPerformance tuningindexing

0 likes · 13 min read

How to Achieve Sub‑Second Queries on Billions of Records with Elasticsearch

Big Data Technology & Architecture

Dec 4, 2019 · Big Data

Comprehensive Flink Interview Guide: Core Concepts, Advanced Topics, and Source‑Code Insights

This article provides an in‑depth Flink interview guide covering the framework’s core concepts, advanced features such as fault‑tolerance, state management, and checkpointing, as well as detailed explanations of its architecture, APIs, partitioning strategies, and source‑code flow, complete with code examples.

Big DataDistributed SystemsFlink

0 likes · 29 min read

Comprehensive Flink Interview Guide: Core Concepts, Advanced Topics, and Source‑Code Insights

AntTech

Dec 4, 2019 · Artificial Intelligence

Ant Financial’s Online Learning System Built on Ray: Architecture, Challenges, and Future Plans

The interview details how Ant Financial transitioned from offline to online machine learning by adopting the Ray distributed engine, describing their open architecture, fusion computing approach, technical advantages, encountered pitfalls, and plans to open‑source the system for broader AI and big‑data use.

AIAnt FinancialBig Data

0 likes · 15 min read

Ant Financial’s Online Learning System Built on Ray: Architecture, Challenges, and Future Plans

Big Data Technology & Architecture

Dec 2, 2019 · Big Data

Implementing Custom Flink Sources and Sinks for RocketMQ and HBase Streaming

This article explains how to create custom Flink SourceFunction and SinkFunction implementations, demonstrates a RocketMQ source and an HBase sink with full code examples, and discusses checkpointing, event‑time handling, and deployment of the streaming job on a Flink‑on‑YARN cluster.

Big DataFlinkHBase

0 likes · 16 min read

Implementing Custom Flink Sources and Sinks for RocketMQ and HBase Streaming

Yanxuan Tech Team

Dec 2, 2019 · Big Data

Why Modern Enterprises Need a Data Middle Platform: Lessons from NetEase Yanxuan

Drawing on NetEase Yanxuan’s experience, this article explains what a data middle platform is, why companies are building one for digital transformation and fine‑grained operations, and details its core components—including the data warehouse, data services, and BI platform—illustrated with real‑world diagrams.

BIBig DataData Middle Platform

0 likes · 12 min read

Why Modern Enterprises Need a Data Middle Platform: Lessons from NetEase Yanxuan

Big Data Technology & Architecture

Dec 1, 2019 · Big Data

Dynamic Configuration Updates in Real-Time Streaming with Spark Broadcast Variables and Flink Broadcast State

This article explains how to dynamically update configuration data in real‑time Spark Streaming and Flink jobs using broadcast variables and broadcast state, providing Java code examples and discussing the limitations and practical considerations of each approach.

Big DataFlinkReal-time Streaming

0 likes · 8 min read

Dynamic Configuration Updates in Real-Time Streaming with Spark Broadcast Variables and Flink Broadcast State

Big Data Technology & Architecture

Dec 1, 2019 · Big Data

Understanding Flink LatencyMarker: End-to-End Delay Measurement and Implementation Details

This article explains the background, source‑code analysis, and practical implementation of Flink's LatencyMarker feature for measuring end‑to‑end job latency, including metric exposure, configuration options, and code snippets illustrating how latency markers are emitted and processed within the streaming pipeline.

Big DataEnd-to-End LatencyFlink

0 likes · 6 min read

Understanding Flink LatencyMarker: End-to-End Delay Measurement and Implementation Details

Big Data Technology & Architecture

Nov 29, 2019 · Big Data

Understanding Flink's Memory Management and Data Flow Architecture

This article explains how Flink manages memory through its MemorySegment abstraction, the implementations of HeapMemorySegment and HybridMemorySegment, the role of ByteBuffer, NetworkBufferPool and LocalBufferPool, and details the end‑to‑end data flow from RecordWriter to Netty transport, including key code snippets.

Big DataData FlowFlink

0 likes · 16 min read

Understanding Flink's Memory Management and Data Flow Architecture

58 Tech

Nov 29, 2019 · Big Data

Application of Big Data and Algorithms in the Real‑Estate Internet

The talk presented at the Shanghai Computer Society Annual Meeting details how big data and algorithms are leveraged in the real‑estate internet sector to enhance user personalization, improve agent matching, and assess video quality, illustrating practical implementations and performance gains across data collection, modeling, and recommendation pipelines.

AIBig DataReal Estate

0 likes · 10 min read

Application of Big Data and Algorithms in the Real‑Estate Internet

Efficient Ops

Nov 28, 2019 · Operations

Master Modern IT Operations: Skill Maps, ELK Architectures & Big Data Monitoring

This article explores the evolving landscape of IT operations, detailing role specializations, comprehensive skill maps for system, web, big data, and container ops, and compares three ELK logging architectures while emphasizing a data‑driven approach to monitoring and incident response.

Big DataELKIT Operations

0 likes · 11 min read

Master Modern IT Operations: Skill Maps, ELK Architectures & Big Data Monitoring

Mafengwo Technology

Nov 28, 2019 · Big Data

Why NiFi Beats Flink: Practical Data Flow for Recommendation Engines

This article explains why the team prefers Apache NiFi over Flink or Storm for data‑flow handling in information‑stream recommendation systems, outlines NiFi’s core components, features, cluster setup, custom processor development, and real‑world use cases such as HDFS, Elasticsearch, and RocketMQ integrations.

Big DataNiFiProcessor Development

0 likes · 17 min read

Why NiFi Beats Flink: Practical Data Flow for Recommendation Engines

YooTech Youzu Tech Team

Nov 28, 2019 · Big Data

How Data Ingestion Evolved at Youzu: From HTTP to Real‑Time DTS & ETL

This article traces the evolution of Youzu's data platform ingestion, comparing early HTTP/script methods with modern DTS and real‑time ETL solutions, evaluating middleware choices, detailing core system architectures, and outlining future improvements for reliable, scalable data access.

Big DataDTSETL

0 likes · 6 min read

How Data Ingestion Evolved at Youzu: From HTTP to Real‑Time DTS & ETL

Big Data Technology & Architecture

Nov 28, 2019 · Big Data

Resolving Unsupported Oracle Data Types in Spark SQL via Custom JdbcDialects

This article explains how to overcome Spark SQL's inability to handle certain Oracle data types, such as Timestamp with local timezone and FLOAT(126), by creating and registering a custom JdbcDialect that remaps unsupported types to compatible Spark types.

Big DataCustom DialectETL

0 likes · 8 min read

Resolving Unsupported Oracle Data Types in Spark SQL via Custom JdbcDialects

58 Tech

Nov 27, 2019 · Information Security

Evolution and Architecture of a Big Data‑Driven Security Portrait System at 58.com

The article details the design, multi‑stage evolution, and operational impact of a big‑data‑based security portrait platform built by 58.com, describing its data pipelines, real‑time risk tagging, strategy scheduling, configuration management, and overall architecture that enable large‑scale threat detection and mitigation.

Big DataSecurityrisk management

0 likes · 15 min read

Evolution and Architecture of a Big Data‑Driven Security Portrait System at 58.com

Big Data Technology & Architecture

Nov 26, 2019 · Big Data

Understanding Flink SQL Window Functions: Types, Implementation, and Emit Triggers

This article provides a comprehensive overview of Flink SQL window functions, detailing time‑based window types, their underlying implementation in the StreamExecGroupWindowAggregate operator, the processing flow of WindowOperator, timer handling, emit/trigger strategies, and practical code examples for Tumble, Hop, and Session windows.

Big DataEmitFlink

0 likes · 20 min read

Understanding Flink SQL Window Functions: Types, Implementation, and Emit Triggers

Java High-Performance Architecture

Nov 26, 2019 · Fundamentals

How Bloom Filters Efficiently Detect Element Presence in Massive Datasets

This article explains the concept, typical use cases such as preventing database misses and cache penetration, the underlying hash‑based implementation with examples, and shows how to deploy a Bloom filter using RedisBloom, providing a practical guide for handling huge data sets.

Big DataRedisBloombloom-filter

0 likes · 6 min read

How Bloom Filters Efficiently Detect Element Presence in Massive Datasets

Architecture Digest

Nov 25, 2019 · Big Data

Introduction to Apache Kafka: Core Concepts, Architecture, and APIs

This article provides a comprehensive overview of Apache Kafka, covering its fundamental capabilities, typical use cases, core components, key APIs, and essential concepts such as topics, partitions, segments, brokers, producers, and consumers, illustrated with diagrams.

APIsBig DataDistributed Systems

0 likes · 8 min read

Introduction to Apache Kafka: Core Concepts, Architecture, and APIs

Big Data Technology & Architecture

Nov 24, 2019 · Big Data

Common Apache Kafka Exceptions and Their Causes

This article lists frequent Apache Kafka exceptions such as UnknownTopicOrPartitionException, LEADER_NOT_AVAILABLE, NotLeaderForPartitionException, TimeoutException, RecordTooLargeException, and others, explaining each error message, typical reasons, and practical troubleshooting steps for producers and consumers.

Big DataConsumerError Handling

0 likes · 5 min read

Common Apache Kafka Exceptions and Their Causes

Tianxing Digital Tech User Experience

Nov 22, 2019 · Product Management

Can Tesla’s Shadow‑Mode Revolutionize Product Design Evaluation?

This article examines the shortcomings of traditional usability testing, explains Tesla’s shadow‑mode data collection and high‑precision mapping, and proposes how the same AI‑driven, data‑rich approach can be adapted to create a self‑learning, automated product‑design evaluation and iteration cycle.

AIBig Datadata-driven iteration

0 likes · 14 min read

Can Tesla’s Shadow‑Mode Revolutionize Product Design Evaluation?

Architecture Digest

Nov 22, 2019 · Big Data

Elasticsearch Optimization Practices for Large‑Scale Data Platforms

This article presents a comprehensive guide to optimizing Elasticsearch for massive data volumes, covering Lucene fundamentals, index and shard design, practical performance‑tuning techniques, and real‑world testing results that enable cross‑month queries and sub‑second response times.

Big DataElasticsearchIndex Optimization

0 likes · 14 min read

Elasticsearch Optimization Practices for Large‑Scale Data Platforms

Meituan Technology Team

Nov 21, 2019 · Big Data

Designing a Platformized Jupyter Service Integrated with Spark for Meituan

Meituan Homestay created a platform‑wide Jupyter service built on JupyterHub and Kubernetes that integrates Spark, scheduling, documentation and storage, providing seamless, reproducible notebooks with custom extensions, magics and container isolation to unify data analysis, model training and production workflows.

Big DataData AnalysisJupyter

0 likes · 19 min read

Designing a Platformized Jupyter Service Integrated with Spark for Meituan

DataFunTalk

Nov 21, 2019 · Big Data

Evolution of 58.com Real-Time Computing Platform and the One-Stop Streaming Data Processing System Wstream

The article details the technical evolution of 58.com’s real-time computing platform—from Storm and Spark Streaming to a Flink‑based one‑stop solution called Wstream—covering use cases, architecture, stability measures, migration from Storm, operational diagnostics, and future development plans.

Big DataData ProcessingFlink

0 likes · 11 min read

Evolution of 58.com Real-Time Computing Platform and the One-Stop Streaming Data Processing System Wstream

Xianyu Technology

Nov 21, 2019 · Big Data

Event-Driven Rule Engine for User Growth at Xianyu

To accelerate growth on Xianyu’s 20 million‑DAU platform, the team built an event‑driven rule engine with a SQL‑like DSL that translates user‑behavior streams into real‑time Flink/Blink queries, cutting rule development from four days to half a day and achieving sub‑5‑second processing latency.

Big DataDSLEvent Stream

0 likes · 9 min read

Event-Driven Rule Engine for User Growth at Xianyu

JD Retail Technology

Nov 19, 2019 · Industry Insights

How JD.com Is Building an Open, Integrated Tech Ecosystem Across Retail, Logistics, and Cloud

JD.com's 2019 JDDiscovery conference revealed a comprehensive, cloud‑native technology landscape that spans AI, big data, IoT, and blockchain, detailing how the company has transformed its integrated retail, logistics, and finance systems into modular, open‑service solutions for external partners.

Artificial IntelligenceBig DataCloud Computing

0 likes · 9 min read

How JD.com Is Building an Open, Integrated Tech Ecosystem Across Retail, Logistics, and Cloud

Big Data Technology & Architecture

Nov 18, 2019 · Big Data

Understanding JVM Garbage Collection and Flink Memory Management

This article explains the fundamentals of JVM garbage collection, its generational algorithms and associated performance issues, and then details Apache Flink's memory management architecture, including MemorySegment, off‑heap buffers, serialization mechanisms, and type information for efficient big‑data processing.

Big DataFlinkGarbage Collection

0 likes · 7 min read

Understanding JVM Garbage Collection and Flink Memory Management

Big Data Technology & Architecture

Nov 16, 2019 · Big Data

Understanding SparkSQL Join Algorithms: Shuffle Hash Join, Broadcast Hash Join, and Sort Merge Join

This article explains SparkSQL's three join strategies—Shuffle Hash Join, Broadcast Hash Join, and Sort Merge Join—detailing their mechanisms, when to use each based on table size, and their relative performance costs in distributed big‑data environments.

Big DataBroadcast JoinHash Join

0 likes · 5 min read

Understanding SparkSQL Join Algorithms: Shuffle Hash Join, Broadcast Hash Join, and Sort Merge Join

iQIYI Technical Product Team

Nov 15, 2019 · Industry Insights

How iQIYI’s Big Data Middle Platform Fuels Scalable Entertainment Innovation

The article analyzes iQIYI’s big‑data middle‑platform strategy, detailing its origins, architecture, digital‑asset management, governance principles and how a unified, transparent, and compatible data platform enables user‑centric, scalable innovation across the entertainment ecosystem.

AnalyticsBig DataData Platform

0 likes · 9 min read

How iQIYI’s Big Data Middle Platform Fuels Scalable Entertainment Innovation

Big Data Technology & Architecture

Nov 14, 2019 · Big Data

Comparison of Flink and Spark Structured Streaming: Joins, State Management, Fault Tolerance, and Backpressure

This article compares Flink and Spark Structured Streaming, detailing their differences in join capabilities, state management, fault‑tolerance mechanisms, exactly‑once semantics, back‑pressure handling, and table registration, while providing code examples and practical insights for real‑time big‑data processing.

Big DataFlinkJoin

0 likes · 13 min read

Comparison of Flink and Spark Structured Streaming: Joins, State Management, Fault Tolerance, and Backpressure

Tencent Cloud Developer

Nov 14, 2019 · Big Data

Tencent Announces Open‑Source High‑Performance Graph Computing Framework Plato

Tencent has open‑sourced its high‑performance graph computing framework Plato, which can process billion‑node graphs in minutes on as few as ten servers, outpacing Spark GraphX by up to two orders of magnitude, and supports offline computation, representation learning, and integration with Kubernetes/YARN for social, recommendation, and biomedical applications.

Big DataDistributed SystemsOpen Source

0 likes · 7 min read

Tencent Announces Open‑Source High‑Performance Graph Computing Framework Plato

Big Data Technology & Architecture

Nov 13, 2019 · Databases

ClickHouse Engines: Use Cases, Syntax, and Limitations

This article provides a comprehensive overview of ClickHouse, covering its typical application scenarios, inherent limitations, common SQL syntax, default values, data types, materialized and expression columns, and detailed explanations of its various storage engines such as TinyLog, Log, Memory, Merge, Distributed, Null, Buffer, Set, MergeTree, ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree, and CollapsingMergeTree, accompanied by practical code examples.

Big DataClickHouseDatabase Engines

0 likes · 25 min read

ClickHouse Engines: Use Cases, Syntax, and Limitations

DataFunTalk

Nov 13, 2019 · Big Data

ByteDance’s Core Optimization Practices on Spark SQL

ByteDance’s data warehouse team shares comprehensive optimizations for Spark SQL, covering architecture overview, bucket join enhancements, materialized columns and views, and shuffle stability and performance improvements, illustrating practical techniques that boost query efficiency and job reliability in large‑scale big‑data environments.

Big DataMaterialized ColumnsShuffle Optimization

0 likes · 20 min read

ByteDance’s Core Optimization Practices on Spark SQL

DevOps

Nov 11, 2019 · Operations

Capital One DevOps Transformation: Data‑Driven Innovation, Cloud Migration, and AI‑Enabled Services

This case study details Capital One’s evolution from a regional credit‑card unit to a data‑centric financial giant, highlighting its vision, data‑driven product strategy, big‑data analytics, AI‑powered customer service, cloud migration to AWS, and the DevOpsSec practices that enabled rapid, secure, and scalable innovation across banking, automotive finance, and digital services.

Big DataDevOpsFinTech

0 likes · 19 min read

Capital One DevOps Transformation: Data‑Driven Innovation, Cloud Migration, and AI‑Enabled Services

Big Data Technology & Architecture

Nov 9, 2019 · Big Data

OneData: A Comprehensive Big Data Architecture and Governance Framework

This article presents the OneData methodology for building a robust big‑data platform, detailing background challenges, goals, unified input and output strategies, model design, naming conventions, data‑cleaning rules, and the resulting business benefits and future outlook.

Big DataData WarehouseOnedata

0 likes · 13 min read

OneData: A Comprehensive Big Data Architecture and Governance Framework

Suning Technology

Nov 9, 2019 · Operations

Suning’s 2019 Smart Retail White Paper Unveils Digital Store Trends

The 2019 Suning Smart Retail White Paper analyzes the digital transformation of Chinese retail stores, highlighting AI, big data, O2O integration, and operational efficiencies that give retailers a competitive edge in the evolving market.

AIBig DataO2O

0 likes · 5 min read

Suning’s 2019 Smart Retail White Paper Unveils Digital Store Trends

Big Data Technology & Architecture

Nov 9, 2019 · Big Data

Comparative Study of Apache Flink and Spark Streaming at Xiaomi: Architecture, Performance, and Serialization

This article examines Xiaomi's migration from Spark Streaming to Apache Flink, comparing scheduling strategies, mini‑batch versus true streaming, resource utilization, latency, and serialization mechanisms, and concludes with practical insights and custom optimization techniques for large‑scale data processing.

Big DataFlinkMini-Batch

0 likes · 17 min read

Comparative Study of Apache Flink and Spark Streaming at Xiaomi: Architecture, Performance, and Serialization

JD Retail Technology

Nov 7, 2019 · Industry Insights

How JD’s Advertising Architecture Scaled for 11.11: Lessons in Cost‑Cutting and Performance

The article details how JD’s advertising division tackled the massive traffic surge of the 11.11 shopping festival by expanding shard capacity, optimizing models and data pipelines, migrating workloads to the cloud, and implementing cost‑saving measures that together ensured stable, high‑performance ad delivery.

AdvertisingBig DataPerformance Optimization

0 likes · 7 min read

How JD’s Advertising Architecture Scaled for 11.11: Lessons in Cost‑Cutting and Performance

DataFunTalk

Nov 7, 2019 · Big Data

Real-Time Computing Engine at Beike: Architecture, Practices, and Future Plans

This article details Beike's real‑time computing engine, covering its background, streaming platform built on Spark Streaming and Flink, data ingestion via Kafka, metadata handling, SQL‑based task development, monitoring, storage solutions, and future roadmap for resource management and AI‑enhanced monitoring.

Big DataFlinkKafka

0 likes · 14 min read

Real-Time Computing Engine at Beike: Architecture, Practices, and Future Plans

Xianyu Technology

Nov 7, 2019 · Big Data

Sequence Pattern Mining for User Behavior Analysis in Xianyu

By applying sequence pattern mining and unsupervised clustering to Xianyu’s massive event logs, the study abstracts high‑level user behaviors, discovers frequent subsequences, uncovers unknown fraudulent account patterns, expands known fraud cohorts with 99 % precision, and enables richer analyses such as PCA‑based cross‑group comparisons.

Big Dataclusteringdata mining

0 likes · 8 min read

Sequence Pattern Mining for User Behavior Analysis in Xianyu

360 Zhihui Cloud Developer

Nov 5, 2019 · Operations

How 360 Scaled AIOps: From Data to Self‑Healing Operations

At the 360 Internet Technology Training Camp, experts detailed how AI-driven AIOps can transform large‑scale operations, covering data collection, model‑based anomaly detection, alert correlation, self‑healing workflows, and visual dashboards, and presented a practical end‑to‑end framework that other companies can adopt quickly.

Big DataOperationsaiops

0 likes · 15 min read

How 360 Scaled AIOps: From Data to Self‑Healing Operations

Architecture Digest

Nov 5, 2019 · Big Data

Architecture Overview of Taobao, Meituan, and Didi Big Data Platforms

This article examines the big‑data architectures of three leading Chinese internet companies—Taobao, Meituan, and Didi—detailing their data sources, synchronization mechanisms, batch and streaming processing layers, and the common scheduling components that unify their Hadoop‑based ecosystems.

Big DataData ArchitectureDidi

0 likes · 7 min read

Architecture Overview of Taobao, Meituan, and Didi Big Data Platforms

Big Data Technology & Architecture

Nov 4, 2019 · Big Data

Understanding Spark Checkpoint: Purpose, Mechanism, and Best Practices

This article explains why Spark checkpoints are needed for large or complex RDD pipelines, how they work by persisting data to reliable storage such as HDFS, and outlines practical steps and best‑practice recommendations for using checkpoints effectively in production environments.

Big DataCheckpointHDFS

0 likes · 6 min read

Understanding Spark Checkpoint: Purpose, Mechanism, and Best Practices

Efficient Ops

Nov 3, 2019 · Operations

How Beijing Mobile Achieved Tier‑3 DevOps Maturity: A Deep Dive into Continuous Delivery

This article details Beijing Mobile's successful Tier‑3 DevOps standard assessment, showcasing their micro‑service, container‑based performance management system, the role of standards and tooling in boosting efficiency, and insights from a Q&A with senior engineers on implementation challenges and future DevOps prospects.

AIBig DataContainerization

0 likes · 11 min read

How Beijing Mobile Achieved Tier‑3 DevOps Maturity: A Deep Dive into Continuous Delivery

Efficient Ops

Nov 3, 2019 · Operations

How Zhejiang Mobile Is Pioneering AIOps to Reach NoOps

Zhejiang Mobile’s IT department chronicles its journey from a 2015 cloud‑native initiative to a cutting‑edge AIOps transformation, detailing a six‑level NoOps roadmap, digital fault‑governance, middle‑platform consolidation, organizational agility, and measurable operational gains that position it as a telecom industry leader.

Artificial IntelligenceBig DataDigital Transformation

0 likes · 7 min read

How Zhejiang Mobile Is Pioneering AIOps to Reach NoOps

Big Data Technology & Architecture

Nov 3, 2019 · Big Data

Understanding Spark Shuffle and Smart Shuffle: Design, Implementation, and Performance Analysis

This article explains the evolution of Spark Shuffle from hash‑based to sort‑based, introduces the Smart Shuffle optimization, details their implementations and configurations, and presents performance comparisons using TPC‑DS benchmarks, highlighting significant speedups and reduced I/O overhead.

Big DataShuffleSmart Shuffle

0 likes · 7 min read

Understanding Spark Shuffle and Smart Shuffle: Design, Implementation, and Performance Analysis

Big Data Technology & Architecture

Nov 2, 2019 · Big Data

Evolution of Elasticsearch Cluster Architecture for JD Daojia Order Center

This article details how JD Daojia's order center migrated its Elasticsearch cluster through multiple architectural stages—from an initial loosely configured setup to a real‑time dual‑cluster solution—addressing scalability, high availability, data synchronization, and performance optimization for billions of documents and hundreds of millions of daily queries.

Big DataCluster ArchitectureElasticsearch

0 likes · 12 min read

Evolution of Elasticsearch Cluster Architecture for JD Daojia Order Center

Big Data Technology & Architecture

Oct 30, 2019 · Big Data

Building a Real‑Time Data Processing Pipeline with Apache Kafka, Spark Streaming, and Cassandra

This tutorial explains how to create a highly scalable, fault‑tolerant real‑time data processing platform by configuring a Kafka topic, a Cassandra keyspace, adding Spark and connector dependencies, developing a Java‑based Spark Streaming pipeline, enabling checkpoints, and deploying the application with spark‑submit.

Big DataJavaKafka

0 likes · 8 min read

Building a Real‑Time Data Processing Pipeline with Apache Kafka, Spark Streaming, and Cassandra

Alibaba Cloud Developer

Oct 30, 2019 · Big Data

How Real-Time Big Data Pipelines Detect E‑Commerce Ad Misplacements

This article explains how a large‑scale e‑commerce search advertising system uses real‑time big‑data pipelines, log synchronization, NoSQL storage, and proactive verification to automatically discover and correct ad placement errors across the entire data processing chain, protecting both advertisers and the platform.

Big Dataad verificationdata pipeline

0 likes · 13 min read

How Real-Time Big Data Pipelines Detect E‑Commerce Ad Misplacements

Big Data Technology & Architecture

Oct 28, 2019 · Big Data

Big Data Technology and Architecture: Leveraging Spark and HBase for Real‑Time and Offline Processing

This article outlines the challenges of various big‑data scenarios such as financial risk control, recommendation systems, and social feeds, explains why Spark is chosen over alternatives, describes a one‑stop data platform architecture with Spark‑HBase integration, and shares best‑practice tips and case studies.

Big DataData ArchitectureHBase

0 likes · 7 min read

Big Data Technology and Architecture: Leveraging Spark and HBase for Real‑Time and Offline Processing

Big Data Technology & Architecture

Oct 27, 2019 · Databases

ClickHouse Architecture and Performance Optimization for Large-Scale OLAP

This article outlines ClickHouse’s columnar OLAP architecture, dual‑center design, storage and write stability strategies, performance testing results, and practical query and system optimizations for handling petabyte‑scale data with high throughput and low latency requirements.

Big DataClickHouseDatabase Architecture

0 likes · 4 min read

ClickHouse Architecture and Performance Optimization for Large-Scale OLAP

DataFunTalk

Oct 25, 2019 · Big Data

Migrating Data from HBase to Kafka Using MapReduce

This article explains how to reverse the typical data flow by extracting massive Rowkeys from HBase with MapReduce, storing them on HDFS, and then using batch Get operations to retrieve the full records and write them into Kafka, while handling retries and monitoring progress.

Big DataData MigrationHBase

0 likes · 9 min read

Migrating Data from HBase to Kafka Using MapReduce

Big Data Technology Architecture

Oct 24, 2019 · Big Data

Real-Time Search Engine Indexing with Flink: Architecture and Implementation

This article explains how to build a real-time search engine indexing pipeline using Flink, covering background, batch versus incremental indexing strategies, a hybrid architecture that merges both approaches, and a concrete cloud‑based implementation involving MySQL binlog, Logtail, SLS, and Elasticsearch.

Big DataElasticsearchFlink

0 likes · 5 min read

Real-Time Search Engine Indexing with Flink: Architecture and Implementation

dbaplus Community

Oct 22, 2019 · Big Data

How Weibo Built a Billion‑Log Real‑Time Data Platform with Flink

This article details how Weibo’s advertising team designed and implemented a real‑time data platform capable of processing over a hundred billion daily logs, covering technology selection, Flink advantages, architecture evolution, data processing pipelines, component libraries, fault‑tolerance strategies, and the construction of a multi‑layer real‑time data warehouse.

Big DataCheckpointData Architecture

0 likes · 25 min read

How Weibo Built a Billion‑Log Real‑Time Data Platform with Flink

Big Data Technology & Architecture

Oct 22, 2019 · Big Data

Real-Time Data Verification: Building a Log Comparison Solution with Flink, Elasticsearch, and Hive

This article explains how to design and implement a real‑time data verification framework using Flink to generate wide tables, storing detailed records in Elasticsearch or HDFS with Hive for cross‑checking against offline data, ensuring trustworthy metrics for dashboards and stakeholders.

Big DataData verificationElasticsearch

0 likes · 7 min read

Real-Time Data Verification: Building a Log Comparison Solution with Flink, Elasticsearch, and Hive

58 Tech

Oct 21, 2019 · Big Data

Improving Information Exposure Measurement: Visible Ad Metrics and Data Processing Practices at 58 Platform

To address inaccuracies in traditional information exposure metrics, this article proposes adopting advertising visibility standards—defining visible exposure by pixel and time thresholds, implementing client-side logging, unique TID tracking, and ETL pipelines—to provide more reliable data for product strategy and user behavior analysis.

Big DataData Qualityad visibility

0 likes · 8 min read

Improving Information Exposure Measurement: Visible Ad Metrics and Data Processing Practices at 58 Platform

Big Data Technology & Architecture

Oct 20, 2019 · Big Data

Converting Spark RDD to DataSet/DataFrame: Two Methods and Handling Serialization Issues

This article explains two approaches—reflection‑based schema inference and programmatic schema definition—to transform a Spark RDD into a DataSet or DataFrame, demonstrates the required code, and discusses common Task‑not‑serializable errors with practical solutions.

Big DataRDDScala

0 likes · 8 min read

Converting Spark RDD to DataSet/DataFrame: Two Methods and Handling Serialization Issues

dbaplus Community

Oct 20, 2019 · Big Data

Mastering Kafka: Concepts, Installation, Optimization, and Security

This comprehensive guide covers Kafka's core concepts, design principles, installation steps, configuration tweaks, performance optimizations, permission management, common operational commands, cluster scaling, log retention settings, and monitoring scripts to help you build and maintain a robust Kafka ecosystem.

Big DataInstallationKafka

0 likes · 20 min read

Mastering Kafka: Concepts, Installation, Optimization, and Security

Architects' Tech Alliance

Oct 17, 2019 · Big Data

Understanding Alibaba's Data Middle Platform: Concepts, Architecture, and Differences from Data Warehouses and Data Lakes

The article explains Alibaba's data middle platform—its definition, methodology, organizational structure, key tools, and how it differs from traditional data warehouses and data lakes—while highlighting its role in supporting scalable, business‑centric data services and digital transformation.

AlibabaBig DataData Architecture

0 likes · 16 min read

Understanding Alibaba's Data Middle Platform: Concepts, Architecture, and Differences from Data Warehouses and Data Lakes

Big Data Technology & Architecture

Oct 17, 2019 · Big Data

Delta Lake: Architecture, Features, and Hands‑On Tutorial

This article explains the origins and motivations of Delta Lake, details its ACID transaction support, schema enforcement, metadata handling, versioning, and unified batch‑and‑stream processing, and provides a step‑by‑step Maven and Spark code tutorial for creating, updating, and querying Delta tables.

ACIDApache SparkBig Data

0 likes · 10 min read

Delta Lake: Architecture, Features, and Hands‑On Tutorial

Meituan Technology Team

Oct 17, 2019 · Big Data

OneData Methodology: Building a Unified Data Warehouse Architecture and Governance Framework

By adapting Alibaba’s OneData methodology, the project establishes a unified data‑warehouse architecture, standards, and governance framework—including consolidated business intake, standardized design layers, naming conventions, and delivery metrics—that resolves data‑quality issues, enhances scalability and reusability, and delivers faster, reliable data support for evolving business needs.

Big DataData ArchitectureData Warehouse

0 likes · 15 min read

OneData Methodology: Building a Unified Data Warehouse Architecture and Governance Framework

Efficient Ops

Oct 16, 2019 · Artificial Intelligence

How AIOps Is Revolutionizing IT Operations – Insights from Sina Expert Peng Dong

This interview explores the rise of AIOps, its business drivers, and practical implementation at Sina Weibo, while sharing Peng Dong’s career journey, technical challenges, and management philosophies that illustrate how AI‑driven automation is reshaping large‑scale IT operations.

Big DataIT Operationsaiops

0 likes · 12 min read

How AIOps Is Revolutionizing IT Operations – Insights from Sina Expert Peng Dong

Youku Technology

Oct 16, 2019 · Artificial Intelligence

Building an Entertainment Content Cognition Brain: AI and Big Data for the Full Content Lifecycle

The talk outlines how Alibaba’s Entertainment Brain leverages AI, big-data analytics, and psychological modeling to map content attributes and user emotions across the entire production-to-distribution lifecycle, enabling data-driven talent selection, script evaluation, real-time feedback, and predictive traffic forecasting for hit-making.

AIBig DataContent Analytics

0 likes · 11 min read

Building an Entertainment Content Cognition Brain: AI and Big Data for the Full Content Lifecycle