Tagged articles
3697 articles
Page 31 of 37
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 18, 2019 · Big Data

How MaxCompute Evolved: 10 Years of Big Data Innovation at Alibaba

This article reviews a decade of MaxCompute development, covering its origins, core technologies, performance gains, ecosystem integration, intelligent features, competitive positioning, and commercialization, while highlighting the platform's role as Alibaba's central big‑data compute engine.

AI IntegrationBig DataData Storage
0 likes · 21 min read
How MaxCompute Evolved: 10 Years of Big Data Innovation at Alibaba
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 17, 2019 · Big Data

Step-by-Step Guide to Installing Hive 2.1.0 on a Hadoop 2.7.1 Cluster (Ubuntu 14.04)

This tutorial provides a comprehensive, step-by-step procedure for setting up Hive 2.1.0 on a Hadoop 2.7.1 cluster running Ubuntu 14.04, covering environment preparation, Hive installation, configuration of environment variables, MySQL metastore integration, client setup, service startup, and basic verification commands.

Big DataHadoopHive
0 likes · 8 min read
Step-by-Step Guide to Installing Hive 2.1.0 on a Hadoop 2.7.1 Cluster (Ubuntu 14.04)
DataFunTalk
DataFunTalk
Apr 17, 2019 · Artificial Intelligence

Evolution of Ctrip Financial Risk Control Models: From Data Platform to AI‑Driven Scoring and Anti‑Fraud Systems

This report details Ctrip Financial's end‑to‑end risk control development, covering business overview, a three‑layer data platform, the progression of credit scoring and anti‑fraud models from rule‑based to advanced AI techniques, and the evaluation, monitoring, and social‑network‑based fraud detection strategies employed.

Big DataFinancial AIanti-fraud
0 likes · 16 min read
Evolution of Ctrip Financial Risk Control Models: From Data Platform to AI‑Driven Scoring and Anti‑Fraud Systems
dbaplus Community
dbaplus Community
Apr 16, 2019 · Big Data

Scaling Elasticsearch for Billions of Daily Events: Cluster Planning, Routing & Hot‑Warm Tips

This article explains how to handle a real‑time OLAP monitoring platform processing 10‑12 billion daily events and 400 billion yearly records by optimizing Elasticsearch 5.3.3 through cluster planning, storage strategies, index sharding, compression, hot‑warm architecture, routing, index templates, rollover, and cross‑cluster search, providing concrete configurations and code examples.

Big DataCluster PlanningElasticsearch
0 likes · 23 min read
Scaling Elasticsearch for Billions of Daily Events: Cluster Planning, Routing & Hot‑Warm Tips
21CTO
21CTO
Apr 15, 2019 · Big Data

Mastering High‑Concurrency Big Data: Sharding, Partitioning, and Index Strategies

This article explores practical techniques for handling massive, high‑concurrency data workloads, covering relational database limits, read/write separation, vertical and horizontal sharding, key selection, archival to NoSQL stores, and the use of heterogeneous index tables to maintain performance.

Big DataPartitioningSharding
0 likes · 6 min read
Mastering High‑Concurrency Big Data: Sharding, Partitioning, and Index Strategies
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 15, 2019 · Artificial Intelligence

Why Deep Learning Finally Succeeded and What Challenges Lie Ahead

This article reviews Jia Yangqing’s insights on why deep learning finally succeeded—highlighting the roles of big data and high‑performance computing—while examining its current limitations, emerging challenges, and future opportunities across AI engineering, AutoML, and hardware‑software co‑design.

AI ChallengesAI EngineeringAutoML
0 likes · 9 min read
Why Deep Learning Finally Succeeded and What Challenges Lie Ahead
JD Retail Technology
JD Retail Technology
Apr 10, 2019 · Databases

HBase at JD.com: Architecture, Use Cases, and Evolution

This article explains how JD.com leverages the open‑source HBase database for massive, low‑latency data storage across various business lines, detailing its architecture, multi‑tenant isolation, disaster‑recovery mechanisms, and integration with Phoenix SQL for OLTP workloads.

Big DataDatabase ArchitectureHBase
0 likes · 13 min read
HBase at JD.com: Architecture, Use Cases, and Evolution
Java Captain
Java Captain
Apr 9, 2019 · Big Data

Kafka FAQs: Zookeeper Dependency, Retention Policies, Cleanup Rules, Performance Bottlenecks, and Cluster Best Practices

This article answers common Kafka questions, explaining why Kafka cannot operate without Zookeeper, describing its two retention strategies based on time and size, detailing how simultaneous time‑ and size‑based cleanup works, listing performance bottlenecks, and offering practical guidelines for sizing and configuring Kafka clusters.

Big DataCluster DesignKafka
0 likes · 2 min read
Kafka FAQs: Zookeeper Dependency, Retention Policies, Cleanup Rules, Performance Bottlenecks, and Cluster Best Practices
Youzan Coder
Youzan Coder
Apr 7, 2019 · Industry Insights

How Youzan Scaled Order Search: Hot‑State Indexing and AKF Expansion

This article reviews the evolution of Youzan's order search architecture over two years, detailing challenges from data growth, the creation of a hot‑state index covering half of search traffic, time‑sharded indexes, and the AKF expansion cube that guides multi‑axis scalability.

Backend DevelopmentBig DataElasticsearch
0 likes · 10 min read
How Youzan Scaled Order Search: Hot‑State Indexing and AKF Expansion
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 3, 2019 · Big Data

Understanding RAID and Its Role in HDFS Architecture

This article explains the storage challenges of big data, introduces RAID technologies and their variants, and shows how the principles of RAID are applied in the Hadoop Distributed File System (HDFS) to achieve scalable, reliable, and high‑performance data storage and processing.

Big DataHDFSRAID
0 likes · 10 min read
Understanding RAID and Its Role in HDFS Architecture
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 3, 2019 · Cloud Computing

What Alibaba Cloud’s New President Reveals About the Future of Cloud Computing

In a candid interview, Alibaba Cloud’s new president discusses how pricing is just a starting point, the shift from open‑source to self‑developed data platforms, the rapid growth of hybrid cloud, security priorities, the role of AI, the evolution of the middle‑platform concept, ecosystem integration, and the strategic focus on scaling, public‑cloud share, and partner collaboration to drive Alibaba Cloud’s future growth.

AIAlibaba CloudBig Data
0 likes · 31 min read
What Alibaba Cloud’s New President Reveals About the Future of Cloud Computing
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 3, 2019 · Cloud Computing

What’s Next for Cloud Computing? Insights from Alibaba Cloud’s New President

In a detailed interview, Alibaba Cloud’s new president discusses the future of cloud computing, emphasizing the shift from price competition to core value, the importance of hybrid cloud, data processing platforms, open‑source challenges, AI integration, ecosystem strategy, and the evolving role of the cloud as a platform and integrated service.

Alibaba CloudArtificial IntelligenceBig Data
0 likes · 28 min read
What’s Next for Cloud Computing? Insights from Alibaba Cloud’s New President
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 2, 2019 · Big Data

Understanding Hadoop MapReduce: Programming Model, WordCount Example, and Job Execution Mechanism

The article explains Hadoop's MapReduce framework as both a programming model and execution engine, detailing its map and reduce phases, the WordCount example code, job startup components, data shuffling, partitioning, and how large‑scale distributed computations are orchestrated across a cluster.

Big DataDistributed computingHadoop
0 likes · 10 min read
Understanding Hadoop MapReduce: Programming Model, WordCount Example, and Job Execution Mechanism
Programmer DD
Programmer DD
Apr 2, 2019 · Backend Development

From Freshman to Senior Engineer: A Developer’s Journey Through Java, Spring, and Big Data

This article chronicles a Chinese computer science graduate’s step‑by‑step evolution from learning basic C and Java in university to building campus apps, winning software contests, mastering Spring, Hadoop, Elasticsearch, and Neo4j, and ultimately landing offers from top tech firms, illustrating the challenges and perseverance required for a successful software engineering career.

Big Datacareerjava
0 likes · 13 min read
From Freshman to Senior Engineer: A Developer’s Journey Through Java, Spring, and Big Data
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 1, 2019 · Big Data

Comprehensive Overview of Hadoop: Core Modules, HDFS Architecture, MapReduce, YARN, and a Scala WordCount Example

This article provides a detailed introduction to Hadoop's ecosystem—including its core modules (Common, HDFS, YARN, MapReduce), the design of a high‑availability HDFS cluster, the principles of distributed file systems, and a complete Scala WordCount MapReduce program—offering a solid foundation for big‑data practitioners.

Big DataHDFSHadoop
0 likes · 15 min read
Comprehensive Overview of Hadoop: Core Modules, HDFS Architecture, MapReduce, YARN, and a Scala WordCount Example
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 29, 2019 · Big Data

Weekly Knowledge Digest: Apache Flink Deep Dives on JOIN LATERAL, TimeInterval, Temporal Table, and State Management

This week's digest shares a personal anecdote and a series of technical deep‑dives into Apache Flink, covering JOIN LATERAL, TimeInterval JOIN, Temporal Table JOIN, state management, and related code examples, while also previewing upcoming work schedules and recommended Flink reference articles.

Apache FlinkBig DataSQL Join
0 likes · 5 min read
Weekly Knowledge Digest: Apache Flink Deep Dives on JOIN LATERAL, TimeInterval, Temporal Table, and State Management
dbaplus Community
dbaplus Community
Mar 27, 2019 · Big Data

How eBay Cut Hadoop Job Runtime by 60%: Real‑World CAL Log Optimization

This article explains how eBay's CAL team reduced Hadoop MapReduce job execution time and resource consumption by over 60% through targeted GC tuning, data‑skew mitigation, and algorithmic improvements, boosting job success rates to nearly 100% while handling petabyte‑scale log data.

Big DataData SkewGC tuning
0 likes · 12 min read
How eBay Cut Hadoop Job Runtime by 60%: Real‑World CAL Log Optimization
Tencent Cloud Developer
Tencent Cloud Developer
Mar 27, 2019 · Industry Insights

How AI and Big Data Drive New Engineering Education: Insights from the 2019 IT Alliance Conference

The 2019 Information Technology New Engineering Alliance conference in Beijing gathered academia, research institutes, and industry leaders to discuss AI, big data, and curriculum innovation, highlighting Tencent's contributions to digital education, cloud certification, and the broader push for industry‑university collaboration in shaping future IT talent.

AIBig DataCloud Computing
0 likes · 6 min read
How AI and Big Data Drive New Engineering Education: Insights from the 2019 IT Alliance Conference
NetEase Game Operations Platform
NetEase Game Operations Platform
Mar 27, 2019 · Big Data

Embedding Python in Java with Jython for Real‑Time Big Data Jobs

This article explains why and how to embed Python code in Java using Jython for real‑time big‑data processing, covering performance benefits, memory‑leak pitfalls, singleton interpreter patterns, function factories, Java‑object conversion, and importing external PyPI packages with practical code examples.

Big DataDynamic LanguageEmbedding
0 likes · 11 min read
Embedding Python in Java with Jython for Real‑Time Big Data Jobs
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 22, 2019 · Big Data

Weekly Knowledge Points: Apache Flink Continuous Queries, Kafka Connectors, SQL Overview, JOIN Operator, and Table API

This weekly briefing introduces Apache Flink's continuous query mechanism, demonstrates how to integrate Kafka as a DataStream connector, provides an overview of Flink SQL features, explains the implementation and optimization of dual‑stream JOIN operators, and showcases the Table API with end‑to‑end examples.

Apache FlinkBig DataTable API
0 likes · 3 min read
Weekly Knowledge Points: Apache Flink Continuous Queries, Kafka Connectors, SQL Overview, JOIN Operator, and Table API
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 21, 2019 · Big Data

Apache Flink Table API Tutorial and End‑to‑End Examples

This article provides a comprehensive tutorial on Apache Flink's Table API, explaining its concepts, core features, and a wide range of operators such as SELECT, WHERE, GROUP BY, UNION, JOIN, and various window functions, while offering complete Scala code examples, custom sources, sinks, and an end‑to‑end job that computes page‑view counts per region using event‑time tumbling windows.

Big DataFlinkScala
0 likes · 36 min read
Apache Flink Table API Tutorial and End‑to‑End Examples
Architects' Tech Alliance
Architects' Tech Alliance
Mar 21, 2019 · Cloud Computing

Understanding the Chinese Enterprise IT Landscape: Market Structure, Demand Drivers, and Technology Trends

This article analyzes China's massive enterprise ecosystem, the composition of its IT market, the human and political factors shaping demand, and how cloud computing, big data, and artificial intelligence are driving a new wave of digital transformation across state‑owned, internet, and other enterprises.

Artificial IntelligenceBig DataChina
0 likes · 14 min read
Understanding the Chinese Enterprise IT Landscape: Market Structure, Demand Drivers, and Technology Trends
Tencent Cloud Developer
Tencent Cloud Developer
Mar 20, 2019 · Big Data

TVP Training Camp: Exploring Big Data Technologies and Trends

The inaugural TVP Training Camp on March 16 2019 in Beijing gathered Tencent Cloud’s TVP members and leading big‑data experts to discuss emerging technologies such as Greenplum, PMEM‑driven infrastructure, data‑operation optimization, and next‑generation cloud databases, while a round‑table addressed practical challenges and affirmed Tencent’s commitment to ongoing expert collaboration.

Big DataCloud ComputingData Analytics
0 likes · 11 min read
TVP Training Camp: Exploring Big Data Technologies and Trends
Youzan Coder
Youzan Coder
Mar 20, 2019 · Big Data

Evolution of Real-Time Computing at Youzan: From Storm to Flink and Future Directions

Youzan’s real‑time computing platform progressed from early Storm deployments through Spark Streaming to a Flink‑based architecture, adding unified task management, monitoring, and dedicated streaming clusters, while now pursuing SQL‑driven jobs, a Druid OLAP engine, and a future real‑time data warehouse.

Big DataFlinkSpark Streaming
0 likes · 14 min read
Evolution of Real-Time Computing at Youzan: From Storm to Flink and Future Directions
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 19, 2019 · Big Data

Comprehensive Overview of SQL and Apache Flink SQL Features with Practical Code Examples

This article provides an in-depth introduction to SQL, its history and ANSI standards, then details Apache Flink's SQL capabilities—including SELECT, WHERE, GROUP BY, UNION, JOIN, window functions, and user-defined functions—accompanied by extensive code examples and a complete end‑to‑end Flink job implementation.

Apache FlinkBig DataStreaming
0 likes · 34 min read
Comprehensive Overview of SQL and Apache Flink SQL Features with Practical Code Examples
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 17, 2019 · Big Data

Understanding Continuous Queries in Apache Flink: From Static Queries to Dynamic Tables and Trigger Simulations

This article explains how Apache Flink implements continuous queries for unbounded stream processing, compares static and continuous query semantics, demonstrates how MySQL triggers can simulate continuous queries in append‑only and update scenarios, and discusses Flink's connector, source, sink, and retraction mechanisms for correct incremental computation.

Apache FlinkBig DataContinuous Query
0 likes · 18 min read
Understanding Continuous Queries in Apache Flink: From Static Queries to Dynamic Tables and Trigger Simulations
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 13, 2019 · Big Data

Understanding Fault Tolerance and Exactly-Once Semantics in Apache Flink

This article explains Apache Flink's fault‑tolerance mechanisms, including checkpointing, barrier alignment, the differences between At‑Least‑Once and Exactly‑Once semantics, configuration options, incremental checkpointing, and the requirements for external sources and sinks to achieve end‑to‑end exactly‑once processing.

Apache FlinkBig DataExactly-Once
0 likes · 15 min read
Understanding Fault Tolerance and Exactly-Once Semantics in Apache Flink
JD Tech
JD Tech
Mar 13, 2019 · Operations

Evolution of JD Digital Technology’s Host Monitoring System “DiTing”: From V1 to V3

The article chronicles the design, evolution, and lessons learned of JD Digital Technology’s self‑built host monitoring platform “DiTing”, detailing its initial requirements, V1 architecture, subsequent V2 and V3 redesigns, encountered challenges, and future directions toward intelligent operations.

Big DataOperationsSystem architecture
0 likes · 12 min read
Evolution of JD Digital Technology’s Host Monitoring System “DiTing”: From V1 to V3
dbaplus Community
dbaplus Community
Mar 12, 2019 · Databases

Mastering HBase Cross‑Datacenter Migration: Snapshots, Architecture, and Real‑World Tips

This article provides a comprehensive technical guide on HBase, covering its core concepts, advantages and drawbacks, architecture layers, practical use cases, and a detailed step‑by‑step process for large‑scale cross‑datacenter migration using snapshot‑based strategies, with commands, diagrams, and lessons learned.

Big DataData MigrationDatabase Architecture
0 likes · 19 min read
Mastering HBase Cross‑Datacenter Migration: Snapshots, Architecture, and Real‑World Tips
DataFunTalk
DataFunTalk
Mar 11, 2019 · Artificial Intelligence

Practical Implementation of Personalized Recommendation Systems: Overview, Algorithms, Challenges, and Architecture

This article presents a comprehensive overview of personalized recommendation systems, covering their purpose, common algorithms, development challenges, the multi‑layer architecture used at DataGrand, optimization techniques, and the range of services offered to enterprise customers.

Big Datacollaborative filteringmachine learning
0 likes · 18 min read
Practical Implementation of Personalized Recommendation Systems: Overview, Algorithms, Challenges, and Architecture
DataFunTalk
DataFunTalk
Mar 7, 2019 · Big Data

Design and Evolution of Didi's Real‑Time Data Computing Platform

The article details how Didi built and iterated its real‑time data platform, describing the shift from MySQL‑based batch processing to a Kafka‑Samza‑Druid architecture with Spark Streaming and Flink, the challenges addressed, and the current capabilities and operational metrics.

Big DataDruidFlink
0 likes · 9 min read
Design and Evolution of Didi's Real‑Time Data Computing Platform
58 Tech
58 Tech
Mar 7, 2019 · Big Data

In-Memory Inverted Index Compression Algorithms: Overview and MILC Optimization for High‑Performance Search

This article reviews major in‑memory inverted index compression techniques such as PForDelta, PEF, and MILC, explains their principles and trade‑offs, and details practical optimizations applied at 58.com to achieve query performance comparable to uncompressed indexes while reducing memory usage by about 35 percent.

Big DataCompressionMILC
0 likes · 17 min read
In-Memory Inverted Index Compression Algorithms: Overview and MILC Optimization for High‑Performance Search
AntTech
AntTech
Mar 6, 2019 · Databases

How Ant Financial Scaled the 2019 Alipay New Year Red Envelope Event with GeaBase Graph Database and Real‑Time Data Intelligence

The 2019 Alipay New Year "Five Blessings" red‑envelope campaign, serving 450 million users, leveraged Ant Financial's GeaBase distributed graph database, a real‑time data‑intelligence platform, and OceanBase elastic resources to achieve millisecond‑level ranking, seconds‑level transaction audit, and seamless high‑concurrency performance.

AlipayBackendBig Data
0 likes · 10 min read
How Ant Financial Scaled the 2019 Alipay New Year Red Envelope Event with GeaBase Graph Database and Real‑Time Data Intelligence
HomeTech
HomeTech
Feb 28, 2019 · Artificial Intelligence

How to Systematically Test and Monitor AI Models in Large‑Scale Production

This article presents a comprehensive approach to testing, automating, and monitoring AI prediction models in a high‑traffic environment, covering background, challenges, evaluation metrics, data sampling methods, automated test scripts, and online monitoring to ensure model accuracy, performance, and reliability.

AI testingBig DataMetrics
0 likes · 13 min read
How to Systematically Test and Monitor AI Models in Large‑Scale Production
Xianyu Technology
Xianyu Technology
Feb 28, 2019 · Big Data

NVID Recommendation System Architecture and Technical Solutions

The NVID recommendation system for Taobao is built on a four‑layer architecture—activity material, configuration, business process, and application—and solves environment isolation, performance, audience management, and A/B testing challenges through optimized data schemas, ID mapping, multi‑level caching with database fallback, and real‑time user targeting, while future work aims at personalized audiences and automated ad optimization.

A/B testingBig DataCaching
0 likes · 11 min read
NVID Recommendation System Architecture and Technical Solutions
AntTech
AntTech
Feb 27, 2019 · Big Data

Ant Financial Data Governance: Practices and Challenges in Data Quality Management

The article details Ant Financial’s comprehensive data quality governance framework, covering its architecture, challenges, implementation strategies, and real‑world case studies, illustrating how the company integrates data monitoring, AI‑driven self‑healing, and rigorous release controls to ensure high‑quality data across its platform.

Ant FinancialBig DataData Platform
0 likes · 17 min read
Ant Financial Data Governance: Practices and Challenges in Data Quality Management
Qunar Tech Salon
Qunar Tech Salon
Feb 27, 2019 · Databases

Evolution of Meituan’s Database Platform: From Manual Operations to Intelligent Automation

This article outlines Meituan’s transition of its database platform from manual, script‑based operations through tool‑ and product‑centric stages to a private‑cloud and automation era, discusses current challenges such as root‑cause analysis and staffing, and shares insights on moving toward fully intelligent, data‑driven database operations.

Big DataCloud ComputingDatabases
0 likes · 13 min read
Evolution of Meituan’s Database Platform: From Manual Operations to Intelligent Automation
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 26, 2019 · Big Data

Deploying Apache Flink Clusters: Standalone and YARN Modes

This guide explains how to set up an Apache Flink cluster on CentOS 7 using three deployment methods—Local, Standalone, and Flink on YARN/Kubernetes—including host configuration, SSH setup, package distribution, configuration file editing, cluster start/stop commands, YARN resource manager concepts, session commands, job submission, fault‑tolerance settings, and log inspection.

Big DataCluster DeploymentFlink
0 likes · 11 min read
Deploying Apache Flink Clusters: Standalone and YARN Modes
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 25, 2019 · Big Data

Understanding Flink DataSetAPI and DataStreamAPI

This article introduces Apache Flink's DataSetAPI and DataStreamAPI, explains their source, transformation, and sink concepts, highlights the key differences in transformation handling, and notes the series' goal of publishing over 500 big‑data tutorials for learners from beginner to expert.

Big DataDataSetAPIDataStreamAPI
0 likes · 2 min read
Understanding Flink DataSetAPI and DataStreamAPI
Vipshop Quality Engineering
Vipshop Quality Engineering
Feb 22, 2019 · Artificial Intelligence

How Vipshop Built an AI‑Powered Sentiment Analysis System for Real‑Time Customer Feedback

Vipshop's in‑house sentiment monitoring platform integrates web‑scraped reviews, WeChat comments and internal service messages, applying lexical sentiment scoring, dictionary‑based Chinese word segmentation, TF‑IDF keyword ranking and lightweight classification to deliver real‑time insights, alerts and actionable reports for thousands of daily user comments.

Big DataNLPSentiment Analysis
0 likes · 17 min read
How Vipshop Built an AI‑Powered Sentiment Analysis System for Real‑Time Customer Feedback
Beike Product & Technology
Beike Product & Technology
Feb 21, 2019 · Big Data

DATABUS Data Integration Platform: Architecture, Capabilities, and TiDB Ecosystem

The article presents an in‑depth overview of the DATABUS data integration platform, detailing its background, current challenges, core capabilities such as data syncing, metadata automation, real‑time subscriptions, and its reliance on TiDB, TiSpark, Hudi, and related big‑data technologies to enable near‑real‑time data warehousing.

Big DataData IntegrationHive
0 likes · 13 min read
DATABUS Data Integration Platform: Architecture, Capabilities, and TiDB Ecosystem
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 20, 2019 · Big Data

Zookeeper: The Core Coordination Service in Big Data Systems

Zookeeper, originally a side‑project of Hadoop, is a Yahoo‑developed distributed coordination framework that provides high‑availability services such as configuration management, distributed locks, and failure handling, and has become a foundational component for many big‑data systems like Hadoop, Kafka, and Dubbo.

Big DataConfiguration ManagementCoordination Service
0 likes · 3 min read
Zookeeper: The Core Coordination Service in Big Data Systems
Sohu Tech Products
Sohu Tech Products
Feb 13, 2019 · Big Data

Evolution and Implementation Details of Spark Shuffle Mechanisms

This article examines the historical evolution of Spark's shuffle implementations—from early Hash‑Based Shuffle to modern SortShuffleWriter, BypassMergeSortShuffleWriter, and UnsafeShuffleWriter—explaining their design choices, selection criteria, and the corresponding shuffle reader architecture in a production‑grade Spark 2.1.1 environment.

Big DataDistributed computingShuffle
0 likes · 13 min read
Evolution and Implementation Details of Spark Shuffle Mechanisms
Ctrip Technology
Ctrip Technology
Feb 13, 2019 · R&D Management

Ctrip’s Technology Evolution: From Call‑Center Era to Big Data and AI

The article outlines Ctrip’s three‑phase technology evolution—from a simple call‑center architecture to layered internet and mobile platforms, and finally to a cloud‑based big‑data and AI‑driven ecosystem—highlighting architectural changes, operational challenges, and strategic lessons for fast‑growing internet companies.

Big DataCtripR&D Management
0 likes · 13 min read
Ctrip’s Technology Evolution: From Call‑Center Era to Big Data and AI
Youzan Coder
Youzan Coder
Feb 1, 2019 · Big Data

Design and Implementation of Log Parsing for a Big Data Offline Task Platform

The article describes a log‑parsing feature for Youzan’s big‑data offline platform that captures runtime logs from Hive, Spark, DataX, MapReduce and HBase jobs, categorizes scheduling types, extracts metrics such as read/write bytes, shuffle volume and GC time, and processes them in real time via a Filebeat‑Logstash‑Kafka‑Spark‑Streaming pipeline storing results in Redis for monitoring, optimization and resource‑usage ranking.

Big DataResource MonitoringYARN
0 likes · 7 min read
Design and Implementation of Log Parsing for a Big Data Offline Task Platform
Didi Tech
Didi Tech
Jan 31, 2019 · Big Data

Router-Based Federation in Hadoop: Architecture, Components, and Didi’s Deployment

Router‑Based Federation replaces Hadoop’s single‑point HDFS bottleneck with a server‑side global namespace managed by Routers and a State Store, enabling scalable, highly available sub‑clusters; Didi back‑ported the feature, deployed five Routers, fixed numerous bugs, and contributed patches to improve stability and functionality.

Big DataHDFSHadoop
0 likes · 11 min read
Router-Based Federation in Hadoop: Architecture, Components, and Didi’s Deployment
DataFunTalk
DataFunTalk
Jan 30, 2019 · Artificial Intelligence

Real‑Time Metrics Processing Technology for Financial Risk Control and Anti‑Fraud

This article outlines the challenges of financial risk control in the internet era and presents a comprehensive real‑time metrics processing system, covering data leakage, fraud, big‑data opportunities, AI model deployment, and the technical architecture of the Bangsheng real‑time indicator platform.

AIBig DataStream Processing
0 likes · 17 min read
Real‑Time Metrics Processing Technology for Financial Risk Control and Anti‑Fraud
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Jan 29, 2019 · Operations

How to Optimize Large-Scale Log Systems for Real-Time Monitoring and Scalability

This article examines the design, deployment, and optimization of massive log systems, comparing architectures, discussing real‑time versus near‑real‑time requirements, and presenting practical improvements such as memory, CPU, network tuning, data partitioning, storage reduction, and component upgrades using ELK, Kafka, Fluentd, and HBase.

Big DataELKFluentd
0 likes · 18 min read
How to Optimize Large-Scale Log Systems for Real-Time Monitoring and Scalability
21CTO
21CTO
Jan 26, 2019 · Big Data

Data Lake vs Data Warehouse: Which One Powers Your Business?

This article explains the core differences between data lakes and data warehouses, their respective strengths, and how they complement each other to support both exploratory analytics and routine business reporting.

AnalyticsBig DataData Lake
0 likes · 5 min read
Data Lake vs Data Warehouse: Which One Powers Your Business?
NetEase Game Operations Platform
NetEase Game Operations Platform
Jan 25, 2019 · Big Data

Understanding Exactly-Once Semantics in Apache Flink: Challenges and Implementation

This article analyzes the difficulties of achieving exactly-once delivery in Apache Flink, explains the distinction between state and end‑to‑end semantics, and details how idempotent and transactional sinks—illustrated with the Bucketing File Sink—realize exactly‑once guarantees through checkpoint‑based two‑phase commit.

Big DataExactly-OnceFlink
0 likes · 13 min read
Understanding Exactly-Once Semantics in Apache Flink: Challenges and Implementation
dbaplus Community
dbaplus Community
Jan 23, 2019 · Big Data

How Zhihu Built a Scalable Data‑Sync Platform with Sqoop and DataX

This article explains Zhihu's journey from ad‑hoc MySQL‑Hive sync using Oozie + Sqoop to a unified, platform‑based data synchronization service that now handles thousands of tables, over 10 TB daily, with load‑aware scheduling, incremental pulls, schema change handling, and tight integration with their offline job scheduler.

Big DataDataXETL
0 likes · 14 min read
How Zhihu Built a Scalable Data‑Sync Platform with Sqoop and DataX
21CTO
21CTO
Jan 23, 2019 · Big Data

Can 1.4 Billion Users Fit Into One WeChat Group? A Technical Feasibility Study

This article analyzes whether the entire Chinese population could be added to a single WeChat group, examining user statistics, message volume, required bandwidth, CPU processing limits, Moore's law projections, supercomputer alternatives, hardware costs, storage demands, and practical challenges, concluding that it is theoretically possible but practically infeasible.

Big DataPerformanceServer
0 likes · 10 min read
Can 1.4 Billion Users Fit Into One WeChat Group? A Technical Feasibility Study
MaGe Linux Operations
MaGe Linux Operations
Jan 23, 2019 · Big Data

How Bloom Filters Power Fast Big Data Searches with Python

This tutorial walks through building a simple Python search engine for big data, covering Bloom filter basics, tokenization with major and minor segmentation, inverted index creation, and implementing both simple and complex (AND/OR) queries, complete with code examples and visual illustrations.

AND/OR queriesBig DataPython
0 likes · 15 min read
How Bloom Filters Power Fast Big Data Searches with Python
Tencent Cloud Developer
Tencent Cloud Developer
Jan 17, 2019 · Artificial Intelligence

Deep Learning for Big Data Recommendation Systems: Tencent's Industrial Practice

Tencent’s industrial practice shows how a large‑scale offline‑nearline‑online “Shield” recommendation architecture, powered by the DeepR framework built on RCaffe, uses deep semantic embeddings, massive neural networks and reinforcement‑learning decisions to handle billions of daily requests, demonstrating that data richness and engineering capability, not model depth alone, drive performance in big‑data recommendation systems.

Big DataDeep LearningNeural Network
0 likes · 13 min read
Deep Learning for Big Data Recommendation Systems: Tencent's Industrial Practice
JD Tech
JD Tech
Jan 17, 2019 · Operations

Technical Overview of JD's Archimedes Resource Scheduling System

The article presents a detailed technical analysis of JD's Archimedes project, describing its evolution from JDOS 2.0 to a large‑scale container scheduling platform that dramatically improves resource utilization, deployment speed, and cost efficiency across JD’s data centers.

AIBig DataJD
0 likes · 6 min read
Technical Overview of JD's Archimedes Resource Scheduling System
Youzan Coder
Youzan Coder
Jan 16, 2019 · Big Data

How Youzan Scaled Real‑Time Analytics with Flink: Architecture, Pitfalls, and Lessons

This article walks through Youzan's real‑time platform architecture, explains why Flink was chosen over Spark Structured Streaming, details practical challenges such as container over‑provisioning and monitoring overhead, shares solutions for Spring integration and async caching, and outlines future directions for SQL‑based streaming and scheduler improvements.

Big DataFlinkReal-time Streaming
0 likes · 19 min read
How Youzan Scaled Real‑Time Analytics with Flink: Architecture, Pitfalls, and Lessons
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Jan 16, 2019 · Big Data

What’s New in Transwarp TDH 5.2.3? Key Performance and Stability Enhancements

TDH 5.2.3 introduces a series of stability and performance upgrades—including transaction and compaction optimizations, enhanced error handling, SQL length protection, improved Oracle‑compatible UDFs, default resource pool support, Guardian caching, TxSQL monitoring, and workflow and OLAP engine fixes—aimed at delivering a more reliable big‑data platform.

Big DataDatabaseOptimization
0 likes · 10 min read
What’s New in Transwarp TDH 5.2.3? Key Performance and Stability Enhancements
dbaplus Community
dbaplus Community
Jan 13, 2019 · Databases

January 2019 DB-Engines Newsletter: Latest Database Releases & Key Features

The January 2019 DB-Engines newsletter compiles the newest releases, feature highlights, and performance improvements across RDBMS, NoSQL, NewSQL, time‑series, big‑data, domestic, and cloud database families, while also explaining the ranking methodology and providing download links for the full issue.

Big DataCloud ComputingDatabases
0 likes · 41 min read
January 2019 DB-Engines Newsletter: Latest Database Releases & Key Features
Youzan Coder
Youzan Coder
Jan 9, 2019 · Big Data

How Youzan Scaled 5,000 Daily SparkSQL Jobs: Migration Lessons from Hive

This article details Youzan's transition from Hive to SparkSQL, covering platform architecture, usability and performance enhancements, migration strategies, automated engine selection, and future plans that together reduced resource consumption by up to 67% while handling thousands of daily jobs.

AvailabilityBig DataData Platform
0 likes · 13 min read
How Youzan Scaled 5,000 Daily SparkSQL Jobs: Migration Lessons from Hive
dbaplus Community
dbaplus Community
Jan 3, 2019 · Backend Development

Supercharging Elasticsearch for Billion-Row Queries: Practical Tips

This guide details how to optimize Elasticsearch for handling billions of daily records, covering core Lucene concepts, index and shard configuration, performance‑tuning parameters, and practical testing methods to achieve sub‑second query responses and long‑term data retention.

Big DataElasticsearchPerformance Optimization
0 likes · 13 min read
Supercharging Elasticsearch for Billion-Row Queries: Practical Tips
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 2, 2019 · Big Data

Optimizing Spark Direct Kafka Consumption: Subpartition Concurrency and Repartition Strategies

To address the long processing time caused by uneven Spark partitions when reading Kafka via the Direct approach, this article explains the SPARK‑22056 solution that modifies KafkaRDD.getPartitions to support a configurable 'topic.partition.subconcurrency' parameter, discusses its trade‑offs, and presents alternative repartition and multithreading techniques.

Big DataPartitioningScala
0 likes · 6 min read
Optimizing Spark Direct Kafka Consumption: Subpartition Concurrency and Repartition Strategies
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 2, 2019 · Big Data

Understanding Spark Streaming Backpressure Mechanism

The article explains how Spark Streaming backpressure, introduced in version 1.5, automatically adjusts data ingestion rates based on processing delays, replaces manual rate limits, and details its architecture, configuration parameters, and usage for preventing data backlog and executor OOM.

Big DataRate ControlSpark
0 likes · 6 min read
Understanding Spark Streaming Backpressure Mechanism
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 1, 2019 · Big Data

Insights from the Real-Time Big Data Meetup: Spark Structured Streaming Overview

The meetup on September 8, co‑hosted by InfoQ and Huawei Cloud, featured Databricks engineer Tathagata Das explaining Spark Structured Streaming’s concepts, fault‑tolerance, performance, event‑time handling, and real‑world use cases such as Apple’s security platform, highlighting its scalability and integration with various data sources.

Big DataSparkStructured Streaming
0 likes · 8 min read
Insights from the Real-Time Big Data Meetup: Spark Structured Streaming Overview
Architects Research Society
Architects Research Society
Dec 30, 2018 · Big Data

Overview of Major Apache Big Data Processing Frameworks

This article provides a concise overview of numerous Apache open‑source projects—including Ignite, MapReduce, Pig, JAQL, Spark, Storm, Flink, Apex, REEF, Twill, and Beam—that enable distributed in‑memory storage, real‑time and batch processing, and advanced analytics for large‑scale data workloads.

ApacheBig DataDistributed computing
0 likes · 22 min read
Overview of Major Apache Big Data Processing Frameworks
Tencent Cloud Developer
Tencent Cloud Developer
Dec 28, 2018 · Big Data

Intelligent Operations for Tencent Cloud Big Data Platform: Challenges, Practices, and Future Directions

Tencent Cloud’s big‑data platform tackles massive, multi‑component clusters by deploying an AIOps framework that aggregates logs and metrics, applies statistical and machine‑learning anomaly detection, uses regression and reinforcement‑learning for job‑parameter optimization, and integrates offline‑online pipelines, achieving over 88 % precision while planning automated root‑cause analysis, productized tools, platformized algorithm integration, and cross‑domain model reuse.

Big DataCloud ComputingIntelligent Operations
0 likes · 20 min read
Intelligent Operations for Tencent Cloud Big Data Platform: Challenges, Practices, and Future Directions