Tagged articles

3697 articles

Page 31 of 37

Apr 18, 2019 · Big Data

How MaxCompute Evolved: 10 Years of Big Data Innovation at Alibaba

This article reviews a decade of MaxCompute development, covering its origins, core technologies, performance gains, ecosystem integration, intelligent features, competitive positioning, and commercialization, while highlighting the platform's role as Alibaba's central big‑data compute engine.

AI IntegrationBig DataData Storage

0 likes · 21 min read

How MaxCompute Evolved: 10 Years of Big Data Innovation at Alibaba

Big Data Technology & Architecture

Apr 17, 2019 · Big Data

Step-by-Step Guide to Installing Hive 2.1.0 on a Hadoop 2.7.1 Cluster (Ubuntu 14.04)

This tutorial provides a comprehensive, step-by-step procedure for setting up Hive 2.1.0 on a Hadoop 2.7.1 cluster running Ubuntu 14.04, covering environment preparation, Hive installation, configuration of environment variables, MySQL metastore integration, client setup, service startup, and basic verification commands.

Big DataHadoopHive

0 likes · 8 min read

Step-by-Step Guide to Installing Hive 2.1.0 on a Hadoop 2.7.1 Cluster (Ubuntu 14.04)

DataFunTalk

Apr 17, 2019 · Artificial Intelligence

Evolution of Ctrip Financial Risk Control Models: From Data Platform to AI‑Driven Scoring and Anti‑Fraud Systems

This report details Ctrip Financial's end‑to‑end risk control development, covering business overview, a three‑layer data platform, the progression of credit scoring and anti‑fraud models from rule‑based to advanced AI techniques, and the evaluation, monitoring, and social‑network‑based fraud detection strategies employed.

Big DataFinancial AIanti-fraud

0 likes · 16 min read

Evolution of Ctrip Financial Risk Control Models: From Data Platform to AI‑Driven Scoring and Anti‑Fraud Systems

dbaplus Community

Apr 16, 2019 · Big Data

Scaling Elasticsearch for Billions of Daily Events: Cluster Planning, Routing & Hot‑Warm Tips

This article explains how to handle a real‑time OLAP monitoring platform processing 10‑12 billion daily events and 400 billion yearly records by optimizing Elasticsearch 5.3.3 through cluster planning, storage strategies, index sharding, compression, hot‑warm architecture, routing, index templates, rollover, and cross‑cluster search, providing concrete configurations and code examples.

Big DataCluster PlanningElasticsearch

0 likes · 23 min read

Scaling Elasticsearch for Billions of Daily Events: Cluster Planning, Routing & Hot‑Warm Tips

Big Data Technology & Architecture

Apr 15, 2019 · Big Data

Map‑Side Join and Reduce‑Side Join Examples in Hadoop MapReduce (Java)

This article provides two reusable Java code samples that demonstrate how to perform a map‑side join and a reduce‑side join in Hadoop MapReduce, enabling efficient joining of a large dataset with a smaller reference table.

Big DataHadoopJoin

0 likes · 8 min read

Map‑Side Join and Reduce‑Side Join Examples in Hadoop MapReduce (Java)

21CTO

Apr 15, 2019 · Big Data

Mastering High‑Concurrency Big Data: Sharding, Partitioning, and Index Strategies

This article explores practical techniques for handling massive, high‑concurrency data workloads, covering relational database limits, read/write separation, vertical and horizontal sharding, key selection, archival to NoSQL stores, and the use of heterogeneous index tables to maintain performance.

Big DataPartitioningSharding

0 likes · 6 min read

Mastering High‑Concurrency Big Data: Sharding, Partitioning, and Index Strategies

Alibaba Cloud Developer

Apr 15, 2019 · Artificial Intelligence

Why Deep Learning Finally Succeeded and What Challenges Lie Ahead

This article reviews Jia Yangqing’s insights on why deep learning finally succeeded—highlighting the roles of big data and high‑performance computing—while examining its current limitations, emerging challenges, and future opportunities across AI engineering, AutoML, and hardware‑software co‑design.

AI ChallengesAI EngineeringAutoML

0 likes · 9 min read

Why Deep Learning Finally Succeeded and What Challenges Lie Ahead

Big Data Technology & Architecture

Apr 12, 2019 · Big Data

Weekly Knowledge Summary: Yarn Resource Scheduler, Hadoop Rack Awareness, HDFS Data Flow, and Small File Solutions

This weekly note shares personal updates and a concise technical overview covering Yarn's resource scheduling, Hadoop's rack‑aware architecture, HDFS data flow, and practical solutions to the HDFS small‑file problem, along with links to further reading and upcoming work plans.

Big DataHDFSHadoop

0 likes · 5 min read

Weekly Knowledge Summary: Yarn Resource Scheduler, Hadoop Rack Awareness, HDFS Data Flow, and Small File Solutions

System Architect Go

Apr 11, 2019 · Big Data

Introduction to Apache Kafka: Core Concepts, Message Delivery, Partition Storage, and Consumption

This article introduces Apache Kafka as a distributed streaming platform, explaining its three core capabilities, key concepts such as producers, topics, brokers, partitions and consumers, and detailing how messages are delivered, stored in partitions, and consumed by consumer groups.

Big DataDistributed StreamingKafka

0 likes · 8 min read

Introduction to Apache Kafka: Core Concepts, Message Delivery, Partition Storage, and Consumption

Architecture Digest

Apr 11, 2019 · Big Data

Understanding Hadoop and HBase: Installation, Configuration, and Basic Operations

This guide introduces Hadoop and HBase fundamentals, explains their architectures and advantages, and provides step‑by‑step instructions for setting up a multi‑node Hadoop cluster, configuring core services, installing HBase, and performing basic HBase shell operations.

Big DataHBaseHadoop

0 likes · 18 min read

Understanding Hadoop and HBase: Installation, Configuration, and Basic Operations

JD Retail Technology

Apr 10, 2019 · Databases

HBase at JD.com: Architecture, Use Cases, and Evolution

This article explains how JD.com leverages the open‑source HBase database for massive, low‑latency data storage across various business lines, detailing its architecture, multi‑tenant isolation, disaster‑recovery mechanisms, and integration with Phoenix SQL for OLTP workloads.

Big DataDatabase ArchitectureHBase

0 likes · 13 min read

HBase at JD.com: Architecture, Use Cases, and Evolution

Java Captain

Apr 9, 2019 · Big Data

Kafka FAQs: Zookeeper Dependency, Retention Policies, Cleanup Rules, Performance Bottlenecks, and Cluster Best Practices

This article answers common Kafka questions, explaining why Kafka cannot operate without Zookeeper, describing its two retention strategies based on time and size, detailing how simultaneous time‑ and size‑based cleanup works, listing performance bottlenecks, and offering practical guidelines for sizing and configuring Kafka clusters.

Big DataCluster DesignKafka

0 likes · 2 min read

Kafka FAQs: Zookeeper Dependency, Retention Policies, Cleanup Rules, Performance Bottlenecks, and Cluster Best Practices

Big Data Technology & Architecture

Apr 8, 2019 · Big Data

Understanding HDFS Data Blocks, Rack Awareness, and Dynamic Node Addition

This article explains how HDFS stores files in replicated data blocks, implements rack awareness to improve reliability and performance, shows the necessary configuration in core-site.xml, provides sample scripts, and demonstrates how to add new DataNode machines without restarting the NameNode.

Big DataData BlockDynamic Node Addition

0 likes · 10 min read

Understanding HDFS Data Blocks, Rack Awareness, and Dynamic Node Addition

Big Data Technology & Architecture

Apr 7, 2019 · Big Data

Understanding YARN: Background, Architecture, and Execution Process

This article explains why YARN was created to overcome the limitations of MapReduce 1.x, describes its architecture—including ResourceManager, NodeManager, ApplicationMaster, Container, and Client—and outlines the step‑by‑step execution flow that enables multiple computation frameworks to run on Hadoop.

Big DataDistributed computingHadoop

0 likes · 11 min read

Understanding YARN: Background, Architecture, and Execution Process

Youzan Coder

Apr 7, 2019 · Industry Insights

How Youzan Scaled Order Search: Hot‑State Indexing and AKF Expansion

This article reviews the evolution of Youzan's order search architecture over two years, detailing challenges from data growth, the creation of a hot‑state index covering half of search traffic, time‑sharded indexes, and the AKF expansion cube that guides multi‑axis scalability.

Backend DevelopmentBig DataElasticsearch

0 likes · 10 min read

How Youzan Scaled Order Search: Hot‑State Indexing and AKF Expansion

Big Data Technology & Architecture

Apr 4, 2019 · Big Data

Weekly Knowledge Points: Interview Reflections, Hadoop Introduction, MapReduce and HDFS Overview

This weekly briefing shares five curated resources covering interview reflections, a concise Hadoop introduction, the principles of MapReduce, an overview of HDFS, and upcoming plans to study Hive and HBase, emphasizing the distributed nature of big‑data processing.

Big DataHDFSHadoop

0 likes · 3 min read

Weekly Knowledge Points: Interview Reflections, Hadoop Introduction, MapReduce and HDFS Overview

Big Data Technology & Architecture

Apr 3, 2019 · Big Data

Understanding RAID and Its Role in HDFS Architecture

This article explains the storage challenges of big data, introduces RAID technologies and their variants, and shows how the principles of RAID are applied in the Hadoop Distributed File System (HDFS) to achieve scalable, reliable, and high‑performance data storage and processing.

Big DataHDFSRAID

0 likes · 10 min read

Understanding RAID and Its Role in HDFS Architecture

Alibaba Cloud Developer

Apr 3, 2019 · Cloud Computing

What Alibaba Cloud’s New President Reveals About the Future of Cloud Computing

In a candid interview, Alibaba Cloud’s new president discusses how pricing is just a starting point, the shift from open‑source to self‑developed data platforms, the rapid growth of hybrid cloud, security priorities, the role of AI, the evolution of the middle‑platform concept, ecosystem integration, and the strategic focus on scaling, public‑cloud share, and partner collaboration to drive Alibaba Cloud’s future growth.

AIAlibaba CloudBig Data

0 likes · 31 min read

What Alibaba Cloud’s New President Reveals About the Future of Cloud Computing

Alibaba Cloud Developer

Apr 3, 2019 · Cloud Computing

What’s Next for Cloud Computing? Insights from Alibaba Cloud’s New President

In a detailed interview, Alibaba Cloud’s new president discusses the future of cloud computing, emphasizing the shift from price competition to core value, the importance of hybrid cloud, data processing platforms, open‑source challenges, AI integration, ecosystem strategy, and the evolving role of the cloud as a platform and integrated service.

Alibaba CloudArtificial IntelligenceBig Data

0 likes · 28 min read

What’s Next for Cloud Computing? Insights from Alibaba Cloud’s New President

Big Data Technology & Architecture

Apr 2, 2019 · Big Data

Understanding Hadoop MapReduce: Programming Model, WordCount Example, and Job Execution Mechanism

The article explains Hadoop's MapReduce framework as both a programming model and execution engine, detailing its map and reduce phases, the WordCount example code, job startup components, data shuffling, partitioning, and how large‑scale distributed computations are orchestrated across a cluster.

Big DataDistributed computingHadoop

0 likes · 10 min read

Understanding Hadoop MapReduce: Programming Model, WordCount Example, and Job Execution Mechanism

Alibaba Cloud Native

Apr 2, 2019 · Big Data

Inside Spark Operator: How Kubernetes Manages Spark Jobs End‑to‑End

This article explains the internal architecture of Spark Operator, covering Kubernetes operator fundamentals, CRD definitions, code layout, job submission flow, state machine handling, monitoring integration, and troubleshooting techniques for reliable Spark workloads on Kubernetes.

Big DataCRDGo

0 likes · 11 min read

Inside Spark Operator: How Kubernetes Manages Spark Jobs End‑to‑End

Programmer DD

Apr 2, 2019 · Backend Development

From Freshman to Senior Engineer: A Developer’s Journey Through Java, Spring, and Big Data

This article chronicles a Chinese computer science graduate’s step‑by‑step evolution from learning basic C and Java in university to building campus apps, winning software contests, mastering Spring, Hadoop, Elasticsearch, and Neo4j, and ultimately landing offers from top tech firms, illustrating the challenges and perseverance required for a successful software engineering career.

Big Datacareerjava

0 likes · 13 min read

From Freshman to Senior Engineer: A Developer’s Journey Through Java, Spring, and Big Data

Big Data Technology & Architecture

Apr 1, 2019 · Big Data

Comprehensive Overview of Hadoop: Core Modules, HDFS Architecture, MapReduce, YARN, and a Scala WordCount Example

This article provides a detailed introduction to Hadoop's ecosystem—including its core modules (Common, HDFS, YARN, MapReduce), the design of a high‑availability HDFS cluster, the principles of distributed file systems, and a complete Scala WordCount MapReduce program—offering a solid foundation for big‑data practitioners.

Big DataHDFSHadoop

0 likes · 15 min read

Comprehensive Overview of Hadoop: Core Modules, HDFS Architecture, MapReduce, YARN, and a Scala WordCount Example

Big Data Technology & Architecture

Mar 29, 2019 · Big Data

Weekly Knowledge Digest: Apache Flink Deep Dives on JOIN LATERAL, TimeInterval, Temporal Table, and State Management

This week's digest shares a personal anecdote and a series of technical deep‑dives into Apache Flink, covering JOIN LATERAL, TimeInterval JOIN, Temporal Table JOIN, state management, and related code examples, while also previewing upcoming work schedules and recommended Flink reference articles.

Apache FlinkBig DataSQL Join

0 likes · 5 min read

Weekly Knowledge Digest: Apache Flink Deep Dives on JOIN LATERAL, TimeInterval, Temporal Table, and State Management

dbaplus Community

Mar 27, 2019 · Big Data

How eBay Cut Hadoop Job Runtime by 60%: Real‑World CAL Log Optimization

This article explains how eBay's CAL team reduced Hadoop MapReduce job execution time and resource consumption by over 60% through targeted GC tuning, data‑skew mitigation, and algorithmic improvements, boosting job success rates to nearly 100% while handling petabyte‑scale log data.

Big DataData SkewGC tuning

0 likes · 12 min read

How eBay Cut Hadoop Job Runtime by 60%: Real‑World CAL Log Optimization

Tencent Cloud Developer

Mar 27, 2019 · Industry Insights

How AI and Big Data Drive New Engineering Education: Insights from the 2019 IT Alliance Conference

The 2019 Information Technology New Engineering Alliance conference in Beijing gathered academia, research institutes, and industry leaders to discuss AI, big data, and curriculum innovation, highlighting Tencent's contributions to digital education, cloud certification, and the broader push for industry‑university collaboration in shaping future IT talent.

AIBig DataCloud Computing

0 likes · 6 min read

How AI and Big Data Drive New Engineering Education: Insights from the 2019 IT Alliance Conference

NetEase Game Operations Platform

Mar 27, 2019 · Big Data

Embedding Python in Java with Jython for Real‑Time Big Data Jobs

This article explains why and how to embed Python code in Java using Jython for real‑time big‑data processing, covering performance benefits, memory‑leak pitfalls, singleton interpreter patterns, function factories, Java‑object conversion, and importing external PyPI packages with practical code examples.

Big DataDynamic LanguageEmbedding

0 likes · 11 min read

Embedding Python in Java with Jython for Real‑Time Big Data Jobs

Big Data Technology & Architecture

Mar 25, 2019 · Big Data

Understanding Apache Flink Interval Join: Syntax, Semantics, and Implementation

This article explains how Apache Flink's Interval Join solves time‑bounded join requirements more efficiently than unbounded joins, covering its syntax, semantics, state‑management considerations, and providing a complete Scala example with code and execution results.

Apache FlinkBig DataInterval Join

0 likes · 11 min read

Understanding Apache Flink Interval Join: Syntax, Semantics, and Implementation

Big Data Technology & Architecture

Mar 22, 2019 · Big Data

Weekly Knowledge Points: Apache Flink Continuous Queries, Kafka Connectors, SQL Overview, JOIN Operator, and Table API

This weekly briefing introduces Apache Flink's continuous query mechanism, demonstrates how to integrate Kafka as a DataStream connector, provides an overview of Flink SQL features, explains the implementation and optimization of dual‑stream JOIN operators, and showcases the Table API with end‑to‑end examples.

Apache FlinkBig DataTable API

0 likes · 3 min read

Weekly Knowledge Points: Apache Flink Continuous Queries, Kafka Connectors, SQL Overview, JOIN Operator, and Table API

Big Data Technology & Architecture

Mar 21, 2019 · Big Data

Apache Flink Table API Tutorial and End‑to‑End Examples

This article provides a comprehensive tutorial on Apache Flink's Table API, explaining its concepts, core features, and a wide range of operators such as SELECT, WHERE, GROUP BY, UNION, JOIN, and various window functions, while offering complete Scala code examples, custom sources, sinks, and an end‑to‑end job that computes page‑view counts per region using event‑time tumbling windows.

Big DataFlinkScala

0 likes · 36 min read

Apache Flink Table API Tutorial and End‑to‑End Examples

Architects' Tech Alliance

Mar 21, 2019 · Cloud Computing

Understanding the Chinese Enterprise IT Landscape: Market Structure, Demand Drivers, and Technology Trends

This article analyzes China's massive enterprise ecosystem, the composition of its IT market, the human and political factors shaping demand, and how cloud computing, big data, and artificial intelligence are driving a new wave of digital transformation across state‑owned, internet, and other enterprises.

Artificial IntelligenceBig DataChina

0 likes · 14 min read

Understanding the Chinese Enterprise IT Landscape: Market Structure, Demand Drivers, and Technology Trends

Xianyu Technology

Mar 21, 2019 · Big Data

Design and Implementation of the Mahé Real-Time Product Selection System Using Blink Stream Computing

Mahé, Xianyu’s real‑time product selection platform, uses Alibaba’s Blink stream engine to merge, evaluate roughly 300 rule‑based filters per item and emit only changed results, processing 1.4 billion daily messages at up to 50 k TPS through a four‑layer, stateful architecture.

Big DataFlinkRule Engine

0 likes · 15 min read

Design and Implementation of the Mahé Real-Time Product Selection System Using Blink Stream Computing

Tencent Cloud Developer

Mar 20, 2019 · Big Data

TVP Training Camp: Exploring Big Data Technologies and Trends

The inaugural TVP Training Camp on March 16 2019 in Beijing gathered Tencent Cloud’s TVP members and leading big‑data experts to discuss emerging technologies such as Greenplum, PMEM‑driven infrastructure, data‑operation optimization, and next‑generation cloud databases, while a round‑table addressed practical challenges and affirmed Tencent’s commitment to ongoing expert collaboration.

Big DataCloud ComputingData Analytics

0 likes · 11 min read

TVP Training Camp: Exploring Big Data Technologies and Trends

Youzan Coder

Mar 20, 2019 · Big Data

Evolution of Real-Time Computing at Youzan: From Storm to Flink and Future Directions

Youzan’s real‑time computing platform progressed from early Storm deployments through Spark Streaming to a Flink‑based architecture, adding unified task management, monitoring, and dedicated streaming clusters, while now pursuing SQL‑driven jobs, a Druid OLAP engine, and a future real‑time data warehouse.

Big DataFlinkSpark Streaming

0 likes · 14 min read

Evolution of Real-Time Computing at Youzan: From Storm to Flink and Future Directions

Big Data Technology & Architecture

Mar 19, 2019 · Big Data

Comprehensive Overview of SQL and Apache Flink SQL Features with Practical Code Examples

This article provides an in-depth introduction to SQL, its history and ANSI standards, then details Apache Flink's SQL capabilities—including SELECT, WHERE, GROUP BY, UNION, JOIN, window functions, and user-defined functions—accompanied by extensive code examples and a complete end‑to‑end Flink job implementation.

Apache FlinkBig DataStreaming

0 likes · 34 min read

Comprehensive Overview of SQL and Apache Flink SQL Features with Practical Code Examples

Architects' Tech Alliance

Mar 18, 2019 · Big Data

Understanding HDFS Architecture, NameNode HA, and Read/Write Processes

This article explains the concepts and architecture of HDFS, the high‑availability mechanisms of NameNode including quorum‑based shared storage, the detailed read and write workflows of the distributed file system, and discusses its typical use cases and limitations.

Big DataHAHDFS

0 likes · 16 min read

Understanding HDFS Architecture, NameNode HA, and Read/Write Processes

Big Data Technology & Architecture

Mar 17, 2019 · Big Data

Understanding Continuous Queries in Apache Flink: From Static Queries to Dynamic Tables and Trigger Simulations

This article explains how Apache Flink implements continuous queries for unbounded stream processing, compares static and continuous query semantics, demonstrates how MySQL triggers can simulate continuous queries in append‑only and update scenarios, and discusses Flink's connector, source, sink, and retraction mechanisms for correct incremental computation.

Apache FlinkBig DataContinuous Query

0 likes · 18 min read

Understanding Continuous Queries in Apache Flink: From Static Queries to Dynamic Tables and Trigger Simulations

dbaplus Community

Mar 14, 2019 · Operations

How Top Internet Companies Scale Spark CI/CD Across Tens of Thousands of Nodes

This article details a practical, production‑grade Spark CI/CD workflow using GitLab and Jenkins, covering source management, multi‑branch release strategies, automated testing, gray‑release, hot‑fix handling, and rollback mechanisms for large‑scale deployments.

Big DataContinuous DeliveryGitLab

0 likes · 17 min read

How Top Internet Companies Scale Spark CI/CD Across Tens of Thousands of Nodes

Big Data Technology & Architecture

Mar 13, 2019 · Big Data

Understanding Fault Tolerance and Exactly-Once Semantics in Apache Flink

This article explains Apache Flink's fault‑tolerance mechanisms, including checkpointing, barrier alignment, the differences between At‑Least‑Once and Exactly‑Once semantics, configuration options, incremental checkpointing, and the requirements for external sources and sinks to achieve end‑to‑end exactly‑once processing.

Apache FlinkBig DataExactly-Once

0 likes · 15 min read

Understanding Fault Tolerance and Exactly-Once Semantics in Apache Flink

JD Tech

Mar 13, 2019 · Operations

Evolution of JD Digital Technology’s Host Monitoring System “DiTing”: From V1 to V3

The article chronicles the design, evolution, and lessons learned of JD Digital Technology’s self‑built host monitoring platform “DiTing”, detailing its initial requirements, V1 architecture, subsequent V2 and V3 redesigns, encountered challenges, and future directions toward intelligent operations.

Big DataOperationsSystem architecture

0 likes · 12 min read

Evolution of JD Digital Technology’s Host Monitoring System “DiTing”: From V1 to V3

dbaplus Community

Mar 12, 2019 · Databases

Mastering HBase Cross‑Datacenter Migration: Snapshots, Architecture, and Real‑World Tips

This article provides a comprehensive technical guide on HBase, covering its core concepts, advantages and drawbacks, architecture layers, practical use cases, and a detailed step‑by‑step process for large‑scale cross‑datacenter migration using snapshot‑based strategies, with commands, diagrams, and lessons learned.

Big DataData MigrationDatabase Architecture

0 likes · 19 min read

Mastering HBase Cross‑Datacenter Migration: Snapshots, Architecture, and Real‑World Tips

DataFunTalk

Mar 11, 2019 · Artificial Intelligence

Practical Implementation of Personalized Recommendation Systems: Overview, Algorithms, Challenges, and Architecture

This article presents a comprehensive overview of personalized recommendation systems, covering their purpose, common algorithms, development challenges, the multi‑layer architecture used at DataGrand, optimization techniques, and the range of services offered to enterprise customers.

Big Datacollaborative filteringmachine learning

0 likes · 18 min read

Practical Implementation of Personalized Recommendation Systems: Overview, Algorithms, Challenges, and Architecture

JD Tech Talk

Mar 11, 2019 · Operations

Evolution of JD Digital Technology’s Host Monitoring System “Diting”: Architecture from V1 to V3

The article chronicles the design, implementation, and iterative evolution of JD Digital Technology’s in‑house host monitoring platform Diting, detailing its V1, V2, and V3 architectures, the challenges encountered at each stage, and future directions toward intelligent, automated operations.

Big DataDistributed SystemsOperations

0 likes · 14 min read

Evolution of JD Digital Technology’s Host Monitoring System “Diting”: Architecture from V1 to V3

DataFunTalk

Mar 7, 2019 · Big Data

Design and Evolution of Didi's Real‑Time Data Computing Platform

The article details how Didi built and iterated its real‑time data platform, describing the shift from MySQL‑based batch processing to a Kafka‑Samza‑Druid architecture with Spark Streaming and Flink, the challenges addressed, and the current capabilities and operational metrics.

Big DataDruidFlink

0 likes · 9 min read

Design and Evolution of Didi's Real‑Time Data Computing Platform

58 Tech

Mar 7, 2019 · Big Data

In-Memory Inverted Index Compression Algorithms: Overview and MILC Optimization for High‑Performance Search

This article reviews major in‑memory inverted index compression techniques such as PForDelta, PEF, and MILC, explains their principles and trade‑offs, and details practical optimizations applied at 58.com to achieve query performance comparable to uncompressed indexes while reducing memory usage by about 35 percent.

Big DataCompressionMILC

0 likes · 17 min read

In-Memory Inverted Index Compression Algorithms: Overview and MILC Optimization for High‑Performance Search

Big Data Technology & Architecture

Mar 6, 2019 · Big Data

Using Flink Redis Sink for Streaming WordCount from Kafka to Redis

This tutorial demonstrates how to integrate Apache Flink with Redis as a sink, showing the Maven dependency, a custom RedisMapper implementation, and a complete Flink job that reads Kafka messages, performs word count, and stores results in Redis, with plans for HBase and MySQL extensions.

Big DataFlinkRedis

0 likes · 4 min read

Using Flink Redis Sink for Streaming WordCount from Kafka to Redis

AntTech

Mar 6, 2019 · Databases

How Ant Financial Scaled the 2019 Alipay New Year Red Envelope Event with GeaBase Graph Database and Real‑Time Data Intelligence

The 2019 Alipay New Year "Five Blessings" red‑envelope campaign, serving 450 million users, leveraged Ant Financial's GeaBase distributed graph database, a real‑time data‑intelligence platform, and OceanBase elastic resources to achieve millisecond‑level ranking, seconds‑level transaction audit, and seamless high‑concurrency performance.

AlipayBackendBig Data

0 likes · 10 min read

How Ant Financial Scaled the 2019 Alipay New Year Red Envelope Event with GeaBase Graph Database and Real‑Time Data Intelligence

Big Data Technology & Architecture

Mar 4, 2019 · Big Data

Apache Flink Table API and SQL Tutorial with Code Examples

This article introduces Apache Flink’s Table API and SQL, explains the TableEnvironment programming model, shows how to register tables and sinks, and provides two complete Java examples—WordCount and a file‑based aggregation—complete with code that can be downloaded for local testing.

Big DataDataStreamFlink

0 likes · 7 min read

Apache Flink Table API and SQL Tutorial with Code Examples

Big Data Technology & Architecture

Mar 3, 2019 · Big Data

Getting Started with Flink Kafka Connector: Concepts, Setup, and Sample Code

This article introduces the Flink‑Kafka connector, explains essential Kafka concepts, shows how to configure checkpointing, provides Maven dependencies, and includes complete Java examples for both producing to and consuming from Kafka within a Flink streaming job.

Big DataConnectorFlink

0 likes · 8 min read

Getting Started with Flink Kafka Connector: Concepts, Setup, and Sample Code

Big Data Technology & Architecture

Mar 2, 2019 · Big Data

Understanding and Using Broadcast Variables in Apache Flink

This article explains the concept, usage, precautions, and a practical example of broadcast variables in Apache Flink, illustrating how to initialize, broadcast, retrieve, and apply shared data across parallel operators with Java code snippets.

Big DataBroadcast VariableDistributed computing

0 likes · 4 min read

Understanding and Using Broadcast Variables in Apache Flink

Big Data Technology & Architecture

Mar 1, 2019 · Big Data

Understanding Watermarks in Apache Flink for Handling Out-of-Order Events

This article explains how Apache Flink uses Watermarks to manage event‑time windows, describes the three time semantics, details periodic and punctuated Watermark generation methods with their Java interfaces, and shows practical DDL examples for handling late and out‑of‑order data in stream processing.

Apache FlinkBig DataEventTime

0 likes · 11 min read

Understanding Watermarks in Apache Flink for Handling Out-of-Order Events

DataFunTalk

Mar 1, 2019 · Big Data

Renrenche Mobile Data Platform: Architecture, Real‑Time Computing, and BI Solutions

The article presents Renrenche’s end‑to‑end mobile data platform, detailing its overall architecture, real‑time Spark‑based computation engine, Web IDE, metadata management, BI reporting built on ClickHouse, and how data‑driven practices empower both online and offline business operations.

BI reportingBig DataClickHouse

0 likes · 15 min read

Renrenche Mobile Data Platform: Architecture, Real‑Time Computing, and BI Solutions

Big Data Technology & Architecture

Feb 28, 2019 · Big Data

Understanding Time Semantics in Apache Flink: Processing Time, Event Time, and Ingestion Time

This article introduces Apache Flink's three time semantics—Processing Time, Event Time, and Ingestion Time—explaining their definitions, differences, and practical implications for windowing and stream processing, while also providing links to introductory Flink tutorials.

Big DataEvent TimeFlink

0 likes · 7 min read

Understanding Time Semantics in Apache Flink: Processing Time, Event Time, and Ingestion Time

Big Data Technology & Architecture

Feb 28, 2019 · Big Data

Understanding Flink Window Types and Their Implementations

This article explains Flink's window concepts—including time‑based, count‑based, tumbling, sliding, and session windows—provides practical Scala code examples for each type, and links to related resources on Flink basics, APIs, deployment, and advanced features.

Big DataFlinkScala

0 likes · 5 min read

Understanding Flink Window Types and Their Implementations

HomeTech

Feb 28, 2019 · Artificial Intelligence

How to Systematically Test and Monitor AI Models in Large‑Scale Production

This article presents a comprehensive approach to testing, automating, and monitoring AI prediction models in a high‑traffic environment, covering background, challenges, evaluation metrics, data sampling methods, automated test scripts, and online monitoring to ensure model accuracy, performance, and reliability.

AI testingBig DataMetrics

0 likes · 13 min read

How to Systematically Test and Monitor AI Models in Large‑Scale Production

Xianyu Technology

Feb 28, 2019 · Big Data

NVID Recommendation System Architecture and Technical Solutions

The NVID recommendation system for Taobao is built on a four‑layer architecture—activity material, configuration, business process, and application—and solves environment isolation, performance, audience management, and A/B testing challenges through optimized data schemas, ID mapping, multi‑level caching with database fallback, and real‑time user targeting, while future work aims at personalized audiences and automated ad optimization.

A/B testingBig DataCaching

0 likes · 11 min read

NVID Recommendation System Architecture and Technical Solutions

Big Data Technology & Architecture

Feb 27, 2019 · Big Data

Understanding Flink Restart Strategies: Configuration and Code Examples

This article explains Flink's restart strategies—including fixed‑delay, failure‑rate, and no‑restart—how to configure them globally via flink‑conf.yaml or programmatically in code, and provides complete Java examples demonstrating each approach.

Big DataFlinkRestart Strategy

0 likes · 4 min read

Understanding Flink Restart Strategies: Configuration and Code Examples

Big Data Technology & Architecture

Feb 27, 2019 · Big Data

Using Flink Distributed Cache: Overview and Example

This article explains Flink's distributed cache feature, describes its registration and retrieval mechanisms, and provides a complete Java example that demonstrates how to register a file, access it within a RichMapFunction, and print the processed results.

Big DataDataset APIDistributed Cache

0 likes · 4 min read

Using Flink Distributed Cache: Overview and Example

AntTech

Feb 27, 2019 · Big Data

Ant Financial Data Governance: Practices and Challenges in Data Quality Management

The article details Ant Financial’s comprehensive data quality governance framework, covering its architecture, challenges, implementation strategies, and real‑world case studies, illustrating how the company integrates data monitoring, AI‑driven self‑healing, and rigorous release controls to ensure high‑quality data across its platform.

Ant FinancialBig DataData Platform

0 likes · 17 min read

Ant Financial Data Governance: Practices and Challenges in Data Quality Management

Qunar Tech Salon

Feb 27, 2019 · Databases

Evolution of Meituan’s Database Platform: From Manual Operations to Intelligent Automation

This article outlines Meituan’s transition of its database platform from manual, script‑based operations through tool‑ and product‑centric stages to a private‑cloud and automation era, discusses current challenges such as root‑cause analysis and staffing, and shares insights on moving toward fully intelligent, data‑driven database operations.

Big DataCloud ComputingDatabases

0 likes · 13 min read

Evolution of Meituan’s Database Platform: From Manual Operations to Intelligent Automation

Big Data Technology & Architecture

Feb 26, 2019 · Big Data

Deploying Apache Flink Clusters: Standalone and YARN Modes

This guide explains how to set up an Apache Flink cluster on CentOS 7 using three deployment methods—Local, Standalone, and Flink on YARN/Kubernetes—including host configuration, SSH setup, package distribution, configuration file editing, cluster start/stop commands, YARN resource manager concepts, session commands, job submission, fault‑tolerance settings, and log inspection.

Big DataCluster DeploymentFlink

0 likes · 11 min read

Deploying Apache Flink Clusters: Standalone and YARN Modes

Big Data Technology & Architecture

Feb 25, 2019 · Big Data

Understanding Flink DataSetAPI and DataStreamAPI

This article introduces Apache Flink's DataSetAPI and DataStreamAPI, explains their source, transformation, and sink concepts, highlights the key differences in transformation handling, and notes the series' goal of publishing over 500 big‑data tutorials for learners from beginner to expert.

Big DataDataSetAPIDataStreamAPI

0 likes · 2 min read

Understanding Flink DataSetAPI and DataStreamAPI

Efficient Ops

Feb 24, 2019 · Databases

Why Row vs Column Storage Matters: Understanding HBase’s Column‑Family Model

This article explains the differences between row‑oriented and column‑oriented storage, compares their trade‑offs, and introduces HBase’s column‑family architecture, including row keys, column qualifiers, timestamps, cells, and how it maps to a multi‑dimensional map structure.

Big DataColumnar StorageDatabases

0 likes · 7 min read

Why Row vs Column Storage Matters: Understanding HBase’s Column‑Family Model

Vipshop Quality Engineering

Feb 22, 2019 · Artificial Intelligence

How Vipshop Built an AI‑Powered Sentiment Analysis System for Real‑Time Customer Feedback

Vipshop's in‑house sentiment monitoring platform integrates web‑scraped reviews, WeChat comments and internal service messages, applying lexical sentiment scoring, dictionary‑based Chinese word segmentation, TF‑IDF keyword ranking and lightweight classification to deliver real‑time insights, alerts and actionable reports for thousands of daily user comments.

Big DataNLPSentiment Analysis

0 likes · 17 min read

How Vipshop Built an AI‑Powered Sentiment Analysis System for Real‑Time Customer Feedback

Beike Product & Technology

Feb 21, 2019 · Big Data

DATABUS Data Integration Platform: Architecture, Capabilities, and TiDB Ecosystem

The article presents an in‑depth overview of the DATABUS data integration platform, detailing its background, current challenges, core capabilities such as data syncing, metadata automation, real‑time subscriptions, and its reliance on TiDB, TiSpark, Hudi, and related big‑data technologies to enable near‑real‑time data warehousing.

Big DataData IntegrationHive

0 likes · 13 min read

DATABUS Data Integration Platform: Architecture, Capabilities, and TiDB Ecosystem

Big Data Technology & Architecture

Feb 20, 2019 · Big Data

Zookeeper: The Core Coordination Service in Big Data Systems

Zookeeper, originally a side‑project of Hadoop, is a Yahoo‑developed distributed coordination framework that provides high‑availability services such as configuration management, distributed locks, and failure handling, and has become a foundational component for many big‑data systems like Hadoop, Kafka, and Dubbo.

Big DataConfiguration ManagementCoordination Service

0 likes · 3 min read

Zookeeper: The Core Coordination Service in Big Data Systems

dbaplus Community

Feb 19, 2019 · Big Data

Mastering HDFS Monitoring on JD Cloud: Key Metrics, Tools, and Best Practices

This article presents a comprehensive guide to monitoring Hadoop Distributed File System (HDFS) on JD Cloud, covering challenges, recommended toolchains, essential metrics, configuration tips, and real‑world case studies to help engineers ensure reliability and performance of large‑scale data clusters.

Big DataELKHDFS

0 likes · 14 min read

Mastering HDFS Monitoring on JD Cloud: Key Metrics, Tools, and Best Practices

Big Data Technology & Architecture

Feb 18, 2019 · Big Data

Big Data Mastery Series – Distributed Theory Foundations and Principles

This article introduces the foundational concepts and principles of distributed systems—including basic concepts, consistency models, CAP theorem, logical clocks, and advanced protocols like Paxos, Raft, and Zab—serving as the first part of a comprehensive Big Data mastery series.

Big DataCAP theoremConsistency

0 likes · 4 min read

Big Data Mastery Series – Distributed Theory Foundations and Principles

Tencent Cloud Developer

Feb 14, 2019 · Industry Insights

Turning IoT Data into Fully Automated Smart Parks: Key Stages & Architecture

The article outlines how rapid urban growth drives smart park initiatives that leverage IoT, big‑data analytics, digital twins, and full‑process visualization to evolve from efficient management to ecosystem integration and ultimately to fully automated, self‑governing urban micro‑environments.

Big DataDigital TwinIndustry Insights

0 likes · 11 min read

Turning IoT Data into Fully Automated Smart Parks: Key Stages & Architecture

Sohu Tech Products

Feb 13, 2019 · Big Data

Evolution and Implementation Details of Spark Shuffle Mechanisms

This article examines the historical evolution of Spark's shuffle implementations—from early Hash‑Based Shuffle to modern SortShuffleWriter, BypassMergeSortShuffleWriter, and UnsafeShuffleWriter—explaining their design choices, selection criteria, and the corresponding shuffle reader architecture in a production‑grade Spark 2.1.1 environment.

Big DataDistributed computingShuffle

0 likes · 13 min read

Evolution and Implementation Details of Spark Shuffle Mechanisms

dbaplus Community

Feb 13, 2019 · Big Data

How Zhihu Scaled Its Real-Time Analytics with Druid and Smart Redis Caching

Zhihu built a self‑service analytics platform on Druid, introduced a multi‑level Redis caching strategy, split long‑duration queries across multiple brokers, and added automatic cache invalidation to dramatically improve query latency and resource usage for massive daily request volumes.

AnalyticsBig DataCaching

0 likes · 13 min read

How Zhihu Scaled Its Real-Time Analytics with Druid and Smart Redis Caching

Ctrip Technology

Feb 13, 2019 · R&D Management

Ctrip’s Technology Evolution: From Call‑Center Era to Big Data and AI

The article outlines Ctrip’s three‑phase technology evolution—from a simple call‑center architecture to layered internet and mobile platforms, and finally to a cloud‑based big‑data and AI‑driven ecosystem—highlighting architectural changes, operational challenges, and strategic lessons for fast‑growing internet companies.

Big DataCtripR&D Management

0 likes · 13 min read

Ctrip’s Technology Evolution: From Call‑Center Era to Big Data and AI

Youzan Coder

Feb 1, 2019 · Big Data

Design and Implementation of Log Parsing for a Big Data Offline Task Platform

The article describes a log‑parsing feature for Youzan’s big‑data offline platform that captures runtime logs from Hive, Spark, DataX, MapReduce and HBase jobs, categorizes scheduling types, extracts metrics such as read/write bytes, shuffle volume and GC time, and processes them in real time via a Filebeat‑Logstash‑Kafka‑Spark‑Streaming pipeline storing results in Redis for monitoring, optimization and resource‑usage ranking.

Big DataResource MonitoringYARN

0 likes · 7 min read

Design and Implementation of Log Parsing for a Big Data Offline Task Platform

Didi Tech

Jan 31, 2019 · Big Data

Router-Based Federation in Hadoop: Architecture, Components, and Didi’s Deployment

Router‑Based Federation replaces Hadoop’s single‑point HDFS bottleneck with a server‑side global namespace managed by Routers and a State Store, enabling scalable, highly available sub‑clusters; Didi back‑ported the feature, deployed five Routers, fixed numerous bugs, and contributed patches to improve stability and functionality.

Big DataHDFSHadoop

0 likes · 11 min read

Router-Based Federation in Hadoop: Architecture, Components, and Didi’s Deployment

DataFunTalk

Jan 30, 2019 · Artificial Intelligence

Real‑Time Metrics Processing Technology for Financial Risk Control and Anti‑Fraud

This article outlines the challenges of financial risk control in the internet era and presents a comprehensive real‑time metrics processing system, covering data leakage, fraud, big‑data opportunities, AI model deployment, and the technical architecture of the Bangsheng real‑time indicator platform.

AIBig DataStream Processing

0 likes · 17 min read

Real‑Time Metrics Processing Technology for Financial Risk Control and Anti‑Fraud

ITFLY8 Architecture Home

Jan 29, 2019 · Operations

How to Optimize Large-Scale Log Systems for Real-Time Monitoring and Scalability

This article examines the design, deployment, and optimization of massive log systems, comparing architectures, discussing real‑time versus near‑real‑time requirements, and presenting practical improvements such as memory, CPU, network tuning, data partitioning, storage reduction, and component upgrades using ELK, Kafka, Fluentd, and HBase.

Big DataELKFluentd

0 likes · 18 min read

How to Optimize Large-Scale Log Systems for Real-Time Monitoring and Scalability

Alibaba Cloud Developer

Jan 28, 2019 · Big Data

How Alibaba’s Blink Supercharges Flink for Massive Stream and Batch Processing

Alibaba’s Blink, an internal enhancement of Apache Flink, is now open‑sourced, bringing advanced runtime, SQL/TableAPI, Hive compatibility, Zeppelin integration, and a revamped Flink Web UI to dramatically boost performance and scalability for both streaming and batch workloads.

Batch processingBig DataFlink

0 likes · 16 min read

How Alibaba’s Blink Supercharges Flink for Massive Stream and Batch Processing

21CTO

Jan 26, 2019 · Big Data

Data Lake vs Data Warehouse: Which One Powers Your Business?

This article explains the core differences between data lakes and data warehouses, their respective strengths, and how they complement each other to support both exploratory analytics and routine business reporting.

AnalyticsBig DataData Lake

0 likes · 5 min read

Data Lake vs Data Warehouse: Which One Powers Your Business?

NetEase Game Operations Platform

Jan 25, 2019 · Big Data

Understanding Exactly-Once Semantics in Apache Flink: Challenges and Implementation

This article analyzes the difficulties of achieving exactly-once delivery in Apache Flink, explains the distinction between state and end‑to‑end semantics, and details how idempotent and transactional sinks—illustrated with the Bucketing File Sink—realize exactly‑once guarantees through checkpoint‑based two‑phase commit.

Big DataExactly-OnceFlink

0 likes · 13 min read

Understanding Exactly-Once Semantics in Apache Flink: Challenges and Implementation

dbaplus Community

Jan 23, 2019 · Big Data

How Zhihu Built a Scalable Data‑Sync Platform with Sqoop and DataX

This article explains Zhihu's journey from ad‑hoc MySQL‑Hive sync using Oozie + Sqoop to a unified, platform‑based data synchronization service that now handles thousands of tables, over 10 TB daily, with load‑aware scheduling, incremental pulls, schema change handling, and tight integration with their offline job scheduler.

Big DataDataXETL

0 likes · 14 min read

How Zhihu Built a Scalable Data‑Sync Platform with Sqoop and DataX

21CTO

Jan 23, 2019 · Big Data

Can 1.4 Billion Users Fit Into One WeChat Group? A Technical Feasibility Study

This article analyzes whether the entire Chinese population could be added to a single WeChat group, examining user statistics, message volume, required bandwidth, CPU processing limits, Moore's law projections, supercomputer alternatives, hardware costs, storage demands, and practical challenges, concluding that it is theoretically possible but practically infeasible.

Big DataPerformanceServer

0 likes · 10 min read

Can 1.4 Billion Users Fit Into One WeChat Group? A Technical Feasibility Study

MaGe Linux Operations

Jan 23, 2019 · Big Data

How Bloom Filters Power Fast Big Data Searches with Python

This tutorial walks through building a simple Python search engine for big data, covering Bloom filter basics, tokenization with major and minor segmentation, inverted index creation, and implementing both simple and complex (AND/OR) queries, complete with code examples and visual illustrations.

AND/OR queriesBig DataPython

0 likes · 15 min read

How Bloom Filters Power Fast Big Data Searches with Python

Tencent Cloud Developer

Jan 17, 2019 · Artificial Intelligence

Deep Learning for Big Data Recommendation Systems: Tencent's Industrial Practice

Tencent’s industrial practice shows how a large‑scale offline‑nearline‑online “Shield” recommendation architecture, powered by the DeepR framework built on RCaffe, uses deep semantic embeddings, massive neural networks and reinforcement‑learning decisions to handle billions of daily requests, demonstrating that data richness and engineering capability, not model depth alone, drive performance in big‑data recommendation systems.

Big DataDeep LearningNeural Network

0 likes · 13 min read

Deep Learning for Big Data Recommendation Systems: Tencent's Industrial Practice

JD Tech

Jan 17, 2019 · Operations

Technical Overview of JD's Archimedes Resource Scheduling System

The article presents a detailed technical analysis of JD's Archimedes project, describing its evolution from JDOS 2.0 to a large‑scale container scheduling platform that dramatically improves resource utilization, deployment speed, and cost efficiency across JD’s data centers.

AIBig DataJD

0 likes · 6 min read

Technical Overview of JD's Archimedes Resource Scheduling System

Youzan Coder

Jan 16, 2019 · Big Data

How Youzan Scaled Real‑Time Analytics with Flink: Architecture, Pitfalls, and Lessons

This article walks through Youzan's real‑time platform architecture, explains why Flink was chosen over Spark Structured Streaming, details practical challenges such as container over‑provisioning and monitoring overhead, shares solutions for Spring integration and async caching, and outlines future directions for SQL‑based streaming and scheduler improvements.

Big DataFlinkReal-time Streaming

0 likes · 19 min read

How Youzan Scaled Real‑Time Analytics with Flink: Architecture, Pitfalls, and Lessons

StarRing Big Data Open Lab

Jan 16, 2019 · Big Data

What’s New in Transwarp TDH 5.2.3? Key Performance and Stability Enhancements

TDH 5.2.3 introduces a series of stability and performance upgrades—including transaction and compaction optimizations, enhanced error handling, SQL length protection, improved Oracle‑compatible UDFs, default resource pool support, Guardian caching, TxSQL monitoring, and workflow and OLAP engine fixes—aimed at delivering a more reliable big‑data platform.

Big DataDatabaseOptimization

0 likes · 10 min read

What’s New in Transwarp TDH 5.2.3? Key Performance and Stability Enhancements

dbaplus Community

Jan 13, 2019 · Databases

January 2019 DB-Engines Newsletter: Latest Database Releases & Key Features

The January 2019 DB-Engines newsletter compiles the newest releases, feature highlights, and performance improvements across RDBMS, NoSQL, NewSQL, time‑series, big‑data, domestic, and cloud database families, while also explaining the ranking methodology and providing download links for the full issue.

Big DataCloud ComputingDatabases

0 likes · 41 min read

Youzan Coder

Jan 9, 2019 · Big Data

How Youzan Scaled 5,000 Daily SparkSQL Jobs: Migration Lessons from Hive

This article details Youzan's transition from Hive to SparkSQL, covering platform architecture, usability and performance enhancements, migration strategies, automated engine selection, and future plans that together reduced resource consumption by up to 67% while handling thousands of daily jobs.

AvailabilityBig DataData Platform

0 likes · 13 min read

How Youzan Scaled 5,000 Daily SparkSQL Jobs: Migration Lessons from Hive

360 Quality & Efficiency

Jan 4, 2019 · Big Data

Overview of Big Data Processing Engines: MapReduce, Tez, Spark, and Flink

This article reviews the evolution and characteristics of major big‑data processing engines—from first‑generation Hadoop MapReduce to second‑generation DAG‑based Tez, third‑generation in‑memory Spark, and fourth‑generation real‑time Flink—highlighting their batch and streaming use cases.

Big DataFlinkMapReduce

0 likes · 9 min read

Overview of Big Data Processing Engines: MapReduce, Tez, Spark, and Flink

dbaplus Community

Jan 3, 2019 · Backend Development

Supercharging Elasticsearch for Billion-Row Queries: Practical Tips

This guide details how to optimize Elasticsearch for handling billions of daily records, covering core Lucene concepts, index and shard configuration, performance‑tuning parameters, and practical testing methods to achieve sub‑second query responses and long‑term data retention.

Big DataElasticsearchPerformance Optimization

0 likes · 13 min read

Supercharging Elasticsearch for Billion-Row Queries: Practical Tips

Big Data Technology & Architecture

Jan 3, 2019 · Big Data

Deploying Apache Flink on YARN and Running Flink Jobs

This tutorial explains how to deploy Apache Flink on a Hadoop YARN cluster, covering both YARN session mode and direct job submission, and demonstrates running the built‑in WordCount example with command‑line options for input, output, and resource configuration.

Apache FlinkBig DataFlink Deployment

0 likes · 8 min read

Deploying Apache Flink on YARN and Running Flink Jobs

Big Data Technology & Architecture

Jan 3, 2019 · Big Data

Reading Kafka Topics with Flink: A Step‑by‑Step Guide

This tutorial demonstrates how to use Apache Flink's Kafka connector to read data from Kafka topics with exactly‑once semantics, covering Maven dependencies, consumer configuration, checkpointing for fault tolerance, and a complete Scala example that writes the streamed data to HDFS.

Big DataFlinkKafkaConnector

0 likes · 5 min read

Reading Kafka Topics with Flink: A Step‑by‑Step Guide

360 Quality & Efficiency

Jan 2, 2019 · Big Data

Understanding ETL and Data Warehouses: A Beginner’s Guide

This article introduces the fundamentals of Business Intelligence, explains what ETL and data warehouses are, compares them with traditional databases, and outlines the main characteristics and popular tools such as Hive used in modern big‑data environments.

BIBig DataData Integration

0 likes · 5 min read

Understanding ETL and Data Warehouses: A Beginner’s Guide

Big Data Technology & Architecture

Jan 2, 2019 · Big Data

Optimizing Spark Direct Kafka Consumption: Subpartition Concurrency and Repartition Strategies

To address the long processing time caused by uneven Spark partitions when reading Kafka via the Direct approach, this article explains the SPARK‑22056 solution that modifies KafkaRDD.getPartitions to support a configurable 'topic.partition.subconcurrency' parameter, discusses its trade‑offs, and presents alternative repartition and multithreading techniques.

Big DataPartitioningScala

0 likes · 6 min read

Optimizing Spark Direct Kafka Consumption: Subpartition Concurrency and Repartition Strategies

Big Data Technology & Architecture

Jan 2, 2019 · Big Data

Understanding Spark Streaming Backpressure Mechanism

The article explains how Spark Streaming backpressure, introduced in version 1.5, automatically adjusts data ingestion rates based on processing delays, replaces manual rate limits, and details its architecture, configuration parameters, and usage for preventing data backlog and executor OOM.

Big DataRate ControlSpark

0 likes · 6 min read

Understanding Spark Streaming Backpressure Mechanism

Big Data Technology & Architecture

Jan 1, 2019 · Big Data

Insights from the Real-Time Big Data Meetup: Spark Structured Streaming Overview

The meetup on September 8, co‑hosted by InfoQ and Huawei Cloud, featured Databricks engineer Tathagata Das explaining Spark Structured Streaming’s concepts, fault‑tolerance, performance, event‑time handling, and real‑world use cases such as Apple’s security platform, highlighting its scalability and integration with various data sources.

Big DataSparkStructured Streaming

0 likes · 8 min read

Insights from the Real-Time Big Data Meetup: Spark Structured Streaming Overview

Big Data Technology & Architecture

Dec 31, 2018 · Big Data

Overview of the Big Data Ecosystem and Core Technologies

This article provides a comprehensive overview of the big data ecosystem, explaining key components such as Hadoop, HDFS, Spark, Hive, Pig, HBase, and related tools, and describes how they work together to store, process, and analyze massive datasets efficiently.

Big DataHadoopHive

0 likes · 16 min read

Overview of the Big Data Ecosystem and Core Technologies

Architects Research Society

Dec 30, 2018 · Big Data

Overview of Major Apache Big Data Processing Frameworks

This article provides a concise overview of numerous Apache open‑source projects—including Ignite, MapReduce, Pig, JAQL, Spark, Storm, Flink, Apex, REEF, Twill, and Beam—that enable distributed in‑memory storage, real‑time and batch processing, and advanced analytics for large‑scale data workloads.

ApacheBig DataDistributed computing

0 likes · 22 min read

Overview of Major Apache Big Data Processing Frameworks

Youzan Coder

Dec 28, 2018 · Big Data

Quantifying HBase Write Path: Disk and Network Costs for High‑Throughput Scenarios

This article analytically breaks down HBase's write pipeline, quantifies disk and network overheads for massive random writes, derives formulas for resource consumption under realistic assumptions, and offers concrete tuning recommendations to optimize throughput and reduce cost.

Big DataHBasePerformance

0 likes · 16 min read

Quantifying HBase Write Path: Disk and Network Costs for High‑Throughput Scenarios

Tencent Cloud Developer

Dec 28, 2018 · Big Data

Intelligent Operations for Tencent Cloud Big Data Platform: Challenges, Practices, and Future Directions

Tencent Cloud’s big‑data platform tackles massive, multi‑component clusters by deploying an AIOps framework that aggregates logs and metrics, applies statistical and machine‑learning anomaly detection, uses regression and reinforcement‑learning for job‑parameter optimization, and integrates offline‑online pipelines, achieving over 88 % precision while planning automated root‑cause analysis, productized tools, platformized algorithm integration, and cross‑domain model reuse.

Big DataCloud ComputingIntelligent Operations

0 likes · 20 min read

Intelligent Operations for Tencent Cloud Big Data Platform: Challenges, Practices, and Future Directions